linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] overlay: Implement volatile-specific fsync error behaviour
@ 2020-12-02  9:27 Sargun Dhillon
  2020-12-02 10:25 ` Amir Goldstein
  2020-12-02 15:07 ` Vivek Goyal
  0 siblings, 2 replies; 28+ messages in thread
From: Sargun Dhillon @ 2020-12-02  9:27 UTC (permalink / raw)
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Jeff Layton, Miklos Szeredi, Vivek Goyal

Overlayfs's volatile option allows the user to bypass all forced sync calls
to the upperdir filesystem. This comes at the cost of safety. We can never
ensure that the user's data is intact, but we can make a best effort to
expose whether or not the data is likely to be in a bad state.

We decided[1] that the best way to handle this in the time being is that if
an overlayfs's upperdir experiences an error after a volatile mount occurs,
that error will be returned on fsync, fdatasync, sync, and syncfs. This is
contradictory to the traditional behaviour of VFS which fails the call
once, and only raises an error if a subsequent fsync error has occured,
and been raised by the filesystem.

One awkward aspect of the patch is that we have to manually set the
superblock's errseq_t after the sync_fs callback as opposed to just
returning an error from syncfs. This is because the call chain looks
something like this:

sys_syncfs ->
	sync_filesystem ->
		__sync_filesystem ->
			/* The return value is ignored here
			sb->s_op->sync_fs(sb)
			_sync_blockdev
		/* Where the VFS fetches the error to raise to userspace */
		errseq_check_and_advance

Because of this we call errseq_set every time the sync_fs callback occurs.

[1]: https://lore.kernel.org/linux-fsdevel/36d820394c3e7cd1faa1b28a8135136d5001dadd.camel@redhat.com/T/#u

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/filesystems/overlayfs.rst |  8 ++++++++
 fs/overlayfs/file.c                     |  5 +++--
 fs/overlayfs/overlayfs.h                |  1 +
 fs/overlayfs/ovl_entry.h                |  2 ++
 fs/overlayfs/readdir.c                  |  5 +++--
 fs/overlayfs/super.c                    | 24 +++++++++++++++-------
 fs/overlayfs/util.c                     | 27 +++++++++++++++++++++++++
 7 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..3af569cea6a7 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -575,6 +575,14 @@ without significant effort.
 The advantage of mounting with the "volatile" option is that all forms of
 sync calls to the upper filesystem are omitted.
 
+In order to avoid a giving a false sense of safety, the syncfs (and fsync)
+semantics of volatile mounts are slightly different than that of the rest of
+VFS.  If any error occurs on the upperdir's filesystem after a volatile mount
+takes place, all sync functions will return the last error observed on the
+upperdir filesystem.  Once this condition is reached, the filesystem will not
+recover, and every subsequent sync call will return an error, even if the
+upperdir has not experience a new error since the last sync call.
+
 When overlay is mounted with "volatile" option, the directory
 "$workdir/work/incompat/volatile" is created.  During next mount, overlay
 checks for this directory and refuses to mount if present. This is a strong
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 802259f33c28..2479b297a966 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -445,8 +445,9 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	const struct cred *old_cred;
 	int ret;
 
-	if (!ovl_should_sync(OVL_FS(file_inode(file)->i_sb)))
-		return 0;
+	ret = ovl_check_sync(OVL_FS(file_inode(file)->i_sb));
+	if (ret <= 0)
+		return ret;
 
 	ret = ovl_real_fdget_meta(file, &real, !datasync);
 	if (ret)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..af79c3a2392e 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
 bool ovl_is_metacopy_dentry(struct dentry *dentry);
 char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 			     int padding);
+int ovl_check_sync(struct ovl_fs *ofs);
 
 static inline bool ovl_is_impuredir(struct super_block *sb,
 				    struct dentry *dentry)
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..9460a52abea3 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -79,6 +79,8 @@ struct ovl_fs {
 	atomic_long_t last_ino;
 	/* Whiteout dentry cache */
 	struct dentry *whiteout;
+	/* snapshot of upperdir's errseq_t at mount time for volatile mounts */
+	errseq_t upper_errseq;
 };
 
 static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..f7f1a29e290f 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -909,8 +909,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	struct file *realfile;
 	int err;
 
-	if (!ovl_should_sync(OVL_FS(file->f_path.dentry->d_sb)))
-		return 0;
+	err = ovl_check_sync(OVL_FS(file->f_path.dentry->d_sb));
+	if (err <= 0)
+		return err;
 
 	realfile = ovl_dir_real_file(file, true);
 	err = PTR_ERR_OR_ZERO(realfile);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..82a096a05bce 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	struct super_block *upper_sb;
 	int ret;
 
-	if (!ovl_upper_mnt(ofs))
-		return 0;
+	ret = ovl_check_sync(ofs);
+	/*
+	 * We have to always set the err, because the return value isn't
+	 * checked, and instead VFS looks at the writeback errseq after
+	 * this call.
+	 */
+	if (ret < 0)
+		errseq_set(&sb->s_wb_err, ret);
+
+	if (!ret)
+		return ret;
 
-	if (!ovl_should_sync(ofs))
-		return 0;
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_op = &ovl_super_operations;
 
 	if (ofs->config.upperdir) {
+		struct super_block *upper_mnt_sb;
+
 		if (!ofs->config.workdir) {
 			pr_err("missing 'workdir'\n");
 			goto out_err;
@@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 		if (!ofs->workdir)
 			sb->s_flags |= SB_RDONLY;
 
-		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
-		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
-
+		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
+		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
+		sb->s_time_gran = upper_mnt_sb->s_time_gran;
+		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
 	}
 	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
 	err = PTR_ERR(oe);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..9b460cd7b151 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -950,3 +950,30 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 	kfree(buf);
 	return ERR_PTR(res);
 }
+
+/*
+ * ovl_check_sync provides sync checking, and safety for volatile mounts
+ *
+ * Returns 1 if sync required.
+ *
+ * Returns 0 if syncing can be skipped because mount is volatile, and no errors
+ * have occurred on the upperdir since the mount.
+ *
+ * Returns -errno if it is a volatile mount, and the error that occurred since
+ * the last mount. If the error code changes, it'll return the latest error
+ * code.
+ */
+
+int ovl_check_sync(struct ovl_fs *ofs)
+{
+	struct vfsmount *mnt;
+
+	if (ovl_should_sync(ofs))
+		return 1;
+
+	mnt = ovl_upper_mnt(ofs);
+	if (!mnt)
+		return 0;
+
+	return errseq_check(&mnt->mnt_sb->s_wb_err, ofs->upper_errseq);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02  9:27 [PATCH] overlay: Implement volatile-specific fsync error behaviour Sargun Dhillon
@ 2020-12-02 10:25 ` Amir Goldstein
  2020-12-02 15:07 ` Vivek Goyal
  1 sibling, 0 replies; 28+ messages in thread
From: Amir Goldstein @ 2020-12-02 10:25 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-fsdevel, overlayfs, Jeff Layton, Miklos Szeredi, Vivek Goyal

On Wed, Dec 2, 2020 at 11:27 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> Overlayfs's volatile option allows the user to bypass all forced sync calls
> to the upperdir filesystem. This comes at the cost of safety. We can never
> ensure that the user's data is intact, but we can make a best effort to
> expose whether or not the data is likely to be in a bad state.
>
> We decided[1] that the best way to handle this in the time being is that if
> an overlayfs's upperdir experiences an error after a volatile mount occurs,
> that error will be returned on fsync, fdatasync, sync, and syncfs. This is
> contradictory to the traditional behaviour of VFS which fails the call
> once, and only raises an error if a subsequent fsync error has occured,
> and been raised by the filesystem.
>
> One awkward aspect of the patch is that we have to manually set the
> superblock's errseq_t after the sync_fs callback as opposed to just
> returning an error from syncfs. This is because the call chain looks
> something like this:
>
> sys_syncfs ->
>         sync_filesystem ->
>                 __sync_filesystem ->
>                         /* The return value is ignored here
>                         sb->s_op->sync_fs(sb)
>                         _sync_blockdev
>                 /* Where the VFS fetches the error to raise to userspace */
>                 errseq_check_and_advance
>
> Because of this we call errseq_set every time the sync_fs callback occurs.
>
> [1]: https://lore.kernel.org/linux-fsdevel/36d820394c3e7cd1faa1b28a8135136d5001dadd.camel@redhat.com/T/#u
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Suggested-by: Amir Goldstein <amir73il@gmail.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Jeff Layton <jlayton@redhat.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---

Looks safe :-)

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

We should consider sending this to stable, but maybe let's merge first
and let it
run in master for a while before because it is not a clear and immediate danger
and if anyone is using volatile already I hope they read all the
warnings on the box.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02  9:27 [PATCH] overlay: Implement volatile-specific fsync error behaviour Sargun Dhillon
  2020-12-02 10:25 ` Amir Goldstein
@ 2020-12-02 15:07 ` Vivek Goyal
  2020-12-02 17:02   ` Jeff Layton
  2020-12-02 17:41   ` Amir Goldstein
  1 sibling, 2 replies; 28+ messages in thread
From: Vivek Goyal @ 2020-12-02 15:07 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, linux-fsdevel, linux-unionfs, Jeff Layton,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 01:27:20AM -0800, Sargun Dhillon wrote:
> Overlayfs's volatile option allows the user to bypass all forced sync calls
> to the upperdir filesystem. This comes at the cost of safety. We can never
> ensure that the user's data is intact, but we can make a best effort to
> expose whether or not the data is likely to be in a bad state.
> 
> We decided[1] that the best way to handle this in the time being is that if
> an overlayfs's upperdir experiences an error after a volatile mount occurs,
> that error will be returned on fsync, fdatasync, sync, and syncfs. This is
> contradictory to the traditional behaviour of VFS which fails the call
> once, and only raises an error if a subsequent fsync error has occured,
> and been raised by the filesystem.
> 
> One awkward aspect of the patch is that we have to manually set the
> superblock's errseq_t after the sync_fs callback as opposed to just
> returning an error from syncfs. This is because the call chain looks
> something like this:
> 
> sys_syncfs ->
> 	sync_filesystem ->
> 		__sync_filesystem ->
> 			/* The return value is ignored here
> 			sb->s_op->sync_fs(sb)
> 			_sync_blockdev
> 		/* Where the VFS fetches the error to raise to userspace */
> 		errseq_check_and_advance
> 
> Because of this we call errseq_set every time the sync_fs callback occurs.
> 
> [1]: https://lore.kernel.org/linux-fsdevel/36d820394c3e7cd1faa1b28a8135136d5001dadd.camel@redhat.com/T/#u
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Suggested-by: Amir Goldstein <amir73il@gmail.com>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Jeff Layton <jlayton@redhat.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/filesystems/overlayfs.rst |  8 ++++++++
>  fs/overlayfs/file.c                     |  5 +++--
>  fs/overlayfs/overlayfs.h                |  1 +
>  fs/overlayfs/ovl_entry.h                |  2 ++
>  fs/overlayfs/readdir.c                  |  5 +++--
>  fs/overlayfs/super.c                    | 24 +++++++++++++++-------
>  fs/overlayfs/util.c                     | 27 +++++++++++++++++++++++++
>  7 files changed, 61 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 580ab9a0fe31..3af569cea6a7 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -575,6 +575,14 @@ without significant effort.
>  The advantage of mounting with the "volatile" option is that all forms of
>  sync calls to the upper filesystem are omitted.
>  
> +In order to avoid a giving a false sense of safety, the syncfs (and fsync)
> +semantics of volatile mounts are slightly different than that of the rest of
> +VFS.  If any error occurs on the upperdir's filesystem after a volatile mount
> +takes place, all sync functions will return the last error observed on the
> +upperdir filesystem.  Once this condition is reached, the filesystem will not
> +recover, and every subsequent sync call will return an error, even if the
> +upperdir has not experience a new error since the last sync call.
> +
>  When overlay is mounted with "volatile" option, the directory
>  "$workdir/work/incompat/volatile" is created.  During next mount, overlay
>  checks for this directory and refuses to mount if present. This is a strong
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 802259f33c28..2479b297a966 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -445,8 +445,9 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	const struct cred *old_cred;
>  	int ret;
>  
> -	if (!ovl_should_sync(OVL_FS(file_inode(file)->i_sb)))
> -		return 0;
> +	ret = ovl_check_sync(OVL_FS(file_inode(file)->i_sb));
> +	if (ret <= 0)
> +		return ret;
>  
>  	ret = ovl_real_fdget_meta(file, &real, !datasync);
>  	if (ret)
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index f8880aa2ba0e..af79c3a2392e 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
>  bool ovl_is_metacopy_dentry(struct dentry *dentry);
>  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>  			     int padding);
> +int ovl_check_sync(struct ovl_fs *ofs);
>  
>  static inline bool ovl_is_impuredir(struct super_block *sb,
>  				    struct dentry *dentry)
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index 1b5a2094df8e..9460a52abea3 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -79,6 +79,8 @@ struct ovl_fs {
>  	atomic_long_t last_ino;
>  	/* Whiteout dentry cache */
>  	struct dentry *whiteout;
> +	/* snapshot of upperdir's errseq_t at mount time for volatile mounts */
> +	errseq_t upper_errseq;
>  };
>  
>  static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 01620ebae1bd..f7f1a29e290f 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -909,8 +909,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
>  	struct file *realfile;
>  	int err;
>  
> -	if (!ovl_should_sync(OVL_FS(file->f_path.dentry->d_sb)))
> -		return 0;
> +	err = ovl_check_sync(OVL_FS(file->f_path.dentry->d_sb));
> +	if (err <= 0)
> +		return err;
>  
>  	realfile = ovl_dir_real_file(file, true);
>  	err = PTR_ERR_OR_ZERO(realfile);
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..82a096a05bce 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>  	struct super_block *upper_sb;
>  	int ret;
>  
> -	if (!ovl_upper_mnt(ofs))
> -		return 0;
> +	ret = ovl_check_sync(ofs);
> +	/*
> +	 * We have to always set the err, because the return value isn't
> +	 * checked, and instead VFS looks at the writeback errseq after
> +	 * this call.
> +	 */
> +	if (ret < 0)
> +		errseq_set(&sb->s_wb_err, ret);

I was wondering that why errseq_set() will result in returning error
all the time. Then realized that last syncfs() call must have set
ERRSEQ_SEEN flag and that will mean errseq_set() will increment
counter and that means this syncfs() will will return error too. Cool.

> +
> +	if (!ret)
> +		return ret;
>  
> -	if (!ovl_should_sync(ofs))
> -		return 0;
>  	/*
>  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>  	 * All the super blocks will be iterated, including upper_sb.
> @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  	sb->s_op = &ovl_super_operations;
>  
>  	if (ofs->config.upperdir) {
> +		struct super_block *upper_mnt_sb;
> +
>  		if (!ofs->config.workdir) {
>  			pr_err("missing 'workdir'\n");
>  			goto out_err;
> @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  		if (!ofs->workdir)
>  			sb->s_flags |= SB_RDONLY;
>  
> -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> -
> +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);

I asked this question in last email as well. errseq_sample() will return
0 if current error has not been seen yet. That means next time a sync
call comes for volatile mount, it will return an error. But that's
not what we want. When we mounted a volatile overlay, if there is an
existing error (seen/unseen), we don't care. We only care if there
is a new error after the volatile mount, right?

I guess we will need another helper similar to errseq_smaple() which
just returns existing value of errseq. And then we will have to
do something about errseq_check() to not return an error if "since"
and "eseq" differ only by "seen" bit.

Otherwise in current form, volatile mount will always return error
if upperdir has error and it has not been seen by anybody.

How did you finally end up testing the error case. Want to simualate
error aritificially and test it.

>  	}
>  	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
>  	err = PTR_ERR(oe);
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 23f475627d07..9b460cd7b151 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -950,3 +950,30 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>  	kfree(buf);
>  	return ERR_PTR(res);
>  }
> +
> +/*
> + * ovl_check_sync provides sync checking, and safety for volatile mounts
> + *
> + * Returns 1 if sync required.
> + *
> + * Returns 0 if syncing can be skipped because mount is volatile, and no errors
> + * have occurred on the upperdir since the mount.
> + *
> + * Returns -errno if it is a volatile mount, and the error that occurred since
> + * the last mount. If the error code changes, it'll return the latest error
> + * code.
> + */
> +
> +int ovl_check_sync(struct ovl_fs *ofs)
> +{
> +	struct vfsmount *mnt;
> +
> +	if (ovl_should_sync(ofs))
> +		return 1;
> +
> +	mnt = ovl_upper_mnt(ofs);
> +	if (!mnt)
> +		return 0;
> +
> +	return errseq_check(&mnt->mnt_sb->s_wb_err, ofs->upper_errseq);

I guess we can do another patch later to set one global flag in overlayfs
super block and use that flag to return errors on other paths like 
read/write etc. But that's for a separate patch later.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 15:07 ` Vivek Goyal
@ 2020-12-02 17:02   ` Jeff Layton
  2020-12-02 17:29     ` Vivek Goyal
  2020-12-02 17:41   ` Amir Goldstein
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-02 17:02 UTC (permalink / raw)
  To: Vivek Goyal, Sargun Dhillon
  Cc: Amir Goldstein, linux-fsdevel, linux-unionfs, Miklos Szeredi

On Wed, 2020-12-02 at 10:07 -0500, Vivek Goyal wrote:
> On Wed, Dec 02, 2020 at 01:27:20AM -0800, Sargun Dhillon wrote:
> > Overlayfs's volatile option allows the user to bypass all forced sync calls
> > to the upperdir filesystem. This comes at the cost of safety. We can never
> > ensure that the user's data is intact, but we can make a best effort to
> > expose whether or not the data is likely to be in a bad state.
> > 
> > We decided[1] that the best way to handle this in the time being is that if
> > an overlayfs's upperdir experiences an error after a volatile mount occurs,
> > that error will be returned on fsync, fdatasync, sync, and syncfs. This is
> > contradictory to the traditional behaviour of VFS which fails the call
> > once, and only raises an error if a subsequent fsync error has occured,
> > and been raised by the filesystem.
> > 
> > One awkward aspect of the patch is that we have to manually set the
> > superblock's errseq_t after the sync_fs callback as opposed to just
> > returning an error from syncfs. This is because the call chain looks
> > something like this:
> > 
> > sys_syncfs ->
> > 	sync_filesystem ->
> > 		__sync_filesystem ->
> > 			/* The return value is ignored here
> > 			sb->s_op->sync_fs(sb)
> > 			_sync_blockdev
> > 		/* Where the VFS fetches the error to raise to userspace */
> > 		errseq_check_and_advance
> > 
> > Because of this we call errseq_set every time the sync_fs callback occurs.
> > 
> > [1]: https://lore.kernel.org/linux-fsdevel/36d820394c3e7cd1faa1b28a8135136d5001dadd.camel@redhat.com/T/#u
> > 
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Suggested-by: Amir Goldstein <amir73il@gmail.com>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Jeff Layton <jlayton@redhat.com>
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/filesystems/overlayfs.rst |  8 ++++++++
> >  fs/overlayfs/file.c                     |  5 +++--
> >  fs/overlayfs/overlayfs.h                |  1 +
> >  fs/overlayfs/ovl_entry.h                |  2 ++
> >  fs/overlayfs/readdir.c                  |  5 +++--
> >  fs/overlayfs/super.c                    | 24 +++++++++++++++-------
> >  fs/overlayfs/util.c                     | 27 +++++++++++++++++++++++++
> >  7 files changed, 61 insertions(+), 11 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > index 580ab9a0fe31..3af569cea6a7 100644
> > --- a/Documentation/filesystems/overlayfs.rst
> > +++ b/Documentation/filesystems/overlayfs.rst
> > @@ -575,6 +575,14 @@ without significant effort.
> >  The advantage of mounting with the "volatile" option is that all forms of
> >  sync calls to the upper filesystem are omitted.
> >  
> > 
> > 
> > 
> > +In order to avoid a giving a false sense of safety, the syncfs (and fsync)
> > +semantics of volatile mounts are slightly different than that of the rest of
> > +VFS.  If any error occurs on the upperdir's filesystem after a volatile mount
> > +takes place, all sync functions will return the last error observed on the
> > +upperdir filesystem.  Once this condition is reached, the filesystem will not
> > +recover, and every subsequent sync call will return an error, even if the
> > +upperdir has not experience a new error since the last sync call.
> > +
> >  When overlay is mounted with "volatile" option, the directory
> >  "$workdir/work/incompat/volatile" is created.  During next mount, overlay
> >  checks for this directory and refuses to mount if present. This is a strong
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index 802259f33c28..2479b297a966 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -445,8 +445,9 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >  	const struct cred *old_cred;
> >  	int ret;
> >  
> > 
> > 
> > 
> > -	if (!ovl_should_sync(OVL_FS(file_inode(file)->i_sb)))
> > -		return 0;
> > +	ret = ovl_check_sync(OVL_FS(file_inode(file)->i_sb));
> > +	if (ret <= 0)
> > +		return ret;
> >  
> > 
> > 
> > 
> >  	ret = ovl_real_fdget_meta(file, &real, !datasync);
> >  	if (ret)
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index f8880aa2ba0e..af79c3a2392e 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry);
> >  bool ovl_is_metacopy_dentry(struct dentry *dentry);
> >  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  			     int padding);
> > +int ovl_check_sync(struct ovl_fs *ofs);
> >  
> > 
> > 
> > 
> >  static inline bool ovl_is_impuredir(struct super_block *sb,
> >  				    struct dentry *dentry)
> > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > index 1b5a2094df8e..9460a52abea3 100644
> > --- a/fs/overlayfs/ovl_entry.h
> > +++ b/fs/overlayfs/ovl_entry.h
> > @@ -79,6 +79,8 @@ struct ovl_fs {
> >  	atomic_long_t last_ino;
> >  	/* Whiteout dentry cache */
> >  	struct dentry *whiteout;
> > +	/* snapshot of upperdir's errseq_t at mount time for volatile mounts */
> > +	errseq_t upper_errseq;
> >  };
> >  
> > 
> > 
> > 
> >  static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 01620ebae1bd..f7f1a29e290f 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -909,8 +909,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
> >  	struct file *realfile;
> >  	int err;
> >  
> > 
> > 
> > 
> > -	if (!ovl_should_sync(OVL_FS(file->f_path.dentry->d_sb)))
> > -		return 0;
> > +	err = ovl_check_sync(OVL_FS(file->f_path.dentry->d_sb));
> > +	if (err <= 0)
> > +		return err;
> >  
> > 
> > 
> > 
> >  	realfile = ovl_dir_real_file(file, true);
> >  	err = PTR_ERR_OR_ZERO(realfile);
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..82a096a05bce 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >  	struct super_block *upper_sb;
> >  	int ret;
> >  
> > 
> > 
> > 
> > -	if (!ovl_upper_mnt(ofs))
> > -		return 0;
> > +	ret = ovl_check_sync(ofs);
> > +	/*
> > +	 * We have to always set the err, because the return value isn't
> > +	 * checked, and instead VFS looks at the writeback errseq after
> > +	 * this call.
> > +	 */
> > +	if (ret < 0)
> > +		errseq_set(&sb->s_wb_err, ret);
> 
> I was wondering that why errseq_set() will result in returning error
> all the time. Then realized that last syncfs() call must have set
> ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> counter and that means this syncfs() will will return error too. Cool.
> 
> > +
> > +	if (!ret)
> > +		return ret;
> >  
> > 
> > 
> > 
> > -	if (!ovl_should_sync(ofs))
> > -		return 0;
> >  	/*
> >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> >  	 * All the super blocks will be iterated, including upper_sb.
> > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> >  	sb->s_op = &ovl_super_operations;
> >  
> > 
> > 
> > 
> >  	if (ofs->config.upperdir) {
> > +		struct super_block *upper_mnt_sb;
> > +
> >  		if (!ofs->config.workdir) {
> >  			pr_err("missing 'workdir'\n");
> >  			goto out_err;
> > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> >  		if (!ofs->workdir)
> >  			sb->s_flags |= SB_RDONLY;
> >  
> > 
> > 
> > 
> > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > -
> > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> 
> I asked this question in last email as well. errseq_sample() will return
> 0 if current error has not been seen yet. That means next time a sync
> call comes for volatile mount, it will return an error. But that's
> not what we want. When we mounted a volatile overlay, if there is an
> existing error (seen/unseen), we don't care. We only care if there
> is a new error after the volatile mount, right?
> 
> I guess we will need another helper similar to errseq_smaple() which
> just returns existing value of errseq. And then we will have to
> do something about errseq_check() to not return an error if "since"
> and "eseq" differ only by "seen" bit.
> 
> Otherwise in current form, volatile mount will always return error
> if upperdir has error and it has not been seen by anybody.
> 
> How did you finally end up testing the error case. Want to simualate
> error aritificially and test it.
> 

If you don't want to see errors that occurred before you did the mount,
then you probably can just resurrect and rename the original version of
errseq_sample. Something like this, but with a different name:

+errseq_t errseq_sample(errseq_t *eseq)
+{
+       errseq_t old = READ_ONCE(*eseq);
+       errseq_t new = old;
+
+       /*
+        * For the common case of no errors ever having been set, we can skip
+        * marking the SEEN bit. Once an error has been set, the value will
+        * never go back to zero.
+        */
+       if (old != 0) {
+               new |= ERRSEQ_SEEN;
+               if (old != new)
+                       cmpxchg(eseq, old, new);
+       }
+       return new;
+}


>  	}
>  	oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
>  	err = PTR_ERR(oe);
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 23f475627d07..9b460cd7b151 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -950,3 +950,30 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>  	kfree(buf);
>  	return ERR_PTR(res);
>  }
> +
> +/*
> + * ovl_check_sync provides sync checking, and safety for volatile mounts
> + *
> + * Returns 1 if sync required.
> + *
> + * Returns 0 if syncing can be skipped because mount is volatile, and no errors
> + * have occurred on the upperdir since the mount.
> + *
> + * Returns -errno if it is a volatile mount, and the error that occurred since
> + * the last mount. If the error code changes, it'll return the latest error
> + * code.
> + */
> +
> +int ovl_check_sync(struct ovl_fs *ofs)
> +{
> +	struct vfsmount *mnt;
> +
> +	if (ovl_should_sync(ofs))
> +		return 1;
> +
> +	mnt = ovl_upper_mnt(ofs);
> +	if (!mnt)
> +		return 0;
> +
> +	return errseq_check(&mnt->mnt_sb->s_wb_err, ofs->upper_errseq);

I guess we can do another patch later to set one global flag in overlayfs
super block and use that flag to return errors on other paths like 
read/write etc. But that's for a separate patch later.

Thanks
Vivek


-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 17:02   ` Jeff Layton
@ 2020-12-02 17:29     ` Vivek Goyal
  2020-12-02 18:22       ` Jeff Layton
  2020-12-02 18:49       ` Sargun Dhillon
  0 siblings, 2 replies; 28+ messages in thread
From: Vivek Goyal @ 2020-12-02 17:29 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:

[..]
> > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > index 290983bcfbb3..82a096a05bce 100644
> > > --- a/fs/overlayfs/super.c
> > > +++ b/fs/overlayfs/super.c
> > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > >  	struct super_block *upper_sb;
> > >  	int ret;
> > >  
> > > 
> > > 
> > > 
> > > -	if (!ovl_upper_mnt(ofs))
> > > -		return 0;
> > > +	ret = ovl_check_sync(ofs);
> > > +	/*
> > > +	 * We have to always set the err, because the return value isn't
> > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > +	 * this call.
> > > +	 */
> > > +	if (ret < 0)
> > > +		errseq_set(&sb->s_wb_err, ret);
> > 
> > I was wondering that why errseq_set() will result in returning error
> > all the time. Then realized that last syncfs() call must have set
> > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > counter and that means this syncfs() will will return error too. Cool.
> > 
> > > +
> > > +	if (!ret)
> > > +		return ret;
> > >  
> > > 
> > > 
> > > 
> > > -	if (!ovl_should_sync(ofs))
> > > -		return 0;
> > >  	/*
> > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > >  	 * All the super blocks will be iterated, including upper_sb.
> > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > >  	sb->s_op = &ovl_super_operations;
> > >  
> > > 
> > > 
> > > 
> > >  	if (ofs->config.upperdir) {
> > > +		struct super_block *upper_mnt_sb;
> > > +
> > >  		if (!ofs->config.workdir) {
> > >  			pr_err("missing 'workdir'\n");
> > >  			goto out_err;
> > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > >  		if (!ofs->workdir)
> > >  			sb->s_flags |= SB_RDONLY;
> > >  
> > > 
> > > 
> > > 
> > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > -
> > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > 
> > I asked this question in last email as well. errseq_sample() will return
> > 0 if current error has not been seen yet. That means next time a sync
> > call comes for volatile mount, it will return an error. But that's
> > not what we want. When we mounted a volatile overlay, if there is an
> > existing error (seen/unseen), we don't care. We only care if there
> > is a new error after the volatile mount, right?
> > 
> > I guess we will need another helper similar to errseq_smaple() which
> > just returns existing value of errseq. And then we will have to
> > do something about errseq_check() to not return an error if "since"
> > and "eseq" differ only by "seen" bit.
> > 
> > Otherwise in current form, volatile mount will always return error
> > if upperdir has error and it has not been seen by anybody.
> > 
> > How did you finally end up testing the error case. Want to simualate
> > error aritificially and test it.
> > 
> 
> If you don't want to see errors that occurred before you did the mount,
> then you probably can just resurrect and rename the original version of
> errseq_sample. Something like this, but with a different name:
> 
> +errseq_t errseq_sample(errseq_t *eseq)
> +{
> +       errseq_t old = READ_ONCE(*eseq);
> +       errseq_t new = old;
> +
> +       /*
> +        * For the common case of no errors ever having been set, we can skip
> +        * marking the SEEN bit. Once an error has been set, the value will
> +        * never go back to zero.
> +        */
> +       if (old != 0) {
> +               new |= ERRSEQ_SEEN;
> +               if (old != new)
> +                       cmpxchg(eseq, old, new);
> +       }
> +       return new;
> +}

Yes, a helper like this should solve the issue at hand. We are not
interested in previous errors. This also sets the ERRSEQ_SEEN on 
sample and it will also solve the other issue when after sampling
if error gets seen, we don't want errseq_check() to return error.

Thinking of some possible names for new function.

errseq_sample_seen()
errseq_sample_set_seen()
errseq_sample_consume_unseen()
errseq_sample_current()

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 15:07 ` Vivek Goyal
  2020-12-02 17:02   ` Jeff Layton
@ 2020-12-02 17:41   ` Amir Goldstein
  1 sibling, 0 replies; 28+ messages in thread
From: Amir Goldstein @ 2020-12-02 17:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, linux-fsdevel, overlayfs, Jeff Layton, Miklos Szeredi

> I asked this question in last email as well. errseq_sample() will return
> 0 if current error has not been seen yet. That means next time a sync
> call comes for volatile mount, it will return an error. But that's
> not what we want. When we mounted a volatile overlay, if there is an
> existing error (seen/unseen), we don't care. We only care if there
> is a new error after the volatile mount, right?
>
> I guess we will need another helper similar to errseq_smaple() which
> just returns existing value of errseq. And then we will have to
> do something about errseq_check() to not return an error if "since"
> and "eseq" differ only by "seen" bit.
>
> Otherwise in current form, volatile mount will always return error
> if upperdir has error and it has not been seen by anybody.
>
> How did you finally end up testing the error case. Want to simualate
> error aritificially and test it.
>

Good spotting!

Besides the specialized test for sync error,
I wonder if anybody ever tested "volatile" setup with xfstests or unionmount?

In xfsftest can set envvar OVERLAY_MOUNT_OPTIONS="-o volatile"

In unionmount I have a branch [1] with support for envvar
UNIONMOUNT_MNTOPTIONS.

I did not merge this change to master because nobody (but me) tested
it, so that would be a good opportunity (hint hint)

Thanks,
Amir.

[1] https://github.com/amir73il/unionmount-testsuite/commits/envvars

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 17:29     ` Vivek Goyal
@ 2020-12-02 18:22       ` Jeff Layton
  2020-12-02 18:56         ` Vivek Goyal
  2020-12-02 18:49       ` Sargun Dhillon
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-02 18:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, 2020-12-02 at 12:29 -0500, Vivek Goyal wrote:
> On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:
> 
> [..]
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index 290983bcfbb3..82a096a05bce 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > >  	struct super_block *upper_sb;
> > > >  	int ret;
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -	if (!ovl_upper_mnt(ofs))
> > > > -		return 0;
> > > > +	ret = ovl_check_sync(ofs);
> > > > +	/*
> > > > +	 * We have to always set the err, because the return value isn't
> > > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > > +	 * this call.
> > > > +	 */
> > > > +	if (ret < 0)
> > > > +		errseq_set(&sb->s_wb_err, ret);
> > > 
> > > I was wondering that why errseq_set() will result in returning error
> > > all the time. Then realized that last syncfs() call must have set
> > > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > > counter and that means this syncfs() will will return error too. Cool.
> > > 
> > > > +
> > > > +	if (!ret)
> > > > +		return ret;
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -	if (!ovl_should_sync(ofs))
> > > > -		return 0;
> > > >  	/*
> > > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > > >  	 * All the super blocks will be iterated, including upper_sb.
> > > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > >  	sb->s_op = &ovl_super_operations;
> > > >  
> > > > 
> > > > 
> > > > 
> > > >  	if (ofs->config.upperdir) {
> > > > +		struct super_block *upper_mnt_sb;
> > > > +
> > > >  		if (!ofs->config.workdir) {
> > > >  			pr_err("missing 'workdir'\n");
> > > >  			goto out_err;
> > > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > >  		if (!ofs->workdir)
> > > >  			sb->s_flags |= SB_RDONLY;
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > -
> > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > 
> > > I asked this question in last email as well. errseq_sample() will return
> > > 0 if current error has not been seen yet. That means next time a sync
> > > call comes for volatile mount, it will return an error. But that's
> > > not what we want. When we mounted a volatile overlay, if there is an
> > > existing error (seen/unseen), we don't care. We only care if there
> > > is a new error after the volatile mount, right?
> > > 
> > > I guess we will need another helper similar to errseq_smaple() which
> > > just returns existing value of errseq. And then we will have to
> > > do something about errseq_check() to not return an error if "since"
> > > and "eseq" differ only by "seen" bit.
> > > 
> > > Otherwise in current form, volatile mount will always return error
> > > if upperdir has error and it has not been seen by anybody.
> > > 
> > > How did you finally end up testing the error case. Want to simualate
> > > error aritificially and test it.
> > > 
> > 
> > If you don't want to see errors that occurred before you did the mount,
> > then you probably can just resurrect and rename the original version of
> > errseq_sample. Something like this, but with a different name:
> > 
> > +errseq_t errseq_sample(errseq_t *eseq)
> > +{
> > +       errseq_t old = READ_ONCE(*eseq);
> > +       errseq_t new = old;
> > +
> > +       /*
> > +        * For the common case of no errors ever having been set, we can skip
> > +        * marking the SEEN bit. Once an error has been set, the value will
> > +        * never go back to zero.
> > +        */
> > +       if (old != 0) {
> > +               new |= ERRSEQ_SEEN;
> > +               if (old != new)
> > +                       cmpxchg(eseq, old, new);
> > +       }
> > +       return new;
> > +}
> 
> Yes, a helper like this should solve the issue at hand. We are not
> interested in previous errors. This also sets the ERRSEQ_SEEN on 
> sample and it will also solve the other issue when after sampling
> if error gets seen, we don't want errseq_check() to return error.
> 
> Thinking of some possible names for new function.
> 
> errseq_sample_seen()
> errseq_sample_set_seen()
> errseq_sample_consume_unseen()
> errseq_sample_current()
> 

errseq_sample_consume_unseen() sounds good, though maybe it should be
"ignore_unseen"? IDK, naming this stuff is the hardest part.

If you don't want to add a new helper, I think you'd probably also be
able to do something like this in fill_super:

    errseq_sample()
    errseq_check_and_advance()


...and just ignore the error returned by the check and advance. At that
point, the cursor should be caught up and any subsequent syncfs call
should return 0 until you record another error. It's a little less
efficient, but only slightly so.
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 17:29     ` Vivek Goyal
  2020-12-02 18:22       ` Jeff Layton
@ 2020-12-02 18:49       ` Sargun Dhillon
  2020-12-02 19:10         ` Jeff Layton
  2020-12-03 10:36         ` Amir Goldstein
  1 sibling, 2 replies; 28+ messages in thread
From: Sargun Dhillon @ 2020-12-02 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Layton, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 12:29:06PM -0500, Vivek Goyal wrote:
> On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:
> 
> [..]
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index 290983bcfbb3..82a096a05bce 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > >  	struct super_block *upper_sb;
> > > >  	int ret;
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -	if (!ovl_upper_mnt(ofs))
> > > > -		return 0;
> > > > +	ret = ovl_check_sync(ofs);
> > > > +	/*
> > > > +	 * We have to always set the err, because the return value isn't
> > > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > > +	 * this call.
> > > > +	 */
> > > > +	if (ret < 0)
> > > > +		errseq_set(&sb->s_wb_err, ret);
> > > 
> > > I was wondering that why errseq_set() will result in returning error
> > > all the time. Then realized that last syncfs() call must have set
> > > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > > counter and that means this syncfs() will will return error too. Cool.
> > > 
> > > > +
> > > > +	if (!ret)
> > > > +		return ret;
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -	if (!ovl_should_sync(ofs))
> > > > -		return 0;
> > > >  	/*
> > > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > > >  	 * All the super blocks will be iterated, including upper_sb.
> > > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > >  	sb->s_op = &ovl_super_operations;
> > > >  
> > > > 
> > > > 
> > > > 
> > > >  	if (ofs->config.upperdir) {
> > > > +		struct super_block *upper_mnt_sb;
> > > > +
> > > >  		if (!ofs->config.workdir) {
> > > >  			pr_err("missing 'workdir'\n");
> > > >  			goto out_err;
> > > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > >  		if (!ofs->workdir)
> > > >  			sb->s_flags |= SB_RDONLY;
> > > >  
> > > > 
> > > > 
> > > > 
> > > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > -
> > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > 
> > > I asked this question in last email as well. errseq_sample() will return
> > > 0 if current error has not been seen yet. That means next time a sync
> > > call comes for volatile mount, it will return an error. But that's
> > > not what we want. When we mounted a volatile overlay, if there is an
> > > existing error (seen/unseen), we don't care. We only care if there
> > > is a new error after the volatile mount, right?
> > > 
> > > I guess we will need another helper similar to errseq_smaple() which
> > > just returns existing value of errseq. And then we will have to
> > > do something about errseq_check() to not return an error if "since"
> > > and "eseq" differ only by "seen" bit.
> > > 
> > > Otherwise in current form, volatile mount will always return error
> > > if upperdir has error and it has not been seen by anybody.
> > > 
> > > How did you finally end up testing the error case. Want to simualate
> > > error aritificially and test it.
> > > 

I used the blockdev error injection layer. It only works with ext2, because
ext4 (and other filesystems) will error and go into readonly.

dd if=/dev/zero of=/tmp/loop bs=1M count=100
losetup /dev/loop8 /tmp/loop 
mkfs.ext2 /dev/loop8
mount -o errors=continue /dev/loop8 /mnt/loop/
mkdir -p /mnt/loop/{upperdir,workdir}
mount -t overlay -o volatile,index=off,lowerdir=/root/lowerdir,upperdir=/mnt/loop/upperdir,workdir=/mnt/loop/workdir none /mnt/foo/
echo 1 > /sys/block/loop8/make-it-fail
echo 100 > /sys/kernel/debug/fail_make_request/probability
echo 1 > /sys/kernel/debug/fail_make_request/times
dd if=/dev/zero of=/mnt/foo/zero bs=1M count=1
sync

I tried to get XFS tests working, but I was unable to get a simpler repro than 
above. This is also easy enough to do with a simple kernel module. Maybe it'd be 
neat to be able to inject in errseq increments via the fault injection API one 
day? I have no idea what the VFS's approach here is.

> > 
> > If you don't want to see errors that occurred before you did the mount,
> > then you probably can just resurrect and rename the original version of
> > errseq_sample. Something like this, but with a different name:
> > 
> > +errseq_t errseq_sample(errseq_t *eseq)
> > +{
> > +       errseq_t old = READ_ONCE(*eseq);
> > +       errseq_t new = old;
> > +
> > +       /*
> > +        * For the common case of no errors ever having been set, we can skip
> > +        * marking the SEEN bit. Once an error has been set, the value will
> > +        * never go back to zero.
> > +        */
> > +       if (old != 0) {
> > +               new |= ERRSEQ_SEEN;
> > +               if (old != new)
> > +                       cmpxchg(eseq, old, new);
> > +       }
> > +       return new;
> > +}
> 
> Yes, a helper like this should solve the issue at hand. We are not
> interested in previous errors. This also sets the ERRSEQ_SEEN on 
> sample and it will also solve the other issue when after sampling
> if error gets seen, we don't want errseq_check() to return error.
> 
> Thinking of some possible names for new function.
> 
> errseq_sample_seen()
> errseq_sample_set_seen()
> errseq_sample_consume_unseen()
> errseq_sample_current()
> 
> Thanks
> Vivek
> 

I think we can just replace the code in super.c with:
ofs->upper_errseq = READ_ONCE(&upper_mnt_sb->s_wb_err);

And then add an errseq helper which checks:
int errseq_check_ignore_seen(errseq_t *eseq, errseq_t since)
{
	errseq_t cur = READ_ONCE(*eseq);

	if ((cur == since) || (cur == since | ERRSEQ_SEEN))
		return 0;

	return -(cur & MAX_ERRNO);
}

--- 

This extra (cur == since | ERRSEQ_SEEN) ignores the situation where cur has 
"been seen". We do not want to do the cmpxchg I think because that would hide 
the situation from the user where if they do a syncfs we hide the error from
the user. 

If the since had seen already set, but cur does not have seen set, it means
we've wrapped.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 18:22       ` Jeff Layton
@ 2020-12-02 18:56         ` Vivek Goyal
  2020-12-02 19:03           ` Sargun Dhillon
  2020-12-02 19:26           ` Jeff Layton
  0 siblings, 2 replies; 28+ messages in thread
From: Vivek Goyal @ 2020-12-02 18:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 01:22:09PM -0500, Jeff Layton wrote:
> On Wed, 2020-12-02 at 12:29 -0500, Vivek Goyal wrote:
> > On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:
> > 
> > [..]
> > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > index 290983bcfbb3..82a096a05bce 100644
> > > > > --- a/fs/overlayfs/super.c
> > > > > +++ b/fs/overlayfs/super.c
> > > > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > >  	struct super_block *upper_sb;
> > > > >  	int ret;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > -	if (!ovl_upper_mnt(ofs))
> > > > > -		return 0;
> > > > > +	ret = ovl_check_sync(ofs);
> > > > > +	/*
> > > > > +	 * We have to always set the err, because the return value isn't
> > > > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > > > +	 * this call.
> > > > > +	 */
> > > > > +	if (ret < 0)
> > > > > +		errseq_set(&sb->s_wb_err, ret);
> > > > 
> > > > I was wondering that why errseq_set() will result in returning error
> > > > all the time. Then realized that last syncfs() call must have set
> > > > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > > > counter and that means this syncfs() will will return error too. Cool.
> > > > 
> > > > > +
> > > > > +	if (!ret)
> > > > > +		return ret;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > -	if (!ovl_should_sync(ofs))
> > > > > -		return 0;
> > > > >  	/*
> > > > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > > > >  	 * All the super blocks will be iterated, including upper_sb.
> > > > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > >  	sb->s_op = &ovl_super_operations;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > >  	if (ofs->config.upperdir) {
> > > > > +		struct super_block *upper_mnt_sb;
> > > > > +
> > > > >  		if (!ofs->config.workdir) {
> > > > >  			pr_err("missing 'workdir'\n");
> > > > >  			goto out_err;
> > > > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > >  		if (!ofs->workdir)
> > > > >  			sb->s_flags |= SB_RDONLY;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > > -
> > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > 
> > > > I asked this question in last email as well. errseq_sample() will return
> > > > 0 if current error has not been seen yet. That means next time a sync
> > > > call comes for volatile mount, it will return an error. But that's
> > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > existing error (seen/unseen), we don't care. We only care if there
> > > > is a new error after the volatile mount, right?
> > > > 
> > > > I guess we will need another helper similar to errseq_smaple() which
> > > > just returns existing value of errseq. And then we will have to
> > > > do something about errseq_check() to not return an error if "since"
> > > > and "eseq" differ only by "seen" bit.
> > > > 
> > > > Otherwise in current form, volatile mount will always return error
> > > > if upperdir has error and it has not been seen by anybody.
> > > > 
> > > > How did you finally end up testing the error case. Want to simualate
> > > > error aritificially and test it.
> > > > 
> > > 
> > > If you don't want to see errors that occurred before you did the mount,
> > > then you probably can just resurrect and rename the original version of
> > > errseq_sample. Something like this, but with a different name:
> > > 
> > > +errseq_t errseq_sample(errseq_t *eseq)
> > > +{
> > > +       errseq_t old = READ_ONCE(*eseq);
> > > +       errseq_t new = old;
> > > +
> > > +       /*
> > > +        * For the common case of no errors ever having been set, we can skip
> > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > +        * never go back to zero.
> > > +        */
> > > +       if (old != 0) {
> > > +               new |= ERRSEQ_SEEN;
> > > +               if (old != new)
> > > +                       cmpxchg(eseq, old, new);
> > > +       }
> > > +       return new;
> > > +}
> > 
> > Yes, a helper like this should solve the issue at hand. We are not
> > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > sample and it will also solve the other issue when after sampling
> > if error gets seen, we don't want errseq_check() to return error.
> > 
> > Thinking of some possible names for new function.
> > 
> > errseq_sample_seen()
> > errseq_sample_set_seen()
> > errseq_sample_consume_unseen()
> > errseq_sample_current()
> > 
> 
> errseq_sample_consume_unseen() sounds good, though maybe it should be
> "ignore_unseen"? IDK, naming this stuff is the hardest part.
> 
> If you don't want to add a new helper, I think you'd probably also be
> able to do something like this in fill_super:
> 
>     errseq_sample()
>     errseq_check_and_advance()
> 
> 
> ...and just ignore the error returned by the check and advance. At that
> point, the cursor should be caught up and any subsequent syncfs call
> should return 0 until you record another error. It's a little less
> efficient, but only slightly so.

This seems even better.

Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
sample. In our case, that would mean that we consumed an unseen error but
never reported it back to user space. And then somebody might complain.

This kind of reminds me posgresql's fsync issues where they did
writes using one fd and another thread opened another fd and
did sync and they expected any errors to be reported.

Similary what if an unseen error is present on superblock on upper
and if we mount volatile overlay and mark the error SEEN, then
if another process opens a file on upper and did syncfs(), it will
complain that exisiting error was not reported to it.

Overlay use case seems to be that we just want to check if an error
has happened on upper superblock since we sampled it and don't
want to consume that error as such. Will it make sense to introduce
two helpers for error sampling and error checking which mask the
SEEN bit and don't do anything with it. For example, following compile
tested only patch.

Now we will not touch SEEN bit at all. And even if SEEN gets set
since we sampled, errseq_check_mask_seen() will not flag it as
error.

Thanks
Vivek

---
 lib/errseq.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Index: redhat-linux/lib/errseq.c
===================================================================
--- redhat-linux.orig/lib/errseq.c	2020-06-09 08:59:29.712836019 -0400
+++ redhat-linux/lib/errseq.c	2020-12-02 13:40:08.085775647 -0500
@@ -130,6 +130,12 @@ errseq_t errseq_sample(errseq_t *eseq)
 }
 EXPORT_SYMBOL(errseq_sample);
 
+errseq_t errseq_sample_mask_seen(errseq_t *eseq)
+{
+	return READ_ONCE(*eseq) & (~ERRSEQ_SEEN);
+}
+EXPORT_SYMBOL(errseq_sample_mask_seen);
+
 /**
  * errseq_check() - Has an error occurred since a particular sample point?
  * @eseq: Pointer to errseq_t value to be checked.
@@ -151,6 +157,17 @@ int errseq_check(errseq_t *eseq, errseq_
 }
 EXPORT_SYMBOL(errseq_check);
 
+int errseq_check_mask_seen(errseq_t *eseq, errseq_t since)
+{
+	errseq_t cur = READ_ONCE(*eseq) & (~ERRSEQ_SEEN);
+
+	since &= ~ERRSEQ_SEEN;
+	if (likely(cur == since))
+		return 0;
+	return -(cur & MAX_ERRNO);
+}
+EXPORT_SYMBOL(errseq_check_mask_seen);
+
 /**
  * errseq_check_and_advance() - Check an errseq_t and advance to current value.
  * @eseq: Pointer to value being checked and reported.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 18:56         ` Vivek Goyal
@ 2020-12-02 19:03           ` Sargun Dhillon
  2020-12-02 19:26           ` Jeff Layton
  1 sibling, 0 replies; 28+ messages in thread
From: Sargun Dhillon @ 2020-12-02 19:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Layton, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 01:56:01PM -0500, Vivek Goyal wrote:
> On Wed, Dec 02, 2020 at 01:22:09PM -0500, Jeff Layton wrote:
> > On Wed, 2020-12-02 at 12:29 -0500, Vivek Goyal wrote:
> > > On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:
> > > 
> > > [..]
> > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > > index 290983bcfbb3..82a096a05bce 100644
> > > > > > --- a/fs/overlayfs/super.c
> > > > > > +++ b/fs/overlayfs/super.c
> > > > > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > > >  	struct super_block *upper_sb;
> > > > > >  	int ret;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -	if (!ovl_upper_mnt(ofs))
> > > > > > -		return 0;
> > > > > > +	ret = ovl_check_sync(ofs);
> > > > > > +	/*
> > > > > > +	 * We have to always set the err, because the return value isn't
> > > > > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > > > > +	 * this call.
> > > > > > +	 */
> > > > > > +	if (ret < 0)
> > > > > > +		errseq_set(&sb->s_wb_err, ret);
> > > > > 
> > > > > I was wondering that why errseq_set() will result in returning error
> > > > > all the time. Then realized that last syncfs() call must have set
> > > > > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > > > > counter and that means this syncfs() will will return error too. Cool.
> > > > > 
> > > > > > +
> > > > > > +	if (!ret)
> > > > > > +		return ret;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -	if (!ovl_should_sync(ofs))
> > > > > > -		return 0;
> > > > > >  	/*
> > > > > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > > > > >  	 * All the super blocks will be iterated, including upper_sb.
> > > > > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > > >  	sb->s_op = &ovl_super_operations;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  	if (ofs->config.upperdir) {
> > > > > > +		struct super_block *upper_mnt_sb;
> > > > > > +
> > > > > >  		if (!ofs->config.workdir) {
> > > > > >  			pr_err("missing 'workdir'\n");
> > > > > >  			goto out_err;
> > > > > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > > >  		if (!ofs->workdir)
> > > > > >  			sb->s_flags |= SB_RDONLY;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > > > -
> > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > 
> > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > call comes for volatile mount, it will return an error. But that's
> > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > is a new error after the volatile mount, right?
> > > > > 
> > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > just returns existing value of errseq. And then we will have to
> > > > > do something about errseq_check() to not return an error if "since"
> > > > > and "eseq" differ only by "seen" bit.
> > > > > 
> > > > > Otherwise in current form, volatile mount will always return error
> > > > > if upperdir has error and it has not been seen by anybody.
> > > > > 
> > > > > How did you finally end up testing the error case. Want to simualate
> > > > > error aritificially and test it.
> > > > > 
> > > > 
> > > > If you don't want to see errors that occurred before you did the mount,
> > > > then you probably can just resurrect and rename the original version of
> > > > errseq_sample. Something like this, but with a different name:
> > > > 
> > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > +{
> > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > +       errseq_t new = old;
> > > > +
> > > > +       /*
> > > > +        * For the common case of no errors ever having been set, we can skip
> > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > +        * never go back to zero.
> > > > +        */
> > > > +       if (old != 0) {
> > > > +               new |= ERRSEQ_SEEN;
> > > > +               if (old != new)
> > > > +                       cmpxchg(eseq, old, new);
> > > > +       }
> > > > +       return new;
> > > > +}
> > > 
> > > Yes, a helper like this should solve the issue at hand. We are not
> > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > sample and it will also solve the other issue when after sampling
> > > if error gets seen, we don't want errseq_check() to return error.
> > > 
> > > Thinking of some possible names for new function.
> > > 
> > > errseq_sample_seen()
> > > errseq_sample_set_seen()
> > > errseq_sample_consume_unseen()
> > > errseq_sample_current()
> > > 
> > 
> > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > 
> > If you don't want to add a new helper, I think you'd probably also be
> > able to do something like this in fill_super:
> > 
> >     errseq_sample()
> >     errseq_check_and_advance()
> > 
> > 
> > ...and just ignore the error returned by the check and advance. At that
> > point, the cursor should be caught up and any subsequent syncfs call
> > should return 0 until you record another error. It's a little less
> > efficient, but only slightly so.
> 
> This seems even better.
> 
> Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> sample. In our case, that would mean that we consumed an unseen error but
> never reported it back to user space. And then somebody might complain.
> 
> This kind of reminds me posgresql's fsync issues where they did
> writes using one fd and another thread opened another fd and
> did sync and they expected any errors to be reported.
> 
> Similary what if an unseen error is present on superblock on upper
> and if we mount volatile overlay and mark the error SEEN, then
> if another process opens a file on upper and did syncfs(), it will
> complain that exisiting error was not reported to it.
> 
> Overlay use case seems to be that we just want to check if an error
> has happened on upper superblock since we sampled it and don't
> want to consume that error as such. Will it make sense to introduce
> two helpers for error sampling and error checking which mask the
> SEEN bit and don't do anything with it. For example, following compile
> tested only patch.
> 
> Now we will not touch SEEN bit at all. And even if SEEN gets set
> since we sampled, errseq_check_mask_seen() will not flag it as
> error.
> 
> Thanks
> Vivek
> 
> ---
>  lib/errseq.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> Index: redhat-linux/lib/errseq.c
> ===================================================================
> --- redhat-linux.orig/lib/errseq.c	2020-06-09 08:59:29.712836019 -0400
> +++ redhat-linux/lib/errseq.c	2020-12-02 13:40:08.085775647 -0500
> @@ -130,6 +130,12 @@ errseq_t errseq_sample(errseq_t *eseq)
>  }
>  EXPORT_SYMBOL(errseq_sample);
>  
> +errseq_t errseq_sample_mask_seen(errseq_t *eseq)
> +{
> +	return READ_ONCE(*eseq) & (~ERRSEQ_SEEN);
> +}
> +EXPORT_SYMBOL(errseq_sample_mask_seen);
> +
If below, we're doing since &= ~ERRSEQ_SEEN;, I see no reason
to remove it here, and just not use READ_ONCE directly.

>  /**
>   * errseq_check() - Has an error occurred since a particular sample point?
>   * @eseq: Pointer to errseq_t value to be checked.
> @@ -151,6 +157,17 @@ int errseq_check(errseq_t *eseq, errseq_
>  }
>  EXPORT_SYMBOL(errseq_check);
>  
> +int errseq_check_mask_seen(errseq_t *eseq, errseq_t since)
> +{
> +	errseq_t cur = READ_ONCE(*eseq) & (~ERRSEQ_SEEN);
> +
> +	since &= ~ERRSEQ_SEEN;
> +	if (likely(cur == since))
> +		return 0;
> +	return -(cur & MAX_ERRNO);
> +}
> +EXPORT_SYMBOL(errseq_check_mask_seen);
> +
This ignores the wrapping case, where cur has SEEN not set on it,
but since does.

>  /**
>   * errseq_check_and_advance() - Check an errseq_t and advance to current value.
>   * @eseq: Pointer to value being checked and reported.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 18:49       ` Sargun Dhillon
@ 2020-12-02 19:10         ` Jeff Layton
  2020-12-03 10:36         ` Amir Goldstein
  1 sibling, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-12-02 19:10 UTC (permalink / raw)
  To: Sargun Dhillon, Vivek Goyal
  Cc: Amir Goldstein, linux-fsdevel, linux-unionfs, Miklos Szeredi

On Wed, 2020-12-02 at 18:49 +0000, Sargun Dhillon wrote:
> On Wed, Dec 02, 2020 at 12:29:06PM -0500, Vivek Goyal wrote:
> > On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:
> > 
> > [..]
> > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > index 290983bcfbb3..82a096a05bce 100644
> > > > > --- a/fs/overlayfs/super.c
> > > > > +++ b/fs/overlayfs/super.c
> > > > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > >  	struct super_block *upper_sb;
> > > > >  	int ret;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > -	if (!ovl_upper_mnt(ofs))
> > > > > -		return 0;
> > > > > +	ret = ovl_check_sync(ofs);
> > > > > +	/*
> > > > > +	 * We have to always set the err, because the return value isn't
> > > > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > > > +	 * this call.
> > > > > +	 */
> > > > > +	if (ret < 0)
> > > > > +		errseq_set(&sb->s_wb_err, ret);
> > > > 
> > > > I was wondering that why errseq_set() will result in returning error
> > > > all the time. Then realized that last syncfs() call must have set
> > > > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > > > counter and that means this syncfs() will will return error too. Cool.
> > > > 
> > > > > +
> > > > > +	if (!ret)
> > > > > +		return ret;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > -	if (!ovl_should_sync(ofs))
> > > > > -		return 0;
> > > > >  	/*
> > > > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > > > >  	 * All the super blocks will be iterated, including upper_sb.
> > > > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > >  	sb->s_op = &ovl_super_operations;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > >  	if (ofs->config.upperdir) {
> > > > > +		struct super_block *upper_mnt_sb;
> > > > > +
> > > > >  		if (!ofs->config.workdir) {
> > > > >  			pr_err("missing 'workdir'\n");
> > > > >  			goto out_err;
> > > > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > >  		if (!ofs->workdir)
> > > > >  			sb->s_flags |= SB_RDONLY;
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > > -
> > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > 
> > > > I asked this question in last email as well. errseq_sample() will return
> > > > 0 if current error has not been seen yet. That means next time a sync
> > > > call comes for volatile mount, it will return an error. But that's
> > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > existing error (seen/unseen), we don't care. We only care if there
> > > > is a new error after the volatile mount, right?
> > > > 
> > > > I guess we will need another helper similar to errseq_smaple() which
> > > > just returns existing value of errseq. And then we will have to
> > > > do something about errseq_check() to not return an error if "since"
> > > > and "eseq" differ only by "seen" bit.
> > > > 
> > > > Otherwise in current form, volatile mount will always return error
> > > > if upperdir has error and it has not been seen by anybody.
> > > > 
> > > > How did you finally end up testing the error case. Want to simualate
> > > > error aritificially and test it.
> > > > 
> 
> I used the blockdev error injection layer. It only works with ext2, because
> ext4 (and other filesystems) will error and go into readonly.
> 
> dd if=/dev/zero of=/tmp/loop bs=1M count=100
> losetup /dev/loop8 /tmp/loop 
> mkfs.ext2 /dev/loop8
> mount -o errors=continue /dev/loop8 /mnt/loop/
> mkdir -p /mnt/loop/{upperdir,workdir}
> mount -t overlay -o volatile,index=off,lowerdir=/root/lowerdir,upperdir=/mnt/loop/upperdir,workdir=/mnt/loop/workdir none /mnt/foo/
> echo 1 > /sys/block/loop8/make-it-fail
> echo 100 > /sys/kernel/debug/fail_make_request/probability
> echo 1 > /sys/kernel/debug/fail_make_request/times
> dd if=/dev/zero of=/mnt/foo/zero bs=1M count=1
> sync
> 
> I tried to get XFS tests working, but I was unable to get a simpler repro than 
> above. This is also easy enough to do with a simple kernel module. Maybe it'd be 
> neat to be able to inject in errseq increments via the fault injection API one 
> day? I have no idea what the VFS's approach here is.
> 
> > > 
> > > If you don't want to see errors that occurred before you did the mount,
> > > then you probably can just resurrect and rename the original version of
> > > errseq_sample. Something like this, but with a different name:
> > > 
> > > +errseq_t errseq_sample(errseq_t *eseq)
> > > +{
> > > +       errseq_t old = READ_ONCE(*eseq);
> > > +       errseq_t new = old;
> > > +
> > > +       /*
> > > +        * For the common case of no errors ever having been set, we can skip
> > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > +        * never go back to zero.
> > > +        */
> > > +       if (old != 0) {
> > > +               new |= ERRSEQ_SEEN;
> > > +               if (old != new)
> > > +                       cmpxchg(eseq, old, new);
> > > +       }
> > > +       return new;
> > > +}
> > 
> > Yes, a helper like this should solve the issue at hand. We are not
> > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > sample and it will also solve the other issue when after sampling
> > if error gets seen, we don't want errseq_check() to return error.
> > 
> > Thinking of some possible names for new function.
> > 
> > errseq_sample_seen()
> > errseq_sample_set_seen()
> > errseq_sample_consume_unseen()
> > errseq_sample_current()
> > 
> > Thanks
> > Vivek
> > 
> 
> I think we can just replace the code in super.c with:
> ofs->upper_errseq = READ_ONCE(&upper_mnt_sb->s_wb_err);
> 
> And then add an errseq helper which checks:
> int errseq_check_ignore_seen(errseq_t *eseq, errseq_t since)
> {
> 	errseq_t cur = READ_ONCE(*eseq);
> 
> 	if ((cur == since) || (cur == since | ERRSEQ_SEEN))
> 		return 0;
> 
> 	return -(cur & MAX_ERRNO);
> }
> 
> --- 
> 
> This extra (cur == since | ERRSEQ_SEEN) ignores the situation where cur has 
> "been seen". We do not want to do the cmpxchg I think because that would hide 
> the situation from the user where if they do a syncfs we hide the error from
> the user. 
> 

> If the since had seen already set, but cur does not have seen set, it
> means we've wrapped.
> 

That looks wrong to me. If you don't mark the SEEN bit when sampling you
can't be sure that you'll see errors that occur after your sample.
Nothing will change if nothing has seen it and the error is the same as
the last one. You can't really hold an unseen error "in reserve" in the
hopes that someone will scrape it later.

Also, you'd only be hiding the error from syncfs callers that did not
have the file open prior to the error occurring. See commit
b4678df184b314 (errseq: Always report a writeback error once) for the
rationale. I'm not inclined to be too sympathetic here about reporting
these errors anyway, given that they were only visible very recently
anyway.


-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 18:56         ` Vivek Goyal
  2020-12-02 19:03           ` Sargun Dhillon
@ 2020-12-02 19:26           ` Jeff Layton
  2020-12-02 21:34             ` Vivek Goyal
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-02 19:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, 2020-12-02 at 13:56 -0500, Vivek Goyal wrote:
> On Wed, Dec 02, 2020 at 01:22:09PM -0500, Jeff Layton wrote:
> > On Wed, 2020-12-02 at 12:29 -0500, Vivek Goyal wrote:
> > > On Wed, Dec 02, 2020 at 12:02:43PM -0500, Jeff Layton wrote:
> > > 
> > > [..]
> > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > > index 290983bcfbb3..82a096a05bce 100644
> > > > > > --- a/fs/overlayfs/super.c
> > > > > > +++ b/fs/overlayfs/super.c
> > > > > > @@ -261,11 +261,18 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > > >  	struct super_block *upper_sb;
> > > > > >  	int ret;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -	if (!ovl_upper_mnt(ofs))
> > > > > > -		return 0;
> > > > > > +	ret = ovl_check_sync(ofs);
> > > > > > +	/*
> > > > > > +	 * We have to always set the err, because the return value isn't
> > > > > > +	 * checked, and instead VFS looks at the writeback errseq after
> > > > > > +	 * this call.
> > > > > > +	 */
> > > > > > +	if (ret < 0)
> > > > > > +		errseq_set(&sb->s_wb_err, ret);
> > > > > 
> > > > > I was wondering that why errseq_set() will result in returning error
> > > > > all the time. Then realized that last syncfs() call must have set
> > > > > ERRSEQ_SEEN flag and that will mean errseq_set() will increment
> > > > > counter and that means this syncfs() will will return error too. Cool.
> > > > > 
> > > > > > +
> > > > > > +	if (!ret)
> > > > > > +		return ret;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -	if (!ovl_should_sync(ofs))
> > > > > > -		return 0;
> > > > > >  	/*
> > > > > >  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > > > > >  	 * All the super blocks will be iterated, including upper_sb.
> > > > > > @@ -1927,6 +1934,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > > >  	sb->s_op = &ovl_super_operations;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  	if (ofs->config.upperdir) {
> > > > > > +		struct super_block *upper_mnt_sb;
> > > > > > +
> > > > > >  		if (!ofs->config.workdir) {
> > > > > >  			pr_err("missing 'workdir'\n");
> > > > > >  			goto out_err;
> > > > > > @@ -1943,9 +1952,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > > >  		if (!ofs->workdir)
> > > > > >  			sb->s_flags |= SB_RDONLY;
> > > > > >  
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -		sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > > > -		sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > > > -
> > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > 
> > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > call comes for volatile mount, it will return an error. But that's
> > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > is a new error after the volatile mount, right?
> > > > > 
> > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > just returns existing value of errseq. And then we will have to
> > > > > do something about errseq_check() to not return an error if "since"
> > > > > and "eseq" differ only by "seen" bit.
> > > > > 
> > > > > Otherwise in current form, volatile mount will always return error
> > > > > if upperdir has error and it has not been seen by anybody.
> > > > > 
> > > > > How did you finally end up testing the error case. Want to simualate
> > > > > error aritificially and test it.
> > > > > 
> > > > 
> > > > If you don't want to see errors that occurred before you did the mount,
> > > > then you probably can just resurrect and rename the original version of
> > > > errseq_sample. Something like this, but with a different name:
> > > > 
> > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > +{
> > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > +       errseq_t new = old;
> > > > +
> > > > +       /*
> > > > +        * For the common case of no errors ever having been set, we can skip
> > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > +        * never go back to zero.
> > > > +        */
> > > > +       if (old != 0) {
> > > > +               new |= ERRSEQ_SEEN;
> > > > +               if (old != new)
> > > > +                       cmpxchg(eseq, old, new);
> > > > +       }
> > > > +       return new;
> > > > +}
> > > 
> > > Yes, a helper like this should solve the issue at hand. We are not
> > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > sample and it will also solve the other issue when after sampling
> > > if error gets seen, we don't want errseq_check() to return error.
> > > 
> > > Thinking of some possible names for new function.
> > > 
> > > errseq_sample_seen()
> > > errseq_sample_set_seen()
> > > errseq_sample_consume_unseen()
> > > errseq_sample_current()
> > > 
> > 
> > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > 
> > If you don't want to add a new helper, I think you'd probably also be
> > able to do something like this in fill_super:
> > 
> >     errseq_sample()
> >     errseq_check_and_advance()
> > 
> > 
> > ...and just ignore the error returned by the check and advance. At that
> > point, the cursor should be caught up and any subsequent syncfs call
> > should return 0 until you record another error. It's a little less
> > efficient, but only slightly so.
> 
> This seems even better.
> 
> Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> sample. In our case, that would mean that we consumed an unseen error but
> never reported it back to user space. And then somebody might complain.
> 
> This kind of reminds me posgresql's fsync issues where they did
> writes using one fd and another thread opened another fd and
> did sync and they expected any errors to be reported.
> 

> Similary what if an unseen error is present on superblock on upper
> and if we mount volatile overlay and mark the error SEEN, then
> if another process opens a file on upper and did syncfs(), it will
> complain that exisiting error was not reported to it.
> 
> Overlay use case seems to be that we just want to check if an error
> has happened on upper superblock since we sampled it and don't
> want to consume that error as such. Will it make sense to introduce
> two helpers for error sampling and error checking which mask the
> SEEN bit and don't do anything with it. For example, following compile
> tested only patch.
> 
> Now we will not touch SEEN bit at all. And even if SEEN gets set
> since we sampled, errseq_check_mask_seen() will not flag it as
> error.
> 
> Thanks
> Vivek
> 

Again, you're not really hiding this from anyone doing something _sane_.
You're only hiding an error from someone who opens the file after an
error occurs and expects to see an error.

That was the behavior for fsync before we switched to errseq_t, and we
had to change errseq_sample for applications that relied on that. syncfs
reporting these errors is pretty new however. I don't think we
necessarily need to make the same guarantees there.

The solution to all of these problems is to ensure that you open the
files early you're issuing syncfs on and keep them open. Then you'll
always see any subsequent errors.

> ---
>  lib/errseq.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> Index: redhat-linux/lib/errseq.c
> ===================================================================
> --- redhat-linux.orig/lib/errseq.c	2020-06-09 08:59:29.712836019 -0400
> +++ redhat-linux/lib/errseq.c	2020-12-02 13:40:08.085775647 -0500
> @@ -130,6 +130,12 @@ errseq_t errseq_sample(errseq_t *eseq)
>  }
>  EXPORT_SYMBOL(errseq_sample);
>  
> +errseq_t errseq_sample_mask_seen(errseq_t *eseq)
> +{
> +	return READ_ONCE(*eseq) & (~ERRSEQ_SEEN);
> +}
> +EXPORT_SYMBOL(errseq_sample_mask_seen);
> +
>  /**
>   * errseq_check() - Has an error occurred since a particular sample point?
>   * @eseq: Pointer to errseq_t value to be checked.
> @@ -151,6 +157,17 @@ int errseq_check(errseq_t *eseq, errseq_
>  }
>  EXPORT_SYMBOL(errseq_check);
>  
> +int errseq_check_mask_seen(errseq_t *eseq, errseq_t since)
> +{
> +	errseq_t cur = READ_ONCE(*eseq) & (~ERRSEQ_SEEN);
> +
> +	since &= ~ERRSEQ_SEEN;
> +	if (likely(cur == since))
> +		return 0;
> +	return -(cur & MAX_ERRNO);
> +}
> +EXPORT_SYMBOL(errseq_check_mask_seen);
> +
> /**
>   * errseq_check_and_advance() - Check an errseq_t and advance to current value.
>   * @eseq: Pointer to value being checked and reported.
> 

NAK. If you do that, then you may not see an error that happens after
your mount occurred. If nothing sets the SEEN bit, then subsequent
occurrences of the same error will not be recorded. See the logic in
errseq_set().

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 19:26           ` Jeff Layton
@ 2020-12-02 21:34             ` Vivek Goyal
  2020-12-02 21:52               ` Jeff Layton
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2020-12-02 21:34 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
[..]
> > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > 
> > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > is a new error after the volatile mount, right?
> > > > > > 
> > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > just returns existing value of errseq. And then we will have to
> > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > and "eseq" differ only by "seen" bit.
> > > > > > 
> > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > 
> > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > error aritificially and test it.
> > > > > > 
> > > > > 
> > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > then you probably can just resurrect and rename the original version of
> > > > > errseq_sample. Something like this, but with a different name:
> > > > > 
> > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > +{
> > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > +       errseq_t new = old;
> > > > > +
> > > > > +       /*
> > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > +        * never go back to zero.
> > > > > +        */
> > > > > +       if (old != 0) {
> > > > > +               new |= ERRSEQ_SEEN;
> > > > > +               if (old != new)
> > > > > +                       cmpxchg(eseq, old, new);
> > > > > +       }
> > > > > +       return new;
> > > > > +}
> > > > 
> > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > sample and it will also solve the other issue when after sampling
> > > > if error gets seen, we don't want errseq_check() to return error.
> > > > 
> > > > Thinking of some possible names for new function.
> > > > 
> > > > errseq_sample_seen()
> > > > errseq_sample_set_seen()
> > > > errseq_sample_consume_unseen()
> > > > errseq_sample_current()
> > > > 
> > > 
> > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > 
> > > If you don't want to add a new helper, I think you'd probably also be
> > > able to do something like this in fill_super:
> > > 
> > >     errseq_sample()
> > >     errseq_check_and_advance()
> > > 
> > > 
> > > ...and just ignore the error returned by the check and advance. At that
> > > point, the cursor should be caught up and any subsequent syncfs call
> > > should return 0 until you record another error. It's a little less
> > > efficient, but only slightly so.
> > 
> > This seems even better.
> > 
> > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > sample. In our case, that would mean that we consumed an unseen error but
> > never reported it back to user space. And then somebody might complain.
> > 
> > This kind of reminds me posgresql's fsync issues where they did
> > writes using one fd and another thread opened another fd and
> > did sync and they expected any errors to be reported.
> > 
> 
> > Similary what if an unseen error is present on superblock on upper
> > and if we mount volatile overlay and mark the error SEEN, then
> > if another process opens a file on upper and did syncfs(), it will
> > complain that exisiting error was not reported to it.
> > 
> > Overlay use case seems to be that we just want to check if an error
> > has happened on upper superblock since we sampled it and don't
> > want to consume that error as such. Will it make sense to introduce
> > two helpers for error sampling and error checking which mask the
> > SEEN bit and don't do anything with it. For example, following compile
> > tested only patch.
> > 
> > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > since we sampled, errseq_check_mask_seen() will not flag it as
> > error.
> > 
> > Thanks
> > Vivek
> > 
> 
> Again, you're not really hiding this from anyone doing something _sane_.
> You're only hiding an error from someone who opens the file after an
> error occurs and expects to see an error.
> 
> That was the behavior for fsync before we switched to errseq_t, and we
> had to change errseq_sample for applications that relied on that. syncfs
> reporting these errors is pretty new however. I don't think we
> necessarily need to make the same guarantees there.
> 
> The solution to all of these problems is to ensure that you open the
> files early you're issuing syncfs on and keep them open. Then you'll
> always see any subsequent errors.

Ok. I guess we will have to set SEEN bit during error_sample otherwise,
we miss errors. I had missed this point.

So mounting a volatile overlay instance will become somewhat
equivalent of as if somebody did a syncfs on upper, consumed
error and did not do anything about it.

If a user cares about not losing such errors, they need to keep an
fd open on upper. 

/me hopes that this does not become an issue for somebody. Even
if it does, one workaround can be don't do volatile overlay or
don't share overlay upper with other conflicting workload.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 21:34             ` Vivek Goyal
@ 2020-12-02 21:52               ` Jeff Layton
  2020-12-03 10:42                 ` Sargun Dhillon
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-02 21:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> [..]
> > > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > 
> > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > is a new error after the volatile mount, right?
> > > > > > > 
> > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > 
> > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > 
> > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > error aritificially and test it.
> > > > > > > 
> > > > > > 
> > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > then you probably can just resurrect and rename the original version of
> > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > 
> > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > +{
> > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > +       errseq_t new = old;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > +        * never go back to zero.
> > > > > > +        */
> > > > > > +       if (old != 0) {
> > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > +               if (old != new)
> > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > +       }
> > > > > > +       return new;
> > > > > > +}
> > > > > 
> > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > > sample and it will also solve the other issue when after sampling
> > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > 
> > > > > Thinking of some possible names for new function.
> > > > > 
> > > > > errseq_sample_seen()
> > > > > errseq_sample_set_seen()
> > > > > errseq_sample_consume_unseen()
> > > > > errseq_sample_current()
> > > > > 
> > > > 
> > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > 
> > > > If you don't want to add a new helper, I think you'd probably also be
> > > > able to do something like this in fill_super:
> > > > 
> > > >     errseq_sample()
> > > >     errseq_check_and_advance()
> > > > 
> > > > 
> > > > ...and just ignore the error returned by the check and advance. At that
> > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > should return 0 until you record another error. It's a little less
> > > > efficient, but only slightly so.
> > > 
> > > This seems even better.
> > > 
> > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > sample. In our case, that would mean that we consumed an unseen error but
> > > never reported it back to user space. And then somebody might complain.
> > > 
> > > This kind of reminds me posgresql's fsync issues where they did
> > > writes using one fd and another thread opened another fd and
> > > did sync and they expected any errors to be reported.
> > > 
> > 
> > > Similary what if an unseen error is present on superblock on upper
> > > and if we mount volatile overlay and mark the error SEEN, then
> > > if another process opens a file on upper and did syncfs(), it will
> > > complain that exisiting error was not reported to it.
> > > 
> > > Overlay use case seems to be that we just want to check if an error
> > > has happened on upper superblock since we sampled it and don't
> > > want to consume that error as such. Will it make sense to introduce
> > > two helpers for error sampling and error checking which mask the
> > > SEEN bit and don't do anything with it. For example, following compile
> > > tested only patch.
> > > 
> > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > error.
> > > 
> > > Thanks
> > > Vivek
> > > 
> > 
> > Again, you're not really hiding this from anyone doing something _sane_.
> > You're only hiding an error from someone who opens the file after an
> > error occurs and expects to see an error.
> > 
> > That was the behavior for fsync before we switched to errseq_t, and we
> > had to change errseq_sample for applications that relied on that. syncfs
> > reporting these errors is pretty new however. I don't think we
> > necessarily need to make the same guarantees there.
> > 
> > The solution to all of these problems is to ensure that you open the
> > files early you're issuing syncfs on and keep them open. Then you'll
> > always see any subsequent errors.
> 
> Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> we miss errors. I had missed this point.
> 
> So mounting a volatile overlay instance will become somewhat
> equivalent of as if somebody did a syncfs on upper, consumed
> error and did not do anything about it.
> 
> If a user cares about not losing such errors, they need to keep an
> fd open on upper. 
> 
> /me hopes that this does not become an issue for somebody. Even
> if it does, one workaround can be don't do volatile overlay or
> don't share overlay upper with other conflicting workload.
> 

Yeah, there are limits to what we can do with 32 bits.

It's not pretty, but I guess you could pr_warn at mount time if you find
an unseen error. That would at least not completely drop it on the
floor.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 18:49       ` Sargun Dhillon
  2020-12-02 19:10         ` Jeff Layton
@ 2020-12-03 10:36         ` Amir Goldstein
  1 sibling, 0 replies; 28+ messages in thread
From: Amir Goldstein @ 2020-12-03 10:36 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, Jeff Layton, linux-fsdevel, overlayfs, Miklos Szeredi

> > > > How did you finally end up testing the error case. Want to simualate
> > > > error aritificially and test it.
> > > >
>
> I used the blockdev error injection layer. It only works with ext2, because
> ext4 (and other filesystems) will error and go into readonly.
>
> dd if=/dev/zero of=/tmp/loop bs=1M count=100
> losetup /dev/loop8 /tmp/loop
> mkfs.ext2 /dev/loop8
> mount -o errors=continue /dev/loop8 /mnt/loop/
> mkdir -p /mnt/loop/{upperdir,workdir}
> mount -t overlay -o volatile,index=off,lowerdir=/root/lowerdir,upperdir=/mnt/loop/upperdir,workdir=/mnt/loop/workdir none /mnt/foo/
> echo 1 > /sys/block/loop8/make-it-fail
> echo 100 > /sys/kernel/debug/fail_make_request/probability
> echo 1 > /sys/kernel/debug/fail_make_request/times
> dd if=/dev/zero of=/mnt/foo/zero bs=1M count=1
> sync
>
> I tried to get XFS tests working, but I was unable to get a simpler repro than
> above. This is also easy enough to do with a simple kernel module. Maybe it'd be
> neat to be able to inject in errseq increments via the fault injection API one
> day? I have no idea what the VFS's approach here is.
>

Here you go. A simple xfstest:

https://github.com/amir73il/xfstests/commits/ovl-volatile

It passes with check -overlay on xfs/ext4 and it fails if you uncomment
the -o volatile mount option

Just move it to the tests/overlay/ tests and add syncfs (I wasn't sure
about how to do that).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-02 21:52               ` Jeff Layton
@ 2020-12-03 10:42                 ` Sargun Dhillon
  2020-12-03 12:06                   ` Jeff Layton
  2020-12-03 14:27                   ` Vivek Goyal
  0 siblings, 2 replies; 28+ messages in thread
From: Sargun Dhillon @ 2020-12-03 10:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Vivek Goyal, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > [..]
> > > > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > 
> > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > 
> > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > 
> > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > 
> > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > error aritificially and test it.
> > > > > > > > 
> > > > > > > 
> > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > 
> > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > +{
> > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > +       errseq_t new = old;
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > +        * never go back to zero.
> > > > > > > +        */
> > > > > > > +       if (old != 0) {
> > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > +               if (old != new)
> > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > +       }
> > > > > > > +       return new;
> > > > > > > +}
> > > > > > 
> > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > > > sample and it will also solve the other issue when after sampling
> > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > 
> > > > > > Thinking of some possible names for new function.
> > > > > > 
> > > > > > errseq_sample_seen()
> > > > > > errseq_sample_set_seen()
> > > > > > errseq_sample_consume_unseen()
> > > > > > errseq_sample_current()
> > > > > > 
> > > > > 
> > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > 
> > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > able to do something like this in fill_super:
> > > > > 
> > > > >     errseq_sample()
> > > > >     errseq_check_and_advance()
> > > > > 
> > > > > 
> > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > should return 0 until you record another error. It's a little less
> > > > > efficient, but only slightly so.
> > > > 
> > > > This seems even better.
> > > > 
> > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > never reported it back to user space. And then somebody might complain.
> > > > 
> > > > This kind of reminds me posgresql's fsync issues where they did
> > > > writes using one fd and another thread opened another fd and
> > > > did sync and they expected any errors to be reported.
> > > > 
> > > 
> > > > Similary what if an unseen error is present on superblock on upper
> > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > if another process opens a file on upper and did syncfs(), it will
> > > > complain that exisiting error was not reported to it.
> > > > 
> > > > Overlay use case seems to be that we just want to check if an error
> > > > has happened on upper superblock since we sampled it and don't
> > > > want to consume that error as such. Will it make sense to introduce
> > > > two helpers for error sampling and error checking which mask the
> > > > SEEN bit and don't do anything with it. For example, following compile
> > > > tested only patch.
> > > > 
> > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > error.
> > > > 
> > > > Thanks
> > > > Vivek
> > > > 
> > > 
> > > Again, you're not really hiding this from anyone doing something _sane_.
> > > You're only hiding an error from someone who opens the file after an
> > > error occurs and expects to see an error.
> > > 
> > > That was the behavior for fsync before we switched to errseq_t, and we
> > > had to change errseq_sample for applications that relied on that. syncfs
> > > reporting these errors is pretty new however. I don't think we
> > > necessarily need to make the same guarantees there.
> > > 
> > > The solution to all of these problems is to ensure that you open the
> > > files early you're issuing syncfs on and keep them open. Then you'll
> > > always see any subsequent errors.
> > 
> > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > we miss errors. I had missed this point.
> > 
> > So mounting a volatile overlay instance will become somewhat
> > equivalent of as if somebody did a syncfs on upper, consumed
> > error and did not do anything about it.
> > 
> > If a user cares about not losing such errors, they need to keep an
> > fd open on upper. 
> > 
> > /me hopes that this does not become an issue for somebody. Even
> > if it does, one workaround can be don't do volatile overlay or
> > don't share overlay upper with other conflicting workload.
> > 
> 
> Yeah, there are limits to what we can do with 32 bits.
> 
> It's not pretty, but I guess you could pr_warn at mount time if you find
> an unseen error. That would at least not completely drop it on the
> floor.
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> 

If I may enumerate our choices to help my own understanding, and
come up with a decent decision on how to proceed:

1. If the filesystem has an unseen error, pr_warn.
2. If the filesystem has an unseen error, refuse to mount it until
   the user clears the error (via syncfs?).
3. Ignore the beginning state of the upperdir
4. Increment the errseq_t.
5. A combination of #1, and #2 and require the user to mount
   -o reallyvolatile or smoe such.

Now the downsides of each of these options:

1. The user probably won't look at these errors. Especially,
   if the application is a container runtime, and these are
   happening on behalf of the application in an automated fashion.
2. Forcing a syncfs on most filesystems is a massively costly
   operation that we want to avoid with the volatile operation.
   Also, go back to #1. Until we implement the new FS API, we
   can't easily give meaningful warnings to users that they
   can programatically act on (unless we use some special errno).
3. This is a noop.
4. We can hide errors from other users of the upperdir if they
   rely on syncfs semantics rather than per-fd fsync semantics
   to check if the filesystem is "clean".
5. See the issues with #1 and #2.

I'm also curious as to how the patchset that allows for partial
sync is going to deal with this problem [1].

There is one other proposal I have, which is we define errseq_t
as two structures:
-errseq_t errseq_set(errseq_t *eseq, int err);
+/* For use on the publishing-side of errseq */
+struct errseq_publisher {
+        atomic_t        errors;
+        errseq_t        errseq_t
+};
+
+errseq_t errseq_set(struct errseq_publisher *eseq, int err);

And errseq_publisher is on the superblock, and errors is always incremented no 
matter what. We risk wrapping, but I think this falls into Jeff's "sane" test -- 
if there are 2**32+ errors without someone doing an fsync, or noticing, you 
might have other problems.

This has two (and a half) downsides:
1. It is a potential performance concern to introduce an atomic here.
2. It takes more space on the superblock.

1 can be mitigated by using a percpu variable, but that makes #2 far worse.

Opinions?

[1]: https://lwn.net/Articles/837133/



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 10:42                 ` Sargun Dhillon
@ 2020-12-03 12:06                   ` Jeff Layton
  2020-12-03 14:27                   ` Vivek Goyal
  1 sibling, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-12-03 12:06 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Thu, 2020-12-03 at 10:42 +0000, Sargun Dhillon wrote:
> On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > [..]
> > > > > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > 
> > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > 
> > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > 
> > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > 
> > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > error aritificially and test it.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > 
> > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > +{
> > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > +       errseq_t new = old;
> > > > > > > > +
> > > > > > > > +       /*
> > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > +        * never go back to zero.
> > > > > > > > +        */
> > > > > > > > +       if (old != 0) {
> > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > +               if (old != new)
> > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > +       }
> > > > > > > > +       return new;
> > > > > > > > +}
> > > > > > > 
> > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > 
> > > > > > > Thinking of some possible names for new function.
> > > > > > > 
> > > > > > > errseq_sample_seen()
> > > > > > > errseq_sample_set_seen()
> > > > > > > errseq_sample_consume_unseen()
> > > > > > > errseq_sample_current()
> > > > > > > 
> > > > > > 
> > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > 
> > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > able to do something like this in fill_super:
> > > > > > 
> > > > > >     errseq_sample()
> > > > > >     errseq_check_and_advance()
> > > > > > 
> > > > > > 
> > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > should return 0 until you record another error. It's a little less
> > > > > > efficient, but only slightly so.
> > > > > 
> > > > > This seems even better.
> > > > > 
> > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > never reported it back to user space. And then somebody might complain.
> > > > > 
> > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > writes using one fd and another thread opened another fd and
> > > > > did sync and they expected any errors to be reported.
> > > > > 
> > > > 
> > > > > Similary what if an unseen error is present on superblock on upper
> > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > complain that exisiting error was not reported to it.
> > > > > 
> > > > > Overlay use case seems to be that we just want to check if an error
> > > > > has happened on upper superblock since we sampled it and don't
> > > > > want to consume that error as such. Will it make sense to introduce
> > > > > two helpers for error sampling and error checking which mask the
> > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > tested only patch.
> > > > > 
> > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > error.
> > > > > 
> > > > > Thanks
> > > > > Vivek
> > > > > 
> > > > 
> > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > You're only hiding an error from someone who opens the file after an
> > > > error occurs and expects to see an error.
> > > > 
> > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > reporting these errors is pretty new however. I don't think we
> > > > necessarily need to make the same guarantees there.
> > > > 
> > > > The solution to all of these problems is to ensure that you open the
> > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > always see any subsequent errors.
> > > 
> > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > we miss errors. I had missed this point.
> > > 
> > > So mounting a volatile overlay instance will become somewhat
> > > equivalent of as if somebody did a syncfs on upper, consumed
> > > error and did not do anything about it.
> > > 
> > > If a user cares about not losing such errors, they need to keep an
> > > fd open on upper. 
> > > 
> > > /me hopes that this does not become an issue for somebody. Even
> > > if it does, one workaround can be don't do volatile overlay or
> > > don't share overlay upper with other conflicting workload.
> > > 
> > 
> > Yeah, there are limits to what we can do with 32 bits.
> > 
> > It's not pretty, but I guess you could pr_warn at mount time if you find
> > an unseen error. That would at least not completely drop it on the
> > floor.
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > 
> 
> If I may enumerate our choices to help my own understanding, and
> come up with a decent decision on how to proceed:
> 
> 1. If the filesystem has an unseen error, pr_warn.
> 2. If the filesystem has an unseen error, refuse to mount it until
>    the user clears the error (via syncfs?).
> 3. Ignore the beginning state of the upperdir
> 4. Increment the errseq_t.
> 5. A combination of #1, and #2 and require the user to mount
>    -o reallyvolatile or smoe such.
> 
> Now the downsides of each of these options:
> 
> 1. The user probably won't look at these errors. Especially,
>    if the application is a container runtime, and these are
>    happening on behalf of the application in an automated fashion.
> 2. Forcing a syncfs on most filesystems is a massively costly
>    operation that we want to avoid with the volatile operation.
>    Also, go back to #1. Until we implement the new FS API, we
>    can't easily give meaningful warnings to users that they
>    can programatically act on (unless we use some special errno).
> 3. This is a noop.
> 4. We can hide errors from other users of the upperdir if they
>    rely on syncfs semantics rather than per-fd fsync semantics
>    to check if the filesystem is "clean".

Not really. You'd only hide it from an application that didn't already
have the file open when the error occurred. While we did need to allow
someone see an error with fsync that occurred before the file was open,
I don't see a compelling need to do that with syncfs.

> 5. See the issues with #1 and #2.
> 
> I'm also curious as to how the patchset that allows for partial
> sync is going to deal with this problem [1].
> 
> There is one other proposal I have, which is we define errseq_t
> as two structures:
> -errseq_t errseq_set(errseq_t *eseq, int err);
> +/* For use on the publishing-side of errseq */
> +struct errseq_publisher {
> +        atomic_t        errors;
> +        errseq_t        errseq_t
> +};
> +
> +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> 
> And errseq_publisher is on the superblock, and errors is always incremented no 
> matter what. We risk wrapping, but I think this falls into Jeff's "sane" test -- 
> if there are 2**32+ errors without someone doing an fsync, or noticing, you 
> might have other problems.
> 

You'd have to have 2**20 errors, and call syncfs on a different fd 2**20
times, and just happen to call syncfs on original fd at exactly the time
that you hit the 2**20'th iteration such that the counter was exactly
the same as the one that you sampled before.

Collisions are possible with this scheme, but they really should be
exceedingly rare.

> This has two (and a half) downsides:
> 1. It is a potential performance concern to introduce an atomic here.
> 2. It takes more space on the superblock.
> 
> 1 can be mitigated by using a percpu variable, but that makes #2 far worse.
> 
> Opinions?
> 
> [1]: https://lwn.net/Articles/837133/
> 
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 10:42                 ` Sargun Dhillon
  2020-12-03 12:06                   ` Jeff Layton
@ 2020-12-03 14:27                   ` Vivek Goyal
  2020-12-03 15:20                     ` Jeff Layton
  1 sibling, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2020-12-03 14:27 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Jeff Layton, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi

On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > [..]
> > > > > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > 
> > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > 
> > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > 
> > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > 
> > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > error aritificially and test it.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > 
> > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > +{
> > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > +       errseq_t new = old;
> > > > > > > > +
> > > > > > > > +       /*
> > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > +        * never go back to zero.
> > > > > > > > +        */
> > > > > > > > +       if (old != 0) {
> > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > +               if (old != new)
> > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > +       }
> > > > > > > > +       return new;
> > > > > > > > +}
> > > > > > > 
> > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > 
> > > > > > > Thinking of some possible names for new function.
> > > > > > > 
> > > > > > > errseq_sample_seen()
> > > > > > > errseq_sample_set_seen()
> > > > > > > errseq_sample_consume_unseen()
> > > > > > > errseq_sample_current()
> > > > > > > 
> > > > > > 
> > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > 
> > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > able to do something like this in fill_super:
> > > > > > 
> > > > > >     errseq_sample()
> > > > > >     errseq_check_and_advance()
> > > > > > 
> > > > > > 
> > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > should return 0 until you record another error. It's a little less
> > > > > > efficient, but only slightly so.
> > > > > 
> > > > > This seems even better.
> > > > > 
> > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > never reported it back to user space. And then somebody might complain.
> > > > > 
> > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > writes using one fd and another thread opened another fd and
> > > > > did sync and they expected any errors to be reported.
> > > > > 
> > > > 
> > > > > Similary what if an unseen error is present on superblock on upper
> > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > complain that exisiting error was not reported to it.
> > > > > 
> > > > > Overlay use case seems to be that we just want to check if an error
> > > > > has happened on upper superblock since we sampled it and don't
> > > > > want to consume that error as such. Will it make sense to introduce
> > > > > two helpers for error sampling and error checking which mask the
> > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > tested only patch.
> > > > > 
> > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > error.
> > > > > 
> > > > > Thanks
> > > > > Vivek
> > > > > 
> > > > 
> > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > You're only hiding an error from someone who opens the file after an
> > > > error occurs and expects to see an error.
> > > > 
> > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > reporting these errors is pretty new however. I don't think we
> > > > necessarily need to make the same guarantees there.
> > > > 
> > > > The solution to all of these problems is to ensure that you open the
> > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > always see any subsequent errors.
> > > 
> > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > we miss errors. I had missed this point.
> > > 
> > > So mounting a volatile overlay instance will become somewhat
> > > equivalent of as if somebody did a syncfs on upper, consumed
> > > error and did not do anything about it.
> > > 
> > > If a user cares about not losing such errors, they need to keep an
> > > fd open on upper. 
> > > 
> > > /me hopes that this does not become an issue for somebody. Even
> > > if it does, one workaround can be don't do volatile overlay or
> > > don't share overlay upper with other conflicting workload.
> > > 
> > 
> > Yeah, there are limits to what we can do with 32 bits.
> > 
> > It's not pretty, but I guess you could pr_warn at mount time if you find
> > an unseen error. That would at least not completely drop it on the
> > floor.
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > 
> 
> If I may enumerate our choices to help my own understanding, and
> come up with a decent decision on how to proceed:
> 
> 1. If the filesystem has an unseen error, pr_warn.
> 2. If the filesystem has an unseen error, refuse to mount it until
>    the user clears the error (via syncfs?).
> 3. Ignore the beginning state of the upperdir
> 4. Increment the errseq_t.
> 5. A combination of #1, and #2 and require the user to mount
>    -o reallyvolatile or smoe such.
> 
> Now the downsides of each of these options:
> 
> 1. The user probably won't look at these errors. Especially,
>    if the application is a container runtime, and these are
>    happening on behalf of the application in an automated fashion.
> 2. Forcing a syncfs on most filesystems is a massively costly
>    operation that we want to avoid with the volatile operation.
>    Also, go back to #1. Until we implement the new FS API, we
>    can't easily give meaningful warnings to users that they
>    can programatically act on (unless we use some special errno).
> 3. This is a noop.
> 4. We can hide errors from other users of the upperdir if they
>    rely on syncfs semantics rather than per-fd fsync semantics
>    to check if the filesystem is "clean".
> 5. See the issues with #1 and #2.
> 
> I'm also curious as to how the patchset that allows for partial
> sync is going to deal with this problem [1].
> 
> There is one other proposal I have, which is we define errseq_t
> as two structures:
> -errseq_t errseq_set(errseq_t *eseq, int err);
> +/* For use on the publishing-side of errseq */
> +struct errseq_publisher {
> +        atomic_t        errors;
> +        errseq_t        errseq_t
> +};
> +
> +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> 
> And errseq_publisher is on the superblock, and errors is always incremented no 
> matter what. We risk wrapping, but I think this falls into Jeff's "sane" test -- 
> if there are 2**32+ errors without someone doing an fsync, or noticing, you 
> might have other problems.
> 
> This has two (and a half) downsides:
> 1. It is a potential performance concern to introduce an atomic here.

Is updation of errseq_t performance sensitive path. Updation happens in
error path and I would think that does not happen often. If that's the
case, it should not be very performance sensitive path.

I agree that warning on unseen error is probably not enough. Applications
can't do much with that. To me it boils down to two options.

A. Live with the idea of swallowing the unseen error on syncfs (if caller
   did not keep an fd open).

B. Extend errseq infrastcture in such a way so that we can detect new
   error without marking error SEEN.

It feels as if B is a safter choice but will be more work. With A, problem
is that behavior will be different in difference scenarios and it will then
become difficult to justify.

- fsync and syncfs behavior will be different w.r.t UNSEEN error.
- syncfs behavior will be different depending on if volatile overlay
  mounts are being used on this filesystem or not.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 14:27                   ` Vivek Goyal
@ 2020-12-03 15:20                     ` Jeff Layton
  2020-12-03 17:08                       ` Sargun Dhillon
  2020-12-03 20:31                       ` Vivek Goyal
  0 siblings, 2 replies; 28+ messages in thread
From: Jeff Layton @ 2020-12-03 15:20 UTC (permalink / raw)
  To: Vivek Goyal, Sargun Dhillon
  Cc: Amir Goldstein, linux-fsdevel, linux-unionfs, Miklos Szeredi,
	Matthew Wilcox

On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > [..]
> > > > > > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > 
> > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > 
> > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > 
> > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > 
> > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > error aritificially and test it.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > 
> > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > +{
> > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > +       errseq_t new = old;
> > > > > > > > > +
> > > > > > > > > +       /*
> > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > +        * never go back to zero.
> > > > > > > > > +        */
> > > > > > > > > +       if (old != 0) {
> > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > +               if (old != new)
> > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > +       }
> > > > > > > > > +       return new;
> > > > > > > > > +}
> > > > > > > > 
> > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > 
> > > > > > > > Thinking of some possible names for new function.
> > > > > > > > 
> > > > > > > > errseq_sample_seen()
> > > > > > > > errseq_sample_set_seen()
> > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > errseq_sample_current()
> > > > > > > > 
> > > > > > > 
> > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > 
> > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > able to do something like this in fill_super:
> > > > > > > 
> > > > > > >     errseq_sample()
> > > > > > >     errseq_check_and_advance()
> > > > > > > 
> > > > > > > 
> > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > efficient, but only slightly so.
> > > > > > 
> > > > > > This seems even better.
> > > > > > 
> > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > 
> > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > writes using one fd and another thread opened another fd and
> > > > > > did sync and they expected any errors to be reported.
> > > > > > 
> > > > > 
> > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > complain that exisiting error was not reported to it.
> > > > > > 
> > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > two helpers for error sampling and error checking which mask the
> > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > tested only patch.
> > > > > > 
> > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > error.
> > > > > > 
> > > > > > Thanks
> > > > > > Vivek
> > > > > > 
> > > > > 
> > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > You're only hiding an error from someone who opens the file after an
> > > > > error occurs and expects to see an error.
> > > > > 
> > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > reporting these errors is pretty new however. I don't think we
> > > > > necessarily need to make the same guarantees there.
> > > > > 
> > > > > The solution to all of these problems is to ensure that you open the
> > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > always see any subsequent errors.
> > > > 
> > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > we miss errors. I had missed this point.
> > > > 
> > > > So mounting a volatile overlay instance will become somewhat
> > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > error and did not do anything about it.
> > > > 
> > > > If a user cares about not losing such errors, they need to keep an
> > > > fd open on upper. 
> > > > 
> > > > /me hopes that this does not become an issue for somebody. Even
> > > > if it does, one workaround can be don't do volatile overlay or
> > > > don't share overlay upper with other conflicting workload.
> > > > 
> > > 
> > > Yeah, there are limits to what we can do with 32 bits.
> > > 
> > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > an unseen error. That would at least not completely drop it on the
> > > floor.
> > > 
> > > -- 
> > > Jeff Layton <jlayton@redhat.com>
> > > 
> > 
> > If I may enumerate our choices to help my own understanding, and
> > come up with a decent decision on how to proceed:
> > 
> > 1. If the filesystem has an unseen error, pr_warn.
> > 2. If the filesystem has an unseen error, refuse to mount it until
> >    the user clears the error (via syncfs?).
> > 3. Ignore the beginning state of the upperdir
> > 4. Increment the errseq_t.
> > 5. A combination of #1, and #2 and require the user to mount
> >    -o reallyvolatile or smoe such.
> > 
> > Now the downsides of each of these options:
> > 
> > 1. The user probably won't look at these errors. Especially,
> >    if the application is a container runtime, and these are
> >    happening on behalf of the application in an automated fashion.
> > 2. Forcing a syncfs on most filesystems is a massively costly
> >    operation that we want to avoid with the volatile operation.
> >    Also, go back to #1. Until we implement the new FS API, we
> >    can't easily give meaningful warnings to users that they
> >    can programatically act on (unless we use some special errno).
> > 3. This is a noop.
> > 4. We can hide errors from other users of the upperdir if they
> >    rely on syncfs semantics rather than per-fd fsync semantics
> >    to check if the filesystem is "clean".
> > 5. See the issues with #1 and #2.
> > 
> > I'm also curious as to how the patchset that allows for partial
> > sync is going to deal with this problem [1].
> > 
> > There is one other proposal I have, which is we define errseq_t
> > as two structures:
> > -errseq_t errseq_set(errseq_t *eseq, int err);
> > +/* For use on the publishing-side of errseq */
> > +struct errseq_publisher {
> > +        atomic_t        errors;
> > +        errseq_t        errseq_t
> > +};
> > +
> > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > 
> > And errseq_publisher is on the superblock, and errors is always incremented no 
> > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test -- 
> > if there are 2**32+ errors without someone doing an fsync, or noticing, you 
> > might have other problems.
> > 
> > This has two (and a half) downsides:
> > 1. It is a potential performance concern to introduce an atomic here.
> 
> Is updation of errseq_t performance sensitive path. Updation happens in
> error path and I would think that does not happen often. If that's the
> case, it should not be very performance sensitive path.
> 
> I agree that warning on unseen error is probably not enough. Applications
> can't do much with that. To me it boils down to two options.
> 
> A. Live with the idea of swallowing the unseen error on syncfs (if caller
>    did not keep an fd open).
> 
> B. Extend errseq infrastcture in such a way so that we can detect new
>    error without marking error SEEN.
> 
> It feels as if B is a safter choice but will be more work. With A, problem
> is that behavior will be different in difference scenarios and it will then
> become difficult to justify.
> 
> - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> - syncfs behavior will be different depending on if volatile overlay
>   mounts are being used on this filesystem or not.
> 

(cc'ing Willy since he helped a lot with this work)

The design for fsync is a bit odd, in that we had to preserve historical
behavior. Note that we didn't get it right at first. Our original
assumption was that applications wouldn't expect to see any writeback
errors that occurred before they opened the file. That turned out to be
wrong, and Willy fixed that by ensuring that unseen errors would be
reported once to the next task to do an fsync [1].

I'm not sure how you could change the behavior to accomodate the desire
for B. One idea: you could do away with the optimization that doesn't
bump the counter and record a new error when no one has seen the
previous one yet, and it's the same error. That would increase the
chances of the counter wrapping around however.

It may also be possible to "steal" another bit or two from the counter
if you see a way to use it.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b


-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 15:20                     ` Jeff Layton
@ 2020-12-03 17:08                       ` Sargun Dhillon
  2020-12-03 17:50                         ` Jeff Layton
  2020-12-03 20:31                       ` Vivek Goyal
  1 sibling, 1 reply; 28+ messages in thread
From: Sargun Dhillon @ 2020-12-03 17:08 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Vivek Goyal, Amir Goldstein, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Thu, Dec 3, 2020 at 7:20 AM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > [..]
> > > > > > > > > > > > +         upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > +         sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > +         sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > +         ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > >
> > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > >
> > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > >
> > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > >
> > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > >
> > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > +{
> > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > +
> > > > > > > > > > +       /*
> > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > +        */
> > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > +               if (old != new)
> > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > +       }
> > > > > > > > > > +       return new;
> > > > > > > > > > +}
> > > > > > > > >
> > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on
> > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > >
> > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > >
> > > > > > > > > errseq_sample_seen()
> > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > errseq_sample_current()
> > > > > > > > >
> > > > > > > >
> > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > >
> > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > able to do something like this in fill_super:
> > > > > > > >
> > > > > > > >     errseq_sample()
> > > > > > > >     errseq_check_and_advance()
> > > > > > > >
> > > > > > > >
> > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > efficient, but only slightly so.
> > > > > > >
> > > > > > > This seems even better.
> > > > > > >
> > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > >
> > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > did sync and they expected any errors to be reported.
> > > > > > >
> > > > > >
> > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > complain that exisiting error was not reported to it.
> > > > > > >
> > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > tested only patch.
> > > > > > >
> > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > error.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vivek
> > > > > > >
> > > > > >
> > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > error occurs and expects to see an error.
> > > > > >
> > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > necessarily need to make the same guarantees there.
> > > > > >
> > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > always see any subsequent errors.
> > > > >
> > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > we miss errors. I had missed this point.
> > > > >
> > > > > So mounting a volatile overlay instance will become somewhat
> > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > error and did not do anything about it.
> > > > >
> > > > > If a user cares about not losing such errors, they need to keep an
> > > > > fd open on upper.
> > > > >
> > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > don't share overlay upper with other conflicting workload.
> > > > >
> > > >
> > > > Yeah, there are limits to what we can do with 32 bits.
> > > >
> > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > an unseen error. That would at least not completely drop it on the
> > > > floor.
> > > >
> > > > --
> > > > Jeff Layton <jlayton@redhat.com>
> > > >
> > >
> > > If I may enumerate our choices to help my own understanding, and
> > > come up with a decent decision on how to proceed:
> > >
> > > 1. If the filesystem has an unseen error, pr_warn.
> > > 2. If the filesystem has an unseen error, refuse to mount it until
> > >    the user clears the error (via syncfs?).
> > > 3. Ignore the beginning state of the upperdir
> > > 4. Increment the errseq_t.
> > > 5. A combination of #1, and #2 and require the user to mount
> > >    -o reallyvolatile or smoe such.
> > >
> > > Now the downsides of each of these options:
> > >
> > > 1. The user probably won't look at these errors. Especially,
> > >    if the application is a container runtime, and these are
> > >    happening on behalf of the application in an automated fashion.
> > > 2. Forcing a syncfs on most filesystems is a massively costly
> > >    operation that we want to avoid with the volatile operation.
> > >    Also, go back to #1. Until we implement the new FS API, we
> > >    can't easily give meaningful warnings to users that they
> > >    can programatically act on (unless we use some special errno).
> > > 3. This is a noop.
> > > 4. We can hide errors from other users of the upperdir if they
> > >    rely on syncfs semantics rather than per-fd fsync semantics
> > >    to check if the filesystem is "clean".
> > > 5. See the issues with #1 and #2.
> > >
> > > I'm also curious as to how the patchset that allows for partial
> > > sync is going to deal with this problem [1].
> > >
> > > There is one other proposal I have, which is we define errseq_t
> > > as two structures:
> > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > +/* For use on the publishing-side of errseq */
> > > +struct errseq_publisher {
> > > +        atomic_t        errors;
> > > +        errseq_t        errseq_t
> > > +};
> > > +
> > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > >
> > > And errseq_publisher is on the superblock, and errors is always incremented no
> > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test --
> > > if there are 2**32+ errors without someone doing an fsync, or noticing, you
> > > might have other problems.
> > >
> > > This has two (and a half) downsides:
> > > 1. It is a potential performance concern to introduce an atomic here.
> >
> > Is updation of errseq_t performance sensitive path. Updation happens in
> > error path and I would think that does not happen often. If that's the
> > case, it should not be very performance sensitive path.
> >
> > I agree that warning on unseen error is probably not enough. Applications
> > can't do much with that. To me it boils down to two options.
> >
> > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> >    did not keep an fd open).
> >
> > B. Extend errseq infrastcture in such a way so that we can detect new
> >    error without marking error SEEN.
> >
> > It feels as if B is a safter choice but will be more work. With A, problem
> > is that behavior will be different in difference scenarios and it will then
> > become difficult to justify.
> >
> > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > - syncfs behavior will be different depending on if volatile overlay
> >   mounts are being used on this filesystem or not.
> >
>
> (cc'ing Willy since he helped a lot with this work)
>
> The design for fsync is a bit odd, in that we had to preserve historical
> behavior. Note that we didn't get it right at first. Our original
> assumption was that applications wouldn't expect to see any writeback
> errors that occurred before they opened the file. That turned out to be
> wrong, and Willy fixed that by ensuring that unseen errors would be
> reported once to the next task to do an fsync [1].
>
> I'm not sure how you could change the behavior to accomodate the desire
> for B. One idea: you could do away with the optimization that doesn't
> bump the counter and record a new error when no one has seen the
> previous one yet, and it's the same error. That would increase the
> chances of the counter wrapping around however.
>

I personally like this approach. If the concerns of counter-wrap or performance
are well-founded, what do you think of adding the atomic which is
incremented only when the seen bit is unset, effectively giving publishers
52 (32 + 20) bits of space to prevent wraparound? And subscribers can
optionally opt to have 52-bit values to check against?

> It may also be possible to "steal" another bit or two from the counter
> if you see a way to use it.
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b
>
>
> --
> Jeff Layton <jlayton@redhat.com>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 17:08                       ` Sargun Dhillon
@ 2020-12-03 17:50                         ` Jeff Layton
  2020-12-03 20:43                           ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-03 17:50 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, Amir Goldstein, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Thu, 2020-12-03 at 09:08 -0800, Sargun Dhillon wrote:
> On Thu, Dec 3, 2020 at 7:20 AM Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > > [..]
> > > > > > > > > > > > > +         upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > > +         sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > > +         sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > > +         ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > > > 
> > > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > > > 
> > > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > > > 
> > > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > > > 
> > > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > > > 
> > > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > > +{
> > > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > > +
> > > > > > > > > > > +       /*
> > > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > > +        */
> > > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > > +               if (old != new)
> > > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > > +       }
> > > > > > > > > > > +       return new;
> > > > > > > > > > > +}
> > > > > > > > > > 
> > > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on
> > > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > > > 
> > > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > > > 
> > > > > > > > > > errseq_sample_seen()
> > > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > > errseq_sample_current()
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > > > 
> > > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > > able to do something like this in fill_super:
> > > > > > > > > 
> > > > > > > > >     errseq_sample()
> > > > > > > > >     errseq_check_and_advance()
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > > efficient, but only slightly so.
> > > > > > > > 
> > > > > > > > This seems even better.
> > > > > > > > 
> > > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > > > 
> > > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > > did sync and they expected any errors to be reported.
> > > > > > > > 
> > > > > > > 
> > > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > > complain that exisiting error was not reported to it.
> > > > > > > > 
> > > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > > tested only patch.
> > > > > > > > 
> > > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > > error.
> > > > > > > > 
> > > > > > > > Thanks
> > > > > > > > Vivek
> > > > > > > > 
> > > > > > > 
> > > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > > error occurs and expects to see an error.
> > > > > > > 
> > > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > > necessarily need to make the same guarantees there.
> > > > > > > 
> > > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > > always see any subsequent errors.
> > > > > > 
> > > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > > we miss errors. I had missed this point.
> > > > > > 
> > > > > > So mounting a volatile overlay instance will become somewhat
> > > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > > error and did not do anything about it.
> > > > > > 
> > > > > > If a user cares about not losing such errors, they need to keep an
> > > > > > fd open on upper.
> > > > > > 
> > > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > > don't share overlay upper with other conflicting workload.
> > > > > > 
> > > > > 
> > > > > Yeah, there are limits to what we can do with 32 bits.
> > > > > 
> > > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > > an unseen error. That would at least not completely drop it on the
> > > > > floor.
> > > > > 
> > > > > --
> > > > > Jeff Layton <jlayton@redhat.com>
> > > > > 
> > > > 
> > > > If I may enumerate our choices to help my own understanding, and
> > > > come up with a decent decision on how to proceed:
> > > > 
> > > > 1. If the filesystem has an unseen error, pr_warn.
> > > > 2. If the filesystem has an unseen error, refuse to mount it until
> > > >    the user clears the error (via syncfs?).
> > > > 3. Ignore the beginning state of the upperdir
> > > > 4. Increment the errseq_t.
> > > > 5. A combination of #1, and #2 and require the user to mount
> > > >    -o reallyvolatile or smoe such.
> > > > 
> > > > Now the downsides of each of these options:
> > > > 
> > > > 1. The user probably won't look at these errors. Especially,
> > > >    if the application is a container runtime, and these are
> > > >    happening on behalf of the application in an automated fashion.
> > > > 2. Forcing a syncfs on most filesystems is a massively costly
> > > >    operation that we want to avoid with the volatile operation.
> > > >    Also, go back to #1. Until we implement the new FS API, we
> > > >    can't easily give meaningful warnings to users that they
> > > >    can programatically act on (unless we use some special errno).
> > > > 3. This is a noop.
> > > > 4. We can hide errors from other users of the upperdir if they
> > > >    rely on syncfs semantics rather than per-fd fsync semantics
> > > >    to check if the filesystem is "clean".
> > > > 5. See the issues with #1 and #2.
> > > > 
> > > > I'm also curious as to how the patchset that allows for partial
> > > > sync is going to deal with this problem [1].
> > > > 
> > > > There is one other proposal I have, which is we define errseq_t
> > > > as two structures:
> > > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > > +/* For use on the publishing-side of errseq */
> > > > +struct errseq_publisher {
> > > > +        atomic_t        errors;
> > > > +        errseq_t        errseq_t
> > > > +};
> > > > +
> > > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > > > 
> > > > And errseq_publisher is on the superblock, and errors is always incremented no
> > > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test --
> > > > if there are 2**32+ errors without someone doing an fsync, or noticing, you
> > > > might have other problems.
> > > > 
> > > > This has two (and a half) downsides:
> > > > 1. It is a potential performance concern to introduce an atomic here.
> > > 
> > > Is updation of errseq_t performance sensitive path. Updation happens in
> > > error path and I would think that does not happen often. If that's the
> > > case, it should not be very performance sensitive path.
> > > 
> > > I agree that warning on unseen error is probably not enough. Applications
> > > can't do much with that. To me it boils down to two options.
> > > 
> > > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> > >    did not keep an fd open).
> > > 
> > > B. Extend errseq infrastcture in such a way so that we can detect new
> > >    error without marking error SEEN.
> > > 
> > > It feels as if B is a safter choice but will be more work. With A, problem
> > > is that behavior will be different in difference scenarios and it will then
> > > become difficult to justify.
> > > 
> > > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > > - syncfs behavior will be different depending on if volatile overlay
> > >   mounts are being used on this filesystem or not.
> > > 
> > 
> > (cc'ing Willy since he helped a lot with this work)
> > 
> > The design for fsync is a bit odd, in that we had to preserve historical
> > behavior. Note that we didn't get it right at first. Our original
> > assumption was that applications wouldn't expect to see any writeback
> > errors that occurred before they opened the file. That turned out to be
> > wrong, and Willy fixed that by ensuring that unseen errors would be
> > reported once to the next task to do an fsync [1].
> > 
> > I'm not sure how you could change the behavior to accomodate the desire
> > for B. One idea: you could do away with the optimization that doesn't
> > bump the counter and record a new error when no one has seen the
> > previous one yet, and it's the same error. That would increase the
> > chances of the counter wrapping around however.
> > 
> 
> I personally like this approach. If the concerns of counter-wrap or performance
> are well-founded, what do you think of adding the atomic which is
> incremented only when the seen bit is unset, effectively giving publishers
> 52 (32 + 20) bits of space to prevent wraparound? And subscribers can
> optionally opt to have 52-bit values to check against?
> 

That seems a bit hacky. I think if you're going to do that, then you
might as well just build 64-bit errseq_t infrastructure and move to
that. You could still just sample and test against the bottom 32 bits.

The problem there though is that growing struct address_space or struct
file for this is somewhat unpalatable. It would be nice to keep all of
this within 32 bits.

I think the best option is probably to just work to restructure the
syncfs code so that overlayfs can override the logic in the syncfs
syscall wrapper and return an error of its choosing to a syncfs()
syscall.

It's a more radical change, but maybe we could add a new
sb->s_op->syncfs operation, and turn the guts of the old syncfs syscall
wrapper into a generic helper.

The naming is a little confusing with an existing sync_fs op, however,
but maybe we could transition all of the old sync_fs ops to the new one
and get rid of ->sync_fs. That's a more invasive set though, and you'd
need to preserve the existing behavior everywhere.

Once you have that, you could just use the "realfile" in
private_data for your syncfs op and everything should "just work".

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 15:20                     ` Jeff Layton
  2020-12-03 17:08                       ` Sargun Dhillon
@ 2020-12-03 20:31                       ` Vivek Goyal
  1 sibling, 0 replies; 28+ messages in thread
From: Vivek Goyal @ 2020-12-03 20:31 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, linux-fsdevel, linux-unionfs,
	Miklos Szeredi, Matthew Wilcox

On Thu, Dec 03, 2020 at 10:20:23AM -0500, Jeff Layton wrote:
> On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > [..]
> > > > > > > > > > > > +		upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > +		sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > +		sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > +		ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > > 
> > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > > 
> > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > > 
> > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > > 
> > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > > 
> > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > +{
> > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > +
> > > > > > > > > > +       /*
> > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > +        */
> > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > +               if (old != new)
> > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > +       }
> > > > > > > > > > +       return new;
> > > > > > > > > > +}
> > > > > > > > > 
> > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on 
> > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > > 
> > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > > 
> > > > > > > > > errseq_sample_seen()
> > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > errseq_sample_current()
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > > 
> > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > able to do something like this in fill_super:
> > > > > > > > 
> > > > > > > >     errseq_sample()
> > > > > > > >     errseq_check_and_advance()
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > efficient, but only slightly so.
> > > > > > > 
> > > > > > > This seems even better.
> > > > > > > 
> > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > > 
> > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > did sync and they expected any errors to be reported.
> > > > > > > 
> > > > > > 
> > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > complain that exisiting error was not reported to it.
> > > > > > > 
> > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > tested only patch.
> > > > > > > 
> > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > error.
> > > > > > > 
> > > > > > > Thanks
> > > > > > > Vivek
> > > > > > > 
> > > > > > 
> > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > error occurs and expects to see an error.
> > > > > > 
> > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > necessarily need to make the same guarantees there.
> > > > > > 
> > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > always see any subsequent errors.
> > > > > 
> > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > we miss errors. I had missed this point.
> > > > > 
> > > > > So mounting a volatile overlay instance will become somewhat
> > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > error and did not do anything about it.
> > > > > 
> > > > > If a user cares about not losing such errors, they need to keep an
> > > > > fd open on upper. 
> > > > > 
> > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > don't share overlay upper with other conflicting workload.
> > > > > 
> > > > 
> > > > Yeah, there are limits to what we can do with 32 bits.
> > > > 
> > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > an unseen error. That would at least not completely drop it on the
> > > > floor.
> > > > 
> > > > -- 
> > > > Jeff Layton <jlayton@redhat.com>
> > > > 
> > > 
> > > If I may enumerate our choices to help my own understanding, and
> > > come up with a decent decision on how to proceed:
> > > 
> > > 1. If the filesystem has an unseen error, pr_warn.
> > > 2. If the filesystem has an unseen error, refuse to mount it until
> > >    the user clears the error (via syncfs?).
> > > 3. Ignore the beginning state of the upperdir
> > > 4. Increment the errseq_t.
> > > 5. A combination of #1, and #2 and require the user to mount
> > >    -o reallyvolatile or smoe such.
> > > 
> > > Now the downsides of each of these options:
> > > 
> > > 1. The user probably won't look at these errors. Especially,
> > >    if the application is a container runtime, and these are
> > >    happening on behalf of the application in an automated fashion.
> > > 2. Forcing a syncfs on most filesystems is a massively costly
> > >    operation that we want to avoid with the volatile operation.
> > >    Also, go back to #1. Until we implement the new FS API, we
> > >    can't easily give meaningful warnings to users that they
> > >    can programatically act on (unless we use some special errno).
> > > 3. This is a noop.
> > > 4. We can hide errors from other users of the upperdir if they
> > >    rely on syncfs semantics rather than per-fd fsync semantics
> > >    to check if the filesystem is "clean".
> > > 5. See the issues with #1 and #2.
> > > 
> > > I'm also curious as to how the patchset that allows for partial
> > > sync is going to deal with this problem [1].
> > > 
> > > There is one other proposal I have, which is we define errseq_t
> > > as two structures:
> > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > +/* For use on the publishing-side of errseq */
> > > +struct errseq_publisher {
> > > +        atomic_t        errors;
> > > +        errseq_t        errseq_t
> > > +};
> > > +
> > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > > 
> > > And errseq_publisher is on the superblock, and errors is always incremented no 
> > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test -- 
> > > if there are 2**32+ errors without someone doing an fsync, or noticing, you 
> > > might have other problems.
> > > 
> > > This has two (and a half) downsides:
> > > 1. It is a potential performance concern to introduce an atomic here.
> > 
> > Is updation of errseq_t performance sensitive path. Updation happens in
> > error path and I would think that does not happen often. If that's the
> > case, it should not be very performance sensitive path.
> > 
> > I agree that warning on unseen error is probably not enough. Applications
> > can't do much with that. To me it boils down to two options.
> > 
> > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> >    did not keep an fd open).
> > 
> > B. Extend errseq infrastcture in such a way so that we can detect new
> >    error without marking error SEEN.
> > 
> > It feels as if B is a safter choice but will be more work. With A, problem
> > is that behavior will be different in difference scenarios and it will then
> > become difficult to justify.
> > 
> > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > - syncfs behavior will be different depending on if volatile overlay
> >   mounts are being used on this filesystem or not.
> > 
> 
> (cc'ing Willy since he helped a lot with this work)
> 
> The design for fsync is a bit odd, in that we had to preserve historical
> behavior. Note that we didn't get it right at first. Our original
> assumption was that applications wouldn't expect to see any writeback
> errors that occurred before they opened the file. That turned out to be
> wrong, and Willy fixed that by ensuring that unseen errors would be
> reported once to the next task to do an fsync [1].
> 
> I'm not sure how you could change the behavior to accomodate the desire
> for B. One idea: you could do away with the optimization that doesn't
> bump the counter and record a new error when no one has seen the
> previous one yet, and it's the same error. That would increase the
> chances of the counter wrapping around however.

Bumping up the counter if new error is same as previous one (seen flag
is not set) will solve the issue at hand without changing syncfs
behavior.

But as you said it increases chances of counter wrapping.
And to solve that we probably will have to look at 64 bit errseq_t.

Thanks
Vivek

> 
> It may also be possible to "steal" another bit or two from the counter
> if you see a way to use it.
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b
> 
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 17:50                         ` Jeff Layton
@ 2020-12-03 20:43                           ` Vivek Goyal
  2020-12-03 21:36                             ` Jeff Layton
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2020-12-03 20:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Thu, Dec 03, 2020 at 12:50:58PM -0500, Jeff Layton wrote:
> On Thu, 2020-12-03 at 09:08 -0800, Sargun Dhillon wrote:
> > On Thu, Dec 3, 2020 at 7:20 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > > > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > > > [..]
> > > > > > > > > > > > > > +         upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > > > +         sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > > > +         sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > > > +         ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > > > > 
> > > > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > > > +
> > > > > > > > > > > > +       /*
> > > > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > > > +        */
> > > > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > > > +               if (old != new)
> > > > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > > > +       }
> > > > > > > > > > > > +       return new;
> > > > > > > > > > > > +}
> > > > > > > > > > > 
> > > > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on
> > > > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > > > > 
> > > > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > > > > 
> > > > > > > > > > > errseq_sample_seen()
> > > > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > > > errseq_sample_current()
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > > > > 
> > > > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > > > able to do something like this in fill_super:
> > > > > > > > > > 
> > > > > > > > > >     errseq_sample()
> > > > > > > > > >     errseq_check_and_advance()
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > > > efficient, but only slightly so.
> > > > > > > > > 
> > > > > > > > > This seems even better.
> > > > > > > > > 
> > > > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > > > > 
> > > > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > > > did sync and they expected any errors to be reported.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > > > complain that exisiting error was not reported to it.
> > > > > > > > > 
> > > > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > > > tested only patch.
> > > > > > > > > 
> > > > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > > > error.
> > > > > > > > > 
> > > > > > > > > Thanks
> > > > > > > > > Vivek
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > > > error occurs and expects to see an error.
> > > > > > > > 
> > > > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > > > necessarily need to make the same guarantees there.
> > > > > > > > 
> > > > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > > > always see any subsequent errors.
> > > > > > > 
> > > > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > > > we miss errors. I had missed this point.
> > > > > > > 
> > > > > > > So mounting a volatile overlay instance will become somewhat
> > > > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > > > error and did not do anything about it.
> > > > > > > 
> > > > > > > If a user cares about not losing such errors, they need to keep an
> > > > > > > fd open on upper.
> > > > > > > 
> > > > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > > > don't share overlay upper with other conflicting workload.
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there are limits to what we can do with 32 bits.
> > > > > > 
> > > > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > > > an unseen error. That would at least not completely drop it on the
> > > > > > floor.
> > > > > > 
> > > > > > --
> > > > > > Jeff Layton <jlayton@redhat.com>
> > > > > > 
> > > > > 
> > > > > If I may enumerate our choices to help my own understanding, and
> > > > > come up with a decent decision on how to proceed:
> > > > > 
> > > > > 1. If the filesystem has an unseen error, pr_warn.
> > > > > 2. If the filesystem has an unseen error, refuse to mount it until
> > > > >    the user clears the error (via syncfs?).
> > > > > 3. Ignore the beginning state of the upperdir
> > > > > 4. Increment the errseq_t.
> > > > > 5. A combination of #1, and #2 and require the user to mount
> > > > >    -o reallyvolatile or smoe such.
> > > > > 
> > > > > Now the downsides of each of these options:
> > > > > 
> > > > > 1. The user probably won't look at these errors. Especially,
> > > > >    if the application is a container runtime, and these are
> > > > >    happening on behalf of the application in an automated fashion.
> > > > > 2. Forcing a syncfs on most filesystems is a massively costly
> > > > >    operation that we want to avoid with the volatile operation.
> > > > >    Also, go back to #1. Until we implement the new FS API, we
> > > > >    can't easily give meaningful warnings to users that they
> > > > >    can programatically act on (unless we use some special errno).
> > > > > 3. This is a noop.
> > > > > 4. We can hide errors from other users of the upperdir if they
> > > > >    rely on syncfs semantics rather than per-fd fsync semantics
> > > > >    to check if the filesystem is "clean".
> > > > > 5. See the issues with #1 and #2.
> > > > > 
> > > > > I'm also curious as to how the patchset that allows for partial
> > > > > sync is going to deal with this problem [1].
> > > > > 
> > > > > There is one other proposal I have, which is we define errseq_t
> > > > > as two structures:
> > > > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > > > +/* For use on the publishing-side of errseq */
> > > > > +struct errseq_publisher {
> > > > > +        atomic_t        errors;
> > > > > +        errseq_t        errseq_t
> > > > > +};
> > > > > +
> > > > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > > > > 
> > > > > And errseq_publisher is on the superblock, and errors is always incremented no
> > > > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test --
> > > > > if there are 2**32+ errors without someone doing an fsync, or noticing, you
> > > > > might have other problems.
> > > > > 
> > > > > This has two (and a half) downsides:
> > > > > 1. It is a potential performance concern to introduce an atomic here.
> > > > 
> > > > Is updation of errseq_t performance sensitive path. Updation happens in
> > > > error path and I would think that does not happen often. If that's the
> > > > case, it should not be very performance sensitive path.
> > > > 
> > > > I agree that warning on unseen error is probably not enough. Applications
> > > > can't do much with that. To me it boils down to two options.
> > > > 
> > > > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> > > >    did not keep an fd open).
> > > > 
> > > > B. Extend errseq infrastcture in such a way so that we can detect new
> > > >    error without marking error SEEN.
> > > > 
> > > > It feels as if B is a safter choice but will be more work. With A, problem
> > > > is that behavior will be different in difference scenarios and it will then
> > > > become difficult to justify.
> > > > 
> > > > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > > > - syncfs behavior will be different depending on if volatile overlay
> > > >   mounts are being used on this filesystem or not.
> > > > 
> > > 
> > > (cc'ing Willy since he helped a lot with this work)
> > > 
> > > The design for fsync is a bit odd, in that we had to preserve historical
> > > behavior. Note that we didn't get it right at first. Our original
> > > assumption was that applications wouldn't expect to see any writeback
> > > errors that occurred before they opened the file. That turned out to be
> > > wrong, and Willy fixed that by ensuring that unseen errors would be
> > > reported once to the next task to do an fsync [1].
> > > 
> > > I'm not sure how you could change the behavior to accomodate the desire
> > > for B. One idea: you could do away with the optimization that doesn't
> > > bump the counter and record a new error when no one has seen the
> > > previous one yet, and it's the same error. That would increase the
> > > chances of the counter wrapping around however.
> > > 
> > 
> > I personally like this approach. If the concerns of counter-wrap or performance
> > are well-founded, what do you think of adding the atomic which is
> > incremented only when the seen bit is unset, effectively giving publishers
> > 52 (32 + 20) bits of space to prevent wraparound? And subscribers can
> > optionally opt to have 52-bit values to check against?
> > 
> 
> That seems a bit hacky. I think if you're going to do that, then you
> might as well just build 64-bit errseq_t infrastructure and move to
> that. You could still just sample and test against the bottom 32 bits.
> 
> The problem there though is that growing struct address_space or struct
> file for this is somewhat unpalatable. It would be nice to keep all of
> this within 32 bits.
> 
> I think the best option is probably to just work to restructure the
> syncfs code so that overlayfs can override the logic in the syncfs
> syscall wrapper and return an error of its choosing to a syncfs()
> syscall.
> 
> It's a more radical change, but maybe we could add a new
> sb->s_op->syncfs operation, and turn the guts of the old syncfs syscall
> wrapper into a generic helper.
> 
> The naming is a little confusing with an existing sync_fs op, however,
> but maybe we could transition all of the old sync_fs ops to the new one
> and get rid of ->sync_fs. That's a more invasive set though, and you'd
> need to preserve the existing behavior everywhere.
> 
> Once you have that, you could just use the "realfile" in
> private_data for your syncfs op and everything should "just work".

This should work for regular overlayfs where calling syncfs() on
overlay will result in calling syncfs() on underlying filesystem 
and help propagating error back, IIUC.

But for this specific use case, of volatile mount, we want to find
out if underlying filesystem superblock had a writeback error since
we mounted overlay instance, without calling syncfs() on underlying
filesystem. And we don't want to consume existing error on underlying
super block fearing applications will complain.

So key requirement here seems to be being able to detect error
on underlying superblock without consuming the unseen error.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 20:43                           ` Vivek Goyal
@ 2020-12-03 21:36                             ` Jeff Layton
  2020-12-03 22:24                               ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-03 21:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, Amir Goldstein, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Thu, 2020-12-03 at 15:43 -0500, Vivek Goyal wrote:
> On Thu, Dec 03, 2020 at 12:50:58PM -0500, Jeff Layton wrote:
> > On Thu, 2020-12-03 at 09:08 -0800, Sargun Dhillon wrote:
> > > On Thu, Dec 3, 2020 at 7:20 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > > > > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > > > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > > > > [..]
> > > > > > > > > > > > > > > +         upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > > > > +         sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > > > > +         sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > > > > +         ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +       /*
> > > > > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > > > > +        */
> > > > > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > > > > +               if (old != new)
> > > > > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > > > > +       }
> > > > > > > > > > > > > +       return new;
> > > > > > > > > > > > > +}
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on
> > > > > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > > > > > 
> > > > > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > > > > > 
> > > > > > > > > > > > errseq_sample_seen()
> > > > > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > > > > errseq_sample_current()
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > > > > > 
> > > > > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > > > > able to do something like this in fill_super:
> > > > > > > > > > > 
> > > > > > > > > > >     errseq_sample()
> > > > > > > > > > >     errseq_check_and_advance()
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > > > > efficient, but only slightly so.
> > > > > > > > > > 
> > > > > > > > > > This seems even better.
> > > > > > > > > > 
> > > > > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > > > > > 
> > > > > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > > > > did sync and they expected any errors to be reported.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > > > > complain that exisiting error was not reported to it.
> > > > > > > > > > 
> > > > > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > > > > tested only patch.
> > > > > > > > > > 
> > > > > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > > > > error.
> > > > > > > > > > 
> > > > > > > > > > Thanks
> > > > > > > > > > Vivek
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > > > > error occurs and expects to see an error.
> > > > > > > > > 
> > > > > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > > > > necessarily need to make the same guarantees there.
> > > > > > > > > 
> > > > > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > > > > always see any subsequent errors.
> > > > > > > > 
> > > > > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > > > > we miss errors. I had missed this point.
> > > > > > > > 
> > > > > > > > So mounting a volatile overlay instance will become somewhat
> > > > > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > > > > error and did not do anything about it.
> > > > > > > > 
> > > > > > > > If a user cares about not losing such errors, they need to keep an
> > > > > > > > fd open on upper.
> > > > > > > > 
> > > > > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > > > > don't share overlay upper with other conflicting workload.
> > > > > > > > 
> > > > > > > 
> > > > > > > Yeah, there are limits to what we can do with 32 bits.
> > > > > > > 
> > > > > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > > > > an unseen error. That would at least not completely drop it on the
> > > > > > > floor.
> > > > > > > 
> > > > > > > --
> > > > > > > Jeff Layton <jlayton@redhat.com>
> > > > > > > 
> > > > > > 
> > > > > > If I may enumerate our choices to help my own understanding, and
> > > > > > come up with a decent decision on how to proceed:
> > > > > > 
> > > > > > 1. If the filesystem has an unseen error, pr_warn.
> > > > > > 2. If the filesystem has an unseen error, refuse to mount it until
> > > > > >    the user clears the error (via syncfs?).
> > > > > > 3. Ignore the beginning state of the upperdir
> > > > > > 4. Increment the errseq_t.
> > > > > > 5. A combination of #1, and #2 and require the user to mount
> > > > > >    -o reallyvolatile or smoe such.
> > > > > > 
> > > > > > Now the downsides of each of these options:
> > > > > > 
> > > > > > 1. The user probably won't look at these errors. Especially,
> > > > > >    if the application is a container runtime, and these are
> > > > > >    happening on behalf of the application in an automated fashion.
> > > > > > 2. Forcing a syncfs on most filesystems is a massively costly
> > > > > >    operation that we want to avoid with the volatile operation.
> > > > > >    Also, go back to #1. Until we implement the new FS API, we
> > > > > >    can't easily give meaningful warnings to users that they
> > > > > >    can programatically act on (unless we use some special errno).
> > > > > > 3. This is a noop.
> > > > > > 4. We can hide errors from other users of the upperdir if they
> > > > > >    rely on syncfs semantics rather than per-fd fsync semantics
> > > > > >    to check if the filesystem is "clean".
> > > > > > 5. See the issues with #1 and #2.
> > > > > > 
> > > > > > I'm also curious as to how the patchset that allows for partial
> > > > > > sync is going to deal with this problem [1].
> > > > > > 
> > > > > > There is one other proposal I have, which is we define errseq_t
> > > > > > as two structures:
> > > > > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > > > > +/* For use on the publishing-side of errseq */
> > > > > > +struct errseq_publisher {
> > > > > > +        atomic_t        errors;
> > > > > > +        errseq_t        errseq_t
> > > > > > +};
> > > > > > +
> > > > > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > > > > > 
> > > > > > And errseq_publisher is on the superblock, and errors is always incremented no
> > > > > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test --
> > > > > > if there are 2**32+ errors without someone doing an fsync, or noticing, you
> > > > > > might have other problems.
> > > > > > 
> > > > > > This has two (and a half) downsides:
> > > > > > 1. It is a potential performance concern to introduce an atomic here.
> > > > > 
> > > > > Is updation of errseq_t performance sensitive path. Updation happens in
> > > > > error path and I would think that does not happen often. If that's the
> > > > > case, it should not be very performance sensitive path.
> > > > > 
> > > > > I agree that warning on unseen error is probably not enough. Applications
> > > > > can't do much with that. To me it boils down to two options.
> > > > > 
> > > > > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> > > > >    did not keep an fd open).
> > > > > 
> > > > > B. Extend errseq infrastcture in such a way so that we can detect new
> > > > >    error without marking error SEEN.
> > > > > 
> > > > > It feels as if B is a safter choice but will be more work. With A, problem
> > > > > is that behavior will be different in difference scenarios and it will then
> > > > > become difficult to justify.
> > > > > 
> > > > > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > > > > - syncfs behavior will be different depending on if volatile overlay
> > > > >   mounts are being used on this filesystem or not.
> > > > > 
> > > > 
> > > > (cc'ing Willy since he helped a lot with this work)
> > > > 
> > > > The design for fsync is a bit odd, in that we had to preserve historical
> > > > behavior. Note that we didn't get it right at first. Our original
> > > > assumption was that applications wouldn't expect to see any writeback
> > > > errors that occurred before they opened the file. That turned out to be
> > > > wrong, and Willy fixed that by ensuring that unseen errors would be
> > > > reported once to the next task to do an fsync [1].
> > > > 
> > > > I'm not sure how you could change the behavior to accomodate the desire
> > > > for B. One idea: you could do away with the optimization that doesn't
> > > > bump the counter and record a new error when no one has seen the
> > > > previous one yet, and it's the same error. That would increase the
> > > > chances of the counter wrapping around however.
> > > > 
> > > 
> > > I personally like this approach. If the concerns of counter-wrap or performance
> > > are well-founded, what do you think of adding the atomic which is
> > > incremented only when the seen bit is unset, effectively giving publishers
> > > 52 (32 + 20) bits of space to prevent wraparound? And subscribers can
> > > optionally opt to have 52-bit values to check against?
> > > 
> > 
> > That seems a bit hacky. I think if you're going to do that, then you
> > might as well just build 64-bit errseq_t infrastructure and move to
> > that. You could still just sample and test against the bottom 32 bits.
> > 
> > The problem there though is that growing struct address_space or struct
> > file for this is somewhat unpalatable. It would be nice to keep all of
> > this within 32 bits.
> > 
> > I think the best option is probably to just work to restructure the
> > syncfs code so that overlayfs can override the logic in the syncfs
> > syscall wrapper and return an error of its choosing to a syncfs()
> > syscall.
> > 
> > It's a more radical change, but maybe we could add a new
> > sb->s_op->syncfs operation, and turn the guts of the old syncfs syscall
> > wrapper into a generic helper.
> > 
> > The naming is a little confusing with an existing sync_fs op, however,
> > but maybe we could transition all of the old sync_fs ops to the new one
> > and get rid of ->sync_fs. That's a more invasive set though, and you'd
> > need to preserve the existing behavior everywhere.
> > 
> > Once you have that, you could just use the "realfile" in
> > private_data for your syncfs op and everything should "just work".
> 
> This should work for regular overlayfs where calling syncfs() on
> overlay will result in calling syncfs() on underlying filesystem 
> and help propagating error back, IIUC.
> 
> But for this specific use case, of volatile mount, we want to find
> out if underlying filesystem superblock had a writeback error since
> we mounted overlay instance, without calling syncfs() on underlying
> filesystem. And we don't want to consume existing error on underlying
> super block fearing applications will complain.
> 

Why? If you're not planning to do anything with the error then why check
for it in the first place? Maybe I don't really understand what you're
trying to do with this.

If it turns out that you just want to see if it ever had an error, you
can always use errseq_check with 0 as the "since" value and that will
tell you without marking or advancing anything. It's not clear to me
what the value of that is here though.

> So key requirement here seems to be being able to detect error
> on underlying superblock without consuming the unseen error.
> 

I think for overlayfs what you really want to do is basically "proxy"
fsync and syncfs calls to the upper layer. Then you should just be able
to use the upper layer's "realfile" when doing fsync/syncfs. You won't
need to sample anything at mount time that way as it should just happen
naturally when you open files on overlayfs.

That does mean you may need to rework how the syncfs syscall dispatches
to the filesystem, but that's not too difficult in principle.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 21:36                             ` Jeff Layton
@ 2020-12-03 22:24                               ` Vivek Goyal
  2020-12-03 23:36                                 ` Jeff Layton
  0 siblings, 1 reply; 28+ messages in thread
From: Vivek Goyal @ 2020-12-03 22:24 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Thu, Dec 03, 2020 at 04:36:45PM -0500, Jeff Layton wrote:
> On Thu, 2020-12-03 at 15:43 -0500, Vivek Goyal wrote:
> > On Thu, Dec 03, 2020 at 12:50:58PM -0500, Jeff Layton wrote:
> > > On Thu, 2020-12-03 at 09:08 -0800, Sargun Dhillon wrote:
> > > > On Thu, Dec 3, 2020 at 7:20 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > 
> > > > > On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > > > > > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > > > > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > > > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > > > > > [..]
> > > > > > > > > > > > > > > > +         upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > > > > > +         sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > > > > > +         sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > > > > > +         ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +       /*
> > > > > > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > > > > > +        */
> > > > > > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > > > > > +               if (old != new)
> > > > > > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > > > > > +       }
> > > > > > > > > > > > > > +       return new;
> > > > > > > > > > > > > > +}
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on
> > > > > > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > errseq_sample_seen()
> > > > > > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > > > > > errseq_sample_current()
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > > > > > > 
> > > > > > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > > > > > able to do something like this in fill_super:
> > > > > > > > > > > > 
> > > > > > > > > > > >     errseq_sample()
> > > > > > > > > > > >     errseq_check_and_advance()
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > > > > > efficient, but only slightly so.
> > > > > > > > > > > 
> > > > > > > > > > > This seems even better.
> > > > > > > > > > > 
> > > > > > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > > > > > > 
> > > > > > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > > > > > did sync and they expected any errors to be reported.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > > > > > complain that exisiting error was not reported to it.
> > > > > > > > > > > 
> > > > > > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > > > > > tested only patch.
> > > > > > > > > > > 
> > > > > > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > > > > > error.
> > > > > > > > > > > 
> > > > > > > > > > > Thanks
> > > > > > > > > > > Vivek
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > > > > > error occurs and expects to see an error.
> > > > > > > > > > 
> > > > > > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > > > > > necessarily need to make the same guarantees there.
> > > > > > > > > > 
> > > > > > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > > > > > always see any subsequent errors.
> > > > > > > > > 
> > > > > > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > > > > > we miss errors. I had missed this point.
> > > > > > > > > 
> > > > > > > > > So mounting a volatile overlay instance will become somewhat
> > > > > > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > > > > > error and did not do anything about it.
> > > > > > > > > 
> > > > > > > > > If a user cares about not losing such errors, they need to keep an
> > > > > > > > > fd open on upper.
> > > > > > > > > 
> > > > > > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > > > > > don't share overlay upper with other conflicting workload.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yeah, there are limits to what we can do with 32 bits.
> > > > > > > > 
> > > > > > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > > > > > an unseen error. That would at least not completely drop it on the
> > > > > > > > floor.
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Jeff Layton <jlayton@redhat.com>
> > > > > > > > 
> > > > > > > 
> > > > > > > If I may enumerate our choices to help my own understanding, and
> > > > > > > come up with a decent decision on how to proceed:
> > > > > > > 
> > > > > > > 1. If the filesystem has an unseen error, pr_warn.
> > > > > > > 2. If the filesystem has an unseen error, refuse to mount it until
> > > > > > >    the user clears the error (via syncfs?).
> > > > > > > 3. Ignore the beginning state of the upperdir
> > > > > > > 4. Increment the errseq_t.
> > > > > > > 5. A combination of #1, and #2 and require the user to mount
> > > > > > >    -o reallyvolatile or smoe such.
> > > > > > > 
> > > > > > > Now the downsides of each of these options:
> > > > > > > 
> > > > > > > 1. The user probably won't look at these errors. Especially,
> > > > > > >    if the application is a container runtime, and these are
> > > > > > >    happening on behalf of the application in an automated fashion.
> > > > > > > 2. Forcing a syncfs on most filesystems is a massively costly
> > > > > > >    operation that we want to avoid with the volatile operation.
> > > > > > >    Also, go back to #1. Until we implement the new FS API, we
> > > > > > >    can't easily give meaningful warnings to users that they
> > > > > > >    can programatically act on (unless we use some special errno).
> > > > > > > 3. This is a noop.
> > > > > > > 4. We can hide errors from other users of the upperdir if they
> > > > > > >    rely on syncfs semantics rather than per-fd fsync semantics
> > > > > > >    to check if the filesystem is "clean".
> > > > > > > 5. See the issues with #1 and #2.
> > > > > > > 
> > > > > > > I'm also curious as to how the patchset that allows for partial
> > > > > > > sync is going to deal with this problem [1].
> > > > > > > 
> > > > > > > There is one other proposal I have, which is we define errseq_t
> > > > > > > as two structures:
> > > > > > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > > > > > +/* For use on the publishing-side of errseq */
> > > > > > > +struct errseq_publisher {
> > > > > > > +        atomic_t        errors;
> > > > > > > +        errseq_t        errseq_t
> > > > > > > +};
> > > > > > > +
> > > > > > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > > > > > > 
> > > > > > > And errseq_publisher is on the superblock, and errors is always incremented no
> > > > > > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test --
> > > > > > > if there are 2**32+ errors without someone doing an fsync, or noticing, you
> > > > > > > might have other problems.
> > > > > > > 
> > > > > > > This has two (and a half) downsides:
> > > > > > > 1. It is a potential performance concern to introduce an atomic here.
> > > > > > 
> > > > > > Is updation of errseq_t performance sensitive path. Updation happens in
> > > > > > error path and I would think that does not happen often. If that's the
> > > > > > case, it should not be very performance sensitive path.
> > > > > > 
> > > > > > I agree that warning on unseen error is probably not enough. Applications
> > > > > > can't do much with that. To me it boils down to two options.
> > > > > > 
> > > > > > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> > > > > >    did not keep an fd open).
> > > > > > 
> > > > > > B. Extend errseq infrastcture in such a way so that we can detect new
> > > > > >    error without marking error SEEN.
> > > > > > 
> > > > > > It feels as if B is a safter choice but will be more work. With A, problem
> > > > > > is that behavior will be different in difference scenarios and it will then
> > > > > > become difficult to justify.
> > > > > > 
> > > > > > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > > > > > - syncfs behavior will be different depending on if volatile overlay
> > > > > >   mounts are being used on this filesystem or not.
> > > > > > 
> > > > > 
> > > > > (cc'ing Willy since he helped a lot with this work)
> > > > > 
> > > > > The design for fsync is a bit odd, in that we had to preserve historical
> > > > > behavior. Note that we didn't get it right at first. Our original
> > > > > assumption was that applications wouldn't expect to see any writeback
> > > > > errors that occurred before they opened the file. That turned out to be
> > > > > wrong, and Willy fixed that by ensuring that unseen errors would be
> > > > > reported once to the next task to do an fsync [1].
> > > > > 
> > > > > I'm not sure how you could change the behavior to accomodate the desire
> > > > > for B. One idea: you could do away with the optimization that doesn't
> > > > > bump the counter and record a new error when no one has seen the
> > > > > previous one yet, and it's the same error. That would increase the
> > > > > chances of the counter wrapping around however.
> > > > > 
> > > > 
> > > > I personally like this approach. If the concerns of counter-wrap or performance
> > > > are well-founded, what do you think of adding the atomic which is
> > > > incremented only when the seen bit is unset, effectively giving publishers
> > > > 52 (32 + 20) bits of space to prevent wraparound? And subscribers can
> > > > optionally opt to have 52-bit values to check against?
> > > > 
> > > 
> > > That seems a bit hacky. I think if you're going to do that, then you
> > > might as well just build 64-bit errseq_t infrastructure and move to
> > > that. You could still just sample and test against the bottom 32 bits.
> > > 
> > > The problem there though is that growing struct address_space or struct
> > > file for this is somewhat unpalatable. It would be nice to keep all of
> > > this within 32 bits.
> > > 
> > > I think the best option is probably to just work to restructure the
> > > syncfs code so that overlayfs can override the logic in the syncfs
> > > syscall wrapper and return an error of its choosing to a syncfs()
> > > syscall.
> > > 
> > > It's a more radical change, but maybe we could add a new
> > > sb->s_op->syncfs operation, and turn the guts of the old syncfs syscall
> > > wrapper into a generic helper.
> > > 
> > > The naming is a little confusing with an existing sync_fs op, however,
> > > but maybe we could transition all of the old sync_fs ops to the new one
> > > and get rid of ->sync_fs. That's a more invasive set though, and you'd
> > > need to preserve the existing behavior everywhere.
> > > 
> > > Once you have that, you could just use the "realfile" in
> > > private_data for your syncfs op and everything should "just work".
> > 
> > This should work for regular overlayfs where calling syncfs() on
> > overlay will result in calling syncfs() on underlying filesystem 
> > and help propagating error back, IIUC.
> > 
> > But for this specific use case, of volatile mount, we want to find
> > out if underlying filesystem superblock had a writeback error since
> > we mounted overlay instance, without calling syncfs() on underlying
> > filesystem. And we don't want to consume existing error on underlying
> > super block fearing applications will complain.
> > 
> 
> Why? If you're not planning to do anything with the error then why check
> for it in the first place? Maybe I don't really understand what you're
> trying to do with this.
> 

Here is the background.

We introduced a new option "-o volatile" for overlayfs. What this option
does is that it disables all calls to sync/syncfs/fsync and returns
success.

https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html?highlight=overlayfs#volatile-mount

Now one problem with this we realized is that what happens if there is
a writeback error on upper filesystem. Previously fsync will catch
that error and return to user space. Now we are not doing any actual
sync for volatile mount, so we don't have a way to detect if any
writeback error happened on upper filesystem.

So it is possible that an application writes something to overlay
volatile mount and gets back corrupted/old data.

- App writes something.
- Writeback of that page fails
- app does fsync, which succeds without doing any sync.
- app reads back page and can get old data if page has been evicted out
  of cache.

So we lost capability to return writeback errors to user space with
volatile mounts. So Amir/Sargun proposed that lets take snapshot
of upper ->s_wb_err when volatile overlay is being mounted. And
on every sync/fsync call check if any error has happened on upper
since volatile overlay has been mounted. If yes, return error to 
user space. 

In fact, Idea is that once an error has been detected, volatile
overlay should effectively return -EIO for all the operations. IOW,
one should now unmount it, throw away upper and restart again.


> If it turns out that you just want to see if it ever had an error, you
> can always use errseq_check with 0 as the "since" value and that will
> tell you without marking or advancing anything. It's not clear to me
> what the value of that is here though.

I think "since == 0" will not work. Say upper already has an error
(seen/unseen), then errseq_check() will always return error. We
don't want that. We don't care if upper has an seen/unseen error
at the time when we sample it. What we care about is that if
there is an error after we sampled, we can detect that and make
whole volatile mount bad.

> 
> > So key requirement here seems to be being able to detect error
> > on underlying superblock without consuming the unseen error.
> > 
> 
> I think for overlayfs what you really want to do is basically "proxy"
> fsync and syncfs calls to the upper layer. Then you should just be able
> to use the upper layer's "realfile" when doing fsync/syncfs. You won't
> need to sample anything at mount time that way as it should just happen
> naturally when you open files on overlayfs.

Which we already do, right? ovl_fsync()/ovl_sync() result in a
call on upper. This probably can be improve futher.

> 
> That does mean you may need to rework how the syncfs syscall dispatches
> to the filesystem, but that's not too difficult in principle.

I think we are looking at two overlay cases here. One is regular
overlayfs where syncfs() needs to be reworked to propagate errors
from upper/ to all the way to application. Right now VFS ignores
error returned from ->sync_fs. 

The other case we are trying to solve right now is volatile mount.
Where we will not actually call fsync/sync_filesystem() on upper
but still want to detect if any error happened since we mounted
this volatile mount. 

And that's why all this discussion of being able to detect an
error on super block without actually consuming the error. Once
we detect that some error has happened on upper since we mounted,
we can start returning errors for all I/O operations to user and
user is supposed to unmount and throw away upper dir and restart.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 22:24                               ` Vivek Goyal
@ 2020-12-03 23:36                                 ` Jeff Layton
  2020-12-04  6:45                                   ` Amir Goldstein
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-12-03 23:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, Amir Goldstein, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Thu, 2020-12-03 at 17:24 -0500, Vivek Goyal wrote:
> On Thu, Dec 03, 2020 at 04:36:45PM -0500, Jeff Layton wrote:
> > On Thu, 2020-12-03 at 15:43 -0500, Vivek Goyal wrote:
> > > On Thu, Dec 03, 2020 at 12:50:58PM -0500, Jeff Layton wrote:
> > > > On Thu, 2020-12-03 at 09:08 -0800, Sargun Dhillon wrote:
> > > > > On Thu, Dec 3, 2020 at 7:20 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > 
> > > > > > On Thu, 2020-12-03 at 09:27 -0500, Vivek Goyal wrote:
> > > > > > > On Thu, Dec 03, 2020 at 10:42:26AM +0000, Sargun Dhillon wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 04:52:33PM -0500, Jeff Layton wrote:
> > > > > > > > > On Wed, 2020-12-02 at 16:34 -0500, Vivek Goyal wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 02:26:23PM -0500, Jeff Layton wrote:
> > > > > > > > > > [..]
> > > > > > > > > > > > > > > > > +         upper_mnt_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > > > > > > > > > > > > > +         sb->s_stack_depth = upper_mnt_sb->s_stack_depth;
> > > > > > > > > > > > > > > > > +         sb->s_time_gran = upper_mnt_sb->s_time_gran;
> > > > > > > > > > > > > > > > > +         ofs->upper_errseq = errseq_sample(&upper_mnt_sb->s_wb_err);
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I asked this question in last email as well. errseq_sample() will return
> > > > > > > > > > > > > > > > 0 if current error has not been seen yet. That means next time a sync
> > > > > > > > > > > > > > > > call comes for volatile mount, it will return an error. But that's
> > > > > > > > > > > > > > > > not what we want. When we mounted a volatile overlay, if there is an
> > > > > > > > > > > > > > > > existing error (seen/unseen), we don't care. We only care if there
> > > > > > > > > > > > > > > > is a new error after the volatile mount, right?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I guess we will need another helper similar to errseq_smaple() which
> > > > > > > > > > > > > > > > just returns existing value of errseq. And then we will have to
> > > > > > > > > > > > > > > > do something about errseq_check() to not return an error if "since"
> > > > > > > > > > > > > > > > and "eseq" differ only by "seen" bit.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Otherwise in current form, volatile mount will always return error
> > > > > > > > > > > > > > > > if upperdir has error and it has not been seen by anybody.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > How did you finally end up testing the error case. Want to simualate
> > > > > > > > > > > > > > > > error aritificially and test it.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > If you don't want to see errors that occurred before you did the mount,
> > > > > > > > > > > > > > > then you probably can just resurrect and rename the original version of
> > > > > > > > > > > > > > > errseq_sample. Something like this, but with a different name:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +errseq_t errseq_sample(errseq_t *eseq)
> > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > +       errseq_t old = READ_ONCE(*eseq);
> > > > > > > > > > > > > > > +       errseq_t new = old;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +       /*
> > > > > > > > > > > > > > > +        * For the common case of no errors ever having been set, we can skip
> > > > > > > > > > > > > > > +        * marking the SEEN bit. Once an error has been set, the value will
> > > > > > > > > > > > > > > +        * never go back to zero.
> > > > > > > > > > > > > > > +        */
> > > > > > > > > > > > > > > +       if (old != 0) {
> > > > > > > > > > > > > > > +               new |= ERRSEQ_SEEN;
> > > > > > > > > > > > > > > +               if (old != new)
> > > > > > > > > > > > > > > +                       cmpxchg(eseq, old, new);
> > > > > > > > > > > > > > > +       }
> > > > > > > > > > > > > > > +       return new;
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Yes, a helper like this should solve the issue at hand. We are not
> > > > > > > > > > > > > > interested in previous errors. This also sets the ERRSEQ_SEEN on
> > > > > > > > > > > > > > sample and it will also solve the other issue when after sampling
> > > > > > > > > > > > > > if error gets seen, we don't want errseq_check() to return error.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thinking of some possible names for new function.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > errseq_sample_seen()
> > > > > > > > > > > > > > errseq_sample_set_seen()
> > > > > > > > > > > > > > errseq_sample_consume_unseen()
> > > > > > > > > > > > > > errseq_sample_current()
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > errseq_sample_consume_unseen() sounds good, though maybe it should be
> > > > > > > > > > > > > "ignore_unseen"? IDK, naming this stuff is the hardest part.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > If you don't want to add a new helper, I think you'd probably also be
> > > > > > > > > > > > > able to do something like this in fill_super:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     errseq_sample()
> > > > > > > > > > > > >     errseq_check_and_advance()
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > ...and just ignore the error returned by the check and advance. At that
> > > > > > > > > > > > > point, the cursor should be caught up and any subsequent syncfs call
> > > > > > > > > > > > > should return 0 until you record another error. It's a little less
> > > > > > > > > > > > > efficient, but only slightly so.
> > > > > > > > > > > > 
> > > > > > > > > > > > This seems even better.
> > > > > > > > > > > > 
> > > > > > > > > > > > Thinking little bit more. I am now concerned about setting ERRSEQ_SEEN on
> > > > > > > > > > > > sample. In our case, that would mean that we consumed an unseen error but
> > > > > > > > > > > > never reported it back to user space. And then somebody might complain.
> > > > > > > > > > > > 
> > > > > > > > > > > > This kind of reminds me posgresql's fsync issues where they did
> > > > > > > > > > > > writes using one fd and another thread opened another fd and
> > > > > > > > > > > > did sync and they expected any errors to be reported.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Similary what if an unseen error is present on superblock on upper
> > > > > > > > > > > > and if we mount volatile overlay and mark the error SEEN, then
> > > > > > > > > > > > if another process opens a file on upper and did syncfs(), it will
> > > > > > > > > > > > complain that exisiting error was not reported to it.
> > > > > > > > > > > > 
> > > > > > > > > > > > Overlay use case seems to be that we just want to check if an error
> > > > > > > > > > > > has happened on upper superblock since we sampled it and don't
> > > > > > > > > > > > want to consume that error as such. Will it make sense to introduce
> > > > > > > > > > > > two helpers for error sampling and error checking which mask the
> > > > > > > > > > > > SEEN bit and don't do anything with it. For example, following compile
> > > > > > > > > > > > tested only patch.
> > > > > > > > > > > > 
> > > > > > > > > > > > Now we will not touch SEEN bit at all. And even if SEEN gets set
> > > > > > > > > > > > since we sampled, errseq_check_mask_seen() will not flag it as
> > > > > > > > > > > > error.
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > Vivek
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Again, you're not really hiding this from anyone doing something _sane_.
> > > > > > > > > > > You're only hiding an error from someone who opens the file after an
> > > > > > > > > > > error occurs and expects to see an error.
> > > > > > > > > > > 
> > > > > > > > > > > That was the behavior for fsync before we switched to errseq_t, and we
> > > > > > > > > > > had to change errseq_sample for applications that relied on that. syncfs
> > > > > > > > > > > reporting these errors is pretty new however. I don't think we
> > > > > > > > > > > necessarily need to make the same guarantees there.
> > > > > > > > > > > 
> > > > > > > > > > > The solution to all of these problems is to ensure that you open the
> > > > > > > > > > > files early you're issuing syncfs on and keep them open. Then you'll
> > > > > > > > > > > always see any subsequent errors.
> > > > > > > > > > 
> > > > > > > > > > Ok. I guess we will have to set SEEN bit during error_sample otherwise,
> > > > > > > > > > we miss errors. I had missed this point.
> > > > > > > > > > 
> > > > > > > > > > So mounting a volatile overlay instance will become somewhat
> > > > > > > > > > equivalent of as if somebody did a syncfs on upper, consumed
> > > > > > > > > > error and did not do anything about it.
> > > > > > > > > > 
> > > > > > > > > > If a user cares about not losing such errors, they need to keep an
> > > > > > > > > > fd open on upper.
> > > > > > > > > > 
> > > > > > > > > > /me hopes that this does not become an issue for somebody. Even
> > > > > > > > > > if it does, one workaround can be don't do volatile overlay or
> > > > > > > > > > don't share overlay upper with other conflicting workload.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Yeah, there are limits to what we can do with 32 bits.
> > > > > > > > > 
> > > > > > > > > It's not pretty, but I guess you could pr_warn at mount time if you find
> > > > > > > > > an unseen error. That would at least not completely drop it on the
> > > > > > > > > floor.
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Jeff Layton <jlayton@redhat.com>
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > If I may enumerate our choices to help my own understanding, and
> > > > > > > > come up with a decent decision on how to proceed:
> > > > > > > > 
> > > > > > > > 1. If the filesystem has an unseen error, pr_warn.
> > > > > > > > 2. If the filesystem has an unseen error, refuse to mount it until
> > > > > > > >    the user clears the error (via syncfs?).
> > > > > > > > 3. Ignore the beginning state of the upperdir
> > > > > > > > 4. Increment the errseq_t.
> > > > > > > > 5. A combination of #1, and #2 and require the user to mount
> > > > > > > >    -o reallyvolatile or smoe such.
> > > > > > > > 
> > > > > > > > Now the downsides of each of these options:
> > > > > > > > 
> > > > > > > > 1. The user probably won't look at these errors. Especially,
> > > > > > > >    if the application is a container runtime, and these are
> > > > > > > >    happening on behalf of the application in an automated fashion.
> > > > > > > > 2. Forcing a syncfs on most filesystems is a massively costly
> > > > > > > >    operation that we want to avoid with the volatile operation.
> > > > > > > >    Also, go back to #1. Until we implement the new FS API, we
> > > > > > > >    can't easily give meaningful warnings to users that they
> > > > > > > >    can programatically act on (unless we use some special errno).
> > > > > > > > 3. This is a noop.
> > > > > > > > 4. We can hide errors from other users of the upperdir if they
> > > > > > > >    rely on syncfs semantics rather than per-fd fsync semantics
> > > > > > > >    to check if the filesystem is "clean".
> > > > > > > > 5. See the issues with #1 and #2.
> > > > > > > > 
> > > > > > > > I'm also curious as to how the patchset that allows for partial
> > > > > > > > sync is going to deal with this problem [1].
> > > > > > > > 
> > > > > > > > There is one other proposal I have, which is we define errseq_t
> > > > > > > > as two structures:
> > > > > > > > -errseq_t errseq_set(errseq_t *eseq, int err);
> > > > > > > > +/* For use on the publishing-side of errseq */
> > > > > > > > +struct errseq_publisher {
> > > > > > > > +        atomic_t        errors;
> > > > > > > > +        errseq_t        errseq_t
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +errseq_t errseq_set(struct errseq_publisher *eseq, int err);
> > > > > > > > 
> > > > > > > > And errseq_publisher is on the superblock, and errors is always incremented no
> > > > > > > > matter what. We risk wrapping, but I think this falls into Jeff's "sane" test --
> > > > > > > > if there are 2**32+ errors without someone doing an fsync, or noticing, you
> > > > > > > > might have other problems.
> > > > > > > > 
> > > > > > > > This has two (and a half) downsides:
> > > > > > > > 1. It is a potential performance concern to introduce an atomic here.
> > > > > > > 
> > > > > > > Is updation of errseq_t performance sensitive path. Updation happens in
> > > > > > > error path and I would think that does not happen often. If that's the
> > > > > > > case, it should not be very performance sensitive path.
> > > > > > > 
> > > > > > > I agree that warning on unseen error is probably not enough. Applications
> > > > > > > can't do much with that. To me it boils down to two options.
> > > > > > > 
> > > > > > > A. Live with the idea of swallowing the unseen error on syncfs (if caller
> > > > > > >    did not keep an fd open).
> > > > > > > 
> > > > > > > B. Extend errseq infrastcture in such a way so that we can detect new
> > > > > > >    error without marking error SEEN.
> > > > > > > 
> > > > > > > It feels as if B is a safter choice but will be more work. With A, problem
> > > > > > > is that behavior will be different in difference scenarios and it will then
> > > > > > > become difficult to justify.
> > > > > > > 
> > > > > > > - fsync and syncfs behavior will be different w.r.t UNSEEN error.
> > > > > > > - syncfs behavior will be different depending on if volatile overlay
> > > > > > >   mounts are being used on this filesystem or not.
> > > > > > > 
> > > > > > 
> > > > > > (cc'ing Willy since he helped a lot with this work)
> > > > > > 
> > > > > > The design for fsync is a bit odd, in that we had to preserve historical
> > > > > > behavior. Note that we didn't get it right at first. Our original
> > > > > > assumption was that applications wouldn't expect to see any writeback
> > > > > > errors that occurred before they opened the file. That turned out to be
> > > > > > wrong, and Willy fixed that by ensuring that unseen errors would be
> > > > > > reported once to the next task to do an fsync [1].
> > > > > > 
> > > > > > I'm not sure how you could change the behavior to accomodate the desire
> > > > > > for B. One idea: you could do away with the optimization that doesn't
> > > > > > bump the counter and record a new error when no one has seen the
> > > > > > previous one yet, and it's the same error. That would increase the
> > > > > > chances of the counter wrapping around however.
> > > > > > 
> > > > > 
> > > > > I personally like this approach. If the concerns of counter-wrap or performance
> > > > > are well-founded, what do you think of adding the atomic which is
> > > > > incremented only when the seen bit is unset, effectively giving publishers
> > > > > 52 (32 + 20) bits of space to prevent wraparound? And subscribers can
> > > > > optionally opt to have 52-bit values to check against?
> > > > > 
> > > > 
> > > > That seems a bit hacky. I think if you're going to do that, then you
> > > > might as well just build 64-bit errseq_t infrastructure and move to
> > > > that. You could still just sample and test against the bottom 32 bits.
> > > > 
> > > > The problem there though is that growing struct address_space or struct
> > > > file for this is somewhat unpalatable. It would be nice to keep all of
> > > > this within 32 bits.
> > > > 
> > > > I think the best option is probably to just work to restructure the
> > > > syncfs code so that overlayfs can override the logic in the syncfs
> > > > syscall wrapper and return an error of its choosing to a syncfs()
> > > > syscall.
> > > > 
> > > > It's a more radical change, but maybe we could add a new
> > > > sb->s_op->syncfs operation, and turn the guts of the old syncfs syscall
> > > > wrapper into a generic helper.
> > > > 
> > > > The naming is a little confusing with an existing sync_fs op, however,
> > > > but maybe we could transition all of the old sync_fs ops to the new one
> > > > and get rid of ->sync_fs. That's a more invasive set though, and you'd
> > > > need to preserve the existing behavior everywhere.
> > > > 
> > > > Once you have that, you could just use the "realfile" in
> > > > private_data for your syncfs op and everything should "just work".
> > > 
> > > This should work for regular overlayfs where calling syncfs() on
> > > overlay will result in calling syncfs() on underlying filesystem 
> > > and help propagating error back, IIUC.
> > > 
> > > But for this specific use case, of volatile mount, we want to find
> > > out if underlying filesystem superblock had a writeback error since
> > > we mounted overlay instance, without calling syncfs() on underlying
> > > filesystem. And we don't want to consume existing error on underlying
> > > super block fearing applications will complain.
> > > 
> > 
> > Why? If you're not planning to do anything with the error then why check
> > for it in the first place? Maybe I don't really understand what you're
> > trying to do with this.
> > 
> 
> Here is the background.
> 
> We introduced a new option "-o volatile" for overlayfs. What this option
> does is that it disables all calls to sync/syncfs/fsync and returns
> success.
> 
> https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html?highlight=overlayfs#volatile-mount
> 
> Now one problem with this we realized is that what happens if there is
> a writeback error on upper filesystem. Previously fsync will catch
> that error and return to user space. Now we are not doing any actual
> sync for volatile mount, so we don't have a way to detect if any
> writeback error happened on upper filesystem.
> 
> So it is possible that an application writes something to overlay
> volatile mount and gets back corrupted/old data.
> 
> - App writes something.
> - Writeback of that page fails
> - app does fsync, which succeds without doing any sync.
> - app reads back page and can get old data if page has been evicted out
>   of cache.
> 
> So we lost capability to return writeback errors to user space with
> volatile mounts. So Amir/Sargun proposed that lets take snapshot
> of upper ->s_wb_err when volatile overlay is being mounted. And
> on every sync/fsync call check if any error has happened on upper
> since volatile overlay has been mounted. If yes, return error to 
> user space. 
> 
> In fact, Idea is that once an error has been detected, volatile
> overlay should effectively return -EIO for all the operations. IOW,
> one should now unmount it, throw away upper and restart again.
> 
> 
> > If it turns out that you just want to see if it ever had an error, you
> > can always use errseq_check with 0 as the "since" value and that will
> > tell you without marking or advancing anything. It's not clear to me
> > what the value of that is here though.
> 
> I think "since == 0" will not work. Say upper already has an error
> (seen/unseen), then errseq_check() will always return error. We
> don't want that. We don't care if upper has an seen/unseen error
> at the time when we sample it. What we care about is that if
> there is an error after we sampled, we can detect that and make
> whole volatile mount bad.
> 
> > 
> > > So key requirement here seems to be being able to detect error
> > > on underlying superblock without consuming the unseen error.
> > > 
> > 
> > I think for overlayfs what you really want to do is basically "proxy"
> > fsync and syncfs calls to the upper layer. Then you should just be able
> > to use the upper layer's "realfile" when doing fsync/syncfs. You won't
> > need to sample anything at mount time that way as it should just happen
> > naturally when you open files on overlayfs.
> 
> Which we already do, right? ovl_fsync()/ovl_sync() result in a
> call on upper. This probably can be improve futher.
> 
> > 
> > That does mean you may need to rework how the syncfs syscall dispatches
> > to the filesystem, but that's not too difficult in principle.
> 
> I think we are looking at two overlay cases here. One is regular
> overlayfs where syncfs() needs to be reworked to propagate errors
> from upper/ to all the way to application. Right now VFS ignores
> error returned from ->sync_fs. 
> 
> The other case we are trying to solve right now is volatile mount.
> Where we will not actually call fsync/sync_filesystem() on upper
> but still want to detect if any error happened since we mounted
> this volatile mount. 
> 
> And that's why all this discussion of being able to detect an
> error on super block without actually consuming the error. Once
> we detect that some error has happened on upper since we mounted,
> we can start returning errors for all I/O operations to user and
> user is supposed to unmount and throw away upper dir and restart.
> 


The problem here is that you want to be able to sample the thing in two
different ways such that you potentially get two different results
afterward:

1) the current syncfs/fsync case where we don't expect later openers to
be able to see the error after you take it.

2) the situation you want where you want to sample the errseq_t but
don't want to cloak an fsync on a subsequent open from seeing it

That's fundamentally not going to work with the single SEEN flag we're
using now. I wonder if you could get you the semantics you want with 2
flags instead of 1. Basically, split the SEEN bit into two:

1) a bit to indicate that the counter doesn't need to be incremented the
next time an error is recorded (SKIP_INC)

2) a bit to indicate that the error has been reported in a way that was
returned to userland, such that later openers won't see it (SEEN)

Then you could just add two different sorts of sampling functions. One
would set both bits when sampling (or advancing) and the other would
just set one of them.

It's a bit more complicated than what we're doing now though and you'd
need to work through the logic of how the API would interact with both
flags.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-03 23:36                                 ` Jeff Layton
@ 2020-12-04  6:45                                   ` Amir Goldstein
  2020-12-04 15:00                                     ` Vivek Goyal
  0 siblings, 1 reply; 28+ messages in thread
From: Amir Goldstein @ 2020-12-04  6:45 UTC (permalink / raw)
  To: Jeff Layton, Vivek Goyal, Sargun Dhillon
  Cc: Linux FS-devel Mailing List, overlayfs, Miklos Szeredi, Matthew Wilcox

> > Here is the background.
> >
> > We introduced a new option "-o volatile" for overlayfs. What this option
> > does is that it disables all calls to sync/syncfs/fsync and returns
> > success.
> >
> > https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html?highlight=overlayfs#volatile-mount
> >
> > Now one problem with this we realized is that what happens if there is
> > a writeback error on upper filesystem. Previously fsync will catch
> > that error and return to user space. Now we are not doing any actual
> > sync for volatile mount, so we don't have a way to detect if any
> > writeback error happened on upper filesystem.
> >
> > So it is possible that an application writes something to overlay
> > volatile mount and gets back corrupted/old data.
> >
> > - App writes something.
> > - Writeback of that page fails
> > - app does fsync, which succeds without doing any sync.
> > - app reads back page and can get old data if page has been evicted out
> >   of cache.
> >
> > So we lost capability to return writeback errors to user space with
> > volatile mounts. So Amir/Sargun proposed that lets take snapshot
> > of upper ->s_wb_err when volatile overlay is being mounted. And
> > on every sync/fsync call check if any error has happened on upper
> > since volatile overlay has been mounted. If yes, return error to
> > user space.
> >
> > In fact, Idea is that once an error has been detected, volatile
> > overlay should effectively return -EIO for all the operations. IOW,
> > one should now unmount it, throw away upper and restart again.
> >
> >
> > > If it turns out that you just want to see if it ever had an error, you
> > > can always use errseq_check with 0 as the "since" value and that will
> > > tell you without marking or advancing anything. It's not clear to me
> > > what the value of that is here though.
> >
> > I think "since == 0" will not work. Say upper already has an error
> > (seen/unseen), then errseq_check() will always return error. We
> > don't want that. We don't care if upper has an seen/unseen error
> > at the time when we sample it. What we care about is that if
> > there is an error after we sampled, we can detect that and make
> > whole volatile mount bad.
> >
> > >
> > > > So key requirement here seems to be being able to detect error
> > > > on underlying superblock without consuming the unseen error.
> > > >
> > >
> > > I think for overlayfs what you really want to do is basically "proxy"
> > > fsync and syncfs calls to the upper layer. Then you should just be able
> > > to use the upper layer's "realfile" when doing fsync/syncfs. You won't
> > > need to sample anything at mount time that way as it should just happen
> > > naturally when you open files on overlayfs.
> >
> > Which we already do, right? ovl_fsync()/ovl_sync() result in a
> > call on upper. This probably can be improve futher.
> >
> > >
> > > That does mean you may need to rework how the syncfs syscall dispatches
> > > to the filesystem, but that's not too difficult in principle.
> >
> > I think we are looking at two overlay cases here. One is regular
> > overlayfs where syncfs() needs to be reworked to propagate errors
> > from upper/ to all the way to application. Right now VFS ignores
> > error returned from ->sync_fs.
> >
> > The other case we are trying to solve right now is volatile mount.
> > Where we will not actually call fsync/sync_filesystem() on upper
> > but still want to detect if any error happened since we mounted
> > this volatile mount.
> >
> > And that's why all this discussion of being able to detect an
> > error on super block without actually consuming the error. Once
> > we detect that some error has happened on upper since we mounted,
> > we can start returning errors for all I/O operations to user and
> > user is supposed to unmount and throw away upper dir and restart.
> >
>
>
> The problem here is that you want to be able to sample the thing in two
> different ways such that you potentially get two different results
> afterward:
>
> 1) the current syncfs/fsync case where we don't expect later openers to
> be able to see the error after you take it.
>
> 2) the situation you want where you want to sample the errseq_t but
> don't want to cloak an fsync on a subsequent open from seeing it
>
> That's fundamentally not going to work with the single SEEN flag we're
> using now. I wonder if you could get you the semantics you want with 2
> flags instead of 1. Basically, split the SEEN bit into two:
>
> 1) a bit to indicate that the counter doesn't need to be incremented the
> next time an error is recorded (SKIP_INC)
>
> 2) a bit to indicate that the error has been reported in a way that was
> returned to userland, such that later openers won't see it (SEEN)
>
> Then you could just add two different sorts of sampling functions. One
> would set both bits when sampling (or advancing) and the other would
> just set one of them.
>
> It's a bit more complicated than what we're doing now though and you'd
> need to work through the logic of how the API would interact with both
> flags.
>

This discussion is a very good exercise for my brain ;-)
but I think we are really over complicating the requirements of volatile.

My suggestion to sample sb error on mount was over-interpreted that
we MUST disregard writeback errors that happened before the mount.
I don't think this is a requirement. If anything, this is a non-requirement.
Why? because what happens if someone unpacks the layers onto
underlying fs (as docker most surely does) and then mounts the volatile
overlay. The files data could have been lost in the time that passed between
unpack of layer and overlay mount.

Of course overlayfs can not be held responsible for the integrity of the
layers it was handed, but why work so hard to deprive users of something
that can benefit the integrity of their system?

So I think we may be prudent and say that if there is an unseen error we
should fail the volatile mount (say ESTALE).

This way userland has the fast path of mounting without syncfs in the
common case and the fallback to slow path:
- syncfs (consume the error)
- unpack layers
- volatile mount

Doesn't this make sense *and* make life simpler?

1. On volatile mount sample sb_err and make sure no unseen error
2. On fsync/syncfs verify no sb_err since mount

Am I missing something?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] overlay: Implement volatile-specific fsync error behaviour
  2020-12-04  6:45                                   ` Amir Goldstein
@ 2020-12-04 15:00                                     ` Vivek Goyal
  0 siblings, 0 replies; 28+ messages in thread
From: Vivek Goyal @ 2020-12-04 15:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Layton, Sargun Dhillon, Linux FS-devel Mailing List,
	overlayfs, Miklos Szeredi, Matthew Wilcox

On Fri, Dec 04, 2020 at 08:45:16AM +0200, Amir Goldstein wrote:
> > > Here is the background.
> > >
> > > We introduced a new option "-o volatile" for overlayfs. What this option
> > > does is that it disables all calls to sync/syncfs/fsync and returns
> > > success.
> > >
> > > https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html?highlight=overlayfs#volatile-mount
> > >
> > > Now one problem with this we realized is that what happens if there is
> > > a writeback error on upper filesystem. Previously fsync will catch
> > > that error and return to user space. Now we are not doing any actual
> > > sync for volatile mount, so we don't have a way to detect if any
> > > writeback error happened on upper filesystem.
> > >
> > > So it is possible that an application writes something to overlay
> > > volatile mount and gets back corrupted/old data.
> > >
> > > - App writes something.
> > > - Writeback of that page fails
> > > - app does fsync, which succeds without doing any sync.
> > > - app reads back page and can get old data if page has been evicted out
> > >   of cache.
> > >
> > > So we lost capability to return writeback errors to user space with
> > > volatile mounts. So Amir/Sargun proposed that lets take snapshot
> > > of upper ->s_wb_err when volatile overlay is being mounted. And
> > > on every sync/fsync call check if any error has happened on upper
> > > since volatile overlay has been mounted. If yes, return error to
> > > user space.
> > >
> > > In fact, Idea is that once an error has been detected, volatile
> > > overlay should effectively return -EIO for all the operations. IOW,
> > > one should now unmount it, throw away upper and restart again.
> > >
> > >
> > > > If it turns out that you just want to see if it ever had an error, you
> > > > can always use errseq_check with 0 as the "since" value and that will
> > > > tell you without marking or advancing anything. It's not clear to me
> > > > what the value of that is here though.
> > >
> > > I think "since == 0" will not work. Say upper already has an error
> > > (seen/unseen), then errseq_check() will always return error. We
> > > don't want that. We don't care if upper has an seen/unseen error
> > > at the time when we sample it. What we care about is that if
> > > there is an error after we sampled, we can detect that and make
> > > whole volatile mount bad.
> > >
> > > >
> > > > > So key requirement here seems to be being able to detect error
> > > > > on underlying superblock without consuming the unseen error.
> > > > >
> > > >
> > > > I think for overlayfs what you really want to do is basically "proxy"
> > > > fsync and syncfs calls to the upper layer. Then you should just be able
> > > > to use the upper layer's "realfile" when doing fsync/syncfs. You won't
> > > > need to sample anything at mount time that way as it should just happen
> > > > naturally when you open files on overlayfs.
> > >
> > > Which we already do, right? ovl_fsync()/ovl_sync() result in a
> > > call on upper. This probably can be improve futher.
> > >
> > > >
> > > > That does mean you may need to rework how the syncfs syscall dispatches
> > > > to the filesystem, but that's not too difficult in principle.
> > >
> > > I think we are looking at two overlay cases here. One is regular
> > > overlayfs where syncfs() needs to be reworked to propagate errors
> > > from upper/ to all the way to application. Right now VFS ignores
> > > error returned from ->sync_fs.
> > >
> > > The other case we are trying to solve right now is volatile mount.
> > > Where we will not actually call fsync/sync_filesystem() on upper
> > > but still want to detect if any error happened since we mounted
> > > this volatile mount.
> > >
> > > And that's why all this discussion of being able to detect an
> > > error on super block without actually consuming the error. Once
> > > we detect that some error has happened on upper since we mounted,
> > > we can start returning errors for all I/O operations to user and
> > > user is supposed to unmount and throw away upper dir and restart.
> > >
> >
> >
> > The problem here is that you want to be able to sample the thing in two
> > different ways such that you potentially get two different results
> > afterward:
> >
> > 1) the current syncfs/fsync case where we don't expect later openers to
> > be able to see the error after you take it.
> >
> > 2) the situation you want where you want to sample the errseq_t but
> > don't want to cloak an fsync on a subsequent open from seeing it
> >
> > That's fundamentally not going to work with the single SEEN flag we're
> > using now. I wonder if you could get you the semantics you want with 2
> > flags instead of 1. Basically, split the SEEN bit into two:
> >
> > 1) a bit to indicate that the counter doesn't need to be incremented the
> > next time an error is recorded (SKIP_INC)
> >
> > 2) a bit to indicate that the error has been reported in a way that was
> > returned to userland, such that later openers won't see it (SEEN)
> >
> > Then you could just add two different sorts of sampling functions. One
> > would set both bits when sampling (or advancing) and the other would
> > just set one of them.
> >
> > It's a bit more complicated than what we're doing now though and you'd
> > need to work through the logic of how the API would interact with both
> > flags.
> >
> 
> This discussion is a very good exercise for my brain ;-)
> but I think we are really over complicating the requirements of volatile.
> 
> My suggestion to sample sb error on mount was over-interpreted that
> we MUST disregard writeback errors that happened before the mount.
> I don't think this is a requirement. If anything, this is a non-requirement.
> Why? because what happens if someone unpacks the layers onto
> underlying fs (as docker most surely does) and then mounts the volatile
> overlay. The files data could have been lost in the time that passed between
> unpack of layer and overlay mount.

I think for unpacking layers one should not use volatile mounts. overlay
assumes lower layers are stable and do not have writeback errors. If
we use volatile mounts for unpacking layers, then unpacked layer (lower)
might have writeback errors and overlay will not detect it.

> 
> Of course overlayfs can not be held responsible for the integrity of the
> layers it was handed, but why work so hard to deprive users of something
> that can benefit the integrity of their system?
> 
> So I think we may be prudent and say that if there is an unseen error we
> should fail the volatile mount (say ESTALE).
> 
> This way userland has the fast path of mounting without syncfs in the
> common case and the fallback to slow path:
> - syncfs (consume the error)
> - unpack layers
> - volatile mount

That sounds reasonable too.
> 
> Doesn't this make sense *and* make life simpler?
> 
> 1. On volatile mount sample sb_err and make sure no unseen error
> 2. On fsync/syncfs verify no sb_err since mount

Thinking little bit more about why we are trying to determine sb_err
since mount. Why not detect error since realfile->f_sb_err instead
(As Jeff Layton suggested). I think this will allow more fine grained
error handling for volatile mounts. If fsync() returns error, then
one can get rid of that file and create new file without getting rid
of whole upper layer?

I think fsync() path probably is easy case where overlay could skip
actual fsync call but determine if there has been a writeback error on
file since realfile->f_sb_err and return error.

Given we actually don't do fsync(), we probably will have to do
error checks on other overlay paths like ovl_read_iter() to make
sure there have not been writeback errors on file otherwise
return -EIO instead.

->sync_fs is little tricky because first of all we don't pass
"struct file" and then we ignore return code from ->sync_fs. So
if we somehow fix all that than syncfs path could also be made
to check error on realfile->f_sb_err instead.

But problem with this is that it is complicated. And its not clear
how much are the gains due to this added complexity.

So probably for now it is better to stick to simpler idea of just fail
volatile mount on unseen errors and determine errors w.r.t ofs->s_wb_err
saved at mount time.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-12-04 15:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-02  9:27 [PATCH] overlay: Implement volatile-specific fsync error behaviour Sargun Dhillon
2020-12-02 10:25 ` Amir Goldstein
2020-12-02 15:07 ` Vivek Goyal
2020-12-02 17:02   ` Jeff Layton
2020-12-02 17:29     ` Vivek Goyal
2020-12-02 18:22       ` Jeff Layton
2020-12-02 18:56         ` Vivek Goyal
2020-12-02 19:03           ` Sargun Dhillon
2020-12-02 19:26           ` Jeff Layton
2020-12-02 21:34             ` Vivek Goyal
2020-12-02 21:52               ` Jeff Layton
2020-12-03 10:42                 ` Sargun Dhillon
2020-12-03 12:06                   ` Jeff Layton
2020-12-03 14:27                   ` Vivek Goyal
2020-12-03 15:20                     ` Jeff Layton
2020-12-03 17:08                       ` Sargun Dhillon
2020-12-03 17:50                         ` Jeff Layton
2020-12-03 20:43                           ` Vivek Goyal
2020-12-03 21:36                             ` Jeff Layton
2020-12-03 22:24                               ` Vivek Goyal
2020-12-03 23:36                                 ` Jeff Layton
2020-12-04  6:45                                   ` Amir Goldstein
2020-12-04 15:00                                     ` Vivek Goyal
2020-12-03 20:31                       ` Vivek Goyal
2020-12-02 18:49       ` Sargun Dhillon
2020-12-02 19:10         ` Jeff Layton
2020-12-03 10:36         ` Amir Goldstein
2020-12-02 17:41   ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).