linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Make overlayfs volatile mounts reusable
@ 2020-11-27  9:20 Sargun Dhillon
  2020-11-27  9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-27  9:20 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	linux-fsdevel, David Howells

This adds some documentation on how the current behaviour of overlayfs
works, and adds some non-obvious caveats to the documentation. We may want
to pick those up independently.

The volatile option is great for "ephemeral" containers. Unfortunately,
it doesn't capture all uses. There are two ways to use it safely right now:

1. Throw away the entire upperdir between mounts
2. Manually syncfs between mounts

For certain use-cases like serverless, or short-lived containers, it is
advantageous to be able to stop the container (runtime) and start it up on
demand / invocation of the function. Usually, there is some bootstrap
process which involves downloading some artifacts, or putting secrets on
disk, and then upon invocation of the function, you want to (re)start the
container.

If you have to syncfs every time you do this, it can lead to excess
filesystem overhead for all of the other containers on the machine, and
stall out every container who's upperdir is on the same underlying
filesystem, unless your filesystem offers something like subvolumes,
and if sync can be restricted to a subvolume.

The kernel has information that it can use to determine whether or not this
is safe -- primarily if the underlying FS has had writeback errors or not.
Overlayfs doesn't delay writes, so the consistency of the upperdir is not
contingent on the mount of overlayfs, but rather the mount of the
underlying filesystem. It can also make sure the underlying filesystem
wasn't remounted. Although, it was suggested that we use derive this
information from the upperdir's inode[1], we can checkpoint this data on
disk in an xattr.

Specifically we checkpoint:
  * Superblock "id": This is a new concept introduced in one of the patches
    which keeps track of (re)mounts of filesystems, by having a per boot
    monotonically increasing integer identifying the superblock. This is
    safer than trying to obfuscate the pointer and putting it into an
    xattr (due to leak risk, and address reuse), and during the course
    of a boot, the u64 should not wrap.
  * Overlay "boot id": This is a new UUID that is overlayfs specific,
    as overlayfs is a module that's independent from the rest of the
    system and can be (re)loaded independently -- thus it generates
    a UUID at load time which can be used to uniquely identify it.
  * upperdir / workdir errseq: A sample of the errseq_t on the workdir /
    upperdir's superblock. Since the errseq_t is implemented as a u32
    with errno + error counter, we can safely store it in a checkpoint.

This has to be done in kernelspace because userspace does not have access
to information such as superblock (re)mounts, the writeback errseq value,
and whether the module has been (re)loaded. Although this mechanism only
determines a subset of the error scenarios, it lays out the groundwork
for adding more.

In the future, we may want to make it so overlayfs shuts down on errors,
or have some kind of other behaviour when errors are detected. If we want
a per-file (on the upperdir) piece of metadata can be added indicating
errseq_t for just that file, and we can check it on each close / open,
and then allow the user to make an intelligent decision of how they want
overlayfs to handle the error. This would allow for errors to be captured
during normal operation as opposed to just between remounts.

This also allows for:
volatile -> volatile
non-volatile -> volatile
non-volatile -> non-volatile

But, for now, prevents volatile -> non-volatile. Perhaps with the future
patches around selective inode sync, and tracking errors on every
interaction will make this possible.


Changes since:
V1 [2]:
 * Add documentation commit about current short-comings of the current volatile
   implementation.
 * Do not allow volatile mounts to be mounted as non-volatile
RFC [3]:
 * Changed datastructure names
 * No longer delete the volatile dir / dirty file
 * Moved xattr to volatile directory
 * Split errseq check

[1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhadzC3-kh-igfxv3pAmC3ocDtAQTxByu4hrn8KtZuieQ@mail.gmail.com/
[2]: https://lore.kernel.org/linux-unionfs/20201125104621.18838-1-sargun@sargun.me/T/#t
[3]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#t

Sargun Dhillon (4):
  fs: Add s_instance_id field to superblock for unique identification
  overlay: Document current outstanding shortcoming of volatile
  overlay: Add the ability to remount volatile directories when safe
  overlay: Add rudimentary checking of writeback errseq on volatile
    remount

 Documentation/filesystems/overlayfs.rst |  24 ++++--
 fs/overlayfs/overlayfs.h                |  38 ++++++++-
 fs/overlayfs/readdir.c                  | 104 +++++++++++++++++++++---
 fs/overlayfs/super.c                    |  74 +++++++++++++----
 fs/overlayfs/util.c                     |   2 +
 fs/super.c                              |   3 +
 include/linux/fs.h                      |   7 ++
 7 files changed, 213 insertions(+), 39 deletions(-)


base-commit: be4df0cea08a8b59eb38d73de988b7ba8022df41
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification
  2020-11-27  9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon
@ 2020-11-27  9:20 ` Sargun Dhillon
  2020-11-27  9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-27  9:20 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	linux-fsdevel, David Howells

This assigns a per-boot unique number to each superblock. This allows
other components to know whether a filesystem has been remounted
since they last interacted with it.

At every boot it is reset to 0. There is no specific reason it is set to 0,
other than repeatability versus using some random starting number. Because
of this, you must store it along some other piece of data which is
initialized at boot time.

This doesn't have any of the overhead of idr, and a u64 wont wrap any time
soon. There is no forward lookup requirement, so an idr is not needed.

In the future, we may want to expose this to userspace. Userspace programs
can benefit from this if they have large chunks of dirty or mmaped memory
that they're interacting with, and they want to see if that volume has been
unmounted, and remounted. Along with this, and a mechanism to inspect the
superblock's errseq a user can determine whether they need to throw away
their cache or similar. This is another benefit in comparison to just
using a pointer to the superblock to uniquely identify it.

Although this doesn't expose an ioctl or similar yet, in the future we
could add an ioctl that allows for fetching the s_instance_id for a given
cache, and inspection of the errseq associated with that.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
---
 fs/super.c         | 3 +++
 include/linux/fs.h | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 904459b35119..e47ace7f8c3d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -42,6 +42,7 @@
 
 static int thaw_super_locked(struct super_block *sb);
 
+static u64 s_instance_id_counter;
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
 
@@ -546,6 +547,7 @@ struct super_block *sget_fc(struct fs_context *fc,
 	s->s_iflags |= fc->s_iflags;
 	strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
 	list_add_tail(&s->s_list, &super_blocks);
+	s->s_instance_id = s_instance_id_counter++;
 	hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
 	spin_unlock(&sb_lock);
 	get_filesystem(s->s_type);
@@ -625,6 +627,7 @@ struct super_block *sget(struct file_system_type *type,
 	s->s_type = type;
 	strlcpy(s->s_id, type->name, sizeof(s->s_id));
 	list_add_tail(&s->s_list, &super_blocks);
+	s->s_instance_id = s_instance_id_counter++;
 	hlist_add_head(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
 	get_filesystem(type);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7519ae003a08..09bf54382a54 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1472,6 +1472,13 @@ struct super_block {
 	char			s_id[32];	/* Informational name */
 	uuid_t			s_uuid;		/* UUID */
 
+	/*
+	 * ID identifying this particular instance of the superblock. It can
+	 * be used to determine if a particular filesystem has been remounted.
+	 * It may be exposed to userspace.
+	 */
+	u64			s_instance_id;
+
 	unsigned int		s_max_links;
 	fmode_t			s_mode;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-27  9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon
  2020-11-27  9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
@ 2020-11-27  9:20 ` Sargun Dhillon
  2020-11-27 12:52   ` Amir Goldstein
  2020-11-27  9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
  2020-11-27  9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon
  3 siblings, 1 reply; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-27  9:20 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	linux-fsdevel, David Howells

This documents behaviour that was discussed in a thread about the volatile
feature. Specifically, how failures can go hidden from asynchronous writes
(such as from mmap, or writes that are not immediately flushed to the
filesystem). Although we pass through calls like msync, fallocate, and
write, and will still return errors on those, it doesn't guarantee all
kinds of errors will happen at those times, and thus may hide errors.

In the future, we can add error checking to all interactions with the
upperdir, and pass through errseq_t from the upperdir on mappings,
and other interactions with the filesystem[1].

[1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/filesystems/overlayfs.rst | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..c6e30c1bc2f2 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -570,7 +570,11 @@ Volatile mount
 This is enabled with the "volatile" mount option.  Volatile mounts are not
 guaranteed to survive a crash.  It is strongly recommended that volatile
 mounts are only used if data written to the overlay can be recreated
-without significant effort.
+without significant effort.  In addition to this, the sync family of syscalls
+are not sufficient to determine whether a write failed as sync calls are
+omitted.  For this reason, it is important that the filesystem used by the
+upperdir handles failure in a fashion that's suitable for the user.  For
+example, upon detecting a fault, ext4 can be configured to panic.
 
 The advantage of mounting with the "volatile" option is that all forms of
 sync calls to the upper filesystem are omitted.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe
  2020-11-27  9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon
  2020-11-27  9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
  2020-11-27  9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon
@ 2020-11-27  9:20 ` Sargun Dhillon
  2020-11-27 11:09   ` kernel test robot
                     ` (2 more replies)
  2020-11-27  9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon
  3 siblings, 3 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-27  9:20 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	linux-fsdevel, David Howells

Overlayfs added the ability to setup mounts where all syncs could be
short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").

A user might want to remount this fs, but we do not let the user because
of the "incompat" detection feature. In the case of volatile, it is safe
to do something like[1]:

$ sync -f /root/upperdir
$ rm -rf /root/workdir/incompat/volatile

There are two ways to go about this. You can call sync on the underlying
filesystem, check the error code, and delete the dirty file if everything
is clean. If you're running lots of containers on the same filesystem, or
you want to avoid all unnecessary I/O, this may be suboptimal.

Alternatively, you can blindly delete the dirty file, and "hope for the
best".

This patch introduces transparent functionality to check if it is
(relatively) safe to reuse the upperdir. It ensures that the filesystem
hasn't been remounted, the system hasn't been rebooted, nor has the
overlayfs code changed. Since the structure is explicitly not meant to be
used between different versions of the code, its stability does not matter
so much.

[1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/filesystems/overlayfs.rst | 18 +++--
 fs/overlayfs/overlayfs.h                | 37 +++++++++-
 fs/overlayfs/readdir.c                  | 98 ++++++++++++++++++++++---
 fs/overlayfs/super.c                    | 73 +++++++++++++-----
 fs/overlayfs/util.c                     |  2 +
 5 files changed, 190 insertions(+), 38 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index c6e30c1bc2f2..b485fdb65b85 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -579,13 +579,17 @@ example, upon detecting a fault, ext4 can be configured to panic.
 The advantage of mounting with the "volatile" option is that all forms of
 sync calls to the upper filesystem are omitted.
 
-When overlay is mounted with "volatile" option, the directory
-"$workdir/work/incompat/volatile" is created.  During next mount, overlay
-checks for this directory and refuses to mount if present. This is a strong
-indicator that user should throw away upper and work directories and create
-fresh one. In very limited cases where the user knows that the system has
-not crashed and contents of upperdir are intact, The "volatile" directory
-can be removed.
+When overlay is mounted with the "volatile" option, the directory
+"$workdir/work/incompat/volatile" is created.  This acts as a indicator
+that the user should throw away upper and work directories and create fresh
+ones.  In some cases, the overlayfs can detect if the upperdir can be
+reused safely in a subsequent volatile mounts, and mounting will proceed as
+normal.  If the filesystem is unable to determine if this is safe (due to a
+reboot, upgraded kernel code, or loss of checkpoint, etc...), the user may
+bypass these safety checks and remove the "volatile" directory if they know
+the system did not encounter a fault and the contents of the upperdir are
+intact. Then, the user can remount the filesystem as normal.
+
 
 Testsuite
 ---------
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..de694ee99d7c 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -32,8 +32,13 @@ enum ovl_xattr {
 	OVL_XATTR_NLINK,
 	OVL_XATTR_UPPER,
 	OVL_XATTR_METACOPY,
+	OVL_XATTR_VOLATILE,
 };
 
+#define OVL_INCOMPATDIR_NAME "incompat"
+#define OVL_VOLATILEDIR_NAME "volatile"
+#define OVL_VOLATILE_DIRTY_NAME "dirty"
+
 enum ovl_inode_flag {
 	/* Pure upper dir that may contain non pure upper entries */
 	OVL_IMPURE,
@@ -57,6 +62,31 @@ enum {
 	OVL_XINO_ON,
 };
 
+/*
+ * This is copied into the volatile xattr, and the user does not interact with
+ * it. There is no stability requirement, as a reboot explicitly invalidates
+ * a volatile workdir. It is explicitly meant not to be a stable api.
+ *
+ * Although this structure isn't meant to be stable it is exposed to potentially
+ * unprivileged users. We don't do any kind of cryptographic operations with
+ * the structure, so it could be tampered with, or inspected. Don't put
+ * kernel memory pointers in it, or anything else that could cause problems,
+ * or information disclosure.
+ */
+struct ovl_volatile_info {
+	/*
+	 * This uniquely identifies a boot, and is reset if overlayfs itself
+	 * is reloaded. Therefore we check our current / known boot_id
+	 * against this before looking at any other fields to validate:
+	 * 1. Is this datastructure laid out in the way we expect? (Overlayfs
+	 *    module, reboot, etc...)
+	 * 2. Could something have changed (like the s_instance_id counter
+	 *    resetting)
+	 */
+	uuid_t		ovl_boot_id;	/* Must stay first member */
+	u64		s_instance_id;
+} __packed;
+
 /*
  * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
  * where:
@@ -422,8 +452,8 @@ void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list);
 void ovl_cache_free(struct list_head *list);
 void ovl_dir_cache_free(struct inode *inode);
 int ovl_check_d_type_supported(struct path *realpath);
-int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt,
-			struct dentry *dentry, int level);
+int ovl_workdir_cleanup(struct ovl_fs *ofs, struct inode *dir,
+			struct vfsmount *mnt, struct dentry *dentry, int level);
 int ovl_indexdir_cleanup(struct ovl_fs *ofs);
 
 /* inode.c */
@@ -520,3 +550,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 
 /* export.c */
 extern const struct export_operations ovl_export_operations;
+
+/* super.c */
+extern uuid_t ovl_boot_id;
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..7b66fbb20261 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -1080,10 +1080,78 @@ int ovl_check_d_type_supported(struct path *realpath)
 
 	return rdd.d_type_supported;
 }
+static int ovl_verify_volatile_info(struct ovl_fs *ofs,
+				    struct dentry *volatiledir)
+{
+	int err;
+	struct ovl_volatile_info info;
+
+	if (!volatiledir->d_inode)
+		return 0;
+
+	if (!ofs->config.ovl_volatile) {
+		pr_debug("Mount is not volatile; upperdir is marked volatile\n");
+		return -EINVAL;
+	}
+
+	err = ovl_do_getxattr(ofs, volatiledir, OVL_XATTR_VOLATILE, &info,
+			      sizeof(info));
+	if (err < 0) {
+		pr_debug("Unable to read volatile xattr: %d\n", err);
+		return -EINVAL;
+	}
+
+	if (err != sizeof(info)) {
+		pr_debug("%s xattr on-disk size is %d expected to read %zd\n",
+			 ovl_xattr(ofs, OVL_XATTR_VOLATILE), err, sizeof(info));
+		return -EINVAL;
+	}
+
+	if (!uuid_equal(&ovl_boot_id, &info.ovl_boot_id)) {
+		pr_debug("boot id has changed (reboot or module reloaded)\n");
+		return -EINVAL;
+	}
+
+	if (volatiledir->d_sb->s_instance_id != info.s_instance_id) {
+		pr_debug("workdir has been unmounted and remounted\n");
+		return -EINVAL;
+	}
+
+	return 1;
+}
 
-#define OVL_INCOMPATDIR_NAME "incompat"
+/*
+ * ovl_check_incompat checks this specific incompat entry for incompatibility.
+ * If it is found to be incompatible -EINVAL will be returned.
+ *
+ * If the directory should be preserved, then this function returns 1.
+ */
+static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p,
+			      struct path *path)
+{
+	int err = -EINVAL;
+	struct dentry *d;
+
+	if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) {
+		d = lookup_one_len(p->name, path->dentry, p->len);
+		if (IS_ERR(d))
+			return PTR_ERR(d);
+
+		err = ovl_verify_volatile_info(ofs, d);
+		dput(d);
+	}
+
+	if (err == -EINVAL)
+		pr_err("incompat feature '%s' cannot be mounted\n", p->name);
+	else
+		pr_debug("incompat '%s' handled: %d\n", p->name, err);
+
+	dput(d);
+	return err;
+}
 
-static int ovl_workdir_cleanup_recurse(struct path *path, int level)
+static int ovl_workdir_cleanup_recurse(struct ovl_fs *ofs, struct path *path,
+				       int level)
 {
 	int err;
 	struct inode *dir = path->dentry->d_inode;
@@ -1125,16 +1193,19 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
 			if (p->len == 2 && p->name[1] == '.')
 				continue;
 		} else if (incompat) {
-			pr_err("overlay with incompat feature '%s' cannot be mounted\n",
-				p->name);
-			err = -EINVAL;
-			break;
+			err = ovl_check_incompat(ofs, p, path);
+			if (err < 0)
+				break;
+			/* Skip cleaning this */
+			if (err == 1)
+				continue;
 		}
 		dentry = lookup_one_len(p->name, path->dentry, p->len);
 		if (IS_ERR(dentry))
 			continue;
 		if (dentry->d_inode)
-			err = ovl_workdir_cleanup(dir, path->mnt, dentry, level);
+			err = ovl_workdir_cleanup(ofs, dir, path->mnt, dentry,
+						  level);
 		dput(dentry);
 		if (err)
 			break;
@@ -1142,11 +1213,13 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
 	inode_unlock(dir);
 out:
 	ovl_cache_free(&list);
+	if (incompat && err >= 0)
+		return 1;
 	return err;
 }
 
-int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt,
-			 struct dentry *dentry, int level)
+int ovl_workdir_cleanup(struct ovl_fs *ofs, struct inode *dir,
+			struct vfsmount *mnt, struct dentry *dentry, int level)
 {
 	int err;
 
@@ -1159,7 +1232,7 @@ int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt,
 		struct path path = { .mnt = mnt, .dentry = dentry };
 
 		inode_unlock(dir);
-		err = ovl_workdir_cleanup_recurse(&path, level + 1);
+		err = ovl_workdir_cleanup_recurse(ofs, &path, level + 1);
 		inode_lock_nested(dir, I_MUTEX_PARENT);
 		if (!err)
 			err = ovl_cleanup(dir, dentry);
@@ -1206,9 +1279,10 @@ int ovl_indexdir_cleanup(struct ovl_fs *ofs)
 		}
 		/* Cleanup leftover from index create/cleanup attempt */
 		if (index->d_name.name[0] == '#') {
-			err = ovl_workdir_cleanup(dir, path.mnt, index, 1);
-			if (err)
+			err = ovl_workdir_cleanup(ofs, dir, path.mnt, index, 1);
+			if (err < 0)
 				break;
+			err = 0;
 			goto next;
 		}
 		err = ovl_verify_index(ofs, index);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..a8ee3ba4ebbd 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/posix_acl_xattr.h>
 #include <linux/exportfs.h>
+#include <linux/uuid.h>
 #include "overlayfs.h"
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
@@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
 
 
 struct ovl_dir_cache;
+uuid_t ovl_boot_id;
 
 #define OVL_MAX_STACK 500
 
@@ -722,20 +724,24 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 				goto out_unlock;
 
 			retried = true;
-			err = ovl_workdir_cleanup(dir, mnt, work, 0);
-			dput(work);
-			if (err == -EINVAL) {
-				work = ERR_PTR(err);
-				goto out_unlock;
+			err = ovl_workdir_cleanup(ofs, dir, mnt, work, 0);
+			/* check if we should reuse the workdir */
+			if (err != 1) {
+				dput(work);
+				if (err == -EINVAL) {
+					work = ERR_PTR(err);
+					goto out_unlock;
+				}
+				goto retry;
 			}
-			goto retry;
+		} else {
+			work = ovl_create_real(dir, work,
+					       OVL_CATTR(attr.ia_mode));
+			err = PTR_ERR(work);
+			if (IS_ERR(work))
+				goto out_err;
 		}
 
-		work = ovl_create_real(dir, work, OVL_CATTR(attr.ia_mode));
-		err = PTR_ERR(work);
-		if (IS_ERR(work))
-			goto out_err;
-
 		/*
 		 * Try to remove POSIX ACL xattrs from workdir.  We are good if:
 		 *
@@ -1237,26 +1243,58 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
 	return child;
 }
 
+static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
+{
+	int err;
+	struct ovl_volatile_info info = {
+		.s_instance_id = volatiledir->d_sb->s_instance_id,
+	};
+
+	uuid_copy(&info.ovl_boot_id, &ovl_boot_id);
+	err = ovl_do_setxattr(ofs, volatiledir, OVL_XATTR_VOLATILE, &info,
+			      sizeof(info));
+
+	if (err == -EOPNOTSUPP)
+		return 0;
+
+	return err;
+}
+
 /*
  * Creates $workdir/work/incompat/volatile/dirty file if it is not already
  * present.
  */
 static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
 {
+	int err;
 	unsigned int ctr;
-	struct dentry *d = dget(ofs->workbasedir);
+	struct dentry *volatiledir, *d = dget(ofs->workbasedir);
 	static const char *const volatile_path[] = {
-		OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
+		OVL_WORKDIR_NAME,
+		OVL_INCOMPATDIR_NAME,
+		OVL_VOLATILEDIR_NAME,
 	};
 	const char *const *name = volatile_path;
 
-	for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
-		d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
+	/* Create the volatile subdirectory that we put the xattr on */
+	for (ctr = 0; ctr < ARRAY_SIZE(volatile_path); ctr++, name++) {
+		d = ovl_lookup_or_create(d, *name, S_IFDIR);
 		if (IS_ERR(d))
 			return PTR_ERR(d);
 	}
-	dput(d);
-	return 0;
+	volatiledir = dget(d);
+
+	/* Create the dirty file exists before we set the xattr */
+	d = ovl_lookup_or_create(d, OVL_VOLATILE_DIRTY_NAME, S_IFREG);
+	if (!IS_ERR(d)) {
+		dput(d);
+		err = ovl_set_volatile_info(ofs, volatiledir);
+	} else {
+		err = PTR_ERR(d);
+	}
+
+	dput(volatiledir);
+	return err;
 }
 
 static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
@@ -2044,6 +2082,7 @@ static int __init ovl_init(void)
 {
 	int err;
 
+	uuid_gen(&ovl_boot_id);
 	ovl_inode_cachep = kmem_cache_create("ovl_inode",
 					     sizeof(struct ovl_inode), 0,
 					     (SLAB_RECLAIM_ACCOUNT|
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..87c9f5a063ed 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -580,6 +580,7 @@ bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry,
 #define OVL_XATTR_NLINK_POSTFIX		"nlink"
 #define OVL_XATTR_UPPER_POSTFIX		"upper"
 #define OVL_XATTR_METACOPY_POSTFIX	"metacopy"
+#define OVL_XATTR_VOLATILE_POSTFIX	"volatile"
 
 #define OVL_XATTR_TAB_ENTRY(x) \
 	[x] = OVL_XATTR_PREFIX x ## _POSTFIX
@@ -592,6 +593,7 @@ const char *ovl_xattr_table[] = {
 	OVL_XATTR_TAB_ENTRY(OVL_XATTR_NLINK),
 	OVL_XATTR_TAB_ENTRY(OVL_XATTR_UPPER),
 	OVL_XATTR_TAB_ENTRY(OVL_XATTR_METACOPY),
+	OVL_XATTR_TAB_ENTRY(OVL_XATTR_VOLATILE),
 };
 
 int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-11-27  9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon
                   ` (2 preceding siblings ...)
  2020-11-27  9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
@ 2020-11-27  9:20 ` Sargun Dhillon
  2020-11-30 18:43   ` Vivek Goyal
                     ` (2 more replies)
  3 siblings, 3 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-27  9:20 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	linux-fsdevel, David Howells

Volatile remounts validate the following at the moment:
 * Has the module been reloaded / the system rebooted
 * Has the workdir been remounted

This adds a new check for errors detected via the superblock's
errseq_t. At mount time, the errseq_t is snapshotted to disk,
and upon remount it's re-verified. This allows for kernel-level
detection of errors without forcing userspace to perform a
sync and allows for the hidden detection of writeback errors.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 fs/overlayfs/overlayfs.h | 1 +
 fs/overlayfs/readdir.c   | 6 ++++++
 fs/overlayfs/super.c     | 1 +
 3 files changed, 8 insertions(+)

diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index de694ee99d7c..e8a711953b64 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -85,6 +85,7 @@ struct ovl_volatile_info {
 	 */
 	uuid_t		ovl_boot_id;	/* Must stay first member */
 	u64		s_instance_id;
+	errseq_t	errseq;	/* Implemented as a u32 */
 } __packed;
 
 /*
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 7b66fbb20261..5795b28bb4cf 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
 		return -EINVAL;
 	}
 
+	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
+	if (err) {
+		pr_debug("Workdir filesystem reports errors: %d\n", err);
+		return -EINVAL;
+	}
+
 	return 1;
 }
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index a8ee3ba4ebbd..2e473f8c75dd 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
 	int err;
 	struct ovl_volatile_info info = {
 		.s_instance_id = volatiledir->d_sb->s_instance_id,
+		.errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
 	};
 
 	uuid_copy(&info.ovl_boot_id, &ovl_boot_id);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe
  2020-11-27  9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
@ 2020-11-27 11:09   ` kernel test robot
  2020-11-27 13:04   ` Amir Goldstein
  2020-12-07 11:39   ` Dan Carpenter
  2 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2020-11-27 11:09 UTC (permalink / raw)
  To: Sargun Dhillon, linux-unionfs, miklos, Alexander Viro, Amir Goldstein
  Cc: kbuild-all, clang-built-linux, Sargun Dhillon, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells

[-- Attachment #1: Type: text/plain, Size: 3273 bytes --]

Hi Sargun,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on be4df0cea08a8b59eb38d73de988b7ba8022df41]

url:    https://github.com/0day-ci/linux/commits/Sargun-Dhillon/Make-overlayfs-volatile-mounts-reusable/20201127-172416
base:    be4df0cea08a8b59eb38d73de988b7ba8022df41
config: powerpc-randconfig-r034-20201127 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project f095ac11a9550530a4a54298debb8b04b36422be)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc cross compiling tool for clang build
        # apt-get install binutils-powerpc-linux-gnu
        # https://github.com/0day-ci/linux/commit/a9871ab1de6c660e6a4c49fcffdb666851003b35
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Sargun-Dhillon/Make-overlayfs-volatile-mounts-reusable/20201127-172416
        git checkout a9871ab1de6c660e6a4c49fcffdb666851003b35
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/overlayfs/readdir.c:1135:6: warning: variable 'd' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) {
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/overlayfs/readdir.c:1149:7: note: uninitialized use occurs here
           dput(d);
                ^
   fs/overlayfs/readdir.c:1135:2: note: remove the 'if' if its condition is always true
           if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/overlayfs/readdir.c:1133:18: note: initialize the variable 'd' to silence this warning
           struct dentry *d;
                           ^
                            = NULL
   1 warning generated.

vim +1135 fs/overlayfs/readdir.c

  1122	
  1123	/*
  1124	 * ovl_check_incompat checks this specific incompat entry for incompatibility.
  1125	 * If it is found to be incompatible -EINVAL will be returned.
  1126	 *
  1127	 * If the directory should be preserved, then this function returns 1.
  1128	 */
  1129	static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p,
  1130				      struct path *path)
  1131	{
  1132		int err = -EINVAL;
  1133		struct dentry *d;
  1134	
> 1135		if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) {
  1136			d = lookup_one_len(p->name, path->dentry, p->len);
  1137			if (IS_ERR(d))
  1138				return PTR_ERR(d);
  1139	
  1140			err = ovl_verify_volatile_info(ofs, d);
  1141			dput(d);
  1142		}
  1143	
  1144		if (err == -EINVAL)
  1145			pr_err("incompat feature '%s' cannot be mounted\n", p->name);
  1146		else
  1147			pr_debug("incompat '%s' handled: %d\n", p->name, err);
  1148	
  1149		dput(d);
  1150		return err;
  1151	}
  1152	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24138 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-27  9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon
@ 2020-11-27 12:52   ` Amir Goldstein
  2020-11-27 22:11     ` Sargun Dhillon
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2020-11-27 12:52 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells

On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> This documents behaviour that was discussed in a thread about the volatile
> feature. Specifically, how failures can go hidden from asynchronous writes
> (such as from mmap, or writes that are not immediately flushed to the
> filesystem). Although we pass through calls like msync, fallocate, and
> write, and will still return errors on those, it doesn't guarantee all
> kinds of errors will happen at those times, and thus may hide errors.
>
> In the future, we can add error checking to all interactions with the
> upperdir, and pass through errseq_t from the upperdir on mappings,
> and other interactions with the filesystem[1].
>
> [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/filesystems/overlayfs.rst | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 580ab9a0fe31..c6e30c1bc2f2 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -570,7 +570,11 @@ Volatile mount
>  This is enabled with the "volatile" mount option.  Volatile mounts are not
>  guaranteed to survive a crash.  It is strongly recommended that volatile
>  mounts are only used if data written to the overlay can be recreated
> -without significant effort.
> +without significant effort.  In addition to this, the sync family of syscalls
> +are not sufficient to determine whether a write failed as sync calls are
> +omitted.  For this reason, it is important that the filesystem used by the
> +upperdir handles failure in a fashion that's suitable for the user.  For
> +example, upon detecting a fault, ext4 can be configured to panic.
>

Reading this now, I think I may have wrongly analysed the issue.
Specifically, when I wrote that the very minimum is to document the
issue, it was under the assumption that a proper fix is hard.
I think I was wrong and that the very minimum is to check for errseq
since mount on the fsync and syncfs calls.

Why? first of all because it is very very simple and goes a long way to
fix the broken contract with applications, not the contract about durability
obviously, but the contract about write+fsync+read expects to find the written
data (during the same mount era).

Second, because the sentence you added above is hard for users to understand
and out of context. If we properly handle the writeback error in fsync/syncfs,
then this sentence becomes irrelevant.
The fact that ext4 can lose data if application did not fsync after
write is true
for volatile as well as non-volatile and it is therefore not relevant
in the context
of overlayfs documentation at all.

Am I wrong saying that it is very very simple to fix?
Would you mind making that fix at the bottom of the patch series, so it can
be easily applied to stable kernels?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe
  2020-11-27  9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
  2020-11-27 11:09   ` kernel test robot
@ 2020-11-27 13:04   ` Amir Goldstein
  2020-12-07 11:39   ` Dan Carpenter
  2 siblings, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2020-11-27 13:04 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells

On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> Overlayfs added the ability to setup mounts where all syncs could be
> short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
>
> A user might want to remount this fs, but we do not let the user because
> of the "incompat" detection feature. In the case of volatile, it is safe
> to do something like[1]:
>
> $ sync -f /root/upperdir
> $ rm -rf /root/workdir/incompat/volatile
>
> There are two ways to go about this. You can call sync on the underlying
> filesystem, check the error code, and delete the dirty file if everything
> is clean. If you're running lots of containers on the same filesystem, or
> you want to avoid all unnecessary I/O, this may be suboptimal.
>
> Alternatively, you can blindly delete the dirty file, and "hope for the
> best".
>
> This patch introduces transparent functionality to check if it is
> (relatively) safe to reuse the upperdir. It ensures that the filesystem
> hasn't been remounted, the system hasn't been rebooted, nor has the
> overlayfs code changed. Since the structure is explicitly not meant to be
> used between different versions of the code, its stability does not matter
> so much.
>
> [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>

You may add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  Documentation/filesystems/overlayfs.rst | 18 +++--
>  fs/overlayfs/overlayfs.h                | 37 +++++++++-
>  fs/overlayfs/readdir.c                  | 98 ++++++++++++++++++++++---
>  fs/overlayfs/super.c                    | 73 +++++++++++++-----
>  fs/overlayfs/util.c                     |  2 +
>  5 files changed, 190 insertions(+), 38 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index c6e30c1bc2f2..b485fdb65b85 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -579,13 +579,17 @@ example, upon detecting a fault, ext4 can be configured to panic.
>  The advantage of mounting with the "volatile" option is that all forms of
>  sync calls to the upper filesystem are omitted.
>
> -When overlay is mounted with "volatile" option, the directory
> -"$workdir/work/incompat/volatile" is created.  During next mount, overlay
> -checks for this directory and refuses to mount if present. This is a strong
> -indicator that user should throw away upper and work directories and create
> -fresh one. In very limited cases where the user knows that the system has
> -not crashed and contents of upperdir are intact, The "volatile" directory
> -can be removed.
> +When overlay is mounted with the "volatile" option, the directory
> +"$workdir/work/incompat/volatile" is created.  This acts as a indicator
> +that the user should throw away upper and work directories and create fresh
> +ones.  In some cases, the overlayfs can detect if the upperdir can be
> +reused safely in a subsequent volatile mounts, and mounting will proceed as
> +normal.  If the filesystem is unable to determine if this is safe (due to a
> +reboot, upgraded kernel code, or loss of checkpoint, etc...), the user may
> +bypass these safety checks and remove the "volatile" directory if they know
> +the system did not encounter a fault and the contents of the upperdir are
> +intact. Then, the user can remount the filesystem as normal.
> +
>
>  Testsuite
>  ---------
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index f8880aa2ba0e..de694ee99d7c 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -32,8 +32,13 @@ enum ovl_xattr {
>         OVL_XATTR_NLINK,
>         OVL_XATTR_UPPER,
>         OVL_XATTR_METACOPY,
> +       OVL_XATTR_VOLATILE,
>  };
>
> +#define OVL_INCOMPATDIR_NAME "incompat"
> +#define OVL_VOLATILEDIR_NAME "volatile"
> +#define OVL_VOLATILE_DIRTY_NAME "dirty"
> +
>  enum ovl_inode_flag {
>         /* Pure upper dir that may contain non pure upper entries */
>         OVL_IMPURE,
> @@ -57,6 +62,31 @@ enum {
>         OVL_XINO_ON,
>  };
>
> +/*
> + * This is copied into the volatile xattr, and the user does not interact with
> + * it. There is no stability requirement, as a reboot explicitly invalidates
> + * a volatile workdir. It is explicitly meant not to be a stable api.
> + *
> + * Although this structure isn't meant to be stable it is exposed to potentially
> + * unprivileged users. We don't do any kind of cryptographic operations with
> + * the structure, so it could be tampered with, or inspected. Don't put
> + * kernel memory pointers in it, or anything else that could cause problems,
> + * or information disclosure.
> + */
> +struct ovl_volatile_info {
> +       /*
> +        * This uniquely identifies a boot, and is reset if overlayfs itself
> +        * is reloaded. Therefore we check our current / known boot_id
> +        * against this before looking at any other fields to validate:
> +        * 1. Is this datastructure laid out in the way we expect? (Overlayfs
> +        *    module, reboot, etc...)
> +        * 2. Could something have changed (like the s_instance_id counter
> +        *    resetting)
> +        */
> +       uuid_t          ovl_boot_id;    /* Must stay first member */
> +       u64             s_instance_id;
> +} __packed;
> +
>  /*
>   * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
>   * where:
> @@ -422,8 +452,8 @@ void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list);
>  void ovl_cache_free(struct list_head *list);
>  void ovl_dir_cache_free(struct inode *inode);
>  int ovl_check_d_type_supported(struct path *realpath);
> -int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt,
> -                       struct dentry *dentry, int level);
> +int ovl_workdir_cleanup(struct ovl_fs *ofs, struct inode *dir,
> +                       struct vfsmount *mnt, struct dentry *dentry, int level);
>  int ovl_indexdir_cleanup(struct ovl_fs *ofs);
>
>  /* inode.c */
> @@ -520,3 +550,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
>
>  /* export.c */
>  extern const struct export_operations ovl_export_operations;
> +
> +/* super.c */
> +extern uuid_t ovl_boot_id;
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 01620ebae1bd..7b66fbb20261 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1080,10 +1080,78 @@ int ovl_check_d_type_supported(struct path *realpath)
>
>         return rdd.d_type_supported;
>  }
> +static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> +                                   struct dentry *volatiledir)
> +{
> +       int err;
> +       struct ovl_volatile_info info;
> +
> +       if (!volatiledir->d_inode)
> +               return 0;
> +
> +       if (!ofs->config.ovl_volatile) {
> +               pr_debug("Mount is not volatile; upperdir is marked volatile\n");
> +               return -EINVAL;
> +       }
> +
> +       err = ovl_do_getxattr(ofs, volatiledir, OVL_XATTR_VOLATILE, &info,
> +                             sizeof(info));
> +       if (err < 0) {
> +               pr_debug("Unable to read volatile xattr: %d\n", err);
> +               return -EINVAL;
> +       }
> +
> +       if (err != sizeof(info)) {
> +               pr_debug("%s xattr on-disk size is %d expected to read %zd\n",
> +                        ovl_xattr(ofs, OVL_XATTR_VOLATILE), err, sizeof(info));
> +               return -EINVAL;
> +       }
> +
> +       if (!uuid_equal(&ovl_boot_id, &info.ovl_boot_id)) {
> +               pr_debug("boot id has changed (reboot or module reloaded)\n");
> +               return -EINVAL;
> +       }
> +
> +       if (volatiledir->d_sb->s_instance_id != info.s_instance_id) {
> +               pr_debug("workdir has been unmounted and remounted\n");
> +               return -EINVAL;
> +       }
> +
> +       return 1;
> +}
>
> -#define OVL_INCOMPATDIR_NAME "incompat"
> +/*
> + * ovl_check_incompat checks this specific incompat entry for incompatibility.
> + * If it is found to be incompatible -EINVAL will be returned.
> + *
> + * If the directory should be preserved, then this function returns 1.
> + */
> +static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p,
> +                             struct path *path)
> +{
> +       int err = -EINVAL;
> +       struct dentry *d;
> +
> +       if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) {
> +               d = lookup_one_len(p->name, path->dentry, p->len);
> +               if (IS_ERR(d))
> +                       return PTR_ERR(d);
> +
> +               err = ovl_verify_volatile_info(ofs, d);
> +               dput(d);
> +       }
> +
> +       if (err == -EINVAL)
> +               pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> +       else
> +               pr_debug("incompat '%s' handled: %d\n", p->name, err);
> +
> +       dput(d);

Letfover.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-27 12:52   ` Amir Goldstein
@ 2020-11-27 22:11     ` Sargun Dhillon
  2020-11-28  2:01       ` Jeff Layton
  2020-11-28  8:56       ` Amir Goldstein
  0 siblings, 2 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-27 22:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells,
	Jeff Layton

On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > This documents behaviour that was discussed in a thread about the volatile
> > feature. Specifically, how failures can go hidden from asynchronous writes
> > (such as from mmap, or writes that are not immediately flushed to the
> > filesystem). Although we pass through calls like msync, fallocate, and
> > write, and will still return errors on those, it doesn't guarantee all
> > kinds of errors will happen at those times, and thus may hide errors.
> >
> > In the future, we can add error checking to all interactions with the
> > upperdir, and pass through errseq_t from the upperdir on mappings,
> > and other interactions with the filesystem[1].
> >
> > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> >
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > --- a/Documentation/filesystems/overlayfs.rst
> > +++ b/Documentation/filesystems/overlayfs.rst
> > @@ -570,7 +570,11 @@ Volatile mount
> >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> >  guaranteed to survive a crash.  It is strongly recommended that volatile
> >  mounts are only used if data written to the overlay can be recreated
> > -without significant effort.
> > +without significant effort.  In addition to this, the sync family of syscalls
> > +are not sufficient to determine whether a write failed as sync calls are
> > +omitted.  For this reason, it is important that the filesystem used by the
> > +upperdir handles failure in a fashion that's suitable for the user.  For
> > +example, upon detecting a fault, ext4 can be configured to panic.
> >
> 
> Reading this now, I think I may have wrongly analysed the issue.
> Specifically, when I wrote that the very minimum is to document the
> issue, it was under the assumption that a proper fix is hard.
> I think I was wrong and that the very minimum is to check for errseq
> since mount on the fsync and syncfs calls.
> 
> Why? first of all because it is very very simple and goes a long way to
> fix the broken contract with applications, not the contract about durability
> obviously, but the contract about write+fsync+read expects to find the written
> data (during the same mount era).
> 
> Second, because the sentence you added above is hard for users to understand
> and out of context. If we properly handle the writeback error in fsync/syncfs,
> then this sentence becomes irrelevant.
> The fact that ext4 can lose data if application did not fsync after
> write is true
> for volatile as well as non-volatile and it is therefore not relevant
> in the context
> of overlayfs documentation at all.
> 
> Am I wrong saying that it is very very simple to fix?
> Would you mind making that fix at the bottom of the patch series, so it can
> be easily applied to stable kernels?
> 
> Thanks,
> Amir.

I'm not sure it's so easy. In VFS, there are places where the superblock's 
errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the 
errseq of the real corresponding real file's superblock. One solution might be 
as part of all these callbacks we set our errseq to the errseq of the filesystem 
that the upperdir, and then rely on VFS's checking.

I'm having a hard time figuring out how to deal with the per-mapping based
error tracking. It seems like this infrastructure was only partially completed
by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
as not all of his patches landed.

How about I split this into two patchsets? One, where I add the logic to copy
the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
and thus allowing VFS to bubble up errors, and the documentation. We can CC
stable on those because I think it has an effect that's universal across
all filesystems.

P.S. 
I notice you maintain overlay tests outside of the kernel. Unfortunately, I 
think for this kind of test, it requires in kernel code to artificially bump the 
writeback error count on the upperdir, or it requires the failure injection 
infrastructure. 

Simulating this behaviour is non-trivial without in-kernel support:

P1: Open(f) -> p1.fd
P2: Open(f) -> p2.fd
P1: syncfs(p1.fd) -> errrno
P2: syncfs(p1.fd) -> 0 # This should return an error


[1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
[2]: https://lwn.net/Articles/722250/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-27 22:11     ` Sargun Dhillon
@ 2020-11-28  2:01       ` Jeff Layton
  2020-11-28  4:45         ` Sargun Dhillon
  2020-11-28  8:56       ` Amir Goldstein
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff Layton @ 2020-11-28  2:01 UTC (permalink / raw)
  To: Sargun Dhillon, Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells

On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > 
> > > This documents behaviour that was discussed in a thread about the volatile
> > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > (such as from mmap, or writes that are not immediately flushed to the
> > > filesystem). Although we pass through calls like msync, fallocate, and
> > > write, and will still return errors on those, it doesn't guarantee all
> > > kinds of errors will happen at those times, and thus may hide errors.
> > > 
> > > In the future, we can add error checking to all interactions with the
> > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > and other interactions with the filesystem[1].
> > > 
> > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > 
> > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: linux-unionfs@vger.kernel.org
> > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > --- a/Documentation/filesystems/overlayfs.rst
> > > +++ b/Documentation/filesystems/overlayfs.rst
> > > @@ -570,7 +570,11 @@ Volatile mount
> > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > >  mounts are only used if data written to the overlay can be recreated
> > > -without significant effort.
> > > +without significant effort.  In addition to this, the sync family of syscalls
> > > +are not sufficient to determine whether a write failed as sync calls are
> > > +omitted.  For this reason, it is important that the filesystem used by the
> > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > 
> > 
> > Reading this now, I think I may have wrongly analysed the issue.
> > Specifically, when I wrote that the very minimum is to document the
> > issue, it was under the assumption that a proper fix is hard.
> > I think I was wrong and that the very minimum is to check for errseq
> > since mount on the fsync and syncfs calls.
> > 
> > Why? first of all because it is very very simple and goes a long way to
> > fix the broken contract with applications, not the contract about durability
> > obviously, but the contract about write+fsync+read expects to find the written
> > data (during the same mount era).
> > 
> > Second, because the sentence you added above is hard for users to understand
> > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > then this sentence becomes irrelevant.
> > The fact that ext4 can lose data if application did not fsync after
> > write is true
> > for volatile as well as non-volatile and it is therefore not relevant
> > in the context
> > of overlayfs documentation at all.
> > 
> > Am I wrong saying that it is very very simple to fix?
> > Would you mind making that fix at the bottom of the patch series, so it can
> > be easily applied to stable kernels?
> > 
> > Thanks,
> > Amir.
> 
> I'm not sure it's so easy. In VFS, there are places where the superblock's 
> errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the 
> errseq of the real corresponding real file's superblock. One solution might be 
> as part of all these callbacks we set our errseq to the errseq of the filesystem 
> that the upperdir, and then rely on VFS's checking.
> 
> I'm having a hard time figuring out how to deal with the per-mapping based
> error tracking. It seems like this infrastructure was only partially completed
> by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> as not all of his patches landed.
> 

The patches in the series were all merged, but we ended up going with a
simpler solution [1] than the first series I posted. Instead of plumbing
the errseq_t handling down into sync_fs, we did it in the syscall
wrapper.

I think the tricky part here is that there is no struct file plumbed
into ->sync_fs, so you don't have an errseq_t cursor to work with by the
time that gets called.

What may be easiest is to just propagate the s_wb_err value from the
upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
called before the errseq_check_and_advance in the syncfs syscall wrapper
and should ensure that syncfs() calls against the overlayfs sb see any
errors that occurred on the upper_sb.

Something like this maybe? Totally untested of course. May also need to
think about how to ensure ordering between racing syncfs calls too
(don't really want the s_wb_err going "backward"):

----------------------------8<---------------------------------
$ git diff
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..d725705abdac 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
        ret = sync_filesystem(upper_sb);
        up_read(&upper_sb->s_umount);
 
+       /* Propagate s_wb_err from upper_sb to overlayfs sb */
+       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
+
        return ret;
 }
----------------------------8<---------------------------------

[1]: https://www.spinics.net/lists/linux-api/msg41276.html

How about I split this into two patchsets? One, where I add the logic to copy
the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
and thus allowing VFS to bubble up errors, and the documentation. We can CC
stable on those because I think it has an effect that's universal across
all filesystems.

P.S. 
I notice you maintain overlay tests outside of the kernel. Unfortunately, I 
think for this kind of test, it requires in kernel code to artificially bump the 
writeback error count on the upperdir, or it requires the failure injection 
infrastructure. 

Simulating this behaviour is non-trivial without in-kernel support:

P1: Open(f) -> p1.fd
P2: Open(f) -> p2.fd
P1: syncfs(p1.fd) -> errrno
P2: syncfs(p1.fd) -> 0 # This should return an error


[1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
[2]: https://lwn.net/Articles/722250/


-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  2:01       ` Jeff Layton
@ 2020-11-28  4:45         ` Sargun Dhillon
  2020-11-28  7:12           ` Amir Goldstein
  2020-11-28 12:04           ` Jeff Layton
  0 siblings, 2 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-28  4:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > 
> > > > This documents behaviour that was discussed in a thread about the volatile
> > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > write, and will still return errors on those, it doesn't guarantee all
> > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > 
> > > > In the future, we can add error checking to all interactions with the
> > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > and other interactions with the filesystem[1].
> > > > 
> > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > 
> > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > Cc: linux-fsdevel@vger.kernel.org
> > > > Cc: linux-unionfs@vger.kernel.org
> > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > ---
> > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > @@ -570,7 +570,11 @@ Volatile mount
> > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > >  mounts are only used if data written to the overlay can be recreated
> > > > -without significant effort.
> > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > 
> > > 
> > > Reading this now, I think I may have wrongly analysed the issue.
> > > Specifically, when I wrote that the very minimum is to document the
> > > issue, it was under the assumption that a proper fix is hard.
> > > I think I was wrong and that the very minimum is to check for errseq
> > > since mount on the fsync and syncfs calls.
> > > 
> > > Why? first of all because it is very very simple and goes a long way to
> > > fix the broken contract with applications, not the contract about durability
> > > obviously, but the contract about write+fsync+read expects to find the written
> > > data (during the same mount era).
> > > 
> > > Second, because the sentence you added above is hard for users to understand
> > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > then this sentence becomes irrelevant.
> > > The fact that ext4 can lose data if application did not fsync after
> > > write is true
> > > for volatile as well as non-volatile and it is therefore not relevant
> > > in the context
> > > of overlayfs documentation at all.
> > > 
> > > Am I wrong saying that it is very very simple to fix?
> > > Would you mind making that fix at the bottom of the patch series, so it can
> > > be easily applied to stable kernels?
> > > 
> > > Thanks,
> > > Amir.
> > 
> > I'm not sure it's so easy. In VFS, there are places where the superblock's 
> > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the 
> > errseq of the real corresponding real file's superblock. One solution might be 
> > as part of all these callbacks we set our errseq to the errseq of the filesystem 
> > that the upperdir, and then rely on VFS's checking.
> > 
> > I'm having a hard time figuring out how to deal with the per-mapping based
> > error tracking. It seems like this infrastructure was only partially completed
> > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > as not all of his patches landed.
> > 
> 
> The patches in the series were all merged, but we ended up going with a
> simpler solution [1] than the first series I posted. Instead of plumbing
> the errseq_t handling down into sync_fs, we did it in the syscall
> wrapper.
Jeff,

Thanks for replying. I'm still a little confused as to what the 
per-address_space wb_err. It seems like we should implement the
flush method, like so:
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..32e5bc0aacd6 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
                            remap_flags, op);
 }
 
+static int ovl_flush(struct file *file, fl_owner_t id)
+{
+       struct file *realfile = file->private_data;
+
+       if (real_file->f_op->flush)
+               return real_file->f_op->flush(filp, id);
+
+       return 0;
+}
+
 const struct file_operations ovl_file_operations = {
        .open           = ovl_open,
        .release        = ovl_release,
@@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
        .fallocate      = ovl_fallocate,
        .fadvise        = ovl_fadvise,
        .unlocked_ioctl = ovl_ioctl,
+       .flush          = ovl_flush,
 #ifdef CONFIG_COMPAT
        .compat_ioctl   = ovl_compat_ioctl,
 #endif


> 
> I think the tricky part here is that there is no struct file plumbed
> into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> time that gets called.
> 
> What may be easiest is to just propagate the s_wb_err value from the
> upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> called before the errseq_check_and_advance in the syncfs syscall wrapper
> and should ensure that syncfs() calls against the overlayfs sb see any
> errors that occurred on the upper_sb.
> 
> Something like this maybe? Totally untested of course. May also need to
> think about how to ensure ordering between racing syncfs calls too
> (don't really want the s_wb_err going "backward"):
> 
> ----------------------------8<---------------------------------
> $ git diff
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..d725705abdac 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>         ret = sync_filesystem(upper_sb);
>         up_read(&upper_sb->s_umount);
>  
> +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> +
>         return ret;
>  }
> ----------------------------8<---------------------------------
> 
> [1]: https://www.spinics.net/lists/linux-api/msg41276.html

So, 
I think this will have bad behaviour because syncfs() twice in a row will 
return the error twice. The reason is that normally in syncfs it calls 
errseq_check_and_advance marking the error on the super block as seen. If we 
copy-up the error value each time, it will break this semantic, as we do not set 
seen on the upperdir.

Either we need to set the seen flag on the upperdir's errseq_t, or have a sync 
method, like:
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..4931d1797c03 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 {
        struct ovl_fs *ofs = sb->s_fs_info;
        struct super_block *upper_sb;
+       errseq_t src;
        int ret;
 
        if (!ovl_upper_mnt(ofs))
@@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
        ret = sync_filesystem(upper_sb);
        up_read(&upper_sb->s_umount);
 
+       /* Propagate the error up from the upper_sb once */
+       src = READ_ONCE(upper_sb->s_wb_err);
+       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
+               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
+
        return ret;
 }
 
@@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
                sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
                sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
+               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
 
diff --git a/lib/errseq.c b/lib/errseq.c
index 81f9e33aa7e7..53275c168265 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
        return err;
 }
 EXPORT_SYMBOL(errseq_check_and_advance);
+
+/**
+ * errseq_counter() - Get the value of the errseq counter
+ * @eseq: Value being checked
+ *
+ * This returns the errseq_t counter value. Reminder, it can wrap because
+ * there are only a limited number of counter bits.
+ *
+ * Return: The counter portion of the errseq_t.
+ */
+int errseq_counter(errseq_t eseq)
+{
+       return eseq >> (ERRSEQ_SHIFT + 1);
+}

This also needs some locking sprinkled on it because racing can occur with 
sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go 
backwards, because you could just use a simple cmpxchg64. I know that would make 
a bunch of structures larger, and if it's just overlayfs that has to pay the tax
we can just sprinkle a mutex on it.

We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, 
and we haven't done a copy yet, we'll get it. Since this will be strictly 
serialized a simple equivalence check will work versus actually having to deal 
with happens before behaviour. There is still a correctness flaw here though, 
which is exactly enough errors occur to result in it wrapping back to the same 
value, it will break.

By the way, how do y'all test this error handling behaviour? I didn't
find any automated testing for what currently exists.
> 
> How about I split this into two patchsets? One, where I add the logic to copy
> the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> and thus allowing VFS to bubble up errors, and the documentation. We can CC
> stable on those because I think it has an effect that's universal across
> all filesystems.
> 
> P.S. 
> I notice you maintain overlay tests outside of the kernel. Unfortunately, I 
> think for this kind of test, it requires in kernel code to artificially bump the 
> writeback error count on the upperdir, or it requires the failure injection 
> infrastructure. 
> 
> Simulating this behaviour is non-trivial without in-kernel support:
> 
> P1: Open(f) -> p1.fd
> P2: Open(f) -> p2.fd
> P1: syncfs(p1.fd) -> errrno
> P2: syncfs(p1.fd) -> 0 # This should return an error
> 
> 
> [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> [2]: https://lwn.net/Articles/722250/
> 
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  4:45         ` Sargun Dhillon
@ 2020-11-28  7:12           ` Amir Goldstein
  2020-11-28  8:52             ` Sargun Dhillon
  2020-11-28 12:04           ` Jeff Layton
  1 sibling, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2020-11-28  7:12 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > >
> > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > >
> > > > > In the future, we can add error checking to all interactions with the
> > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > and other interactions with the filesystem[1].
> > > > >
> > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > >
> > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > ---
> > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > -without significant effort.
> > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > >
> > > >
> > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > Specifically, when I wrote that the very minimum is to document the
> > > > issue, it was under the assumption that a proper fix is hard.
> > > > I think I was wrong and that the very minimum is to check for errseq
> > > > since mount on the fsync and syncfs calls.
> > > >
> > > > Why? first of all because it is very very simple and goes a long way to
> > > > fix the broken contract with applications, not the contract about durability
> > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > data (during the same mount era).
> > > >
> > > > Second, because the sentence you added above is hard for users to understand
> > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > then this sentence becomes irrelevant.
> > > > The fact that ext4 can lose data if application did not fsync after
> > > > write is true
> > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > in the context
> > > > of overlayfs documentation at all.
> > > >
> > > > Am I wrong saying that it is very very simple to fix?
> > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > be easily applied to stable kernels?
> > > >
> > > > Thanks,
> > > > Amir.
> > >
> > > I'm not sure it's so easy. In VFS, there are places where the superblock's
> > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the
> > > errseq of the real corresponding real file's superblock. One solution might be
> > > as part of all these callbacks we set our errseq to the errseq of the filesystem
> > > that the upperdir, and then rely on VFS's checking.
> > >
> > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > error tracking. It seems like this infrastructure was only partially completed
> > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > as not all of his patches landed.
> > >
> >
> > The patches in the series were all merged, but we ended up going with a
> > simpler solution [1] than the first series I posted. Instead of plumbing
> > the errseq_t handling down into sync_fs, we did it in the syscall
> > wrapper.
> Jeff,
>
> Thanks for replying. I'm still a little confused as to what the
> per-address_space wb_err. It seems like we should implement the
> flush method, like so:
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..32e5bc0aacd6 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
>                             remap_flags, op);
>  }
>
> +static int ovl_flush(struct file *file, fl_owner_t id)
> +{
> +       struct file *realfile = file->private_data;
> +
> +       if (real_file->f_op->flush)
> +               return real_file->f_op->flush(filp, id);
> +
> +       return 0;
> +}
> +
>  const struct file_operations ovl_file_operations = {
>         .open           = ovl_open,
>         .release        = ovl_release,
> @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
>         .fallocate      = ovl_fallocate,
>         .fadvise        = ovl_fadvise,
>         .unlocked_ioctl = ovl_ioctl,
> +       .flush          = ovl_flush,
>  #ifdef CONFIG_COMPAT
>         .compat_ioctl   = ovl_compat_ioctl,
>  #endif
>
>
> >
> > I think the tricky part here is that there is no struct file plumbed
> > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > time that gets called.
> >
> > What may be easiest is to just propagate the s_wb_err value from the
> > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > and should ensure that syncfs() calls against the overlayfs sb see any
> > errors that occurred on the upper_sb.
> >
> > Something like this maybe? Totally untested of course. May also need to
> > think about how to ensure ordering between racing syncfs calls too
> > (don't really want the s_wb_err going "backward"):
> >
> > ----------------------------8<---------------------------------
> > $ git diff
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d725705abdac 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >         ret = sync_filesystem(upper_sb);
> >         up_read(&upper_sb->s_umount);
> >
> > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > +
> >         return ret;
> >  }
> > ----------------------------8<---------------------------------
> >
> > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
>
> So,
> I think this will have bad behaviour because syncfs() twice in a row will
> return the error twice. The reason is that normally in syncfs it calls
> errseq_check_and_advance marking the error on the super block as seen. If we
> copy-up the error value each time, it will break this semantic, as we do not set
> seen on the upperdir.
>
> Either we need to set the seen flag on the upperdir's errseq_t, or have a sync
> method, like:
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..4931d1797c03 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>  {
>         struct ovl_fs *ofs = sb->s_fs_info;
>         struct super_block *upper_sb;
> +       errseq_t src;
>         int ret;
>
>         if (!ovl_upper_mnt(ofs))
> @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>         ret = sync_filesystem(upper_sb);
>         up_read(&upper_sb->s_umount);
>
> +       /* Propagate the error up from the upper_sb once */
> +       src = READ_ONCE(upper_sb->s_wb_err);
> +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> +
>         return ret;
>  }
>
> @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>
>                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
>                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
>
> diff --git a/lib/errseq.c b/lib/errseq.c
> index 81f9e33aa7e7..53275c168265 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
>         return err;
>  }
>  EXPORT_SYMBOL(errseq_check_and_advance);
> +
> +/**
> + * errseq_counter() - Get the value of the errseq counter
> + * @eseq: Value being checked
> + *
> + * This returns the errseq_t counter value. Reminder, it can wrap because
> + * there are only a limited number of counter bits.
> + *
> + * Return: The counter portion of the errseq_t.
> + */
> +int errseq_counter(errseq_t eseq)
> +{
> +       return eseq >> (ERRSEQ_SHIFT + 1);
> +}
>
> This also needs some locking sprinkled on it because racing can occur with
> sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go
> backwards, because you could just use a simple cmpxchg64. I know that would make
> a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> we can just sprinkle a mutex on it.
>
> We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read,
> and we haven't done a copy yet, we'll get it. Since this will be strictly
> serialized a simple equivalence check will work versus actually having to deal
> with happens before behaviour. There is still a correctness flaw here though,
> which is exactly enough errors occur to result in it wrapping back to the same
> value, it will break.
>
> By the way, how do y'all test this error handling behaviour? I didn't
> find any automated testing for what currently exists.
> >
> > How about I split this into two patchsets? One, where I add the logic to copy
> > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > stable on those because I think it has an effect that's universal across
> > all filesystems.
> >
> > P.S.
> > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > think for this kind of test, it requires in kernel code to artificially bump the
> > writeback error count on the upperdir, or it requires the failure injection
> > infrastructure.
> >
> > Simulating this behaviour is non-trivial without in-kernel support:
> >
> > P1: Open(f) -> p1.fd
> > P2: Open(f) -> p2.fd
> > P1: syncfs(p1.fd) -> errrno
> > P2: syncfs(p1.fd) -> 0 # This should return an error
> >
> >
> > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > [2]: https://lwn.net/Articles/722250/
> >
> >

Sargun (and Jeff),

Thank you for this discussion. It would be very nice to work on testing
and fixing errseq propagation is correct on overlayfs.
Alas, this is not what I suggested.

What I suggested was a solution only for the volatile overlay issue
where data can vaporise without applications noticing:
"...the very minimum is to check for errseq since mount on the fsync
 and syncfs calls."

Do you get it? there is no pre-file state in the game, not for fsync and
not for syncfs.

Any single error, no matter how temporary it is and what damage it may
or may not have caused to upper layer consistency, permanently
invalidates the reliability of the volatile overlay, resulting in:
Effective immediately: every fsync/syncfs returns EIO.
Going forward: maybe implement overlay shutdown, so every access
returns EIO.

So now that I hopefully explained myself better, I'll ask again:
Am I wrong saying that it is very very simple to fix?
Would you mind making that fix at the bottom of the patch series, so it can
be easily applied to stable kernels?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  7:12           ` Amir Goldstein
@ 2020-11-28  8:52             ` Sargun Dhillon
  2020-11-28  9:04               ` Amir Goldstein
  2020-12-01 11:09               ` Sargun Dhillon
  0 siblings, 2 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-11-28  8:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote:
> On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > >
> > > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > > >
> > > > > > In the future, we can add error checking to all interactions with the
> > > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > > and other interactions with the filesystem[1].
> > > > > >
> > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > > >
> > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > > ---
> > > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > > -without significant effort.
> > > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > > >
> > > > >
> > > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > > Specifically, when I wrote that the very minimum is to document the
> > > > > issue, it was under the assumption that a proper fix is hard.
> > > > > I think I was wrong and that the very minimum is to check for errseq
> > > > > since mount on the fsync and syncfs calls.
> > > > >
> > > > > Why? first of all because it is very very simple and goes a long way to
> > > > > fix the broken contract with applications, not the contract about durability
> > > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > > data (during the same mount era).
> > > > >
> > > > > Second, because the sentence you added above is hard for users to understand
> > > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > > then this sentence becomes irrelevant.
> > > > > The fact that ext4 can lose data if application did not fsync after
> > > > > write is true
> > > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > > in the context
> > > > > of overlayfs documentation at all.
> > > > >
> > > > > Am I wrong saying that it is very very simple to fix?
> > > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > > be easily applied to stable kernels?
> > > > >
> > > > > Thanks,
> > > > > Amir.
> > > >
> > > > I'm not sure it's so easy. In VFS, there are places where the superblock's
> > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the
> > > > errseq of the real corresponding real file's superblock. One solution might be
> > > > as part of all these callbacks we set our errseq to the errseq of the filesystem
> > > > that the upperdir, and then rely on VFS's checking.
> > > >
> > > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > > error tracking. It seems like this infrastructure was only partially completed
> > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > > as not all of his patches landed.
> > > >
> > >
> > > The patches in the series were all merged, but we ended up going with a
> > > simpler solution [1] than the first series I posted. Instead of plumbing
> > > the errseq_t handling down into sync_fs, we did it in the syscall
> > > wrapper.
> > Jeff,
> >
> > Thanks for replying. I'm still a little confused as to what the
> > per-address_space wb_err. It seems like we should implement the
> > flush method, like so:
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..32e5bc0aacd6 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
> >                             remap_flags, op);
> >  }
> >
> > +static int ovl_flush(struct file *file, fl_owner_t id)
> > +{
> > +       struct file *realfile = file->private_data;
> > +
> > +       if (real_file->f_op->flush)
> > +               return real_file->f_op->flush(filp, id);
> > +
> > +       return 0;
> > +}
> > +
> >  const struct file_operations ovl_file_operations = {
> >         .open           = ovl_open,
> >         .release        = ovl_release,
> > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
> >         .fallocate      = ovl_fallocate,
> >         .fadvise        = ovl_fadvise,
> >         .unlocked_ioctl = ovl_ioctl,
> > +       .flush          = ovl_flush,
> >  #ifdef CONFIG_COMPAT
> >         .compat_ioctl   = ovl_compat_ioctl,
> >  #endif
> >
> >
> > >
> > > I think the tricky part here is that there is no struct file plumbed
> > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > > time that gets called.
> > >
> > > What may be easiest is to just propagate the s_wb_err value from the
> > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > > and should ensure that syncfs() calls against the overlayfs sb see any
> > > errors that occurred on the upper_sb.
> > >
> > > Something like this maybe? Totally untested of course. May also need to
> > > think about how to ensure ordering between racing syncfs calls too
> > > (don't really want the s_wb_err going "backward"):
> > >
> > > ----------------------------8<---------------------------------
> > > $ git diff
> > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > index 290983bcfbb3..d725705abdac 100644
> > > --- a/fs/overlayfs/super.c
> > > +++ b/fs/overlayfs/super.c
> > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > >         ret = sync_filesystem(upper_sb);
> > >         up_read(&upper_sb->s_umount);
> > >
> > > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > > +
> > >         return ret;
> > >  }
> > > ----------------------------8<---------------------------------
> > >
> > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
> >
> > So,
> > I think this will have bad behaviour because syncfs() twice in a row will
> > return the error twice. The reason is that normally in syncfs it calls
> > errseq_check_and_advance marking the error on the super block as seen. If we
> > copy-up the error value each time, it will break this semantic, as we do not set
> > seen on the upperdir.
> >
> > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync
> > method, like:
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..4931d1797c03 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >  {
> >         struct ovl_fs *ofs = sb->s_fs_info;
> >         struct super_block *upper_sb;
> > +       errseq_t src;
> >         int ret;
> >
> >         if (!ovl_upper_mnt(ofs))
> > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >         ret = sync_filesystem(upper_sb);
> >         up_read(&upper_sb->s_umount);
> >
> > +       /* Propagate the error up from the upper_sb once */
> > +       src = READ_ONCE(upper_sb->s_wb_err);
> > +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> > +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> > +
> >         return ret;
> >  }
> >
> > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> >
> >                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> >                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> >
> > diff --git a/lib/errseq.c b/lib/errseq.c
> > index 81f9e33aa7e7..53275c168265 100644
> > --- a/lib/errseq.c
> > +++ b/lib/errseq.c
> > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> >         return err;
> >  }
> >  EXPORT_SYMBOL(errseq_check_and_advance);
> > +
> > +/**
> > + * errseq_counter() - Get the value of the errseq counter
> > + * @eseq: Value being checked
> > + *
> > + * This returns the errseq_t counter value. Reminder, it can wrap because
> > + * there are only a limited number of counter bits.
> > + *
> > + * Return: The counter portion of the errseq_t.
> > + */
> > +int errseq_counter(errseq_t eseq)
> > +{
> > +       return eseq >> (ERRSEQ_SHIFT + 1);
> > +}
> >
> > This also needs some locking sprinkled on it because racing can occur with
> > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go
> > backwards, because you could just use a simple cmpxchg64. I know that would make
> > a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> > we can just sprinkle a mutex on it.
> >
> > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read,
> > and we haven't done a copy yet, we'll get it. Since this will be strictly
> > serialized a simple equivalence check will work versus actually having to deal
> > with happens before behaviour. There is still a correctness flaw here though,
> > which is exactly enough errors occur to result in it wrapping back to the same
> > value, it will break.
> >
> > By the way, how do y'all test this error handling behaviour? I didn't
> > find any automated testing for what currently exists.
> > >
> > > How about I split this into two patchsets? One, where I add the logic to copy
> > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > > stable on those because I think it has an effect that's universal across
> > > all filesystems.
> > >
> > > P.S.
> > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > > think for this kind of test, it requires in kernel code to artificially bump the
> > > writeback error count on the upperdir, or it requires the failure injection
> > > infrastructure.
> > >
> > > Simulating this behaviour is non-trivial without in-kernel support:
> > >
> > > P1: Open(f) -> p1.fd
> > > P2: Open(f) -> p2.fd
> > > P1: syncfs(p1.fd) -> errrno
> > > P2: syncfs(p1.fd) -> 0 # This should return an error
> > >
> > >
> > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > > [2]: https://lwn.net/Articles/722250/
> > >
> > >
> 
> Sargun (and Jeff),
> 
> Thank you for this discussion. It would be very nice to work on testing
> and fixing errseq propagation is correct on overlayfs.
> Alas, this is not what I suggested.
> 
> What I suggested was a solution only for the volatile overlay issue
> where data can vaporise without applications noticing:
> "...the very minimum is to check for errseq since mount on the fsync
>  and syncfs calls."
> 
Yeah, I was confusing the checking that VFS does on our behalf and the checking 
that we can do ourselves in the sync callback. If we return an error prior to 
the vfs checking it short-circuits that entirely.

> Do you get it? there is no pre-file state in the game, not for fsync and not 
> for syncfs.
> 
> Any single error, no matter how temporary it is and what damage it may
> or may not have caused to upper layer consistency, permanently
> invalidates the reliability of the volatile overlay, resulting in:
> Effective immediately: every fsync/syncfs returns EIO.
> Going forward: maybe implement overlay shutdown, so every access
> returns EIO.
> 
> So now that I hopefully explained myself better, I'll ask again:
> Am I wrong saying that it is very very simple to fix?
> Would you mind making that fix at the bottom of the patch series, so it can
> be easily applied to stable kernels?
> 
> Thanks,
> Amir.

I think that this should be easy enough if the semantic is such that volatile
overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's
super block has experienced errors since the initial mount. I imagine we do not
want to make it such that if the upperdir has ever experienced errors, return
EIO on syncfs.

The one caveat that I see is that if the errseq wraps, we can silently begin 
swallowing errors again. Thus, on the first failed syncfs we should just
store a flag indicating that the volatile fs is bad, and to continue to return
EIO rather than go through the process of checking errseq_t, but that's easy
enough to write.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-27 22:11     ` Sargun Dhillon
  2020-11-28  2:01       ` Jeff Layton
@ 2020-11-28  8:56       ` Amir Goldstein
  2020-11-28  9:06         ` Amir Goldstein
  1 sibling, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2020-11-28  8:56 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells,
	Jeff Layton

> I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> think for this kind of test, it requires in kernel code to artificially bump the
> writeback error count on the upperdir, or it requires the failure injection
> infrastructure.
>
> Simulating this behaviour is non-trivial without in-kernel support:
>
> P1: Open(f) -> p1.fd
> P2: Open(f) -> p2.fd
> P1: syncfs(p1.fd) -> errrno
> P2: syncfs(p1.fd) -> 0 # This should return an error
>

failure injection is an option. xfstest generic/019 is an example.
generic/108 uses a different method and generic/361 uses a plain
loop image over commit without any fault injection to trigger writeback
errors.

With current xfstests, check -overlay run (runs all generic tests with
overlayfs over the configured base fs) all the 3 tests mentioned above
will be skipped because of _require_block_device, but the requirement
is there for a different reason for each of them.

At first look, the loop device approach is the most generic one and could
easily work also with overlayfs, so you could create an overlay specific
test (non generic) based on generic/361, but it is not easy to use the
helper _scratch_mkfs_sized, because check -overlay runs do not mkfs
the base scratch fs.

My recommendation would be to fix generic/019 in a similar manner
to the way that tests that _require_scratch_shutdown were fixed to run
with check -overlay:

* Instead of _require_block_device, add a specific helper
   _require_scratch_fail_make_request, which like _require_scratch_shutdown
   knows how to deal with overlay FSTYP correctly

* Instead of `_sysfs_dev $SCRATCH_DEV` use a helper _scratch_sysfs_dev
   that knows how to deal with overlay FSTYP correctly

This will add test coverage to overlayfs fsync/syncfs and then you can
do one or both of:
1. Run 'check -overlay generic/019' with  OVERLAY_MOUNT_OPTIONS="volatile"
2. Fork an overlay specific test from the generic test that will test the
    volatile error handling on every 'check -overlay -g quick' run

#2 will provide better coverage against regressions in volatile writeback error
handling and will be a good start for a test to reuse a volatile mount after
writeback errors.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  8:52             ` Sargun Dhillon
@ 2020-11-28  9:04               ` Amir Goldstein
  2020-12-01 11:09               ` Sargun Dhillon
  1 sibling, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2020-11-28  9:04 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

> > What I suggested was a solution only for the volatile overlay issue
> > where data can vaporise without applications noticing:
> > "...the very minimum is to check for errseq since mount on the fsync
> >  and syncfs calls."
> >
> Yeah, I was confusing the checking that VFS does on our behalf and the checking
> that we can do ourselves in the sync callback. If we return an error prior to
> the vfs checking it short-circuits that entirely.
>
> > Do you get it? there is no pre-file state in the game, not for fsync and not
> > for syncfs.
> >
> > Any single error, no matter how temporary it is and what damage it may
> > or may not have caused to upper layer consistency, permanently
> > invalidates the reliability of the volatile overlay, resulting in:
> > Effective immediately: every fsync/syncfs returns EIO.
> > Going forward: maybe implement overlay shutdown, so every access
> > returns EIO.
> >
> > So now that I hopefully explained myself better, I'll ask again:
> > Am I wrong saying that it is very very simple to fix?
> > Would you mind making that fix at the bottom of the patch series, so it can
> > be easily applied to stable kernels?
> >
> > Thanks,
> > Amir.
>
> I think that this should be easy enough if the semantic is such that volatile
> overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's
> super block has experienced errors since the initial mount. I imagine we do not
> want to make it such that if the upperdir has ever experienced errors, return
> EIO on syncfs.
>
> The one caveat that I see is that if the errseq wraps, we can silently begin
> swallowing errors again. Thus, on the first failed syncfs we should just
> store a flag indicating that the volatile fs is bad, and to continue to return
> EIO rather than go through the process of checking errseq_t, but that's easy
> enough to write.

I agree. I sent another reply to your question about testing.
The test I suggested generic/019, only tests that the first fsync
after writeback
error fails and that umount succeeds, so logic is good for volatile overlay.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  8:56       ` Amir Goldstein
@ 2020-11-28  9:06         ` Amir Goldstein
  0 siblings, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2020-11-28  9:06 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells,
	Jeff Layton

On Sat, Nov 28, 2020 at 10:56 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > think for this kind of test, it requires in kernel code to artificially bump the
> > writeback error count on the upperdir, or it requires the failure injection
> > infrastructure.
> >
> > Simulating this behaviour is non-trivial without in-kernel support:
> >
> > P1: Open(f) -> p1.fd
> > P2: Open(f) -> p2.fd
> > P1: syncfs(p1.fd) -> errrno
> > P2: syncfs(p1.fd) -> 0 # This should return an error
> >
>
> failure injection is an option. xfstest generic/019 is an example.
> generic/108 uses a different method and generic/361 uses a plain
> loop image over commit without any fault injection to trigger writeback
> errors.
>
> With current xfstests, check -overlay run (runs all generic tests with
> overlayfs over the configured base fs) all the 3 tests mentioned above
> will be skipped because of _require_block_device, but the requirement
> is there for a different reason for each of them.
>
> At first look, the loop device approach is the most generic one and could
> easily work also with overlayfs, so you could create an overlay specific
> test (non generic) based on generic/361, but it is not easy to use the
> helper _scratch_mkfs_sized, because check -overlay runs do not mkfs
> the base scratch fs.
>
> My recommendation would be to fix generic/019 in a similar manner
> to the way that tests that _require_scratch_shutdown were fixed to run
> with check -overlay:
>
> * Instead of _require_block_device, add a specific helper
>    _require_scratch_fail_make_request, which like _require_scratch_shutdown
>    knows how to deal with overlay FSTYP correctly
>
> * Instead of `_sysfs_dev $SCRATCH_DEV` use a helper _scratch_sysfs_dev
>    that knows how to deal with overlay FSTYP correctly
>

I missed:

* Instead of `blockdev --getsz $SCRATCH_DEV` use helper _scratch_blockdev_getsz

> This will add test coverage to overlayfs fsync/syncfs and then you can
> do one or both of:
> 1. Run 'check -overlay generic/019' with  OVERLAY_MOUNT_OPTIONS="volatile"
> 2. Fork an overlay specific test from the generic test that will test the
>     volatile error handling on every 'check -overlay -g quick' run
>
> #2 will provide better coverage against regressions in volatile writeback error
> handling and will be a good start for a test to reuse a volatile mount after
> writeback errors.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  4:45         ` Sargun Dhillon
  2020-11-28  7:12           ` Amir Goldstein
@ 2020-11-28 12:04           ` Jeff Layton
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff Layton @ 2020-11-28 12:04 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

On Sat, 2020-11-28 at 04:45 +0000, Sargun Dhillon wrote:
> On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > 
> > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > > 
> > > > > In the future, we can add error checking to all interactions with the
> > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > and other interactions with the filesystem[1].
> > > > > 
> > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > > 
> > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > ---
> > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > -without significant effort.
> > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > > 
> > > > 
> > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > Specifically, when I wrote that the very minimum is to document the
> > > > issue, it was under the assumption that a proper fix is hard.
> > > > I think I was wrong and that the very minimum is to check for errseq
> > > > since mount on the fsync and syncfs calls.
> > > > 
> > > > Why? first of all because it is very very simple and goes a long way to
> > > > fix the broken contract with applications, not the contract about durability
> > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > data (during the same mount era).
> > > > 
> > > > Second, because the sentence you added above is hard for users to understand
> > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > then this sentence becomes irrelevant.
> > > > The fact that ext4 can lose data if application did not fsync after
> > > > write is true
> > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > in the context
> > > > of overlayfs documentation at all.
> > > > 
> > > > Am I wrong saying that it is very very simple to fix?
> > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > be easily applied to stable kernels?
> > > > 
> > > > Thanks,
> > > > Amir.
> > > 
> > > I'm not sure it's so easy. In VFS, there are places where the superblock's 
> > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the 
> > > errseq of the real corresponding real file's superblock. One solution might be 
> > > as part of all these callbacks we set our errseq to the errseq of the filesystem 
> > > that the upperdir, and then rely on VFS's checking.
> > > 
> > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > error tracking. It seems like this infrastructure was only partially completed
> > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > as not all of his patches landed.
> > > 
> > 
> > The patches in the series were all merged, but we ended up going with a
> > simpler solution [1] than the first series I posted. Instead of plumbing
> > the errseq_t handling down into sync_fs, we did it in the syscall
> > wrapper.
> Jeff,
> 
> Thanks for replying. I'm still a little confused as to what the 
> per-address_space wb_err. It seems like we should implement the
> flush method, like so:
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..32e5bc0aacd6 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
>                             remap_flags, op);
>  }
>  
> 
> 
> 
> +static int ovl_flush(struct file *file, fl_owner_t id)
> +{
> +       struct file *realfile = file->private_data;
> +
> +       if (real_file->f_op->flush)
> +               return real_file->f_op->flush(filp, id);
> +
> +       return 0;
> +}
> +
>  const struct file_operations ovl_file_operations = {
>         .open           = ovl_open,
>         .release        = ovl_release,
> @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
>         .fallocate      = ovl_fallocate,
>         .fadvise        = ovl_fadvise,
>         .unlocked_ioctl = ovl_ioctl,
> +       .flush          = ovl_flush,
>  #ifdef CONFIG_COMPAT
>         .compat_ioctl   = ovl_compat_ioctl,
>  #endif
> 
> 
> > 
> > I think the tricky part here is that there is no struct file plumbed
> > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > time that gets called.
> > 
> > What may be easiest is to just propagate the s_wb_err value from the
> > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > and should ensure that syncfs() calls against the overlayfs sb see any
> > errors that occurred on the upper_sb.
> > 
> > Something like this maybe? Totally untested of course. May also need to
> > think about how to ensure ordering between racing syncfs calls too
> > (don't really want the s_wb_err going "backward"):
> > 
> > ----------------------------8<---------------------------------
> > $ git diff
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d725705abdac 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >         ret = sync_filesystem(upper_sb);
> >         up_read(&upper_sb->s_umount);
> >  
> > 
> > 
> > 
> > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > +
> >         return ret;
> >  }
> > ----------------------------8<---------------------------------
> > 
> > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
> 
> So, 
> I think this will have bad behaviour because syncfs() twice in a row will 
> return the error twice. The reason is that normally in syncfs it calls 
> errseq_check_and_advance marking the error on the super block as seen. If we 
> copy-up the error value each time, it will break this semantic, as we do not set 
> seen on the upperdir.
> 

You're absolutely right. I thought about this after I had sent the mail
last night.

> Either we need to set the seen flag on the upperdir's errseq_t, or have a sync 
> method, like:
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..4931d1797c03 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>  {
>         struct ovl_fs *ofs = sb->s_fs_info;
>         struct super_block *upper_sb;
> +       errseq_t src;
>         int ret;
>  
> 
> 
> 
> 
> 
> 
> 
>         if (!ovl_upper_mnt(ofs))
> @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>         ret = sync_filesystem(upper_sb);
>         up_read(&upper_sb->s_umount);
>  
> 
> 
> 
> 
> 
> 
> 
> +       /* Propagate the error up from the upper_sb once */
> +       src = READ_ONCE(upper_sb->s_wb_err);
> +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> +
>         return ret;
>  }
>  
> 
> 
> 
> 
> 
> 
> 
> @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  
> 
> 
> 
> 
> 
> 
> 
>                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
>                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
>  
> 
> 
> 
> 
> 
> 
> 
> diff --git a/lib/errseq.c b/lib/errseq.c
> index 81f9e33aa7e7..53275c168265 100644
> --- a/lib/errseq.c
> +++ b/lib/errseq.c
> @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
>         return err;
>  }
>  EXPORT_SYMBOL(errseq_check_and_advance);
> +
> +/**
> + * errseq_counter() - Get the value of the errseq counter
> + * @eseq: Value being checked
> + *
> + * This returns the errseq_t counter value. Reminder, it can wrap because
> + * there are only a limited number of counter bits.
> + *
> + * Return: The counter portion of the errseq_t.
> + */
> +int errseq_counter(errseq_t eseq)
> +{
> +       return eseq >> (ERRSEQ_SHIFT + 1);
> +}
> 
> This also needs some locking sprinkled on it because racing can occur with 
> sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go 
> backwards, because you could just use a simple cmpxchg64. I know that would make 
> a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> we can just sprinkle a mutex on it.
> 

A spinlock would work here too. None of these are blocking operations.

> We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, 
> and we haven't done a copy yet, we'll get it. Since this will be strictly 
> serialized a simple equivalence check will work versus actually having to deal 
> with happens before behaviour. There is still a correctness flaw here though, 
> which is exactly enough errors occur to result in it wrapping back to the same 
> value, it will break.
> 
> By the way, how do y'all test this error handling behaviour? I didn't
> find any automated testing for what currently exists.

There are a couple of xfstests that you may be able to use as a starting
point. See:

generic/441
generic/442


> > 
> > How about I split this into two patchsets? One, where I add the logic to copy
> > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > stable on those because I think it has an effect that's universal across
> > all filesystems.
> > 
> > P.S. 
> > I notice you maintain overlay tests outside of the kernel. Unfortunately, I 
> > think for this kind of test, it requires in kernel code to artificially bump the 
> > writeback error count on the upperdir, or it requires the failure injection 
> > infrastructure. 
> > 
> > Simulating this behaviour is non-trivial without in-kernel support:
> > 
> > P1: Open(f) -> p1.fd
> > P2: Open(f) -> p2.fd
> > P1: syncfs(p1.fd) -> errrno
> > P2: syncfs(p1.fd) -> 0 # This should return an error
> > 
> > 
> > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > [2]: https://lwn.net/Articles/722250/
> > 
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > 
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-11-27  9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon
@ 2020-11-30 18:43   ` Vivek Goyal
  2020-11-30 19:15   ` Vivek Goyal
  2020-11-30 19:33   ` Vivek Goyal
  2 siblings, 0 replies; 31+ messages in thread
From: Vivek Goyal @ 2020-11-30 18:43 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein,
	Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells

On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> Volatile remounts validate the following at the moment:
>  * Has the module been reloaded / the system rebooted
>  * Has the workdir been remounted
> 
> This adds a new check for errors detected via the superblock's
> errseq_t. At mount time, the errseq_t is snapshotted to disk,
> and upon remount it's re-verified. This allows for kernel-level
> detection of errors without forcing userspace to perform a
> sync and allows for the hidden detection of writeback errors.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/overlayfs.h | 1 +
>  fs/overlayfs/readdir.c   | 6 ++++++
>  fs/overlayfs/super.c     | 1 +
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index de694ee99d7c..e8a711953b64 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -85,6 +85,7 @@ struct ovl_volatile_info {
>  	 */
>  	uuid_t		ovl_boot_id;	/* Must stay first member */
>  	u64		s_instance_id;
> +	errseq_t	errseq;	/* Implemented as a u32 */
>  } __packed;
>  
>  /*
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 7b66fbb20261..5795b28bb4cf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
>  		return -EINVAL;
>  	}
>  
> +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> +	if (err) {
> +		pr_debug("Workdir filesystem reports errors: %d\n", err);

It probably will make sense to make it pr_err() instead? Will be good
to know if kernel logic detected errors.


Thanks
Vivek


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-11-27  9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon
  2020-11-30 18:43   ` Vivek Goyal
@ 2020-11-30 19:15   ` Vivek Goyal
  2020-12-05  9:13     ` Amir Goldstein
  2020-11-30 19:33   ` Vivek Goyal
  2 siblings, 1 reply; 31+ messages in thread
From: Vivek Goyal @ 2020-11-30 19:15 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein,
	Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells

On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> Volatile remounts validate the following at the moment:
>  * Has the module been reloaded / the system rebooted
>  * Has the workdir been remounted
> 
> This adds a new check for errors detected via the superblock's
> errseq_t. At mount time, the errseq_t is snapshotted to disk,
> and upon remount it's re-verified. This allows for kernel-level
> detection of errors without forcing userspace to perform a
> sync and allows for the hidden detection of writeback errors.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/overlayfs.h | 1 +
>  fs/overlayfs/readdir.c   | 6 ++++++
>  fs/overlayfs/super.c     | 1 +
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index de694ee99d7c..e8a711953b64 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -85,6 +85,7 @@ struct ovl_volatile_info {
>  	 */
>  	uuid_t		ovl_boot_id;	/* Must stay first member */
>  	u64		s_instance_id;
> +	errseq_t	errseq;	/* Implemented as a u32 */
>  } __packed;
>  
>  /*
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 7b66fbb20261..5795b28bb4cf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
>  		return -EINVAL;
>  	}
>  
> +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> +	if (err) {
> +		pr_debug("Workdir filesystem reports errors: %d\n", err);
> +		return -EINVAL;
> +	}
> +
>  	return 1;
>  }
>  
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index a8ee3ba4ebbd..2e473f8c75dd 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
>  	int err;
>  	struct ovl_volatile_info info = {
>  		.s_instance_id = volatiledir->d_sb->s_instance_id,
> +		.errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),

errse_sample() seems to return 0 if nobody has seen the error yet. That
means on remount we will fail. It is a false failure from our perspective
and we are not interested in knowing if somebody else has seen the
failure or not. 

Maybe we need a flag in errseq_sample() to get us current value
irrespective of the fact whether anybody has seen the error or not?

If we end up making this change, then we probably will have to somehow
mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
sampled ->s_wb_err when nobody saw it and later by the remount time
say ERRSEQ_SEEN is set, we don't want remount to fail.

Thanks
Vivek
>  	};
>  
>  	uuid_copy(&info.ovl_boot_id, &ovl_boot_id);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-11-27  9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon
  2020-11-30 18:43   ` Vivek Goyal
  2020-11-30 19:15   ` Vivek Goyal
@ 2020-11-30 19:33   ` Vivek Goyal
  2020-12-01 11:56     ` Sargun Dhillon
  2 siblings, 1 reply; 31+ messages in thread
From: Vivek Goyal @ 2020-11-30 19:33 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein,
	Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells

On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> Volatile remounts validate the following at the moment:
>  * Has the module been reloaded / the system rebooted
>  * Has the workdir been remounted
> 
> This adds a new check for errors detected via the superblock's
> errseq_t. At mount time, the errseq_t is snapshotted to disk,
> and upon remount it's re-verified. This allows for kernel-level
> detection of errors without forcing userspace to perform a
> sync and allows for the hidden detection of writeback errors.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/overlayfs.h | 1 +
>  fs/overlayfs/readdir.c   | 6 ++++++
>  fs/overlayfs/super.c     | 1 +
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index de694ee99d7c..e8a711953b64 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -85,6 +85,7 @@ struct ovl_volatile_info {
>  	 */
>  	uuid_t		ovl_boot_id;	/* Must stay first member */
>  	u64		s_instance_id;
> +	errseq_t	errseq;	/* Implemented as a u32 */
>  } __packed;
>  
>  /*
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 7b66fbb20261..5795b28bb4cf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
>  		return -EINVAL;
>  	}
>  
> +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);

Might be a stupid question. Will ask anyway.

But what protects against wrapping of counter. IOW, Say we stored info.errseq
value as A. It is possible that bunch of errors occurred and at remount
time ->s_wb_err is back to A and we pass the check. (Despite the fact lots
of errors have occurred since we sampled).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-11-28  8:52             ` Sargun Dhillon
  2020-11-28  9:04               ` Amir Goldstein
@ 2020-12-01 11:09               ` Sargun Dhillon
  2020-12-01 11:29                 ` Amir Goldstein
  2020-12-01 13:01                 ` Jeff Layton
  1 sibling, 2 replies; 31+ messages in thread
From: Sargun Dhillon @ 2020-12-01 11:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote:
> On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote:
> > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > >
> > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > > >
> > > > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > > > >
> > > > > > > In the future, we can add error checking to all interactions with the
> > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > > > and other interactions with the filesystem[1].
> > > > > > >
> > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > > > >
> > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > > > ---
> > > > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > > > -without significant effort.
> > > > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > > > >
> > > > > >
> > > > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > > > Specifically, when I wrote that the very minimum is to document the
> > > > > > issue, it was under the assumption that a proper fix is hard.
> > > > > > I think I was wrong and that the very minimum is to check for errseq
> > > > > > since mount on the fsync and syncfs calls.
> > > > > >
> > > > > > Why? first of all because it is very very simple and goes a long way to
> > > > > > fix the broken contract with applications, not the contract about durability
> > > > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > > > data (during the same mount era).
> > > > > >
> > > > > > Second, because the sentence you added above is hard for users to understand
> > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > > > then this sentence becomes irrelevant.
> > > > > > The fact that ext4 can lose data if application did not fsync after
> > > > > > write is true
> > > > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > > > in the context
> > > > > > of overlayfs documentation at all.
> > > > > >
> > > > > > Am I wrong saying that it is very very simple to fix?
> > > > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > > > be easily applied to stable kernels?
> > > > > >
> > > > > > Thanks,
> > > > > > Amir.
> > > > >
> > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's
> > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the
> > > > > errseq of the real corresponding real file's superblock. One solution might be
> > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem
> > > > > that the upperdir, and then rely on VFS's checking.
> > > > >
> > > > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > > > error tracking. It seems like this infrastructure was only partially completed
> > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > > > as not all of his patches landed.
> > > > >
> > > >
> > > > The patches in the series were all merged, but we ended up going with a
> > > > simpler solution [1] than the first series I posted. Instead of plumbing
> > > > the errseq_t handling down into sync_fs, we did it in the syscall
> > > > wrapper.
> > > Jeff,
> > >
> > > Thanks for replying. I'm still a little confused as to what the
> > > per-address_space wb_err. It seems like we should implement the
> > > flush method, like so:
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index efccb7c1f9bc..32e5bc0aacd6 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
> > >                             remap_flags, op);
> > >  }
> > >
> > > +static int ovl_flush(struct file *file, fl_owner_t id)
> > > +{
> > > +       struct file *realfile = file->private_data;
> > > +
> > > +       if (real_file->f_op->flush)
> > > +               return real_file->f_op->flush(filp, id);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >  const struct file_operations ovl_file_operations = {
> > >         .open           = ovl_open,
> > >         .release        = ovl_release,
> > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
> > >         .fallocate      = ovl_fallocate,
> > >         .fadvise        = ovl_fadvise,
> > >         .unlocked_ioctl = ovl_ioctl,
> > > +       .flush          = ovl_flush,
> > >  #ifdef CONFIG_COMPAT
> > >         .compat_ioctl   = ovl_compat_ioctl,
> > >  #endif
> > >
> > >
> > > >
> > > > I think the tricky part here is that there is no struct file plumbed
> > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > > > time that gets called.
> > > >
> > > > What may be easiest is to just propagate the s_wb_err value from the
> > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > > > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > > > and should ensure that syncfs() calls against the overlayfs sb see any
> > > > errors that occurred on the upper_sb.
> > > >
> > > > Something like this maybe? Totally untested of course. May also need to
> > > > think about how to ensure ordering between racing syncfs calls too
> > > > (don't really want the s_wb_err going "backward"):
> > > >
> > > > ----------------------------8<---------------------------------
> > > > $ git diff
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index 290983bcfbb3..d725705abdac 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > >         ret = sync_filesystem(upper_sb);
> > > >         up_read(&upper_sb->s_umount);
> > > >
> > > > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > > > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > > > +
> > > >         return ret;
> > > >  }
> > > > ----------------------------8<---------------------------------
> > > >
> > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
> > >
> > > So,
> > > I think this will have bad behaviour because syncfs() twice in a row will
> > > return the error twice. The reason is that normally in syncfs it calls
> > > errseq_check_and_advance marking the error on the super block as seen. If we
> > > copy-up the error value each time, it will break this semantic, as we do not set
> > > seen on the upperdir.
> > >
> > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync
> > > method, like:
> > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > index 290983bcfbb3..4931d1797c03 100644
> > > --- a/fs/overlayfs/super.c
> > > +++ b/fs/overlayfs/super.c
> > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > >  {
> > >         struct ovl_fs *ofs = sb->s_fs_info;
> > >         struct super_block *upper_sb;
> > > +       errseq_t src;
> > >         int ret;
> > >
> > >         if (!ovl_upper_mnt(ofs))
> > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > >         ret = sync_filesystem(upper_sb);
> > >         up_read(&upper_sb->s_umount);
> > >
> > > +       /* Propagate the error up from the upper_sb once */
> > > +       src = READ_ONCE(upper_sb->s_wb_err);
> > > +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> > > +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> > > +
> > >         return ret;
> > >  }
> > >
> > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > >
> > >                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > >                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> > >
> > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > index 81f9e33aa7e7..53275c168265 100644
> > > --- a/lib/errseq.c
> > > +++ b/lib/errseq.c
> > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > >         return err;
> > >  }
> > >  EXPORT_SYMBOL(errseq_check_and_advance);
> > > +
> > > +/**
> > > + * errseq_counter() - Get the value of the errseq counter
> > > + * @eseq: Value being checked
> > > + *
> > > + * This returns the errseq_t counter value. Reminder, it can wrap because
> > > + * there are only a limited number of counter bits.
> > > + *
> > > + * Return: The counter portion of the errseq_t.
> > > + */
> > > +int errseq_counter(errseq_t eseq)
> > > +{
> > > +       return eseq >> (ERRSEQ_SHIFT + 1);
> > > +}
> > >
> > > This also needs some locking sprinkled on it because racing can occur with
> > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go
> > > backwards, because you could just use a simple cmpxchg64. I know that would make
> > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> > > we can just sprinkle a mutex on it.
> > >
> > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read,
> > > and we haven't done a copy yet, we'll get it. Since this will be strictly
> > > serialized a simple equivalence check will work versus actually having to deal
> > > with happens before behaviour. There is still a correctness flaw here though,
> > > which is exactly enough errors occur to result in it wrapping back to the same
> > > value, it will break.
> > >
> > > By the way, how do y'all test this error handling behaviour? I didn't
> > > find any automated testing for what currently exists.
> > > >
> > > > How about I split this into two patchsets? One, where I add the logic to copy
> > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > > > stable on those because I think it has an effect that's universal across
> > > > all filesystems.
> > > >
> > > > P.S.
> > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > > > think for this kind of test, it requires in kernel code to artificially bump the
> > > > writeback error count on the upperdir, or it requires the failure injection
> > > > infrastructure.
> > > >
> > > > Simulating this behaviour is non-trivial without in-kernel support:
> > > >
> > > > P1: Open(f) -> p1.fd
> > > > P2: Open(f) -> p2.fd
> > > > P1: syncfs(p1.fd) -> errrno
> > > > P2: syncfs(p1.fd) -> 0 # This should return an error
> > > >
> > > >
> > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > > > [2]: https://lwn.net/Articles/722250/
> > > >
> > > >
> > 
> > Sargun (and Jeff),
> > 
> > Thank you for this discussion. It would be very nice to work on testing
> > and fixing errseq propagation is correct on overlayfs.
> > Alas, this is not what I suggested.
> > 
> > What I suggested was a solution only for the volatile overlay issue
> > where data can vaporise without applications noticing:
> > "...the very minimum is to check for errseq since mount on the fsync
> >  and syncfs calls."
> > 
> Yeah, I was confusing the checking that VFS does on our behalf and the checking 
> that we can do ourselves in the sync callback. If we return an error prior to 
> the vfs checking it short-circuits that entirely.
> 
> > Do you get it? there is no pre-file state in the game, not for fsync and not 
> > for syncfs.
> > 
> > Any single error, no matter how temporary it is and what damage it may
> > or may not have caused to upper layer consistency, permanently
> > invalidates the reliability of the volatile overlay, resulting in:
> > Effective immediately: every fsync/syncfs returns EIO.
> > Going forward: maybe implement overlay shutdown, so every access
> > returns EIO.
> > 
> > So now that I hopefully explained myself better, I'll ask again:
> > Am I wrong saying that it is very very simple to fix?
> > Would you mind making that fix at the bottom of the patch series, so it can
> > be easily applied to stable kernels?
> > 
> > Thanks,
> > Amir.
> 
> I think that this should be easy enough if the semantic is such that volatile
> overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's
> super block has experienced errors since the initial mount. I imagine we do not
> want to make it such that if the upperdir has ever experienced errors, return
> EIO on syncfs.
> 
> The one caveat that I see is that if the errseq wraps, we can silently begin 
> swallowing errors again. Thus, on the first failed syncfs we should just
> store a flag indicating that the volatile fs is bad, and to continue to return
> EIO rather than go through the process of checking errseq_t, but that's easy
> enough to write.

It turns out this is much more complicated than I initially thought. I'm not 
entirely sure why VFS even defines sync_fs (on the superblock) as returning an 
int. The way that sync_fs works is:

sys_syncfs ->
  sync_filesystem(sb) ->
    __sync_filesystem(sb, N) ->
      sb->s_op->sync_fs
      /* __sync_blockdev has its own writeback error checking */
      __sync_blockdev
  errseq_check_and_advance(sb...)

__sync_filesystem calls the sync_fs callback on the superblock's superblock
operations:

static int __sync_filesystem(struct super_block *sb, int wait)
{
	if (wait)
		sync_inodes_sb(sb);
	else
		writeback_inodes_sb(sb, WB_REASON_SYNC);

	if (sb->s_op->sync_fs)
		sb->s_op->sync_fs(sb, wait);
	return __sync_blockdev(sb->s_bdev, wait);
}

But it completely discards the result. On the other hand, fsync, and fdatasync
exhibit different behaviour:

sys_fdatasync ->
  do_fsync
   (see below)

sys_fsync ->
  do_fsync ->
    vfs_fsync ->
      vfs_fsync_range ->
        return file->f_op->fsync
----


syncfs seems to enforce the semantics laid out by VFS[1]. Specifically the 
statement:
  When there is an error during writeback, they expect that error to be reported 
  when a file sync request is made. After an error has been reported on one 
  request, subsequent requests on the same file descriptor should return 0, unless 
  further writeback errors have occurred since the previous file syncronization.

This is enforced by the errseq_check_and_advance logic. We can hack around this
logic by resetting the errset (setting the error on it) every time we get the
sync_fs callback, but that to me seems wrong. FWIW, implementing this behaviour
for fdatasync, and fsync is easier, because the error is bubbled up from the
filesystem to the VFS. I don't actually think this is a good idea because
it seems like this sync_fs behaviour is a bit...not neccessarily what all
filesystems expect. For example, btrfs_sync_fs returns an error if it
is unable to finish the current transaction. Nowhere in the btrfs code
does it set the errseq on the superblock if this fails.

I think we have a couple paths forward:
1. Change the semantic of sync_fs so the error is always bubbled up if
   it returns a non-zero value. If we do this, we have to decide whether
   or not we would _also_ call errseq_check_and_advance on the SB,
   or leave that to a subsequent call.
2. Have overlayfs forcefully set an error on the superblock on every
   callback to sync_fs. This seems ugly, but I wrote a little patch,
   and it seems to solve the problem for all the fsync / fdatasync /
   sync / syncfs variants without having to do plumbing in VFS.
3. Choose a different set of semantics for how we want to handle
   errors in volatile mounts.

[1]: https://www.kernel.org/doc/html/v5.9/filesystems/vfs.html?highlight=handling%20errors%20during%20writeback#handling-errors-during-writeback

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-12-01 11:09               ` Sargun Dhillon
@ 2020-12-01 11:29                 ` Amir Goldstein
  2020-12-01 13:01                 ` Jeff Layton
  1 sibling, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2020-12-01 11:29 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel,
	David Howells

> syncfs seems to enforce the semantics laid out by VFS[1]. Specifically the
> statement:
>   When there is an error during writeback, they expect that error to be reported
>   when a file sync request is made. After an error has been reported on one
>   request, subsequent requests on the same file descriptor should return 0, unless
>   further writeback errors have occurred since the previous file syncronization.
>
> This is enforced by the errseq_check_and_advance logic. We can hack around this
> logic by resetting the errset (setting the error on it) every time we get the
> sync_fs callback, but that to me seems wrong. FWIW, implementing this behaviour
> for fdatasync, and fsync is easier, because the error is bubbled up from the
> filesystem to the VFS. I don't actually think this is a good idea because
> it seems like this sync_fs behaviour is a bit...not neccessarily what all
> filesystems expect. For example, btrfs_sync_fs returns an error if it
> is unable to finish the current transaction. Nowhere in the btrfs code
> does it set the errseq on the superblock if this fails.
>
> I think we have a couple paths forward:
> 1. Change the semantic of sync_fs so the error is always bubbled up if
>    it returns a non-zero value. If we do this, we have to decide whether
>    or not we would _also_ call errseq_check_and_advance on the SB,
>    or leave that to a subsequent call.
> 2. Have overlayfs forcefully set an error on the superblock on every
>    callback to sync_fs. This seems ugly, but I wrote a little patch,
>    and it seems to solve the problem for all the fsync / fdatasync /
>    sync / syncfs variants without having to do plumbing in VFS.
> 3. Choose a different set of semantics for how we want to handle
>    errors in volatile mounts.
>

IMO it is best if you post your patch to fix volatile overlayfs, because
it seems to me that the volatile overlayfs issue is worse than generic
sync_fs issues and it's good if we had a small backportable patch
even if a bit ugly.

Later you can pursue the sync_fs semantic fixes if you wish they
do not contradict the fix in overlayfs, just are just a way to remove
a hack.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-11-30 19:33   ` Vivek Goyal
@ 2020-12-01 11:56     ` Sargun Dhillon
  2020-12-01 12:45       ` Jeff Layton
  0 siblings, 1 reply; 31+ messages in thread
From: Sargun Dhillon @ 2020-12-01 11:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein,
	Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells,
	Jeff Layton

On Mon, Nov 30, 2020 at 02:33:42PM -0500, Vivek Goyal wrote:
> On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > Volatile remounts validate the following at the moment:
> >  * Has the module been reloaded / the system rebooted
> >  * Has the workdir been remounted
> > 
> > This adds a new check for errors detected via the superblock's
> > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > and upon remount it's re-verified. This allows for kernel-level
> > detection of errors without forcing userspace to perform a
> > sync and allows for the hidden detection of writeback errors.
> > 
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/overlayfs/overlayfs.h | 1 +
> >  fs/overlayfs/readdir.c   | 6 ++++++
> >  fs/overlayfs/super.c     | 1 +
> >  3 files changed, 8 insertions(+)
> > 
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index de694ee99d7c..e8a711953b64 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> >  	 */
> >  	uuid_t		ovl_boot_id;	/* Must stay first member */
> >  	u64		s_instance_id;
> > +	errseq_t	errseq;	/* Implemented as a u32 */
> >  } __packed;
> >  
> >  /*
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 7b66fbb20261..5795b28bb4cf 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> >  		return -EINVAL;
> >  	}
> >  
> > +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> 
> Might be a stupid question. Will ask anyway.
> 
> But what protects against wrapping of counter. IOW, Say we stored info.errseq
> value as A. It is possible that bunch of errors occurred and at remount
> time ->s_wb_err is back to A and we pass the check. (Despite the fact lots
> of errors have occurred since we sampled).
> 
> Thanks
> Vivek
> 

+Jeff Layton <jlayton@redhat.com>

Nothing. The current errseq API works like this today where if you have 2^20
(1048576) errors, and syncfs (or other calls that mark the errseq as seen), and
the error that occured 1048575 times ago was the same error as you just last
had, and the error on the upperdir has already been marked as seen, the error
will be swallowed up silently.

This exists throughout all of VFS. I think we're potentially making this more
likely by checkpointing to disk. The one aspect which is a little different about
the usecase in the patch is that it relies on this mechanism to determine if
an error has occured after the entire FS was constructed, so it's somewhat
more consequential than the current issue in VFS which will just bubble up
errors in a few files.

On my system syncfs takes about 2 milliseconds, so you have a chance to
experience this every ~30 minutes if the syscalls align in the right way. If
we expanded the errseq_t to u64, we would potentially get a collision
every 4503599627370496 calls, or assuming the 2 millisecond invariant
holds, every 285 years. Now, we probably don't want to make errseq_t into
a u64 because of performance reasons (not all systems have native u64
cmpxchg), and the extra memory it'd take up.

If we really want to avoid this case, I can think of one "simple" solution,
which is something like laying out errseq_t as something like a errseq_t_src
that's 64-bits, and all readers just look at the lower 32-bits. The longer
errseq_t would exist on super_blocks, but files would still get the shorter one.
To potentially avoid the performance penalty of atomic longs, we could also
do something like this:

typedef struct {
    atomic_t overflow;
    u32 errseq;
} errseq_t_big;

And in errseq_set, do:
/* Wraps */
if (new < old)
        atomic_inc(&eseq->overflow);

*shrug*
I don't think that the above scenario is likely though.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-12-01 11:56     ` Sargun Dhillon
@ 2020-12-01 12:45       ` Jeff Layton
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff Layton @ 2020-12-01 12:45 UTC (permalink / raw)
  To: Sargun Dhillon, Vivek Goyal
  Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein,
	Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells

On Tue, 2020-12-01 at 11:56 +0000, Sargun Dhillon wrote:
> On Mon, Nov 30, 2020 at 02:33:42PM -0500, Vivek Goyal wrote:
> > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > > Volatile remounts validate the following at the moment:
> > >  * Has the module been reloaded / the system rebooted
> > >  * Has the workdir been remounted
> > > 
> > > This adds a new check for errors detected via the superblock's
> > > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > > and upon remount it's re-verified. This allows for kernel-level
> > > detection of errors without forcing userspace to perform a
> > > sync and allows for the hidden detection of writeback errors.
> > > 
> > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: linux-unionfs@vger.kernel.org
> > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  fs/overlayfs/overlayfs.h | 1 +
> > >  fs/overlayfs/readdir.c   | 6 ++++++
> > >  fs/overlayfs/super.c     | 1 +
> > >  3 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > > index de694ee99d7c..e8a711953b64 100644
> > > --- a/fs/overlayfs/overlayfs.h
> > > +++ b/fs/overlayfs/overlayfs.h
> > > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> > >  	 */
> > >  	uuid_t		ovl_boot_id;	/* Must stay first member */
> > >  	u64		s_instance_id;
> > > +	errseq_t	errseq;	/* Implemented as a u32 */
> > >  } __packed;
> > >  
> > > 
> > > 
> > > 
> > >  /*
> > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > index 7b66fbb20261..5795b28bb4cf 100644
> > > --- a/fs/overlayfs/readdir.c
> > > +++ b/fs/overlayfs/readdir.c
> > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> > >  		return -EINVAL;
> > >  	}
> > >  
> > > 
> > > 
> > > 
> > > +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > 
> > Might be a stupid question. Will ask anyway.
> > 
> > But what protects against wrapping of counter. IOW, Say we stored info.errseq
> > value as A. It is possible that bunch of errors occurred and at remount
> > time ->s_wb_err is back to A and we pass the check. (Despite the fact lots
> > of errors have occurred since we sampled).
> > 
> > Thanks
> > Vivek
> > 
> 
> +Jeff Layton <jlayton@redhat.com>
> 
> Nothing. The current errseq API works like this today where if you have 2^20
> (1048576) errors, and syncfs (or other calls that mark the errseq as seen), and
> the error that occured 1048575 times ago was the same error as you just last
> had, and the error on the upperdir has already been marked as seen, the error
> will be swallowed up silently.
 
> This exists throughout all of VFS. I think we're potentially making this more
> likely by checkpointing to disk. The one aspect which is a little different about
> the usecase in the patch is that it relies on this mechanism to determine if
> an error has occured after the entire FS was constructed, so it's somewhat
> more consequential than the current issue in VFS which will just bubble up
> errors in a few files.
> 
> On my system syncfs takes about 2 milliseconds, so you have a chance to
> experience this every ~30 minutes if the syscalls align in the right way. If
> we expanded the errseq_t to u64, we would potentially get a collision
> every 4503599627370496 calls, or assuming the 2 millisecond invariant
> holds, every 285 years. Now, we probably don't want to make errseq_t into
> a u64 because of performance reasons (not all systems have native u64
> cmpxchg), and the extra memory it'd take up.
> 
> If we really want to avoid this case, I can think of one "simple" solution,
> which is something like laying out errseq_t as something like a errseq_t_src
> that's 64-bits, and all readers just look at the lower 32-bits. The longer
> errseq_t would exist on super_blocks, but files would still get the shorter one.
> To potentially avoid the performance penalty of atomic longs, we could also
> do something like this:
> 
> typedef struct {
>     atomic_t overflow;
>     u32 errseq;
> } errseq_t_big;
> 
> And in errseq_set, do:
> /* Wraps */
> if (new < old)
>         atomic_inc(&eseq->overflow);
> 
> *shrug*
> I don't think that the above scenario is likely though.
> 


The counter is only bumped if the seen flag is set, so you'd need to
record 2^20 errors _and_ fetch them that many times, and just get
unlucky once and hit the exact counter value that you had previously
sampled.

If that's your situation, then you probably have much bigger problems
than the counter wrapping.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-12-01 11:09               ` Sargun Dhillon
  2020-12-01 11:29                 ` Amir Goldstein
@ 2020-12-01 13:01                 ` Jeff Layton
  2020-12-01 15:24                   ` Vivek Goyal
  1 sibling, 1 reply; 31+ messages in thread
From: Jeff Layton @ 2020-12-01 13:01 UTC (permalink / raw)
  To: Sargun Dhillon, Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells

On Tue, 2020-12-01 at 11:09 +0000, Sargun Dhillon wrote:
> On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote:
> > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote:
> > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > 
> > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > > > > 
> > > > > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > > > > > 
> > > > > > > > In the future, we can add error checking to all interactions with the
> > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > > > > and other interactions with the filesystem[1].
> > > > > > > > 
> > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > > > > > 
> > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > > > > ---
> > > > > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > > > > -without significant effort.
> > > > > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > > > > > 
> > > > > > > 
> > > > > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > > > > Specifically, when I wrote that the very minimum is to document the
> > > > > > > issue, it was under the assumption that a proper fix is hard.
> > > > > > > I think I was wrong and that the very minimum is to check for errseq
> > > > > > > since mount on the fsync and syncfs calls.
> > > > > > > 
> > > > > > > Why? first of all because it is very very simple and goes a long way to
> > > > > > > fix the broken contract with applications, not the contract about durability
> > > > > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > > > > data (during the same mount era).
> > > > > > > 
> > > > > > > Second, because the sentence you added above is hard for users to understand
> > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > > > > then this sentence becomes irrelevant.
> > > > > > > The fact that ext4 can lose data if application did not fsync after
> > > > > > > write is true
> > > > > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > > > > in the context
> > > > > > > of overlayfs documentation at all.
> > > > > > > 
> > > > > > > Am I wrong saying that it is very very simple to fix?
> > > > > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > > > > be easily applied to stable kernels?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Amir.
> > > > > > 
> > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's
> > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the
> > > > > > errseq of the real corresponding real file's superblock. One solution might be
> > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem
> > > > > > that the upperdir, and then rely on VFS's checking.
> > > > > > 
> > > > > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > > > > error tracking. It seems like this infrastructure was only partially completed
> > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > > > > as not all of his patches landed.
> > > > > > 
> > > > > 
> > > > > The patches in the series were all merged, but we ended up going with a
> > > > > simpler solution [1] than the first series I posted. Instead of plumbing
> > > > > the errseq_t handling down into sync_fs, we did it in the syscall
> > > > > wrapper.
> > > > Jeff,
> > > > 
> > > > Thanks for replying. I'm still a little confused as to what the
> > > > per-address_space wb_err. It seems like we should implement the
> > > > flush method, like so:
> > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > > index efccb7c1f9bc..32e5bc0aacd6 100644
> > > > --- a/fs/overlayfs/file.c
> > > > +++ b/fs/overlayfs/file.c
> > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
> > > >                             remap_flags, op);
> > > >  }
> > > > 
> > > > +static int ovl_flush(struct file *file, fl_owner_t id)
> > > > +{
> > > > +       struct file *realfile = file->private_data;
> > > > +
> > > > +       if (real_file->f_op->flush)
> > > > +               return real_file->f_op->flush(filp, id);
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > >  const struct file_operations ovl_file_operations = {
> > > >         .open           = ovl_open,
> > > >         .release        = ovl_release,
> > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
> > > >         .fallocate      = ovl_fallocate,
> > > >         .fadvise        = ovl_fadvise,
> > > >         .unlocked_ioctl = ovl_ioctl,
> > > > +       .flush          = ovl_flush,
> > > >  #ifdef CONFIG_COMPAT
> > > >         .compat_ioctl   = ovl_compat_ioctl,
> > > >  #endif
> > > > 
> > > > 
> > > > > 
> > > > > I think the tricky part here is that there is no struct file plumbed
> > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > > > > time that gets called.
> > > > > 
> > > > > What may be easiest is to just propagate the s_wb_err value from the
> > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > > > > and should ensure that syncfs() calls against the overlayfs sb see any
> > > > > errors that occurred on the upper_sb.
> > > > > 
> > > > > Something like this maybe? Totally untested of course. May also need to
> > > > > think about how to ensure ordering between racing syncfs calls too
> > > > > (don't really want the s_wb_err going "backward"):
> > > > > 
> > > > > ----------------------------8<---------------------------------
> > > > > $ git diff
> > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > index 290983bcfbb3..d725705abdac 100644
> > > > > --- a/fs/overlayfs/super.c
> > > > > +++ b/fs/overlayfs/super.c
> > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > >         ret = sync_filesystem(upper_sb);
> > > > >         up_read(&upper_sb->s_umount);
> > > > > 
> > > > > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > > > > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > > > > +
> > > > >         return ret;
> > > > >  }
> > > > > ----------------------------8<---------------------------------
> > > > > 
> > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
> > > > 
> > > > So,
> > > > I think this will have bad behaviour because syncfs() twice in a row will
> > > > return the error twice. The reason is that normally in syncfs it calls
> > > > errseq_check_and_advance marking the error on the super block as seen. If we
> > > > copy-up the error value each time, it will break this semantic, as we do not set
> > > > seen on the upperdir.
> > > > 
> > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync
> > > > method, like:
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index 290983bcfbb3..4931d1797c03 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > >  {
> > > >         struct ovl_fs *ofs = sb->s_fs_info;
> > > >         struct super_block *upper_sb;
> > > > +       errseq_t src;
> > > >         int ret;
> > > > 
> > > >         if (!ovl_upper_mnt(ofs))
> > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > >         ret = sync_filesystem(upper_sb);
> > > >         up_read(&upper_sb->s_umount);
> > > > 
> > > > +       /* Propagate the error up from the upper_sb once */
> > > > +       src = READ_ONCE(upper_sb->s_wb_err);
> > > > +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> > > > +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> > > > +
> > > >         return ret;
> > > >  }
> > > > 
> > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > 
> > > >                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > >                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> > > > 
> > > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > > index 81f9e33aa7e7..53275c168265 100644
> > > > --- a/lib/errseq.c
> > > > +++ b/lib/errseq.c
> > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > > >         return err;
> > > >  }
> > > >  EXPORT_SYMBOL(errseq_check_and_advance);
> > > > +
> > > > +/**
> > > > + * errseq_counter() - Get the value of the errseq counter
> > > > + * @eseq: Value being checked
> > > > + *
> > > > + * This returns the errseq_t counter value. Reminder, it can wrap because
> > > > + * there are only a limited number of counter bits.
> > > > + *
> > > > + * Return: The counter portion of the errseq_t.
> > > > + */
> > > > +int errseq_counter(errseq_t eseq)
> > > > +{
> > > > +       return eseq >> (ERRSEQ_SHIFT + 1);
> > > > +}
> > > > 
> > > > This also needs some locking sprinkled on it because racing can occur with
> > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go
> > > > backwards, because you could just use a simple cmpxchg64. I know that would make
> > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> > > > we can just sprinkle a mutex on it.
> > > > 
> > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read,
> > > > and we haven't done a copy yet, we'll get it. Since this will be strictly
> > > > serialized a simple equivalence check will work versus actually having to deal
> > > > with happens before behaviour. There is still a correctness flaw here though,
> > > > which is exactly enough errors occur to result in it wrapping back to the same
> > > > value, it will break.
> > > > 
> > > > By the way, how do y'all test this error handling behaviour? I didn't
> > > > find any automated testing for what currently exists.
> > > > > 
> > > > > How about I split this into two patchsets? One, where I add the logic to copy
> > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > > > > stable on those because I think it has an effect that's universal across
> > > > > all filesystems.
> > > > > 
> > > > > P.S.
> > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > > > > think for this kind of test, it requires in kernel code to artificially bump the
> > > > > writeback error count on the upperdir, or it requires the failure injection
> > > > > infrastructure.
> > > > > 
> > > > > Simulating this behaviour is non-trivial without in-kernel support:
> > > > > 
> > > > > P1: Open(f) -> p1.fd
> > > > > P2: Open(f) -> p2.fd
> > > > > P1: syncfs(p1.fd) -> errrno
> > > > > P2: syncfs(p1.fd) -> 0 # This should return an error
> > > > > 
> > > > > 
> > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > > > > [2]: https://lwn.net/Articles/722250/
> > > > > 
> > > > > 
> > > 
> > > Sargun (and Jeff),
> > > 
> > > Thank you for this discussion. It would be very nice to work on testing
> > > and fixing errseq propagation is correct on overlayfs.
> > > Alas, this is not what I suggested.
> > > 
> > > What I suggested was a solution only for the volatile overlay issue
> > > where data can vaporise without applications noticing:
> > > "...the very minimum is to check for errseq since mount on the fsync
> > >  and syncfs calls."
> > > 
> > Yeah, I was confusing the checking that VFS does on our behalf and the checking 
> > that we can do ourselves in the sync callback. If we return an error prior to 
> > the vfs checking it short-circuits that entirely.
> > 
> > > Do you get it? there is no pre-file state in the game, not for fsync and not 
> > > for syncfs.
> > > 
> > > Any single error, no matter how temporary it is and what damage it may
> > > or may not have caused to upper layer consistency, permanently
> > > invalidates the reliability of the volatile overlay, resulting in:
> > > Effective immediately: every fsync/syncfs returns EIO.
> > > Going forward: maybe implement overlay shutdown, so every access
> > > returns EIO.
> > > 
> > > So now that I hopefully explained myself better, I'll ask again:
> > > Am I wrong saying that it is very very simple to fix?
> > > Would you mind making that fix at the bottom of the patch series, so it can
> > > be easily applied to stable kernels?
> > > 
> > > Thanks,
> > > Amir.
> > 
> > I think that this should be easy enough if the semantic is such that volatile
> > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's
> > super block has experienced errors since the initial mount. I imagine we do not
> > want to make it such that if the upperdir has ever experienced errors, return
> > EIO on syncfs.
> > 
> > The one caveat that I see is that if the errseq wraps, we can silently begin 
> > swallowing errors again. Thus, on the first failed syncfs we should just
> > store a flag indicating that the volatile fs is bad, and to continue to return
> > EIO rather than go through the process of checking errseq_t, but that's easy
> > enough to write.
> 
> It turns out this is much more complicated than I initially thought. I'm not 
> entirely sure why VFS even defines sync_fs (on the superblock) as returning an 
> int. The way that sync_fs works is:
> 
> sys_syncfs ->
>   sync_filesystem(sb) ->
>     __sync_filesystem(sb, N) ->
>       sb->s_op->sync_fs
>       /* __sync_blockdev has its own writeback error checking */
>       __sync_blockdev
>   errseq_check_and_advance(sb...)
> 
> __sync_filesystem calls the sync_fs callback on the superblock's superblock
> operations:
> 
> static int __sync_filesystem(struct super_block *sb, int wait)
> {
> 	if (wait)
> 		sync_inodes_sb(sb);
> 	else
> 		writeback_inodes_sb(sb, WB_REASON_SYNC);
> 
> 	if (sb->s_op->sync_fs)
> 		sb->s_op->sync_fs(sb, wait);
> 	return __sync_blockdev(sb->s_bdev, wait);
> }
> 
> But it completely discards the result. On the other hand, fsync, and fdatasync
> exhibit different behaviour:
> 
> sys_fdatasync ->
>   do_fsync
>    (see below)
> 
> sys_fsync ->
>   do_fsync ->
>     vfs_fsync ->
>       vfs_fsync_range ->
>         return file->f_op->fsync
> ----
> 

This is mainly a result of the evolution of the code. In the beginning
there was only sync() and it doesn't report errors. Eventually syncfs()
was added, but the only error it reliably reported was EBADF.

I added the machinery to record writeback errors of individual inodes at
the superblock level recently, but the sync_fs op was left alone (mainly
because changing this would be very invasive, and the value of doing so
wasn't completely clear).


> 
> syncfs seems to enforce the semantics laid out by VFS[1]. Specifically the 
> statement:
>   When there is an error during writeback, they expect that error to be reported 
>   when a file sync request is made. After an error has been reported on one 
>   request, subsequent requests on the same file descriptor should return 0, unless 
>   further writeback errors have occurred since the previous file syncronization.
> 
> This is enforced by the errseq_check_and_advance logic. We can hack around this
> logic by resetting the errset (setting the error on it) every time we get the
> sync_fs callback, but that to me seems wrong. FWIW, implementing this behaviour
> for fdatasync, and fsync is easier, because the error is bubbled up from the
> filesystem to the VFS. I don't actually think this is a good idea because
> it seems like this sync_fs behaviour is a bit...not neccessarily what all
> filesystems expect. For example, btrfs_sync_fs returns an error if it
> is unable to finish the current transaction. Nowhere in the btrfs code
> does it set the errseq on the superblock if this fails.
> 
> I think we have a couple paths forward:
> 1. Change the semantic of sync_fs so the error is always bubbled up if
>    it returns a non-zero value. If we do this, we have to decide whether
>    or not we would _also_ call errseq_check_and_advance on the SB,
>    or leave that to a subsequent call.

You would want to call errseq_check_and_advance in that situation.
Otherwise a subsequent syncfs call would return an error even when an
error had not occurred since the last syncfs.

> 2. Have overlayfs forcefully set an error on the superblock on every
>    callback to sync_fs. This seems ugly, but I wrote a little patch,
>    and it seems to solve the problem for all the fsync / fdatasync /
>    sync / syncfs variants without having to do plumbing in VFS.

It is ugly, but that's probably the easiest solution.

> 3. Choose a different set of semantics for how we want to handle
>    errors in volatile mounts.
> 
> [1]: https://www.kernel.org/doc/html/v5.9/filesystems/vfs.html?highlight=handling%20errors%20during%20writeback#handling-errors-during-writeback
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-12-01 13:01                 ` Jeff Layton
@ 2020-12-01 15:24                   ` Vivek Goyal
  2020-12-01 16:10                     ` Jeff Layton
  0 siblings, 1 reply; 31+ messages in thread
From: Vivek Goyal @ 2020-12-01 15:24 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Sargun Dhillon, Amir Goldstein, overlayfs, Miklos Szeredi,
	Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel,
	David Howells

On Tue, Dec 01, 2020 at 08:01:11AM -0500, Jeff Layton wrote:
> On Tue, 2020-12-01 at 11:09 +0000, Sargun Dhillon wrote:
> > On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote:
> > > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote:
> > > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > 
> > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > > > > > 
> > > > > > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > > > > > > 
> > > > > > > > > In the future, we can add error checking to all interactions with the
> > > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > > > > > and other interactions with the filesystem[1].
> > > > > > > > > 
> > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > > > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > > > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > > > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > > > > > -without significant effort.
> > > > > > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > > > > > Specifically, when I wrote that the very minimum is to document the
> > > > > > > > issue, it was under the assumption that a proper fix is hard.
> > > > > > > > I think I was wrong and that the very minimum is to check for errseq
> > > > > > > > since mount on the fsync and syncfs calls.
> > > > > > > > 
> > > > > > > > Why? first of all because it is very very simple and goes a long way to
> > > > > > > > fix the broken contract with applications, not the contract about durability
> > > > > > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > > > > > data (during the same mount era).
> > > > > > > > 
> > > > > > > > Second, because the sentence you added above is hard for users to understand
> > > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > > > > > then this sentence becomes irrelevant.
> > > > > > > > The fact that ext4 can lose data if application did not fsync after
> > > > > > > > write is true
> > > > > > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > > > > > in the context
> > > > > > > > of overlayfs documentation at all.
> > > > > > > > 
> > > > > > > > Am I wrong saying that it is very very simple to fix?
> > > > > > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > > > > > be easily applied to stable kernels?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Amir.
> > > > > > > 
> > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's
> > > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the
> > > > > > > errseq of the real corresponding real file's superblock. One solution might be
> > > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem
> > > > > > > that the upperdir, and then rely on VFS's checking.
> > > > > > > 
> > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > > > > > error tracking. It seems like this infrastructure was only partially completed
> > > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > > > > > as not all of his patches landed.
> > > > > > > 
> > > > > > 
> > > > > > The patches in the series were all merged, but we ended up going with a
> > > > > > simpler solution [1] than the first series I posted. Instead of plumbing
> > > > > > the errseq_t handling down into sync_fs, we did it in the syscall
> > > > > > wrapper.
> > > > > Jeff,
> > > > > 
> > > > > Thanks for replying. I'm still a little confused as to what the
> > > > > per-address_space wb_err. It seems like we should implement the
> > > > > flush method, like so:
> > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > > > index efccb7c1f9bc..32e5bc0aacd6 100644
> > > > > --- a/fs/overlayfs/file.c
> > > > > +++ b/fs/overlayfs/file.c
> > > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
> > > > >                             remap_flags, op);
> > > > >  }
> > > > > 
> > > > > +static int ovl_flush(struct file *file, fl_owner_t id)
> > > > > +{
> > > > > +       struct file *realfile = file->private_data;
> > > > > +
> > > > > +       if (real_file->f_op->flush)
> > > > > +               return real_file->f_op->flush(filp, id);
> > > > > +
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > >  const struct file_operations ovl_file_operations = {
> > > > >         .open           = ovl_open,
> > > > >         .release        = ovl_release,
> > > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
> > > > >         .fallocate      = ovl_fallocate,
> > > > >         .fadvise        = ovl_fadvise,
> > > > >         .unlocked_ioctl = ovl_ioctl,
> > > > > +       .flush          = ovl_flush,
> > > > >  #ifdef CONFIG_COMPAT
> > > > >         .compat_ioctl   = ovl_compat_ioctl,
> > > > >  #endif
> > > > > 
> > > > > 
> > > > > > 
> > > > > > I think the tricky part here is that there is no struct file plumbed
> > > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > > > > > time that gets called.
> > > > > > 
> > > > > > What may be easiest is to just propagate the s_wb_err value from the
> > > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > > > > > and should ensure that syncfs() calls against the overlayfs sb see any
> > > > > > errors that occurred on the upper_sb.
> > > > > > 
> > > > > > Something like this maybe? Totally untested of course. May also need to
> > > > > > think about how to ensure ordering between racing syncfs calls too
> > > > > > (don't really want the s_wb_err going "backward"):
> > > > > > 
> > > > > > ----------------------------8<---------------------------------
> > > > > > $ git diff
> > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > > index 290983bcfbb3..d725705abdac 100644
> > > > > > --- a/fs/overlayfs/super.c
> > > > > > +++ b/fs/overlayfs/super.c
> > > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > > >         ret = sync_filesystem(upper_sb);
> > > > > >         up_read(&upper_sb->s_umount);
> > > > > > 
> > > > > > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > > > > > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > > > > > +
> > > > > >         return ret;
> > > > > >  }
> > > > > > ----------------------------8<---------------------------------
> > > > > > 
> > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
> > > > > 
> > > > > So,
> > > > > I think this will have bad behaviour because syncfs() twice in a row will
> > > > > return the error twice. The reason is that normally in syncfs it calls
> > > > > errseq_check_and_advance marking the error on the super block as seen. If we
> > > > > copy-up the error value each time, it will break this semantic, as we do not set
> > > > > seen on the upperdir.
> > > > > 
> > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync
> > > > > method, like:
> > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > index 290983bcfbb3..4931d1797c03 100644
> > > > > --- a/fs/overlayfs/super.c
> > > > > +++ b/fs/overlayfs/super.c
> > > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > >  {
> > > > >         struct ovl_fs *ofs = sb->s_fs_info;
> > > > >         struct super_block *upper_sb;
> > > > > +       errseq_t src;
> > > > >         int ret;
> > > > > 
> > > > >         if (!ovl_upper_mnt(ofs))
> > > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > >         ret = sync_filesystem(upper_sb);
> > > > >         up_read(&upper_sb->s_umount);
> > > > > 
> > > > > +       /* Propagate the error up from the upper_sb once */
> > > > > +       src = READ_ONCE(upper_sb->s_wb_err);
> > > > > +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> > > > > +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> > > > > +
> > > > >         return ret;
> > > > >  }
> > > > > 
> > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > > 
> > > > >                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > >                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > > +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> > > > > 
> > > > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > > > index 81f9e33aa7e7..53275c168265 100644
> > > > > --- a/lib/errseq.c
> > > > > +++ b/lib/errseq.c
> > > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > > > >         return err;
> > > > >  }
> > > > >  EXPORT_SYMBOL(errseq_check_and_advance);
> > > > > +
> > > > > +/**
> > > > > + * errseq_counter() - Get the value of the errseq counter
> > > > > + * @eseq: Value being checked
> > > > > + *
> > > > > + * This returns the errseq_t counter value. Reminder, it can wrap because
> > > > > + * there are only a limited number of counter bits.
> > > > > + *
> > > > > + * Return: The counter portion of the errseq_t.
> > > > > + */
> > > > > +int errseq_counter(errseq_t eseq)
> > > > > +{
> > > > > +       return eseq >> (ERRSEQ_SHIFT + 1);
> > > > > +}
> > > > > 
> > > > > This also needs some locking sprinkled on it because racing can occur with
> > > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go
> > > > > backwards, because you could just use a simple cmpxchg64. I know that would make
> > > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> > > > > we can just sprinkle a mutex on it.
> > > > > 
> > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read,
> > > > > and we haven't done a copy yet, we'll get it. Since this will be strictly
> > > > > serialized a simple equivalence check will work versus actually having to deal
> > > > > with happens before behaviour. There is still a correctness flaw here though,
> > > > > which is exactly enough errors occur to result in it wrapping back to the same
> > > > > value, it will break.
> > > > > 
> > > > > By the way, how do y'all test this error handling behaviour? I didn't
> > > > > find any automated testing for what currently exists.
> > > > > > 
> > > > > > How about I split this into two patchsets? One, where I add the logic to copy
> > > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > > > > > stable on those because I think it has an effect that's universal across
> > > > > > all filesystems.
> > > > > > 
> > > > > > P.S.
> > > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > > > > > think for this kind of test, it requires in kernel code to artificially bump the
> > > > > > writeback error count on the upperdir, or it requires the failure injection
> > > > > > infrastructure.
> > > > > > 
> > > > > > Simulating this behaviour is non-trivial without in-kernel support:
> > > > > > 
> > > > > > P1: Open(f) -> p1.fd
> > > > > > P2: Open(f) -> p2.fd
> > > > > > P1: syncfs(p1.fd) -> errrno
> > > > > > P2: syncfs(p1.fd) -> 0 # This should return an error
> > > > > > 
> > > > > > 
> > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > > > > > [2]: https://lwn.net/Articles/722250/
> > > > > > 
> > > > > > 
> > > > 
> > > > Sargun (and Jeff),
> > > > 
> > > > Thank you for this discussion. It would be very nice to work on testing
> > > > and fixing errseq propagation is correct on overlayfs.
> > > > Alas, this is not what I suggested.
> > > > 
> > > > What I suggested was a solution only for the volatile overlay issue
> > > > where data can vaporise without applications noticing:
> > > > "...the very minimum is to check for errseq since mount on the fsync
> > > >  and syncfs calls."
> > > > 
> > > Yeah, I was confusing the checking that VFS does on our behalf and the checking 
> > > that we can do ourselves in the sync callback. If we return an error prior to 
> > > the vfs checking it short-circuits that entirely.
> > > 
> > > > Do you get it? there is no pre-file state in the game, not for fsync and not 
> > > > for syncfs.
> > > > 
> > > > Any single error, no matter how temporary it is and what damage it may
> > > > or may not have caused to upper layer consistency, permanently
> > > > invalidates the reliability of the volatile overlay, resulting in:
> > > > Effective immediately: every fsync/syncfs returns EIO.
> > > > Going forward: maybe implement overlay shutdown, so every access
> > > > returns EIO.
> > > > 
> > > > So now that I hopefully explained myself better, I'll ask again:
> > > > Am I wrong saying that it is very very simple to fix?
> > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > be easily applied to stable kernels?
> > > > 
> > > > Thanks,
> > > > Amir.
> > > 
> > > I think that this should be easy enough if the semantic is such that volatile
> > > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's
> > > super block has experienced errors since the initial mount. I imagine we do not
> > > want to make it such that if the upperdir has ever experienced errors, return
> > > EIO on syncfs.
> > > 
> > > The one caveat that I see is that if the errseq wraps, we can silently begin 
> > > swallowing errors again. Thus, on the first failed syncfs we should just
> > > store a flag indicating that the volatile fs is bad, and to continue to return
> > > EIO rather than go through the process of checking errseq_t, but that's easy
> > > enough to write.
> > 
> > It turns out this is much more complicated than I initially thought. I'm not 
> > entirely sure why VFS even defines sync_fs (on the superblock) as returning an 
> > int. The way that sync_fs works is:
> > 
> > sys_syncfs ->
> >   sync_filesystem(sb) ->
> >     __sync_filesystem(sb, N) ->
> >       sb->s_op->sync_fs
> >       /* __sync_blockdev has its own writeback error checking */
> >       __sync_blockdev
> >   errseq_check_and_advance(sb...)
> > 
> > __sync_filesystem calls the sync_fs callback on the superblock's superblock
> > operations:
> > 
> > static int __sync_filesystem(struct super_block *sb, int wait)
> > {
> > 	if (wait)
> > 		sync_inodes_sb(sb);
> > 	else
> > 		writeback_inodes_sb(sb, WB_REASON_SYNC);
> > 
> > 	if (sb->s_op->sync_fs)
> > 		sb->s_op->sync_fs(sb, wait);
> > 	return __sync_blockdev(sb->s_bdev, wait);
> > }
> > 
> > But it completely discards the result. On the other hand, fsync, and fdatasync
> > exhibit different behaviour:
> > 
> > sys_fdatasync ->
> >   do_fsync
> >    (see below)
> > 
> > sys_fsync ->
> >   do_fsync ->
> >     vfs_fsync ->
> >       vfs_fsync_range ->
> >         return file->f_op->fsync
> > ----
> > 
> 
> This is mainly a result of the evolution of the code. In the beginning
> there was only sync() and it doesn't report errors. Eventually syncfs()
> was added, but the only error it reliably reported was EBADF.
> 
> I added the machinery to record writeback errors of individual inodes at
> the superblock level recently, but the sync_fs op was left alone (mainly
> because changing this would be very invasive, and the value of doing so
> wasn't completely clear).

Is following patch enough to capture ->sync_fs return value and return
to user space on syncfs. Or there is more to it which I am missing.


---
 fs/sync.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: redhat-linux/fs/sync.c
===================================================================
--- redhat-linux.orig/fs/sync.c	2020-08-10 15:40:22.535349278 -0400
+++ redhat-linux/fs/sync.c	2020-12-01 10:19:03.624330578 -0500
@@ -30,14 +30,17 @@
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
+	int ret, ret2;
+
 	if (wait)
 		sync_inodes_sb(sb);
 	else
 		writeback_inodes_sb(sb, WB_REASON_SYNC);
 
 	if (sb->s_op->sync_fs)
-		sb->s_op->sync_fs(sb, wait);
-	return __sync_blockdev(sb->s_bdev, wait);
+		ret = sb->s_op->sync_fs(sb, wait);
+	ret2 = __sync_blockdev(sb->s_bdev, wait);
+	return ret ? ret : ret2;
 }
 
 /*


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile
  2020-12-01 15:24                   ` Vivek Goyal
@ 2020-12-01 16:10                     ` Jeff Layton
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff Layton @ 2020-12-01 16:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, Amir Goldstein, overlayfs, Miklos Szeredi,
	Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel,
	David Howells

On Tue, 2020-12-01 at 10:24 -0500, Vivek Goyal wrote:
> On Tue, Dec 01, 2020 at 08:01:11AM -0500, Jeff Layton wrote:
> > On Tue, 2020-12-01 at 11:09 +0000, Sargun Dhillon wrote:
> > > On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote:
> > > > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote:
> > > > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > > 
> > > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote:
> > > > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote:
> > > > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote:
> > > > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > > > > > > 
> > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile
> > > > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes
> > > > > > > > > > (such as from mmap, or writes that are not immediately flushed to the
> > > > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and
> > > > > > > > > > write, and will still return errors on those, it doesn't guarantee all
> > > > > > > > > > kinds of errors will happen at those times, and thus may hide errors.
> > > > > > > > > > 
> > > > > > > > > > In the future, we can add error checking to all interactions with the
> > > > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings,
> > > > > > > > > > and other interactions with the filesystem[1].
> > > > > > > > > > 
> > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > > > > > > > Cc: linux-fsdevel@vger.kernel.org
> > > > > > > > > > Cc: linux-unionfs@vger.kernel.org
> > > > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++-
> > > > > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644
> > > > > > > > > > --- a/Documentation/filesystems/overlayfs.rst
> > > > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst
> > > > > > > > > > @@ -570,7 +570,11 @@ Volatile mount
> > > > > > > > > >  This is enabled with the "volatile" mount option.  Volatile mounts are not
> > > > > > > > > >  guaranteed to survive a crash.  It is strongly recommended that volatile
> > > > > > > > > >  mounts are only used if data written to the overlay can be recreated
> > > > > > > > > > -without significant effort.
> > > > > > > > > > +without significant effort.  In addition to this, the sync family of syscalls
> > > > > > > > > > +are not sufficient to determine whether a write failed as sync calls are
> > > > > > > > > > +omitted.  For this reason, it is important that the filesystem used by the
> > > > > > > > > > +upperdir handles failure in a fashion that's suitable for the user.  For
> > > > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Reading this now, I think I may have wrongly analysed the issue.
> > > > > > > > > Specifically, when I wrote that the very minimum is to document the
> > > > > > > > > issue, it was under the assumption that a proper fix is hard.
> > > > > > > > > I think I was wrong and that the very minimum is to check for errseq
> > > > > > > > > since mount on the fsync and syncfs calls.
> > > > > > > > > 
> > > > > > > > > Why? first of all because it is very very simple and goes a long way to
> > > > > > > > > fix the broken contract with applications, not the contract about durability
> > > > > > > > > obviously, but the contract about write+fsync+read expects to find the written
> > > > > > > > > data (during the same mount era).
> > > > > > > > > 
> > > > > > > > > Second, because the sentence you added above is hard for users to understand
> > > > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs,
> > > > > > > > > then this sentence becomes irrelevant.
> > > > > > > > > The fact that ext4 can lose data if application did not fsync after
> > > > > > > > > write is true
> > > > > > > > > for volatile as well as non-volatile and it is therefore not relevant
> > > > > > > > > in the context
> > > > > > > > > of overlayfs documentation at all.
> > > > > > > > > 
> > > > > > > > > Am I wrong saying that it is very very simple to fix?
> > > > > > > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > > > > > > be easily applied to stable kernels?
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Amir.
> > > > > > > > 
> > > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's
> > > > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the
> > > > > > > > errseq of the real corresponding real file's superblock. One solution might be
> > > > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem
> > > > > > > > that the upperdir, and then rely on VFS's checking.
> > > > > > > > 
> > > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based
> > > > > > > > error tracking. It seems like this infrastructure was only partially completed
> > > > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now,
> > > > > > > > as not all of his patches landed.
> > > > > > > > 
> > > > > > > 
> > > > > > > The patches in the series were all merged, but we ended up going with a
> > > > > > > simpler solution [1] than the first series I posted. Instead of plumbing
> > > > > > > the errseq_t handling down into sync_fs, we did it in the syscall
> > > > > > > wrapper.
> > > > > > Jeff,
> > > > > > 
> > > > > > Thanks for replying. I'm still a little confused as to what the
> > > > > > per-address_space wb_err. It seems like we should implement the
> > > > > > flush method, like so:
> > > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > > > > index efccb7c1f9bc..32e5bc0aacd6 100644
> > > > > > --- a/fs/overlayfs/file.c
> > > > > > +++ b/fs/overlayfs/file.c
> > > > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
> > > > > >                             remap_flags, op);
> > > > > >  }
> > > > > > 
> > > > > > +static int ovl_flush(struct file *file, fl_owner_t id)
> > > > > > +{
> > > > > > +       struct file *realfile = file->private_data;
> > > > > > +
> > > > > > +       if (real_file->f_op->flush)
> > > > > > +               return real_file->f_op->flush(filp, id);
> > > > > > +
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  const struct file_operations ovl_file_operations = {
> > > > > >         .open           = ovl_open,
> > > > > >         .release        = ovl_release,
> > > > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = {
> > > > > >         .fallocate      = ovl_fallocate,
> > > > > >         .fadvise        = ovl_fadvise,
> > > > > >         .unlocked_ioctl = ovl_ioctl,
> > > > > > +       .flush          = ovl_flush,
> > > > > >  #ifdef CONFIG_COMPAT
> > > > > >         .compat_ioctl   = ovl_compat_ioctl,
> > > > > >  #endif
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > I think the tricky part here is that there is no struct file plumbed
> > > > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the
> > > > > > > time that gets called.
> > > > > > > 
> > > > > > > What may be easiest is to just propagate the s_wb_err value from the
> > > > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get
> > > > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper
> > > > > > > and should ensure that syncfs() calls against the overlayfs sb see any
> > > > > > > errors that occurred on the upper_sb.
> > > > > > > 
> > > > > > > Something like this maybe? Totally untested of course. May also need to
> > > > > > > think about how to ensure ordering between racing syncfs calls too
> > > > > > > (don't really want the s_wb_err going "backward"):
> > > > > > > 
> > > > > > > ----------------------------8<---------------------------------
> > > > > > > $ git diff
> > > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > > > index 290983bcfbb3..d725705abdac 100644
> > > > > > > --- a/fs/overlayfs/super.c
> > > > > > > +++ b/fs/overlayfs/super.c
> > > > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > > > >         ret = sync_filesystem(upper_sb);
> > > > > > >         up_read(&upper_sb->s_umount);
> > > > > > > 
> > > > > > > +       /* Propagate s_wb_err from upper_sb to overlayfs sb */
> > > > > > > +       WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err));
> > > > > > > +
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > > ----------------------------8<---------------------------------
> > > > > > > 
> > > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html
> > > > > > 
> > > > > > So,
> > > > > > I think this will have bad behaviour because syncfs() twice in a row will
> > > > > > return the error twice. The reason is that normally in syncfs it calls
> > > > > > errseq_check_and_advance marking the error on the super block as seen. If we
> > > > > > copy-up the error value each time, it will break this semantic, as we do not set
> > > > > > seen on the upperdir.
> > > > > > 
> > > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync
> > > > > > method, like:
> > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > > > index 290983bcfbb3..4931d1797c03 100644
> > > > > > --- a/fs/overlayfs/super.c
> > > > > > +++ b/fs/overlayfs/super.c
> > > > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > > >  {
> > > > > >         struct ovl_fs *ofs = sb->s_fs_info;
> > > > > >         struct super_block *upper_sb;
> > > > > > +       errseq_t src;
> > > > > >         int ret;
> > > > > > 
> > > > > >         if (!ovl_upper_mnt(ofs))
> > > > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > > > > >         ret = sync_filesystem(upper_sb);
> > > > > >         up_read(&upper_sb->s_umount);
> > > > > > 
> > > > > > +       /* Propagate the error up from the upper_sb once */
> > > > > > +       src = READ_ONCE(upper_sb->s_wb_err);
> > > > > > +       if (errseq_counter(src) != errseq_counter(sb->s_wb_err))
> > > > > > +               WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN);
> > > > > > +
> > > > > >         return ret;
> > > > > >  }
> > > > > > 
> > > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
> > > > > > 
> > > > > >                 sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
> > > > > >                 sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
> > > > > > +               sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err);
> > > > > > 
> > > > > > diff --git a/lib/errseq.c b/lib/errseq.c
> > > > > > index 81f9e33aa7e7..53275c168265 100644
> > > > > > --- a/lib/errseq.c
> > > > > > +++ b/lib/errseq.c
> > > > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > > > > >         return err;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(errseq_check_and_advance);
> > > > > > +
> > > > > > +/**
> > > > > > + * errseq_counter() - Get the value of the errseq counter
> > > > > > + * @eseq: Value being checked
> > > > > > + *
> > > > > > + * This returns the errseq_t counter value. Reminder, it can wrap because
> > > > > > + * there are only a limited number of counter bits.
> > > > > > + *
> > > > > > + * Return: The counter portion of the errseq_t.
> > > > > > + */
> > > > > > +int errseq_counter(errseq_t eseq)
> > > > > > +{
> > > > > > +       return eseq >> (ERRSEQ_SHIFT + 1);
> > > > > > +}
> > > > > > 
> > > > > > This also needs some locking sprinkled on it because racing can occur with
> > > > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go
> > > > > > backwards, because you could just use a simple cmpxchg64. I know that would make
> > > > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax
> > > > > > we can just sprinkle a mutex on it.
> > > > > > 
> > > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read,
> > > > > > and we haven't done a copy yet, we'll get it. Since this will be strictly
> > > > > > serialized a simple equivalence check will work versus actually having to deal
> > > > > > with happens before behaviour. There is still a correctness flaw here though,
> > > > > > which is exactly enough errors occur to result in it wrapping back to the same
> > > > > > value, it will break.
> > > > > > 
> > > > > > By the way, how do y'all test this error handling behaviour? I didn't
> > > > > > find any automated testing for what currently exists.
> > > > > > > 
> > > > > > > How about I split this into two patchsets? One, where I add the logic to copy
> > > > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock,
> > > > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC
> > > > > > > stable on those because I think it has an effect that's universal across
> > > > > > > all filesystems.
> > > > > > > 
> > > > > > > P.S.
> > > > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I
> > > > > > > think for this kind of test, it requires in kernel code to artificially bump the
> > > > > > > writeback error count on the upperdir, or it requires the failure injection
> > > > > > > infrastructure.
> > > > > > > 
> > > > > > > Simulating this behaviour is non-trivial without in-kernel support:
> > > > > > > 
> > > > > > > P1: Open(f) -> p1.fd
> > > > > > > P2: Open(f) -> p2.fd
> > > > > > > P1: syncfs(p1.fd) -> errrno
> > > > > > > P2: syncfs(p1.fd) -> 0 # This should return an error
> > > > > > > 
> > > > > > > 
> > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175
> > > > > > > [2]: https://lwn.net/Articles/722250/
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > > Sargun (and Jeff),
> > > > > 
> > > > > Thank you for this discussion. It would be very nice to work on testing
> > > > > and fixing errseq propagation is correct on overlayfs.
> > > > > Alas, this is not what I suggested.
> > > > > 
> > > > > What I suggested was a solution only for the volatile overlay issue
> > > > > where data can vaporise without applications noticing:
> > > > > "...the very minimum is to check for errseq since mount on the fsync
> > > > >  and syncfs calls."
> > > > > 
> > > > Yeah, I was confusing the checking that VFS does on our behalf and the checking 
> > > > that we can do ourselves in the sync callback. If we return an error prior to 
> > > > the vfs checking it short-circuits that entirely.
> > > > 
> > > > > Do you get it? there is no pre-file state in the game, not for fsync and not 
> > > > > for syncfs.
> > > > > 
> > > > > Any single error, no matter how temporary it is and what damage it may
> > > > > or may not have caused to upper layer consistency, permanently
> > > > > invalidates the reliability of the volatile overlay, resulting in:
> > > > > Effective immediately: every fsync/syncfs returns EIO.
> > > > > Going forward: maybe implement overlay shutdown, so every access
> > > > > returns EIO.
> > > > > 
> > > > > So now that I hopefully explained myself better, I'll ask again:
> > > > > Am I wrong saying that it is very very simple to fix?
> > > > > Would you mind making that fix at the bottom of the patch series, so it can
> > > > > be easily applied to stable kernels?
> > > > > 
> > > > > Thanks,
> > > > > Amir.
> > > > 
> > > > I think that this should be easy enough if the semantic is such that volatile
> > > > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's
> > > > super block has experienced errors since the initial mount. I imagine we do not
> > > > want to make it such that if the upperdir has ever experienced errors, return
> > > > EIO on syncfs.
> > > > 
> > > > The one caveat that I see is that if the errseq wraps, we can silently begin 
> > > > swallowing errors again. Thus, on the first failed syncfs we should just
> > > > store a flag indicating that the volatile fs is bad, and to continue to return
> > > > EIO rather than go through the process of checking errseq_t, but that's easy
> > > > enough to write.
> > > 
> > > It turns out this is much more complicated than I initially thought. I'm not 
> > > entirely sure why VFS even defines sync_fs (on the superblock) as returning an 
> > > int. The way that sync_fs works is:
> > > 
> > > sys_syncfs ->
> > >   sync_filesystem(sb) ->
> > >     __sync_filesystem(sb, N) ->
> > >       sb->s_op->sync_fs
> > >       /* __sync_blockdev has its own writeback error checking */
> > >       __sync_blockdev
> > >   errseq_check_and_advance(sb...)
> > > 
> > > __sync_filesystem calls the sync_fs callback on the superblock's superblock
> > > operations:
> > > 
> > > static int __sync_filesystem(struct super_block *sb, int wait)
> > > {
> > > 	if (wait)
> > > 		sync_inodes_sb(sb);
> > > 	else
> > > 		writeback_inodes_sb(sb, WB_REASON_SYNC);
> > > 
> > > 	if (sb->s_op->sync_fs)
> > > 		sb->s_op->sync_fs(sb, wait);
> > > 	return __sync_blockdev(sb->s_bdev, wait);
> > > }
> > > 
> > > But it completely discards the result. On the other hand, fsync, and fdatasync
> > > exhibit different behaviour:
> > > 
> > > sys_fdatasync ->
> > >   do_fsync
> > >    (see below)
> > > 
> > > sys_fsync ->
> > >   do_fsync ->
> > >     vfs_fsync ->
> > >       vfs_fsync_range ->
> > >         return file->f_op->fsync
> > > ----
> > > 
> > 
> > This is mainly a result of the evolution of the code. In the beginning
> > there was only sync() and it doesn't report errors. Eventually syncfs()
> > was added, but the only error it reliably reported was EBADF.
> > 
> > I added the machinery to record writeback errors of individual inodes at
> > the superblock level recently, but the sync_fs op was left alone (mainly
> > because changing this would be very invasive, and the value of doing so
> > wasn't completely clear).
> 
> Is following patch enough to capture ->sync_fs return value and return
> to user space on syncfs. Or there is more to it which I am missing.
> 
> 
> ---
>  fs/sync.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> Index: redhat-linux/fs/sync.c
> ===================================================================
> --- redhat-linux.orig/fs/sync.c	2020-08-10 15:40:22.535349278 -0400
> +++ redhat-linux/fs/sync.c	2020-12-01 10:19:03.624330578 -0500
> @@ -30,14 +30,17 @@
>   */
>  static int __sync_filesystem(struct super_block *sb, int wait)
>  {
> +	int ret, ret2;
> +
>  	if (wait)
>  		sync_inodes_sb(sb);
>  	else
>  		writeback_inodes_sb(sb, WB_REASON_SYNC);
>  
> 
>  	if (sb->s_op->sync_fs)
> -		sb->s_op->sync_fs(sb, wait);
> -	return __sync_blockdev(sb->s_bdev, wait);
> +		ret = sb->s_op->sync_fs(sb, wait);
> +	ret2 = __sync_blockdev(sb->s_bdev, wait);
> +	return ret ? ret : ret2;
>  }
>  
> 
>  /*
> 

Maybe. The question of course is what will regress once we make that
change. Are there sync_fs routines in place today that regularly return
errors that are just ignored? It's difficult to know without testing all
the affected filesystems.

If we did decide to do that, then there are some other callers of
sync_fs -- dquot_quota_sync, for instance, and we probably need to vet
them carefully.

Also, a sync_fs error could be transient. Suppose you have something
call sync() and it gets an error back from sync_fs (which is ignored),
and then something else calls syncfs() and doesn't see any error because
its sync_fs call succeeded. Arguably, we should return an error in that
situation.

What may work best is to just do an errseq_set on sb->s_wb_err when the
either sync_fs or __sync_blockdev returns an error. Then, the
errseq_check_and_advance in syncfs can eventually report it.
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-11-30 19:15   ` Vivek Goyal
@ 2020-12-05  9:13     ` Amir Goldstein
  2020-12-05 13:51       ` Jeff Layton
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2020-12-05  9:13 UTC (permalink / raw)
  To: Vivek Goyal, Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Daniel J Walsh, linux-fsdevel, David Howells, Jeff Layton

On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > Volatile remounts validate the following at the moment:
> >  * Has the module been reloaded / the system rebooted
> >  * Has the workdir been remounted
> >
> > This adds a new check for errors detected via the superblock's
> > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > and upon remount it's re-verified. This allows for kernel-level
> > detection of errors without forcing userspace to perform a
> > sync and allows for the hidden detection of writeback errors.
> >
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/overlayfs/overlayfs.h | 1 +
> >  fs/overlayfs/readdir.c   | 6 ++++++
> >  fs/overlayfs/super.c     | 1 +
> >  3 files changed, 8 insertions(+)
> >
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index de694ee99d7c..e8a711953b64 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> >        */
> >       uuid_t          ovl_boot_id;    /* Must stay first member */
> >       u64             s_instance_id;
> > +     errseq_t        errseq; /* Implemented as a u32 */
> >  } __packed;
> >
> >  /*
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 7b66fbb20261..5795b28bb4cf 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> >               return -EINVAL;
> >       }
> >
> > +     err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > +     if (err) {
> > +             pr_debug("Workdir filesystem reports errors: %d\n", err);
> > +             return -EINVAL;
> > +     }
> > +
> >       return 1;
> >  }
> >
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index a8ee3ba4ebbd..2e473f8c75dd 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
> >       int err;
> >       struct ovl_volatile_info info = {
> >               .s_instance_id = volatiledir->d_sb->s_instance_id,
> > +             .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
>
> errse_sample() seems to return 0 if nobody has seen the error yet. That
> means on remount we will fail. It is a false failure from our perspective
> and we are not interested in knowing if somebody else has seen the
> failure or not.
>
> Maybe we need a flag in errseq_sample() to get us current value
> irrespective of the fact whether anybody has seen the error or not?
>
> If we end up making this change, then we probably will have to somehow
> mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
> sampled ->s_wb_err when nobody saw it and later by the remount time
> say ERRSEQ_SEEN is set, we don't want remount to fail.
>

Hopping back to this review, looks like for volatile mount we need
something like (in this order):
1. check if re-use and get sampled errseq from volatiledir xattr
2. otherwise errseq_sample() upper_sb and store in volatiledir xattr
3. errseq_check() since stored or sampled errseq (0 for fresh mount
with unseen error)
4. fail volatile mount if errseq_check() failed
5. errseq_check() since stored errseq on fsync()/syncfs()

For fresh volatile mount, syncfs can fix the temporary mount error.
For re-used volatile mount, the mount error is permanent.

Did I miss anything?
Is the mount safe for both seen and unseen error cases? no error case?
Are we safe if a syncfs on upper_sb sneaks in between 2 and 3?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-12-05  9:13     ` Amir Goldstein
@ 2020-12-05 13:51       ` Jeff Layton
  2020-12-05 14:51         ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff Layton @ 2020-12-05 13:51 UTC (permalink / raw)
  To: Amir Goldstein, Vivek Goyal, Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Daniel J Walsh, linux-fsdevel, David Howells

On Sat, 2020-12-05 at 11:13 +0200, Amir Goldstein wrote:
> On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > > Volatile remounts validate the following at the moment:
> > >  * Has the module been reloaded / the system rebooted
> > >  * Has the workdir been remounted
> > > 
> > > This adds a new check for errors detected via the superblock's
> > > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > > and upon remount it's re-verified. This allows for kernel-level
> > > detection of errors without forcing userspace to perform a
> > > sync and allows for the hidden detection of writeback errors.
> > > 
> > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: linux-unionfs@vger.kernel.org
> > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  fs/overlayfs/overlayfs.h | 1 +
> > >  fs/overlayfs/readdir.c   | 6 ++++++
> > >  fs/overlayfs/super.c     | 1 +
> > >  3 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > > index de694ee99d7c..e8a711953b64 100644
> > > --- a/fs/overlayfs/overlayfs.h
> > > +++ b/fs/overlayfs/overlayfs.h
> > > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> > >        */
> > >       uuid_t          ovl_boot_id;    /* Must stay first member */
> > >       u64             s_instance_id;
> > > +     errseq_t        errseq; /* Implemented as a u32 */
> > >  } __packed;
> > > 
> > >  /*
> > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > index 7b66fbb20261..5795b28bb4cf 100644
> > > --- a/fs/overlayfs/readdir.c
> > > +++ b/fs/overlayfs/readdir.c
> > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> > >               return -EINVAL;
> > >       }
> > > 
> > > +     err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > > +     if (err) {
> > > +             pr_debug("Workdir filesystem reports errors: %d\n", err);
> > > +             return -EINVAL;
> > > +     }
> > > +
> > >       return 1;
> > >  }
> > > 
> > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > index a8ee3ba4ebbd..2e473f8c75dd 100644
> > > --- a/fs/overlayfs/super.c
> > > +++ b/fs/overlayfs/super.c
> > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
> > >       int err;
> > >       struct ovl_volatile_info info = {
> > >               .s_instance_id = volatiledir->d_sb->s_instance_id,
> > > +             .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
> > 
> > errse_sample() seems to return 0 if nobody has seen the error yet. That
> > means on remount we will fail. It is a false failure from our perspective
> > and we are not interested in knowing if somebody else has seen the
> > failure or not.
> > 
> > Maybe we need a flag in errseq_sample() to get us current value
> > irrespective of the fact whether anybody has seen the error or not?
> > 
> > If we end up making this change, then we probably will have to somehow
> > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
> > sampled ->s_wb_err when nobody saw it and later by the remount time
> > say ERRSEQ_SEEN is set, we don't want remount to fail.
> > 
> 
> Hopping back to this review, looks like for volatile mount we need
> something like (in this order):
> 1. check if re-use and get sampled errseq from volatiledir xattr
> 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr

I'm not sure I follow. Why does this need to go into an xattr?

errseq_t is never persisted on stable storage. It's an entirely
in-memory thing.


> 3. errseq_check() since stored or sampled errseq (0 for fresh mount
> with unseen error)
> 4. fail volatile mount if errseq_check() failed
> 5. errseq_check() since stored errseq on fsync()/syncfs()
> 

I think this is simpler than that. You just need a new errseq_t helper
that only conditionally samples if the thing is 0 or if the error has
already been seen. Something like this (hopefully with a better name):

bool errseq_sample_no_unseen(errseq_t *eseq, errseq_t *sample)         
{                                                                      
        errseq_t old = READ_ONCE(*eseq);                               
                                                                       
        if (old && !(old & ERRSEQ_SEEN))                               
                return false;                                          
        *sample = old;                                                 
        return true;                                                   
}                                                                      
       
If that returns false, fail the mount. If it's true, then save off the
sample and proceed.

> For fresh volatile mount, syncfs can fix the temporary mount error.
> For re-used volatile mount, the mount error is permanent.
> 
> Did I miss anything?
> Is the mount safe for both seen and unseen error cases? no error case?
> Are we safe if a syncfs on upper_sb sneaks in between 2 and 3?
> 
> Thanks,
> Amir.
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount
  2020-12-05 13:51       ` Jeff Layton
@ 2020-12-05 14:51         ` Amir Goldstein
  0 siblings, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2020-12-05 14:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Vivek Goyal, Sargun Dhillon, overlayfs, Miklos Szeredi,
	Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel,
	David Howells

On Sat, Dec 5, 2020 at 3:51 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Sat, 2020-12-05 at 11:13 +0200, Amir Goldstein wrote:
> > On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > > > Volatile remounts validate the following at the moment:
> > > >  * Has the module been reloaded / the system rebooted
> > > >  * Has the workdir been remounted
> > > >
> > > > This adds a new check for errors detected via the superblock's
> > > > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > > > and upon remount it's re-verified. This allows for kernel-level
> > > > detection of errors without forcing userspace to perform a
> > > > sync and allows for the hidden detection of writeback errors.
> > > >
> > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > Cc: linux-fsdevel@vger.kernel.org
> > > > Cc: linux-unionfs@vger.kernel.org
> > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > ---
> > > >  fs/overlayfs/overlayfs.h | 1 +
> > > >  fs/overlayfs/readdir.c   | 6 ++++++
> > > >  fs/overlayfs/super.c     | 1 +
> > > >  3 files changed, 8 insertions(+)
> > > >
> > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > > > index de694ee99d7c..e8a711953b64 100644
> > > > --- a/fs/overlayfs/overlayfs.h
> > > > +++ b/fs/overlayfs/overlayfs.h
> > > > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> > > >        */
> > > >       uuid_t          ovl_boot_id;    /* Must stay first member */
> > > >       u64             s_instance_id;
> > > > +     errseq_t        errseq; /* Implemented as a u32 */
> > > >  } __packed;
> > > >
> > > >  /*
> > > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > > index 7b66fbb20261..5795b28bb4cf 100644
> > > > --- a/fs/overlayfs/readdir.c
> > > > +++ b/fs/overlayfs/readdir.c
> > > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> > > >               return -EINVAL;
> > > >       }
> > > >
> > > > +     err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > > > +     if (err) {
> > > > +             pr_debug("Workdir filesystem reports errors: %d\n", err);
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > >       return 1;
> > > >  }
> > > >
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index a8ee3ba4ebbd..2e473f8c75dd 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
> > > >       int err;
> > > >       struct ovl_volatile_info info = {
> > > >               .s_instance_id = volatiledir->d_sb->s_instance_id,
> > > > +             .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
> > >
> > > errse_sample() seems to return 0 if nobody has seen the error yet. That
> > > means on remount we will fail. It is a false failure from our perspective
> > > and we are not interested in knowing if somebody else has seen the
> > > failure or not.
> > >
> > > Maybe we need a flag in errseq_sample() to get us current value
> > > irrespective of the fact whether anybody has seen the error or not?
> > >
> > > If we end up making this change, then we probably will have to somehow
> > > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
> > > sampled ->s_wb_err when nobody saw it and later by the remount time
> > > say ERRSEQ_SEEN is set, we don't want remount to fail.
> > >
> >
> > Hopping back to this review, looks like for volatile mount we need
> > something like (in this order):
> > 1. check if re-use and get sampled errseq from volatiledir xattr
> > 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr
>
> I'm not sure I follow. Why does this need to go into an xattr?
>
> errseq_t is never persisted on stable storage. It's an entirely
> in-memory thing.
>

We know, but that was the purpose of this patch series [1].
Reusing volatile overlay layers is not allowed in v5.9.
Sargun is trying to allow that by verifying that since the first volatile
mount there was:
* no reboot
* no overlay module reload
* no underlying fs re-mount
* [and with this patch] no writeback error on upper fs

[1] https://lore.kernel.org/linux-unionfs/20201127092058.15117-1-sargun@sargun.me/T/#mb2f1c770a47898d8781e62a46fcc7526535e5dde

>
> > 3. errseq_check() since stored or sampled errseq (0 for fresh mount
> > with unseen error)
> > 4. fail volatile mount if errseq_check() failed
> > 5. errseq_check() since stored errseq on fsync()/syncfs()
> >
>
> I think this is simpler than that. You just need a new errseq_t helper
> that only conditionally samples if the thing is 0 or if the error has
> already been seen. Something like this (hopefully with a better name):
>
> bool errseq_sample_no_unseen(errseq_t *eseq, errseq_t *sample)
> {
>         errseq_t old = READ_ONCE(*eseq);
>
>         if (old && !(old & ERRSEQ_SEEN))
>                 return false;
>         *sample = old;
>         return true;
> }
>
> If that returns false, fail the mount. If it's true, then save off the
> sample and proceed.
>

Yes, but that wasn't the purpose of the original patch set,
it was just something else that needed fixing that we found during
review of this patch for which Sargun posted a separate patch.

Thank,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe
  2020-11-27  9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
  2020-11-27 11:09   ` kernel test robot
  2020-11-27 13:04   ` Amir Goldstein
@ 2020-12-07 11:39   ` Dan Carpenter
  2 siblings, 0 replies; 31+ messages in thread
From: Dan Carpenter @ 2020-12-07 11:39 UTC (permalink / raw)
  To: kbuild, Sargun Dhillon, linux-unionfs, miklos, Alexander Viro,
	Amir Goldstein
  Cc: lkp, kbuild-all, Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal,
	Daniel J Walsh, linux-fsdevel, David Howells

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]

Hi Sargun,

url:    https://github.com/0day-ci/linux/commits/Sargun-Dhillon/Make-overlayfs-volatile-mounts-reusable/20201127-172416
base:    be4df0cea08a8b59eb38d73de988b7ba8022df41
config: x86_64-randconfig-m001-20201127 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
fs/overlayfs/readdir.c:1149 ovl_check_incompat() error: uninitialized symbol 'd'.

vim +/d +1149 fs/overlayfs/readdir.c

a9871ab1de6c660 Sargun Dhillon 2020-11-27  1129  static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p,
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1130  			      struct path *path)
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1131  {
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1132  	int err = -EINVAL;
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1133  	struct dentry *d;
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1134  
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1135  	if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) {
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1136  		d = lookup_one_len(p->name, path->dentry, p->len);
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1137  		if (IS_ERR(d))
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1138  			return PTR_ERR(d);
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1139  
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1140  		err = ovl_verify_volatile_info(ofs, d);
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1141  		dput(d);
                                                                ^^^^^^^
dput() d here.

a9871ab1de6c660 Sargun Dhillon 2020-11-27  1142  	}
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1143  
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1144  	if (err == -EINVAL)
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1145  		pr_err("incompat feature '%s' cannot be mounted\n", p->name);
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1146  	else
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1147  		pr_debug("incompat '%s' handled: %d\n", p->name, err);
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1148  
a9871ab1de6c660 Sargun Dhillon 2020-11-27 @1149  	dput(d);
                                                        ^^^^^^
"d" is uninitialized and this is probably a double free.

a9871ab1de6c660 Sargun Dhillon 2020-11-27  1150  	return err;
a9871ab1de6c660 Sargun Dhillon 2020-11-27  1151  }

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32026 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-12-07 11:41 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-27  9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon
2020-11-27  9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
2020-11-27  9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon
2020-11-27 12:52   ` Amir Goldstein
2020-11-27 22:11     ` Sargun Dhillon
2020-11-28  2:01       ` Jeff Layton
2020-11-28  4:45         ` Sargun Dhillon
2020-11-28  7:12           ` Amir Goldstein
2020-11-28  8:52             ` Sargun Dhillon
2020-11-28  9:04               ` Amir Goldstein
2020-12-01 11:09               ` Sargun Dhillon
2020-12-01 11:29                 ` Amir Goldstein
2020-12-01 13:01                 ` Jeff Layton
2020-12-01 15:24                   ` Vivek Goyal
2020-12-01 16:10                     ` Jeff Layton
2020-11-28 12:04           ` Jeff Layton
2020-11-28  8:56       ` Amir Goldstein
2020-11-28  9:06         ` Amir Goldstein
2020-11-27  9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
2020-11-27 11:09   ` kernel test robot
2020-11-27 13:04   ` Amir Goldstein
2020-12-07 11:39   ` Dan Carpenter
2020-11-27  9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon
2020-11-30 18:43   ` Vivek Goyal
2020-11-30 19:15   ` Vivek Goyal
2020-12-05  9:13     ` Amir Goldstein
2020-12-05 13:51       ` Jeff Layton
2020-12-05 14:51         ` Amir Goldstein
2020-11-30 19:33   ` Vivek Goyal
2020-12-01 11:56     ` Sargun Dhillon
2020-12-01 12:45       ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).