linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Make overlayfs volatile mounts reusable
@ 2020-11-16  4:57 Sargun Dhillon
  2020-11-16  4:57 ` [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16  4:57 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	David Howells

The volatile option is great for "ephemeral" containers. Unfortunately,
it doesn't capture all uses. There are two ways to use it safely right now:

1. Throw away the entire upperdir between mounts
2. Manually syncfs between mounts

For certain use-cases like serverless, or short-lived containers, it is
advantageous to be able to stop the container (runtime) and start it up on
demand / invocation of the function. Usually, there is some bootstrap
process which involves downloading some artifacts, or putting secrets on
disk, and then upon invocation of the function, you want to (re)start the
container.

If you have to syncfs every time you do this, it can lead to excess
filesystem overhead for all of the other containers on the machine, and
stall out every container who's upperdir is on the same underlying
filesystem, unless your filesystem offers something like subvolumes,
and if sync can be restricted to a subvolume.

The kernel has information that it can use to determine whether or not this
is safe -- primarily if the underlying FS has had writeback errors or not.
Overlayfs doesn't delay writes, so the consistency of the upperdir is not
contingent on the mount of overlayfs, but rather the mount of the
underlying filesystem. It can also make sure the underlying filesystem
wasn't remounted. Although, it was suggested that we use derive this
information from the upperdir's inode[1], we can checkpoint this data on
disk in an xattr.

Specifically we checkpoint:
  * Superblock "id": This is a new concept introduced in one of the patches
    which keeps track of (re)mounts of filesystems, by having a per boot
    monotonically increasing integer identifying the superblock. This is
    safer than trying to obfuscate the pointer and putting it into an
    xattr (due to leak risk, and address reuse), and during the course
    of a boot, the u64 should not wrap.
  * Overlay "boot id": This is a new UUID that is overlayfs specific,
    as overlayfs is a module that's independent from the rest of the
    system and can be (re)loaded independently -- thus it generates
    a UUID at load time which can be used to uniquely identify it.
  * upperdir / workdir errseq: A sample of the errseq_t on the workdir /
    upperdir's superblock. Since the errseq_t is implemented as a u32
    with errno + error counter, we can safely store it in a checkpoint.
    

[1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhadzC3-kh-igfxv3pAmC3ocDtAQTxByu4hrn8KtZuieQ@mail.gmail.com/

Sargun Dhillon (3):
  fs: Add s_instance_id field to superblock for unique identification
  overlay: Add ovl_do_getxattr helper
  overlay: Add the ability to remount volatile directories when safe

 Documentation/filesystems/overlayfs.rst |  5 +-
 fs/overlayfs/overlayfs.h                | 43 +++++++++++++
 fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
 fs/overlayfs/super.c                    | 22 ++++++-
 fs/super.c                              |  3 +
 include/linux/fs.h                      |  7 ++
 include/uapi/linux/fs.h                 |  2 +
 7 files changed, 160 insertions(+), 8 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification
  2020-11-16  4:57 [RFC PATCH 0/3] Make overlayfs volatile mounts reusable Sargun Dhillon
@ 2020-11-16  4:57 ` Sargun Dhillon
  2020-11-16  5:07   ` Sargun Dhillon
  2020-11-16  4:57 ` [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper Sargun Dhillon
  2020-11-16  4:57 ` [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
  2 siblings, 1 reply; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16  4:57 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	David Howells, linux-fsdevel

This assigns a per-boot unique number to each superblock. This allows
other components to know whether a filesystem has been remounted
since they last interacted with it.

At every boot it is reset to 0. There is no specific reason it is set to 0,
other than repeatability versus using some random starting number. Because
of this, you must store it along some other piece of data which is
initialized at boot time.

This doesn't have any of the overhead of idr, and a u64 wont wrap any time
soon. There is no forward lookup requirement, so an idr is not needed.

In the future, we may want to expose this to userspace. Userspace programs
can benefit from this if they have large chunks of dirty or mmaped memory
that they're interacting with, and they want to see if that volume has been
unmounted, and remounted. Along with this, and a mechanism to inspect the
superblock's errseq a user can determine whether they need to throw away
their cache or similar. This is another benefit in comparison to just
using a pointer to the superblock to uniquely identify it.

Although this doesn't expose an ioctl or similar yet, in the future we
could add an ioctl that allows for fetching the s_instance_id for a given
cache, and inspection of the errseq associated with that.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
---
 fs/super.c              | 3 +++
 include/linux/fs.h      | 7 +++++++
 include/uapi/linux/fs.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 904459b35119..e47ace7f8c3d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -42,6 +42,7 @@
 
 static int thaw_super_locked(struct super_block *sb);
 
+static u64 s_instance_id_counter;
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
 
@@ -546,6 +547,7 @@ struct super_block *sget_fc(struct fs_context *fc,
 	s->s_iflags |= fc->s_iflags;
 	strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
 	list_add_tail(&s->s_list, &super_blocks);
+	s->s_instance_id = s_instance_id_counter++;
 	hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
 	spin_unlock(&sb_lock);
 	get_filesystem(s->s_type);
@@ -625,6 +627,7 @@ struct super_block *sget(struct file_system_type *type,
 	s->s_type = type;
 	strlcpy(s->s_id, type->name, sizeof(s->s_id));
 	list_add_tail(&s->s_list, &super_blocks);
+	s->s_instance_id = s_instance_id_counter++;
 	hlist_add_head(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
 	get_filesystem(type);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dbbeb52ce5f3..642847c3673f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1472,6 +1472,13 @@ struct super_block {
 	char			s_id[32];	/* Informational name */
 	uuid_t			s_uuid;		/* UUID */
 
+	/*
+	 * ID identifying this particular instance of the superblock. It can
+	 * be used to determine if a particular filesystem has been remounted.
+	 * It may be exposed to userspace.
+	 */
+	u64			s_instance_id;
+
 	unsigned int		s_max_links;
 	fmode_t			s_mode;
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f44eb0a04afd..f2b126656c22 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -13,6 +13,7 @@
 #include <linux/limits.h>
 #include <linux/ioctl.h>
 #include <linux/types.h>
+#include <linux/uuid.h>
 #ifndef __KERNEL__
 #include <linux/fscrypt.h>
 #endif
@@ -203,6 +204,7 @@ struct fsxattr {
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
+#define FS_IOC_GET_SB_INSTANCE		_IOR('f', 3, uuid_t)
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
 #define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper
  2020-11-16  4:57 [RFC PATCH 0/3] Make overlayfs volatile mounts reusable Sargun Dhillon
  2020-11-16  4:57 ` [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
@ 2020-11-16  4:57 ` Sargun Dhillon
  2020-11-16 11:00   ` Amir Goldstein
  2020-11-16  4:57 ` [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
  2 siblings, 1 reply; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16  4:57 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	David Howells, linux-fsdevel, Amir Goldstein

We already have a helper for getting xattrs from inodes, namely
ovl_getxattr, but it doesn't allow for copying xattrs onto the current
stack. In addition, it is not instrumented like the rest of the helpers.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/overlayfs.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 29bc1ec699e7..9eb911f243e1 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -179,6 +179,15 @@ static inline int ovl_do_setxattr(struct dentry *dentry, const char *name,
 	return err;
 }
 
+static inline int ovl_do_getxattr(struct dentry *dentry, const char *name,
+				  void *value, size_t size)
+{
+	int err = vfs_getxattr(dentry, name, value, size);
+	pr_debug("getxattr(%pd2, \"%s\", \"%*pE\", %zu) = %i\n",
+		 dentry, name, min((int)size, 48), value, size, err);
+	return err;
+}
+
 static inline int ovl_do_removexattr(struct dentry *dentry, const char *name)
 {
 	int err = vfs_removexattr(dentry, name);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16  4:57 [RFC PATCH 0/3] Make overlayfs volatile mounts reusable Sargun Dhillon
  2020-11-16  4:57 ` [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
  2020-11-16  4:57 ` [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper Sargun Dhillon
@ 2020-11-16  4:57 ` Sargun Dhillon
  2020-11-16  9:31   ` Amir Goldstein
  2020-11-16 14:42   ` Vivek Goyal
  2 siblings, 2 replies; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16  4:57 UTC (permalink / raw)
  To: linux-unionfs, miklos, Alexander Viro
  Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh,
	David Howells, linux-fsdevel, Amir Goldstein

Overlayfs added the ability to setup mounts where all syncs could be
short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").

A user might want to remount this fs, but we do not let the user because
of the "incompat" detection feature. In the case of volatile, it is safe
to do something like[1]:

$ sync -f /root/upperdir
$ rm -rf /root/workdir/incompat/volatile

There are two ways to go about this. You can call sync on the underlying
filesystem, check the error code, and delete the dirty file if everything
is clean. If you're running lots of containers on the same filesystem, or
you want to avoid all unnecessary I/O, this may be suboptimal.

Alternatively, you can blindly delete the dirty file, and "hope for the
best".

This patch introduces transparent functionality to check if it is
(relatively) safe to reuse the upperdir. It ensures that the filesystem
hasn't been remounted, the system hasn't been rebooted, nor has the
overlayfs code changed. It also checks the errseq on the superblock
indicating if there have been any writeback errors since the previous
mount. Currently, this information is not directly exposed to userspace, so
the user cannot make decisions based on this. Instead we checkpoint
this information to disk, and upon remount we see if any of it has
changed. Since the structure is explicitly not meant to be used
between different versions of the code, its stability does not
matter so much.

[1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/overlayfs.rst |  5 +-
 fs/overlayfs/overlayfs.h                | 34 ++++++++++
 fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
 fs/overlayfs/super.c                    | 22 ++++++-
 4 files changed, 139 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 580ab9a0fe31..fa3faeeab727 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -581,7 +581,10 @@ checks for this directory and refuses to mount if present. This is a strong
 indicator that user should throw away upper and work directories and create
 fresh one. In very limited cases where the user knows that the system has
 not crashed and contents of upperdir are intact, The "volatile" directory
-can be removed.
+can be removed.  In certain cases it the filesystem can detect that the
+upperdir can be reused safely, and it will not require the user to
+manually delete the volatile directory.
+
 
 Testsuite
 ---------
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 9eb911f243e1..980d2c930f7a 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -30,6 +30,11 @@ enum ovl_path_type {
 #define OVL_XATTR_NLINK OVL_XATTR_PREFIX "nlink"
 #define OVL_XATTR_UPPER OVL_XATTR_PREFIX "upper"
 #define OVL_XATTR_METACOPY OVL_XATTR_PREFIX "metacopy"
+#define OVL_XATTR_VOLATILE OVL_XATTR_PREFIX "volatile"
+
+#define OVL_INCOMPATDIR_NAME "incompat"
+#define OVL_VOLATILEDIR_NAME "volatile"
+#define OVL_VOLATILE_DIRTY_NAME "dirty"
 
 enum ovl_inode_flag {
 	/* Pure upper dir that may contain non pure upper entries */
@@ -54,6 +59,32 @@ enum {
 	OVL_XINO_ON,
 };
 
+/*
+ * This is copied into the volatile xattr, and the user does not interact with
+ * it. There is no stability requirement, as a reboot explicitly invalidates
+ * a volatile workdir. It is explicitly meant not to be a stable api.
+ *
+ * Although this structure isn't meant to be stable it is exposed to potentially
+ * unprivileged users. We don't do any kind of cryptographic operations with
+ * the structure, so it could be tampered with, or inspected. Don't put
+ * kernel memory pointers in it, or anything else that could cause problems,
+ * or information disclosure.
+ */
+struct overlay_volatile_info {
+	/*
+	 * This uniquely identifies a boot, and is reset if overlayfs itself
+	 * is reloaded. Therefore we check our current / known boot_id
+	 * against this before looking at any other fields to validate:
+	 * 1. Is this datastructure laid out in the way we expect? (Overlayfs
+	 *    module, reboot, etc...)
+	 * 2. Could something have changed (like the s_instance_id counter
+	 *    resetting)
+	 */
+	uuid_t		overlay_boot_id;
+	u64		s_instance_id;
+	errseq_t	errseq; /* Just a u32 */
+} __packed;
+
 /*
  * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
  * where:
@@ -501,3 +532,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 
 /* export.c */
 extern const struct export_operations ovl_export_operations;
+
+/* super.c */
+extern uuid_t overlay_boot_id;
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index f8cc15533afa..ee0d2b88a19c 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -1054,7 +1054,84 @@ int ovl_check_d_type_supported(struct path *realpath)
 	return rdd.d_type_supported;
 }
 
-#define OVL_INCOMPATDIR_NAME "incompat"
+static int ovl_check_incompat_volatile(struct ovl_cache_entry *p,
+				       struct path *path)
+{
+	int err, ret = -EINVAL;
+	struct overlay_volatile_info info;
+	struct dentry *d_volatile, *d_dirty;
+
+	d_volatile = lookup_one_len(p->name, path->dentry, p->len);
+	if (IS_ERR(d_volatile))
+		return PTR_ERR(d_volatile);
+
+	inode_lock_nested(d_volatile->d_inode, I_MUTEX_PARENT);
+	d_dirty = lookup_one_len(OVL_VOLATILE_DIRTY_NAME, d_volatile,
+				 strlen(OVL_VOLATILE_DIRTY_NAME));
+	if (IS_ERR(d_dirty)) {
+		err = PTR_ERR(d_dirty);
+		if (err != -ENOENT)
+			ret = err;
+		goto out_putvolatile;
+	}
+
+	if (!d_dirty->d_inode)
+		goto out_putdirty;
+
+	inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
+	err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
+	inode_unlock(d_dirty->d_inode);
+	if (err != sizeof(info))
+		goto out_putdirty;
+
+	if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
+		pr_debug("boot id has changed (reboot or module reloaded)\n");
+		goto out_putdirty;
+	}
+
+	if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
+		pr_debug("workdir has been unmounted and remounted\n");
+		goto out_putdirty;
+	}
+
+	err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
+	if (err) {
+		pr_debug("workdir dir has experienced errors: %d\n", err);
+		goto out_putdirty;
+	}
+
+	/* Dirty file is okay, delete it. */
+	ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
+
+out_putdirty:
+	dput(d_dirty);
+out_putvolatile:
+	inode_unlock(d_volatile->d_inode);
+	dput(d_volatile);
+	return ret;
+}
+
+/*
+ * check_incompat checks this specific incompat entry for incompatibility.
+ * If it is found to be incompatible -EINVAL will be returned.
+ *
+ * Any other -errno indicates an unknown error, and filesystem mounting
+ * should be aborted.
+ */
+static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
+{
+	int err = -EINVAL;
+
+	if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
+		err = ovl_check_incompat_volatile(p, path);
+
+	if (err == -EINVAL)
+		pr_err("incompat feature '%s' cannot be mounted\n", p->name);
+	else
+		pr_debug("incompat '%s' handled: %d\n", p->name, err);
+
+	return err;
+}
 
 static int ovl_workdir_cleanup_recurse(struct path *path, int level)
 {
@@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
 			if (p->len == 2 && p->name[1] == '.')
 				continue;
 		} else if (incompat) {
-			pr_err("overlay with incompat feature '%s' cannot be mounted\n",
-				p->name);
-			err = -EINVAL;
-			break;
+			err = ovl_check_incompat(p, path);
+			if (err)
+				break;
 		}
 		dentry = lookup_one_len(p->name, path->dentry, p->len);
 		if (IS_ERR(dentry))
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 2ee0ba16cc7b..94980898009f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/posix_acl_xattr.h>
 #include <linux/exportfs.h>
+#include <linux/uuid.h>
 #include "overlayfs.h"
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
@@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
 
 
 struct ovl_dir_cache;
+uuid_t overlay_boot_id;
 
 #define OVL_MAX_STACK 500
 
@@ -1246,20 +1248,35 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
  */
 static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
 {
+	int err;
 	unsigned int ctr;
 	struct dentry *d = dget(ofs->workbasedir);
 	static const char *const volatile_path[] = {
-		OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
+		OVL_WORKDIR_NAME,
+		OVL_INCOMPATDIR_NAME,
+		OVL_VOLATILEDIR_NAME,
+		OVL_VOLATILE_DIRTY_NAME,
 	};
 	const char *const *name = volatile_path;
+	struct overlay_volatile_info info = {};
 
 	for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
 		d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
 		if (IS_ERR(d))
 			return PTR_ERR(d);
 	}
+
+	uuid_copy(&info.overlay_boot_id, &overlay_boot_id);
+	info.s_instance_id = d->d_inode->i_sb->s_instance_id;
+	info.errseq = errseq_sample(&d->d_inode->i_sb->s_wb_err);
+
+
+	err = ovl_do_setxattr(d, OVL_XATTR_VOLATILE, &info, sizeof(info), 0);
+	if (err == -EOPNOTSUPP)
+		err = 0;
+
 	dput(d);
-	return 0;
+	return err;
 }
 
 static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
@@ -2045,6 +2062,7 @@ static int __init ovl_init(void)
 {
 	int err;
 
+	uuid_gen(&overlay_boot_id);
 	ovl_inode_cachep = kmem_cache_create("ovl_inode",
 					     sizeof(struct ovl_inode), 0,
 					     (SLAB_RECLAIM_ACCOUNT|
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification
  2020-11-16  4:57 ` [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
@ 2020-11-16  5:07   ` Sargun Dhillon
  0 siblings, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16  5:07 UTC (permalink / raw)
  To: overlayfs, Miklos Szeredi, Alexander Viro
  Cc: Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, David Howells,
	Linux FS-devel Mailing List

On Sun, Nov 15, 2020 at 8:58 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> This assigns a per-boot unique number to each superblock. This allows
> other components to know whether a filesystem has been remounted
> since they last interacted with it.
>
> At every boot it is reset to 0. There is no specific reason it is set to 0,
> other than repeatability versus using some random starting number. Because
> of this, you must store it along some other piece of data which is
> initialized at boot time.
>
> This doesn't have any of the overhead of idr, and a u64 wont wrap any time
> soon. There is no forward lookup requirement, so an idr is not needed.
>
> In the future, we may want to expose this to userspace. Userspace programs
> can benefit from this if they have large chunks of dirty or mmaped memory
> that they're interacting with, and they want to see if that volume has been
> unmounted, and remounted. Along with this, and a mechanism to inspect the
> superblock's errseq a user can determine whether they need to throw away
> their cache or similar. This is another benefit in comparison to just
> using a pointer to the superblock to uniquely identify it.
>
> Although this doesn't expose an ioctl or similar yet, in the future we
> could add an ioctl that allows for fetching the s_instance_id for a given
> cache, and inspection of the errseq associated with that.
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> ---
>  fs/super.c              | 3 +++
>  include/linux/fs.h      | 7 +++++++
>  include/uapi/linux/fs.h | 2 ++
>  3 files changed, 12 insertions(+)
>
> diff --git a/fs/super.c b/fs/super.c
> index 904459b35119..e47ace7f8c3d 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -42,6 +42,7 @@
>
>  static int thaw_super_locked(struct super_block *sb);
>
> +static u64 s_instance_id_counter;
>  static LIST_HEAD(super_blocks);
>  static DEFINE_SPINLOCK(sb_lock);
>
> @@ -546,6 +547,7 @@ struct super_block *sget_fc(struct fs_context *fc,
>         s->s_iflags |= fc->s_iflags;
>         strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
>         list_add_tail(&s->s_list, &super_blocks);
> +       s->s_instance_id = s_instance_id_counter++;
>         hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
>         spin_unlock(&sb_lock);
>         get_filesystem(s->s_type);
> @@ -625,6 +627,7 @@ struct super_block *sget(struct file_system_type *type,
>         s->s_type = type;
>         strlcpy(s->s_id, type->name, sizeof(s->s_id));
>         list_add_tail(&s->s_list, &super_blocks);
> +       s->s_instance_id = s_instance_id_counter++;
>         hlist_add_head(&s->s_instances, &type->fs_supers);
>         spin_unlock(&sb_lock);
>         get_filesystem(type);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dbbeb52ce5f3..642847c3673f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1472,6 +1472,13 @@ struct super_block {
>         char                    s_id[32];       /* Informational name */
>         uuid_t                  s_uuid;         /* UUID */
>
> +       /*
> +        * ID identifying this particular instance of the superblock. It can
> +        * be used to determine if a particular filesystem has been remounted.
> +        * It may be exposed to userspace.
> +        */
> +       u64                     s_instance_id;
> +
>         unsigned int            s_max_links;
>         fmode_t                 s_mode;
>

Hit send a little too quickly. Please ignore this hunk as part of the RFC.
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..f2b126656c22 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -13,6 +13,7 @@
>  #include <linux/limits.h>
>  #include <linux/ioctl.h>
>  #include <linux/types.h>
> +#include <linux/uuid.h>
>  #ifndef __KERNEL__
>  #include <linux/fscrypt.h>
>  #endif
> @@ -203,6 +204,7 @@ struct fsxattr {
>
>  #define        FS_IOC_GETFLAGS                 _IOR('f', 1, long)
>  #define        FS_IOC_SETFLAGS                 _IOW('f', 2, long)
> +#define FS_IOC_GET_SB_INSTANCE         _IOR('f', 3, uuid_t)
>  #define        FS_IOC_GETVERSION               _IOR('v', 1, long)
>  #define        FS_IOC_SETVERSION               _IOW('v', 2, long)
>  #define FS_IOC_FIEMAP                  _IOWR('f', 11, struct fiemap)
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16  4:57 ` [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
@ 2020-11-16  9:31   ` Amir Goldstein
  2020-11-16 10:30     ` Sargun Dhillon
  2020-11-16 14:42   ` Vivek Goyal
  1 sibling, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2020-11-16  9:31 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, David Howells, linux-fsdevel

On Mon, Nov 16, 2020 at 6:58 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> Overlayfs added the ability to setup mounts where all syncs could be
> short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
>
> A user might want to remount this fs, but we do not let the user because
> of the "incompat" detection feature. In the case of volatile, it is safe
> to do something like[1]:
>
> $ sync -f /root/upperdir
> $ rm -rf /root/workdir/incompat/volatile
>
> There are two ways to go about this. You can call sync on the underlying
> filesystem, check the error code, and delete the dirty file if everything
> is clean. If you're running lots of containers on the same filesystem, or
> you want to avoid all unnecessary I/O, this may be suboptimal.
>
> Alternatively, you can blindly delete the dirty file, and "hope for the
> best".
>
> This patch introduces transparent functionality to check if it is
> (relatively) safe to reuse the upperdir. It ensures that the filesystem
> hasn't been remounted, the system hasn't been rebooted, nor has the
> overlayfs code changed. It also checks the errseq on the superblock
> indicating if there have been any writeback errors since the previous
> mount. Currently, this information is not directly exposed to userspace, so
> the user cannot make decisions based on this.

This is the main reason IMO that this patch is needed, but it's buried inside
this paragraph. It wasn't obvious to me at first why userspace solution
was not possible. Maybe try to give it more focus.


> Instead we checkpoint
> this information to disk, and upon remount we see if any of it has
> changed. Since the structure is explicitly not meant to be used
> between different versions of the code, its stability does not
> matter so much.
>
> [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> ---
>  Documentation/filesystems/overlayfs.rst |  5 +-
>  fs/overlayfs/overlayfs.h                | 34 ++++++++++
>  fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
>  fs/overlayfs/super.c                    | 22 ++++++-
>  4 files changed, 139 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 580ab9a0fe31..fa3faeeab727 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -581,7 +581,10 @@ checks for this directory and refuses to mount if present. This is a strong
>  indicator that user should throw away upper and work directories and create
>  fresh one. In very limited cases where the user knows that the system has
>  not crashed and contents of upperdir are intact, The "volatile" directory
> -can be removed.
> +can be removed.  In certain cases it the filesystem can detect that the

typo: it the filesystem

> +upperdir can be reused safely, and it will not require the user to
> +manually delete the volatile directory.
> +
>
>  Testsuite
>  ---------
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 9eb911f243e1..980d2c930f7a 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -30,6 +30,11 @@ enum ovl_path_type {
>  #define OVL_XATTR_NLINK OVL_XATTR_PREFIX "nlink"
>  #define OVL_XATTR_UPPER OVL_XATTR_PREFIX "upper"
>  #define OVL_XATTR_METACOPY OVL_XATTR_PREFIX "metacopy"
> +#define OVL_XATTR_VOLATILE OVL_XATTR_PREFIX "volatile"
> +
> +#define OVL_INCOMPATDIR_NAME "incompat"
> +#define OVL_VOLATILEDIR_NAME "volatile"
> +#define OVL_VOLATILE_DIRTY_NAME "dirty"
>
>  enum ovl_inode_flag {
>         /* Pure upper dir that may contain non pure upper entries */
> @@ -54,6 +59,32 @@ enum {
>         OVL_XINO_ON,
>  };
>
> +/*
> + * This is copied into the volatile xattr, and the user does not interact with
> + * it. There is no stability requirement, as a reboot explicitly invalidates
> + * a volatile workdir. It is explicitly meant not to be a stable api.
> + *
> + * Although this structure isn't meant to be stable it is exposed to potentially
> + * unprivileged users. We don't do any kind of cryptographic operations with
> + * the structure, so it could be tampered with, or inspected. Don't put
> + * kernel memory pointers in it, or anything else that could cause problems,
> + * or information disclosure.
> + */
> +struct overlay_volatile_info {

ovl_volatile_info please

> +       /*
> +        * This uniquely identifies a boot, and is reset if overlayfs itself
> +        * is reloaded. Therefore we check our current / known boot_id
> +        * against this before looking at any other fields to validate:
> +        * 1. Is this datastructure laid out in the way we expect? (Overlayfs
> +        *    module, reboot, etc...)
> +        * 2. Could something have changed (like the s_instance_id counter
> +        *    resetting)
> +        */
> +       uuid_t          overlay_boot_id;

ovl_boot_id

> +       u64             s_instance_id;
> +       errseq_t        errseq; /* Just a u32 */
> +} __packed;
> +
>  /*
>   * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
>   * where:
> @@ -501,3 +532,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
>
>  /* export.c */
>  extern const struct export_operations ovl_export_operations;
> +
> +/* super.c */
> +extern uuid_t overlay_boot_id;
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index f8cc15533afa..ee0d2b88a19c 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1054,7 +1054,84 @@ int ovl_check_d_type_supported(struct path *realpath)
>         return rdd.d_type_supported;
>  }
>
> -#define OVL_INCOMPATDIR_NAME "incompat"
> +static int ovl_check_incompat_volatile(struct ovl_cache_entry *p,
> +                                      struct path *path)
> +{
> +       int err, ret = -EINVAL;
> +       struct overlay_volatile_info info;
> +       struct dentry *d_volatile, *d_dirty;
> +
> +       d_volatile = lookup_one_len(p->name, path->dentry, p->len);
> +       if (IS_ERR(d_volatile))
> +               return PTR_ERR(d_volatile);
> +
> +       inode_lock_nested(d_volatile->d_inode, I_MUTEX_PARENT);

You can't do this. I_MUTEX_PARENT level is already taken on parent
and you also don't need to perform lookup in this helper. I will explain below.

> +       d_dirty = lookup_one_len(OVL_VOLATILE_DIRTY_NAME, d_volatile,
> +                                strlen(OVL_VOLATILE_DIRTY_NAME));
> +       if (IS_ERR(d_dirty)) {
> +               err = PTR_ERR(d_dirty);
> +               if (err != -ENOENT)
> +                       ret = err;
> +               goto out_putvolatile;
> +       }
> +
> +       if (!d_dirty->d_inode)
> +               goto out_putdirty;
> +
> +       inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);

What's this lock for?

> +       err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> +       inode_unlock(d_dirty->d_inode);
> +       if (err != sizeof(info))
> +               goto out_putdirty;
> +
> +       if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> +               pr_debug("boot id has changed (reboot or module reloaded)\n");
> +               goto out_putdirty;
> +       }
> +
> +       if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> +               pr_debug("workdir has been unmounted and remounted\n");
> +               goto out_putdirty;
> +       }
> +
> +       err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> +       if (err) {
> +               pr_debug("workdir dir has experienced errors: %d\n", err);
> +               goto out_putdirty;
> +       }

Please put all the above including getxattr in helper ovl_verify_volatile_info()

> +
> +       /* Dirty file is okay, delete it. */
> +       ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);

That's a problem. By doing this, you have now approved a regular overlay
re-mount, but you need only approve a volatile overlay re-mount.
Need to pass ofs to ovl_workdir_cleanup{,_recurse}.

> +
> +out_putdirty:
> +       dput(d_dirty);
> +out_putvolatile:
> +       inode_unlock(d_volatile->d_inode);
> +       dput(d_volatile);
> +       return ret;
> +}
> +
> +/*
> + * check_incompat checks this specific incompat entry for incompatibility.
> + * If it is found to be incompatible -EINVAL will be returned.
> + *
> + * Any other -errno indicates an unknown error, and filesystem mounting
> + * should be aborted.
> + */
> +static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
> +{
> +       int err = -EINVAL;
> +
> +       if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
> +               err = ovl_check_incompat_volatile(p, path);
> +
> +       if (err == -EINVAL)
> +               pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> +       else
> +               pr_debug("incompat '%s' handled: %d\n", p->name, err);
> +
> +       return err;
> +}
>
>  static int ovl_workdir_cleanup_recurse(struct path *path, int level)
>  {
> @@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
>                         if (p->len == 2 && p->name[1] == '.')
>                                 continue;
>                 } else if (incompat) {
> -                       pr_err("overlay with incompat feature '%s' cannot be mounted\n",
> -                               p->name);
> -                       err = -EINVAL;
> -                       break;
> +                       err = ovl_check_incompat(p, path);
> +                       if (err)
> +                               break;

The call to ovl_check_incompat here is too soon and it makes
you need to lookup both the volatile dir and dirty file.
What you need to do and let cleanup recurse into the next level
while letting it know that we are now traversing the "incompat"
subtree.

You can see a more generic implementation I once made here:
https://github.com/amir73il/linux/blob/ovl-features/fs/overlayfs/readdir.c#L1051
but it should be simpler with just incompat/volatile.
Perhaps something like this:

                dentry = lookup_one_len(p->name, path->dentry, p->len);
                if (IS_ERR(dentry))
                        continue;
-               if (dentry->d_inode)
+               if (dentry->d_inode && d_is_dir(dentry) && incompat)
+                       err = ovl_incompatdir_cleanup(dir, path->mnt, dentry);
+               else if (dentry->d_inode)
                        err = ovl_workdir_cleanup(dir, path->mnt,
dentry, level);
                dput(dentry);

Then inside ovl_incompatdir_cleanup() you can call ovl_check_incompat()
with dentry argument.

Now you have a few options. A simple option would be to put the volatile
xattr on the volatile dir instead of on the dirty file.
If you do that, you can call ovl_verify_volatile_info() on the volatile dentry
without any lookups (only on a volatile re-mount) and if the volatile dir is
approved for reuse, you don't even need to remove the dirty file, because
it's just going to be re-created anyway.


>                 }
>                 dentry = lookup_one_len(p->name, path->dentry, p->len);
>                 if (IS_ERR(dentry))
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 2ee0ba16cc7b..94980898009f 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -15,6 +15,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/posix_acl_xattr.h>
>  #include <linux/exportfs.h>
> +#include <linux/uuid.h>
>  #include "overlayfs.h"
>
>  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> @@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
>
>
>  struct ovl_dir_cache;
> +uuid_t overlay_boot_id;

ovl_boot_id please.

>
>  #define OVL_MAX_STACK 500
>
> @@ -1246,20 +1248,35 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
>   */
>  static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
>  {
> +       int err;
>         unsigned int ctr;
>         struct dentry *d = dget(ofs->workbasedir);
>         static const char *const volatile_path[] = {
> -               OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
> +               OVL_WORKDIR_NAME,
> +               OVL_INCOMPATDIR_NAME,
> +               OVL_VOLATILEDIR_NAME,
> +               OVL_VOLATILE_DIRTY_NAME,
>         };
>         const char *const *name = volatile_path;
> +       struct overlay_volatile_info info = {};
>
>         for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
>                 d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
>                 if (IS_ERR(d))
>                         return PTR_ERR(d);
>         }
> +
> +       uuid_copy(&info.overlay_boot_id, &overlay_boot_id);
> +       info.s_instance_id = d->d_inode->i_sb->s_instance_id;
> +       info.errseq = errseq_sample(&d->d_inode->i_sb->s_wb_err);
> +
> +
> +       err = ovl_do_setxattr(d, OVL_XATTR_VOLATILE, &info, sizeof(info), 0);
> +       if (err == -EOPNOTSUPP)
> +               err = 0;
> +

Please put all the above in helper ovl_set_volatile_info()

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16  9:31   ` Amir Goldstein
@ 2020-11-16 10:30     ` Sargun Dhillon
  2020-11-16 11:17       ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16 10:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, David Howells, linux-fsdevel

On Mon, Nov 16, 2020 at 11:31:20AM +0200, Amir Goldstein wrote:
> On Mon, Nov 16, 2020 at 6:58 AM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > Overlayfs added the ability to setup mounts where all syncs could be
> > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> >
> > A user might want to remount this fs, but we do not let the user because
> > of the "incompat" detection feature. In the case of volatile, it is safe
> > to do something like[1]:
> >
> > $ sync -f /root/upperdir
> > $ rm -rf /root/workdir/incompat/volatile
> >
> > There are two ways to go about this. You can call sync on the underlying
> > filesystem, check the error code, and delete the dirty file if everything
> > is clean. If you're running lots of containers on the same filesystem, or
> > you want to avoid all unnecessary I/O, this may be suboptimal.
> >
> > Alternatively, you can blindly delete the dirty file, and "hope for the
> > best".
> >
> > This patch introduces transparent functionality to check if it is
> > (relatively) safe to reuse the upperdir. It ensures that the filesystem
> > hasn't been remounted, the system hasn't been rebooted, nor has the
> > overlayfs code changed. It also checks the errseq on the superblock
> > indicating if there have been any writeback errors since the previous
> > mount. Currently, this information is not directly exposed to userspace, so
> > the user cannot make decisions based on this.
> 
> This is the main reason IMO that this patch is needed, but it's buried inside
> this paragraph. It wasn't obvious to me at first why userspace solution
> was not possible. Maybe try to give it more focus.
> 
> 
> > Instead we checkpoint
> > this information to disk, and upon remount we see if any of it has
> > changed. Since the structure is explicitly not meant to be used
> > between different versions of the code, its stability does not
> > matter so much.
> >
> > [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c
> >
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  Documentation/filesystems/overlayfs.rst |  5 +-
> >  fs/overlayfs/overlayfs.h                | 34 ++++++++++
> >  fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
> >  fs/overlayfs/super.c                    | 22 ++++++-
> >  4 files changed, 139 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > index 580ab9a0fe31..fa3faeeab727 100644
> > --- a/Documentation/filesystems/overlayfs.rst
> > +++ b/Documentation/filesystems/overlayfs.rst
> > @@ -581,7 +581,10 @@ checks for this directory and refuses to mount if present. This is a strong
> >  indicator that user should throw away upper and work directories and create
> >  fresh one. In very limited cases where the user knows that the system has
> >  not crashed and contents of upperdir are intact, The "volatile" directory
> > -can be removed.
> > +can be removed.  In certain cases it the filesystem can detect that the
> 
> typo: it the filesystem
> 
> > +upperdir can be reused safely, and it will not require the user to
> > +manually delete the volatile directory.
> > +
> >
> >  Testsuite
> >  ---------
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index 9eb911f243e1..980d2c930f7a 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -30,6 +30,11 @@ enum ovl_path_type {
> >  #define OVL_XATTR_NLINK OVL_XATTR_PREFIX "nlink"
> >  #define OVL_XATTR_UPPER OVL_XATTR_PREFIX "upper"
> >  #define OVL_XATTR_METACOPY OVL_XATTR_PREFIX "metacopy"
> > +#define OVL_XATTR_VOLATILE OVL_XATTR_PREFIX "volatile"
> > +
> > +#define OVL_INCOMPATDIR_NAME "incompat"
> > +#define OVL_VOLATILEDIR_NAME "volatile"
> > +#define OVL_VOLATILE_DIRTY_NAME "dirty"
> >
> >  enum ovl_inode_flag {
> >         /* Pure upper dir that may contain non pure upper entries */
> > @@ -54,6 +59,32 @@ enum {
> >         OVL_XINO_ON,
> >  };
> >
> > +/*
> > + * This is copied into the volatile xattr, and the user does not interact with
> > + * it. There is no stability requirement, as a reboot explicitly invalidates
> > + * a volatile workdir. It is explicitly meant not to be a stable api.
> > + *
> > + * Although this structure isn't meant to be stable it is exposed to potentially
> > + * unprivileged users. We don't do any kind of cryptographic operations with
> > + * the structure, so it could be tampered with, or inspected. Don't put
> > + * kernel memory pointers in it, or anything else that could cause problems,
> > + * or information disclosure.
> > + */
> > +struct overlay_volatile_info {
> 
> ovl_volatile_info please
> 
> > +       /*
> > +        * This uniquely identifies a boot, and is reset if overlayfs itself
> > +        * is reloaded. Therefore we check our current / known boot_id
> > +        * against this before looking at any other fields to validate:
> > +        * 1. Is this datastructure laid out in the way we expect? (Overlayfs
> > +        *    module, reboot, etc...)
> > +        * 2. Could something have changed (like the s_instance_id counter
> > +        *    resetting)
> > +        */
> > +       uuid_t          overlay_boot_id;
> 
> ovl_boot_id
> 
> > +       u64             s_instance_id;
> > +       errseq_t        errseq; /* Just a u32 */
> > +} __packed;
> > +
> >  /*
> >   * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
> >   * where:
> > @@ -501,3 +532,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
> >
> >  /* export.c */
> >  extern const struct export_operations ovl_export_operations;
> > +
> > +/* super.c */
> > +extern uuid_t overlay_boot_id;
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index f8cc15533afa..ee0d2b88a19c 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1054,7 +1054,84 @@ int ovl_check_d_type_supported(struct path *realpath)
> >         return rdd.d_type_supported;
> >  }
> >
> > -#define OVL_INCOMPATDIR_NAME "incompat"
> > +static int ovl_check_incompat_volatile(struct ovl_cache_entry *p,
> > +                                      struct path *path)
> > +{
> > +       int err, ret = -EINVAL;
> > +       struct overlay_volatile_info info;
> > +       struct dentry *d_volatile, *d_dirty;
> > +
> > +       d_volatile = lookup_one_len(p->name, path->dentry, p->len);
> > +       if (IS_ERR(d_volatile))
> > +               return PTR_ERR(d_volatile);
> > +
> > +       inode_lock_nested(d_volatile->d_inode, I_MUTEX_PARENT);
> 
> You can't do this. I_MUTEX_PARENT level is already taken on parent
> and you also don't need to perform lookup in this helper. I will explain below.
> 
> > +       d_dirty = lookup_one_len(OVL_VOLATILE_DIRTY_NAME, d_volatile,
> > +                                strlen(OVL_VOLATILE_DIRTY_NAME));
> > +       if (IS_ERR(d_dirty)) {
> > +               err = PTR_ERR(d_dirty);
> > +               if (err != -ENOENT)
> > +                       ret = err;
> > +               goto out_putvolatile;
> > +       }
> > +
> > +       if (!d_dirty->d_inode)
> > +               goto out_putdirty;
> > +
> > +       inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
> 
> What's this lock for?
> 
I need to take a lock on this inode to prevent modifications to it, right, or is
getting the xattr safe?

> > +       err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> > +       inode_unlock(d_dirty->d_inode);
> > +       if (err != sizeof(info))
> > +               goto out_putdirty;
> > +
> > +       if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> > +               pr_debug("boot id has changed (reboot or module reloaded)\n");
> > +               goto out_putdirty;
> > +       }
> > +
> > +       if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> > +               pr_debug("workdir has been unmounted and remounted\n");
> > +               goto out_putdirty;
> > +       }
> > +
> > +       err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> > +       if (err) {
> > +               pr_debug("workdir dir has experienced errors: %d\n", err);
> > +               goto out_putdirty;
> > +       }
> 
> Please put all the above including getxattr in helper ovl_verify_volatile_info()
> 
Is it okay if the helper stays in super.c?


> > +
> > +       /* Dirty file is okay, delete it. */
> > +       ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
> 
> That's a problem. By doing this, you have now approved a regular overlay
> re-mount, but you need only approve a volatile overlay re-mount.
> Need to pass ofs to ovl_workdir_cleanup{,_recurse}.
> 
I can add a check to make sure this behaviour is only allowed on remounts back 
into volatile. There's technically a race condition here, where if there
is an error between this check, and the mounting being finished, the FS
could be dirty, but that already exists with the impl today.

> > +
> > +out_putdirty:
> > +       dput(d_dirty);
> > +out_putvolatile:
> > +       inode_unlock(d_volatile->d_inode);
> > +       dput(d_volatile);
> > +       return ret;
> > +}
> > +
> > +/*
> > + * check_incompat checks this specific incompat entry for incompatibility.
> > + * If it is found to be incompatible -EINVAL will be returned.
> > + *
> > + * Any other -errno indicates an unknown error, and filesystem mounting
> > + * should be aborted.
> > + */
> > +static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
> > +{
> > +       int err = -EINVAL;
> > +
> > +       if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
> > +               err = ovl_check_incompat_volatile(p, path);
> > +
> > +       if (err == -EINVAL)
> > +               pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> > +       else
> > +               pr_debug("incompat '%s' handled: %d\n", p->name, err);
> > +
> > +       return err;
> > +}
> >
> >  static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> >  {
> > @@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> >                         if (p->len == 2 && p->name[1] == '.')
> >                                 continue;
> >                 } else if (incompat) {
> > -                       pr_err("overlay with incompat feature '%s' cannot be mounted\n",
> > -                               p->name);
> > -                       err = -EINVAL;
> > -                       break;
> > +                       err = ovl_check_incompat(p, path);
> > +                       if (err)
> > +                               break;
> 
> The call to ovl_check_incompat here is too soon and it makes
> you need to lookup both the volatile dir and dirty file.
> What you need to do and let cleanup recurse into the next level
> while letting it know that we are now traversing the "incompat"
> subtree.
> 
Maybe a dumb question but why is it incompat/volatile/dirty, rather than just 
incompat/volatile, where volatile is a file? Are there any caveats with putting
the xattr on the directory, or alternatively are there any reasons not to make
the structure incompat/volatile/dirty?

> You can see a more generic implementation I once made here:
> https://github.com/amir73il/linux/blob/ovl-features/fs/overlayfs/readdir.c#L1051
> but it should be simpler with just incompat/volatile.
> Perhaps something like this:
> 
>                 dentry = lookup_one_len(p->name, path->dentry, p->len);
>                 if (IS_ERR(dentry))
>                         continue;
> -               if (dentry->d_inode)
> +               if (dentry->d_inode && d_is_dir(dentry) && incompat)
> +                       err = ovl_incompatdir_cleanup(dir, path->mnt, dentry);
> +               else if (dentry->d_inode)
>                         err = ovl_workdir_cleanup(dir, path->mnt,
> dentry, level);
>                 dput(dentry);
> 
> Then inside ovl_incompatdir_cleanup() you can call ovl_check_incompat()
> with dentry argument.
> 
> Now you have a few options. A simple option would be to put the volatile
> xattr on the volatile dir instead of on the dirty file.
> If you do that, you can call ovl_verify_volatile_info() on the volatile dentry
> without any lookups (only on a volatile re-mount) and if the volatile dir is
> approved for reuse, you don't even need to remove the dirty file, because
> it's just going to be re-created anyway.
> 
> 
> >                 }
> >                 dentry = lookup_one_len(p->name, path->dentry, p->len);
> >                 if (IS_ERR(dentry))
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 2ee0ba16cc7b..94980898009f 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/seq_file.h>
> >  #include <linux/posix_acl_xattr.h>
> >  #include <linux/exportfs.h>
> > +#include <linux/uuid.h>
> >  #include "overlayfs.h"
> >
> >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > @@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
> >
> >
> >  struct ovl_dir_cache;
> > +uuid_t overlay_boot_id;
> 
> ovl_boot_id please.
> 
> >
> >  #define OVL_MAX_STACK 500
> >
> > @@ -1246,20 +1248,35 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
> >   */
> >  static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
> >  {
> > +       int err;
> >         unsigned int ctr;
> >         struct dentry *d = dget(ofs->workbasedir);
> >         static const char *const volatile_path[] = {
> > -               OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
> > +               OVL_WORKDIR_NAME,
> > +               OVL_INCOMPATDIR_NAME,
> > +               OVL_VOLATILEDIR_NAME,
> > +               OVL_VOLATILE_DIRTY_NAME,
> >         };
> >         const char *const *name = volatile_path;
> > +       struct overlay_volatile_info info = {};
> >
> >         for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
> >                 d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
> >                 if (IS_ERR(d))
> >                         return PTR_ERR(d);
> >         }
> > +
> > +       uuid_copy(&info.overlay_boot_id, &overlay_boot_id);
> > +       info.s_instance_id = d->d_inode->i_sb->s_instance_id;
> > +       info.errseq = errseq_sample(&d->d_inode->i_sb->s_wb_err);
> > +
> > +
> > +       err = ovl_do_setxattr(d, OVL_XATTR_VOLATILE, &info, sizeof(info), 0);
> > +       if (err == -EOPNOTSUPP)
> > +               err = 0;
> > +
> 
> Please put all the above in helper ovl_set_volatile_info()
> 
> Thanks,
> Amir.
Thank you for the fast review.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper
  2020-11-16  4:57 ` [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper Sargun Dhillon
@ 2020-11-16 11:00   ` Amir Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-16 11:00 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, David Howells, linux-fsdevel

On Mon, Nov 16, 2020 at 6:58 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> We already have a helper for getting xattrs from inodes, namely
> ovl_getxattr, but it doesn't allow for copying xattrs onto the current
> stack. In addition, it is not instrumented like the rest of the helpers.
>
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/overlayfs.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 29bc1ec699e7..9eb911f243e1 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -179,6 +179,15 @@ static inline int ovl_do_setxattr(struct dentry *dentry, const char *name,
>         return err;
>  }
>
> +static inline int ovl_do_getxattr(struct dentry *dentry, const char *name,
> +                                 void *value, size_t size)
> +{
> +       int err = vfs_getxattr(dentry, name, value, size);
> +       pr_debug("getxattr(%pd2, \"%s\", \"%*pE\", %zu) = %i\n",
> +                dentry, name, min((int)size, 48), value, size, err);
> +       return err;
> +}
> +

upstream already has this helper.

>  static inline int ovl_do_removexattr(struct dentry *dentry, const char *name)
>  {
>         int err = vfs_removexattr(dentry, name);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 10:30     ` Sargun Dhillon
@ 2020-11-16 11:17       ` Amir Goldstein
  2020-11-16 12:52         ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2020-11-16 11:17 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, David Howells, linux-fsdevel

> > > +       inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
> >
> > What's this lock for?
> >
> I need to take a lock on this inode to prevent modifications to it, right, or is
> getting the xattr safe?

No. see Documentation/filesystems/locking.rst.

>
> > > +       err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> > > +       inode_unlock(d_dirty->d_inode);
> > > +       if (err != sizeof(info))
> > > +               goto out_putdirty;
> > > +
> > > +       if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> > > +               pr_debug("boot id has changed (reboot or module reloaded)\n");
> > > +               goto out_putdirty;
> > > +       }
> > > +
> > > +       if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> > > +               pr_debug("workdir has been unmounted and remounted\n");
> > > +               goto out_putdirty;
> > > +       }
> > > +
> > > +       err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> > > +       if (err) {
> > > +               pr_debug("workdir dir has experienced errors: %d\n", err);
> > > +               goto out_putdirty;
> > > +       }
> >
> > Please put all the above including getxattr in helper ovl_verify_volatile_info()
> >
> Is it okay if the helper stays in super.c?
>

Yes.

>
> > > +
> > > +       /* Dirty file is okay, delete it. */
> > > +       ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
> >
> > That's a problem. By doing this, you have now approved a regular overlay
> > re-mount, but you need only approve a volatile overlay re-mount.
> > Need to pass ofs to ovl_workdir_cleanup{,_recurse}.
> >
> I can add a check to make sure this behaviour is only allowed on remounts back
> into volatile. There's technically a race condition here, where if there
> is an error between this check, and the mounting being finished, the FS
> could be dirty, but that already exists with the impl today.
>

If you follow my suggestion below and never unlink dirty file,
the filesystem will never be not-dirty so it is safer.

> > > +
> > > +out_putdirty:
> > > +       dput(d_dirty);
> > > +out_putvolatile:
> > > +       inode_unlock(d_volatile->d_inode);
> > > +       dput(d_volatile);
> > > +       return ret;
> > > +}
> > > +
> > > +/*
> > > + * check_incompat checks this specific incompat entry for incompatibility.
> > > + * If it is found to be incompatible -EINVAL will be returned.
> > > + *
> > > + * Any other -errno indicates an unknown error, and filesystem mounting
> > > + * should be aborted.
> > > + */
> > > +static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
> > > +{
> > > +       int err = -EINVAL;
> > > +
> > > +       if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
> > > +               err = ovl_check_incompat_volatile(p, path);
> > > +
> > > +       if (err == -EINVAL)
> > > +               pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> > > +       else
> > > +               pr_debug("incompat '%s' handled: %d\n", p->name, err);
> > > +
> > > +       return err;
> > > +}
> > >
> > >  static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> > >  {
> > > @@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> > >                         if (p->len == 2 && p->name[1] == '.')
> > >                                 continue;
> > >                 } else if (incompat) {
> > > -                       pr_err("overlay with incompat feature '%s' cannot be mounted\n",
> > > -                               p->name);
> > > -                       err = -EINVAL;
> > > -                       break;
> > > +                       err = ovl_check_incompat(p, path);
> > > +                       if (err)
> > > +                               break;
> >
> > The call to ovl_check_incompat here is too soon and it makes
> > you need to lookup both the volatile dir and dirty file.
> > What you need to do and let cleanup recurse into the next level
> > while letting it know that we are now traversing the "incompat"
> > subtree.
> >
> Maybe a dumb question but why is it incompat/volatile/dirty, rather than just
> incompat/volatile, where volatile is a file?

Not dumb. It's so old kernels cannot mount with this workdir,
because recursive cleanup never recursed more than 2 levels.
If it were just incompat/volatile old kernels would have cleaned it
and allowed it to mount.

> Are there any caveats with putting the xattr on the directory

Not that I can think of.

> , or alternatively are there any reasons not to make
> the structure incompat/volatile/dirty?
>

Old kernels as I wrote above.
The sole purpose of the dirty file is to cause rmdir on volatile to fail
in old kernels.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 11:17       ` Amir Goldstein
@ 2020-11-16 12:52         ` Amir Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-16 12:52 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano,
	Vivek Goyal, Daniel J Walsh, David Howells, linux-fsdevel

On Mon, Nov 16, 2020 at 1:17 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > +       inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
> > >
> > > What's this lock for?
> > >
> > I need to take a lock on this inode to prevent modifications to it, right, or is
> > getting the xattr safe?
>
> No. see Documentation/filesystems/locking.rst.
>
> >
> > > > +       err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> > > > +       inode_unlock(d_dirty->d_inode);
> > > > +       if (err != sizeof(info))
> > > > +               goto out_putdirty;
> > > > +
> > > > +       if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> > > > +               pr_debug("boot id has changed (reboot or module reloaded)\n");
> > > > +               goto out_putdirty;
> > > > +       }
> > > > +
> > > > +       if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> > > > +               pr_debug("workdir has been unmounted and remounted\n");
> > > > +               goto out_putdirty;
> > > > +       }
> > > > +
> > > > +       err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> > > > +       if (err) {
> > > > +               pr_debug("workdir dir has experienced errors: %d\n", err);
> > > > +               goto out_putdirty;
> > > > +       }
> > >
> > > Please put all the above including getxattr in helper ovl_verify_volatile_info()
> > >
> > Is it okay if the helper stays in super.c?
> >
>
> Yes.
>
> >
> > > > +
> > > > +       /* Dirty file is okay, delete it. */
> > > > +       ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
> > >
> > > That's a problem. By doing this, you have now approved a regular overlay
> > > re-mount, but you need only approve a volatile overlay re-mount.
> > > Need to pass ofs to ovl_workdir_cleanup{,_recurse}.
> > >
> > I can add a check to make sure this behaviour is only allowed on remounts back
> > into volatile. There's technically a race condition here, where if there
> > is an error between this check, and the mounting being finished, the FS
> > could be dirty, but that already exists with the impl today.
> >
>
> If you follow my suggestion below and never unlink dirty file,
> the filesystem will never be not-dirty so it is safer.
>

To clarify, as I wrote, there are two options.
The first option, as your patch did, removes the dirty file in
ovl_workdir_cleanup()
and re-creates it in ovl_make_workdir().

The second option, which I prefer, is to keep the dirty file, because until
syncfs was run these workdir/upperdir are dirty and should not be reused
should a crash happen after the dirty file removal.

But the second option means that ovl_workdir_cleanup() returns with
"work" directory not removed and ovl_workdir_create() is not prepared
for that.

My suggestion is to return 1 from ovl_workdir_cleanup,{_recurrsive}
for the case of successful cleanup with remaining and verified
work/incompat dir.

ovl_workdir_cleanup() should not call ovl_cleanup() which prints an
error in case ovl_workdir_cleanup_recurse() returned 1.
ovl_workdir_create() should goto out_unlock in case ovl_workdir_cleanup()
returned 1.

Hope I am not forgetting anything.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16  4:57 ` [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
  2020-11-16  9:31   ` Amir Goldstein
@ 2020-11-16 14:42   ` Vivek Goyal
  2020-11-16 14:45     ` Vivek Goyal
                       ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-16 14:42 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-unionfs, miklos, Alexander Viro, Giuseppe Scrivano,
	Daniel J Walsh, David Howells, linux-fsdevel, Amir Goldstein

On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> Overlayfs added the ability to setup mounts where all syncs could be
> short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> 
> A user might want to remount this fs, but we do not let the user because
> of the "incompat" detection feature. In the case of volatile, it is safe
> to do something like[1]:
> 
> $ sync -f /root/upperdir
> $ rm -rf /root/workdir/incompat/volatile
> 
> There are two ways to go about this. You can call sync on the underlying
> filesystem, check the error code, and delete the dirty file if everything
> is clean. If you're running lots of containers on the same filesystem, or
> you want to avoid all unnecessary I/O, this may be suboptimal.
> 

Hi Sargun,

I had asked bunch of questions in previous mail thread to be more
clear on your requirements but never got any response. It would
have helped understanding your requirements better.

How about following patch set which seems to sync only dirty inodes of
upper belonging to a particular overlayfs instance.

https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/

So if could implement a mount option which ignores fsync but upon
syncfs, only syncs dirty inodes of that overlayfs instance, it will
make sure we are not syncing whole of the upper fs. And we could
do this syncing on unmount of overlayfs and remove dirty file upon
successful sync.

Looks like this will be much simpler method and should be able to
meet your requirements (As long as you are fine with syncing dirty
upper inodes of this overlay instance on unmount).

Thanks
Vivek

> Alternatively, you can blindly delete the dirty file, and "hope for the
> best".
> 
> This patch introduces transparent functionality to check if it is
> (relatively) safe to reuse the upperdir. It ensures that the filesystem
> hasn't been remounted, the system hasn't been rebooted, nor has the
> overlayfs code changed. It also checks the errseq on the superblock
> indicating if there have been any writeback errors since the previous
> mount. Currently, this information is not directly exposed to userspace, so
> the user cannot make decisions based on this. Instead we checkpoint
> this information to disk, and upon remount we see if any of it has
> changed. Since the structure is explicitly not meant to be used
> between different versions of the code, its stability does not
> matter so much.

> [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> ---
>  Documentation/filesystems/overlayfs.rst |  5 +-
>  fs/overlayfs/overlayfs.h                | 34 ++++++++++
>  fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
>  fs/overlayfs/super.c                    | 22 ++++++-
>  4 files changed, 139 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 580ab9a0fe31..fa3faeeab727 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -581,7 +581,10 @@ checks for this directory and refuses to mount if present. This is a strong
>  indicator that user should throw away upper and work directories and create
>  fresh one. In very limited cases where the user knows that the system has
>  not crashed and contents of upperdir are intact, The "volatile" directory
> -can be removed.
> +can be removed.  In certain cases it the filesystem can detect that the
> +upperdir can be reused safely, and it will not require the user to
> +manually delete the volatile directory.
> +
>  
>  Testsuite
>  ---------
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 9eb911f243e1..980d2c930f7a 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -30,6 +30,11 @@ enum ovl_path_type {
>  #define OVL_XATTR_NLINK OVL_XATTR_PREFIX "nlink"
>  #define OVL_XATTR_UPPER OVL_XATTR_PREFIX "upper"
>  #define OVL_XATTR_METACOPY OVL_XATTR_PREFIX "metacopy"
> +#define OVL_XATTR_VOLATILE OVL_XATTR_PREFIX "volatile"
> +
> +#define OVL_INCOMPATDIR_NAME "incompat"
> +#define OVL_VOLATILEDIR_NAME "volatile"
> +#define OVL_VOLATILE_DIRTY_NAME "dirty"
>  
>  enum ovl_inode_flag {
>  	/* Pure upper dir that may contain non pure upper entries */
> @@ -54,6 +59,32 @@ enum {
>  	OVL_XINO_ON,
>  };
>  
> +/*
> + * This is copied into the volatile xattr, and the user does not interact with
> + * it. There is no stability requirement, as a reboot explicitly invalidates
> + * a volatile workdir. It is explicitly meant not to be a stable api.
> + *
> + * Although this structure isn't meant to be stable it is exposed to potentially
> + * unprivileged users. We don't do any kind of cryptographic operations with
> + * the structure, so it could be tampered with, or inspected. Don't put
> + * kernel memory pointers in it, or anything else that could cause problems,
> + * or information disclosure.
> + */
> +struct overlay_volatile_info {
> +	/*
> +	 * This uniquely identifies a boot, and is reset if overlayfs itself
> +	 * is reloaded. Therefore we check our current / known boot_id
> +	 * against this before looking at any other fields to validate:
> +	 * 1. Is this datastructure laid out in the way we expect? (Overlayfs
> +	 *    module, reboot, etc...)
> +	 * 2. Could something have changed (like the s_instance_id counter
> +	 *    resetting)
> +	 */
> +	uuid_t		overlay_boot_id;
> +	u64		s_instance_id;
> +	errseq_t	errseq; /* Just a u32 */
> +} __packed;
> +
>  /*
>   * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
>   * where:
> @@ -501,3 +532,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
>  
>  /* export.c */
>  extern const struct export_operations ovl_export_operations;
> +
> +/* super.c */
> +extern uuid_t overlay_boot_id;
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index f8cc15533afa..ee0d2b88a19c 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1054,7 +1054,84 @@ int ovl_check_d_type_supported(struct path *realpath)
>  	return rdd.d_type_supported;
>  }
>  
> -#define OVL_INCOMPATDIR_NAME "incompat"
> +static int ovl_check_incompat_volatile(struct ovl_cache_entry *p,
> +				       struct path *path)
> +{
> +	int err, ret = -EINVAL;
> +	struct overlay_volatile_info info;
> +	struct dentry *d_volatile, *d_dirty;
> +
> +	d_volatile = lookup_one_len(p->name, path->dentry, p->len);
> +	if (IS_ERR(d_volatile))
> +		return PTR_ERR(d_volatile);
> +
> +	inode_lock_nested(d_volatile->d_inode, I_MUTEX_PARENT);
> +	d_dirty = lookup_one_len(OVL_VOLATILE_DIRTY_NAME, d_volatile,
> +				 strlen(OVL_VOLATILE_DIRTY_NAME));
> +	if (IS_ERR(d_dirty)) {
> +		err = PTR_ERR(d_dirty);
> +		if (err != -ENOENT)
> +			ret = err;
> +		goto out_putvolatile;
> +	}
> +
> +	if (!d_dirty->d_inode)
> +		goto out_putdirty;
> +
> +	inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
> +	err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> +	inode_unlock(d_dirty->d_inode);
> +	if (err != sizeof(info))
> +		goto out_putdirty;
> +
> +	if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> +		pr_debug("boot id has changed (reboot or module reloaded)\n");
> +		goto out_putdirty;
> +	}
> +
> +	if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> +		pr_debug("workdir has been unmounted and remounted\n");
> +		goto out_putdirty;
> +	}
> +
> +	err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> +	if (err) {
> +		pr_debug("workdir dir has experienced errors: %d\n", err);
> +		goto out_putdirty;
> +	}
> +
> +	/* Dirty file is okay, delete it. */
> +	ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
> +
> +out_putdirty:
> +	dput(d_dirty);
> +out_putvolatile:
> +	inode_unlock(d_volatile->d_inode);
> +	dput(d_volatile);
> +	return ret;
> +}
> +
> +/*
> + * check_incompat checks this specific incompat entry for incompatibility.
> + * If it is found to be incompatible -EINVAL will be returned.
> + *
> + * Any other -errno indicates an unknown error, and filesystem mounting
> + * should be aborted.
> + */
> +static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
> +{
> +	int err = -EINVAL;
> +
> +	if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
> +		err = ovl_check_incompat_volatile(p, path);
> +
> +	if (err == -EINVAL)
> +		pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> +	else
> +		pr_debug("incompat '%s' handled: %d\n", p->name, err);
> +
> +	return err;
> +}
>  
>  static int ovl_workdir_cleanup_recurse(struct path *path, int level)
>  {
> @@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
>  			if (p->len == 2 && p->name[1] == '.')
>  				continue;
>  		} else if (incompat) {
> -			pr_err("overlay with incompat feature '%s' cannot be mounted\n",
> -				p->name);
> -			err = -EINVAL;
> -			break;
> +			err = ovl_check_incompat(p, path);
> +			if (err)
> +				break;
>  		}
>  		dentry = lookup_one_len(p->name, path->dentry, p->len);
>  		if (IS_ERR(dentry))
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 2ee0ba16cc7b..94980898009f 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -15,6 +15,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/posix_acl_xattr.h>
>  #include <linux/exportfs.h>
> +#include <linux/uuid.h>
>  #include "overlayfs.h"
>  
>  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> @@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
>  
>  
>  struct ovl_dir_cache;
> +uuid_t overlay_boot_id;
>  
>  #define OVL_MAX_STACK 500
>  
> @@ -1246,20 +1248,35 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
>   */
>  static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
>  {
> +	int err;
>  	unsigned int ctr;
>  	struct dentry *d = dget(ofs->workbasedir);
>  	static const char *const volatile_path[] = {
> -		OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
> +		OVL_WORKDIR_NAME,
> +		OVL_INCOMPATDIR_NAME,
> +		OVL_VOLATILEDIR_NAME,
> +		OVL_VOLATILE_DIRTY_NAME,
>  	};
>  	const char *const *name = volatile_path;
> +	struct overlay_volatile_info info = {};
>  
>  	for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
>  		d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
>  		if (IS_ERR(d))
>  			return PTR_ERR(d);
>  	}
> +
> +	uuid_copy(&info.overlay_boot_id, &overlay_boot_id);
> +	info.s_instance_id = d->d_inode->i_sb->s_instance_id;
> +	info.errseq = errseq_sample(&d->d_inode->i_sb->s_wb_err);
> +
> +
> +	err = ovl_do_setxattr(d, OVL_XATTR_VOLATILE, &info, sizeof(info), 0);
> +	if (err == -EOPNOTSUPP)
> +		err = 0;
> +
>  	dput(d);
> -	return 0;
> +	return err;
>  }
>  
>  static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
> @@ -2045,6 +2062,7 @@ static int __init ovl_init(void)
>  {
>  	int err;
>  
> +	uuid_gen(&overlay_boot_id);
>  	ovl_inode_cachep = kmem_cache_create("ovl_inode",
>  					     sizeof(struct ovl_inode), 0,
>  					     (SLAB_RECLAIM_ACCOUNT|
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 14:42   ` Vivek Goyal
@ 2020-11-16 14:45     ` Vivek Goyal
  2020-11-16 15:20     ` Amir Goldstein
  2020-11-16 17:38     ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-16 14:45 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-unionfs, miklos, Alexander Viro, Giuseppe Scrivano,
	Daniel J Walsh, David Howells, linux-fsdevel, Amir Goldstein

On Mon, Nov 16, 2020 at 09:42:40AM -0500, Vivek Goyal wrote:
> On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > Overlayfs added the ability to setup mounts where all syncs could be
> > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > 
> > A user might want to remount this fs, but we do not let the user because
> > of the "incompat" detection feature. In the case of volatile, it is safe
> > to do something like[1]:
> > 
> > $ sync -f /root/upperdir
> > $ rm -rf /root/workdir/incompat/volatile
> > 
> > There are two ways to go about this. You can call sync on the underlying
> > filesystem, check the error code, and delete the dirty file if everything
> > is clean. If you're running lots of containers on the same filesystem, or
> > you want to avoid all unnecessary I/O, this may be suboptimal.
> > 
> 
> Hi Sargun,
> 
> I had asked bunch of questions in previous mail thread to be more
> clear on your requirements but never got any response. It would
> have helped understanding your requirements better.
> 
> How about following patch set which seems to sync only dirty inodes of
> upper belonging to a particular overlayfs instance.
> 
> https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> 
> So if could implement a mount option which ignores fsync but upon
> syncfs, only syncs dirty inodes of that overlayfs instance, it will
> make sure we are not syncing whole of the upper fs. And we could
> do this syncing on unmount of overlayfs and remove dirty file upon
> successful sync.
> 
> Looks like this will be much simpler method and should be able to
> meet your requirements (As long as you are fine with syncing dirty
> upper inodes of this overlay instance on unmount).

This approach also has the advantage error detection is much more granular
and you don't have to throw away container A if there was a writeback
issue in any other unrelated container N sharing same upper.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 14:42   ` Vivek Goyal
  2020-11-16 14:45     ` Vivek Goyal
@ 2020-11-16 15:20     ` Amir Goldstein
  2020-11-16 16:36       ` Vivek Goyal
  2020-11-16 17:38     ` Sargun Dhillon
  2 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2020-11-16 15:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > Overlayfs added the ability to setup mounts where all syncs could be
> > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> >
> > A user might want to remount this fs, but we do not let the user because
> > of the "incompat" detection feature. In the case of volatile, it is safe
> > to do something like[1]:
> >
> > $ sync -f /root/upperdir
> > $ rm -rf /root/workdir/incompat/volatile
> >
> > There are two ways to go about this. You can call sync on the underlying
> > filesystem, check the error code, and delete the dirty file if everything
> > is clean. If you're running lots of containers on the same filesystem, or
> > you want to avoid all unnecessary I/O, this may be suboptimal.
> >
>
> Hi Sargun,
>
> I had asked bunch of questions in previous mail thread to be more
> clear on your requirements but never got any response. It would
> have helped understanding your requirements better.
>
> How about following patch set which seems to sync only dirty inodes of
> upper belonging to a particular overlayfs instance.
>
> https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
>
> So if could implement a mount option which ignores fsync but upon
> syncfs, only syncs dirty inodes of that overlayfs instance, it will
> make sure we are not syncing whole of the upper fs. And we could
> do this syncing on unmount of overlayfs and remove dirty file upon
> successful sync.
>
> Looks like this will be much simpler method and should be able to
> meet your requirements (As long as you are fine with syncing dirty
> upper inodes of this overlay instance on unmount).
>

Do note that the latest patch set by Chengguang not only syncs dirty
inodes of this overlay instance, but also waits for in-flight writeback on
all the upper fs inodes and I think that with !ovl_should_sync(ofs)
we will not re-dirty the ovl inodes and lose track of the list of dirty
inodes - maybe that can be fixed.

Also, I am not sure anymore that we can safely remove the dirty file after
sync dirty inodes sync_fs and umount. If someone did sync_fs before us
and consumed the error, we may have a copied up file in upper whose
data is not on disk, but when we sync_fs on unmount we won't get an
error? not sure.

I am less concerned about ways to allow re-mount of volatile
overlayfs than I am about turning volatile overlayfs into non-volatile.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 15:20     ` Amir Goldstein
@ 2020-11-16 16:36       ` Vivek Goyal
  2020-11-16 18:25         ` Sargun Dhillon
  2020-11-16 20:18         ` Amir Goldstein
  0 siblings, 2 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-16 16:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > Overlayfs added the ability to setup mounts where all syncs could be
> > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > >
> > > A user might want to remount this fs, but we do not let the user because
> > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > to do something like[1]:
> > >
> > > $ sync -f /root/upperdir
> > > $ rm -rf /root/workdir/incompat/volatile
> > >
> > > There are two ways to go about this. You can call sync on the underlying
> > > filesystem, check the error code, and delete the dirty file if everything
> > > is clean. If you're running lots of containers on the same filesystem, or
> > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > >
> >
> > Hi Sargun,
> >
> > I had asked bunch of questions in previous mail thread to be more
> > clear on your requirements but never got any response. It would
> > have helped understanding your requirements better.
> >
> > How about following patch set which seems to sync only dirty inodes of
> > upper belonging to a particular overlayfs instance.
> >
> > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> >
> > So if could implement a mount option which ignores fsync but upon
> > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > make sure we are not syncing whole of the upper fs. And we could
> > do this syncing on unmount of overlayfs and remove dirty file upon
> > successful sync.
> >
> > Looks like this will be much simpler method and should be able to
> > meet your requirements (As long as you are fine with syncing dirty
> > upper inodes of this overlay instance on unmount).
> >
> 
> Do note that the latest patch set by Chengguang not only syncs dirty
> inodes of this overlay instance, but also waits for in-flight writeback on
> all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> we will not re-dirty the ovl inodes and lose track of the list of dirty
> inodes - maybe that can be fixed.
> 
> Also, I am not sure anymore that we can safely remove the dirty file after
> sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> and consumed the error, we may have a copied up file in upper whose
> data is not on disk, but when we sync_fs on unmount we won't get an
> error? not sure.

May be we can save errseq_t when mounting overlay and compare with
errseq_t stored in upper sb after unmount. That will tell us whether
error has happened since we mounted overlay. (Similar to what Sargun
is doing).

In fact, if this is a concern, we have this issue with user space
"sync <upper>" too? Other sync might fail and this one succeeds
and we will think upper is just fine. May be container tools can
keep a file/dir open at the time of mount and call syncfs using
that fd instead. (And that should catch errors since that fd
was opened, I am assuming).

> 
> I am less concerned about ways to allow re-mount of volatile
> overlayfs than I am about turning volatile overlayfs into non-volatile.

If we are not interested in converting volatile containers into
non-volatile, then whole point of these patch series is to detect
if any writeback error has happened or not. If writeback error has
happened, then we detect that at remount and possibly throw away
container.

What happens today if writeback error has happened. Is that page thrown
away from page cache and read back from disk? IOW, will user lose
the data it had written in page cache because writeback failed. I am
assuming we can't keep the dirty page around for very long otherwise
it has potential to fill up all the available ram with dirty pages which
can't be written back.

Why is it important to detect writeback error only during remount. What
happens if container overlay instance is already mounted and writeback
error happens. We will not detct that, right?

IOW, if capturing writeback error is important for volatile containers,
then capturing it only during remount time is not enough. Normally
fsync/syncfs should catch it and now we have skipped those, so in
the process we lost mechanism to detect writeback errrors for
volatile containers?

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 14:42   ` Vivek Goyal
  2020-11-16 14:45     ` Vivek Goyal
  2020-11-16 15:20     ` Amir Goldstein
@ 2020-11-16 17:38     ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16 17:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-unionfs, miklos, Alexander Viro, Giuseppe Scrivano,
	Daniel J Walsh, David Howells, linux-fsdevel, Amir Goldstein

On Mon, Nov 16, 2020 at 09:42:40AM -0500, Vivek Goyal wrote:
> On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > Overlayfs added the ability to setup mounts where all syncs could be
> > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > 
> > A user might want to remount this fs, but we do not let the user because
> > of the "incompat" detection feature. In the case of volatile, it is safe
> > to do something like[1]:
> > 
> > $ sync -f /root/upperdir
> > $ rm -rf /root/workdir/incompat/volatile
> > 
> > There are two ways to go about this. You can call sync on the underlying
> > filesystem, check the error code, and delete the dirty file if everything
> > is clean. If you're running lots of containers on the same filesystem, or
> > you want to avoid all unnecessary I/O, this may be suboptimal.
> > 
> 
> Hi Sargun,
> 
> I had asked bunch of questions in previous mail thread to be more
> clear on your requirements but never got any response. It would
> have helped understanding your requirements better.
> 
Sorry, I didn't see your questions.


> How about following patch set which seems to sync only dirty inodes of
> upper belonging to a particular overlayfs instance.
> 
> https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> 
> So if could implement a mount option which ignores fsync but upon
> syncfs, only syncs dirty inodes of that overlayfs instance, it will
> make sure we are not syncing whole of the upper fs. And we could
> do this syncing on unmount of overlayfs and remove dirty file upon
> successful sync.
> 
Doing any kind of sync that involves a head-of-line blocking metadata flush,
or even data flush causes significan problems on our systems. We at some 
point did some analysis on systems in our fleet that did sync and noticed
that it has a very suboptimal effect on system-wide performance because
the sudden rush of IOPs caused our drives to stall.

I didn't dig too much into it, but on XFS, even letting users do sync
directly on their inodes could lead to trouble in terms of the spurious
burst of IOPs generated. Even though our drives can sustain a large amount
of I/O over time, a sudden burst causes our cloud provider to throttle them,
which in turn can lead to slow I/O across the system, and depending on what's
going on, this can turn into WQ stalls.

> Looks like this will be much simpler method and should be able to
> meet your requirements (As long as you are fine with syncing dirty
> upper inodes of this overlay instance on unmount).
> 
> Thanks
> Vivek
> 
> > Alternatively, you can blindly delete the dirty file, and "hope for the
> > best".
> > 
> > This patch introduces transparent functionality to check if it is
> > (relatively) safe to reuse the upperdir. It ensures that the filesystem
> > hasn't been remounted, the system hasn't been rebooted, nor has the
> > overlayfs code changed. It also checks the errseq on the superblock
> > indicating if there have been any writeback errors since the previous
> > mount. Currently, this information is not directly exposed to userspace, so
> > the user cannot make decisions based on this. Instead we checkpoint
> > this information to disk, and upon remount we see if any of it has
> > changed. Since the structure is explicitly not meant to be used
> > between different versions of the code, its stability does not
> > matter so much.
> 
> > [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c
> > 
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  Documentation/filesystems/overlayfs.rst |  5 +-
> >  fs/overlayfs/overlayfs.h                | 34 ++++++++++
> >  fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
> >  fs/overlayfs/super.c                    | 22 ++++++-
> >  4 files changed, 139 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > index 580ab9a0fe31..fa3faeeab727 100644
> > --- a/Documentation/filesystems/overlayfs.rst
> > +++ b/Documentation/filesystems/overlayfs.rst
> > @@ -581,7 +581,10 @@ checks for this directory and refuses to mount if present. This is a strong
> >  indicator that user should throw away upper and work directories and create
> >  fresh one. In very limited cases where the user knows that the system has
> >  not crashed and contents of upperdir are intact, The "volatile" directory
> > -can be removed.
> > +can be removed.  In certain cases it the filesystem can detect that the
> > +upperdir can be reused safely, and it will not require the user to
> > +manually delete the volatile directory.
> > +
> >  
> >  Testsuite
> >  ---------
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index 9eb911f243e1..980d2c930f7a 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -30,6 +30,11 @@ enum ovl_path_type {
> >  #define OVL_XATTR_NLINK OVL_XATTR_PREFIX "nlink"
> >  #define OVL_XATTR_UPPER OVL_XATTR_PREFIX "upper"
> >  #define OVL_XATTR_METACOPY OVL_XATTR_PREFIX "metacopy"
> > +#define OVL_XATTR_VOLATILE OVL_XATTR_PREFIX "volatile"
> > +
> > +#define OVL_INCOMPATDIR_NAME "incompat"
> > +#define OVL_VOLATILEDIR_NAME "volatile"
> > +#define OVL_VOLATILE_DIRTY_NAME "dirty"
> >  
> >  enum ovl_inode_flag {
> >  	/* Pure upper dir that may contain non pure upper entries */
> > @@ -54,6 +59,32 @@ enum {
> >  	OVL_XINO_ON,
> >  };
> >  
> > +/*
> > + * This is copied into the volatile xattr, and the user does not interact with
> > + * it. There is no stability requirement, as a reboot explicitly invalidates
> > + * a volatile workdir. It is explicitly meant not to be a stable api.
> > + *
> > + * Although this structure isn't meant to be stable it is exposed to potentially
> > + * unprivileged users. We don't do any kind of cryptographic operations with
> > + * the structure, so it could be tampered with, or inspected. Don't put
> > + * kernel memory pointers in it, or anything else that could cause problems,
> > + * or information disclosure.
> > + */
> > +struct overlay_volatile_info {
> > +	/*
> > +	 * This uniquely identifies a boot, and is reset if overlayfs itself
> > +	 * is reloaded. Therefore we check our current / known boot_id
> > +	 * against this before looking at any other fields to validate:
> > +	 * 1. Is this datastructure laid out in the way we expect? (Overlayfs
> > +	 *    module, reboot, etc...)
> > +	 * 2. Could something have changed (like the s_instance_id counter
> > +	 *    resetting)
> > +	 */
> > +	uuid_t		overlay_boot_id;
> > +	u64		s_instance_id;
> > +	errseq_t	errseq; /* Just a u32 */
> > +} __packed;
> > +
> >  /*
> >   * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
> >   * where:
> > @@ -501,3 +532,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
> >  
> >  /* export.c */
> >  extern const struct export_operations ovl_export_operations;
> > +
> > +/* super.c */
> > +extern uuid_t overlay_boot_id;
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index f8cc15533afa..ee0d2b88a19c 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1054,7 +1054,84 @@ int ovl_check_d_type_supported(struct path *realpath)
> >  	return rdd.d_type_supported;
> >  }
> >  
> > -#define OVL_INCOMPATDIR_NAME "incompat"
> > +static int ovl_check_incompat_volatile(struct ovl_cache_entry *p,
> > +				       struct path *path)
> > +{
> > +	int err, ret = -EINVAL;
> > +	struct overlay_volatile_info info;
> > +	struct dentry *d_volatile, *d_dirty;
> > +
> > +	d_volatile = lookup_one_len(p->name, path->dentry, p->len);
> > +	if (IS_ERR(d_volatile))
> > +		return PTR_ERR(d_volatile);
> > +
> > +	inode_lock_nested(d_volatile->d_inode, I_MUTEX_PARENT);
> > +	d_dirty = lookup_one_len(OVL_VOLATILE_DIRTY_NAME, d_volatile,
> > +				 strlen(OVL_VOLATILE_DIRTY_NAME));
> > +	if (IS_ERR(d_dirty)) {
> > +		err = PTR_ERR(d_dirty);
> > +		if (err != -ENOENT)
> > +			ret = err;
> > +		goto out_putvolatile;
> > +	}
> > +
> > +	if (!d_dirty->d_inode)
> > +		goto out_putdirty;
> > +
> > +	inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
> > +	err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> > +	inode_unlock(d_dirty->d_inode);
> > +	if (err != sizeof(info))
> > +		goto out_putdirty;
> > +
> > +	if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> > +		pr_debug("boot id has changed (reboot or module reloaded)\n");
> > +		goto out_putdirty;
> > +	}
> > +
> > +	if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> > +		pr_debug("workdir has been unmounted and remounted\n");
> > +		goto out_putdirty;
> > +	}
> > +
> > +	err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> > +	if (err) {
> > +		pr_debug("workdir dir has experienced errors: %d\n", err);
> > +		goto out_putdirty;
> > +	}
> > +
> > +	/* Dirty file is okay, delete it. */
> > +	ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
> > +
> > +out_putdirty:
> > +	dput(d_dirty);
> > +out_putvolatile:
> > +	inode_unlock(d_volatile->d_inode);
> > +	dput(d_volatile);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * check_incompat checks this specific incompat entry for incompatibility.
> > + * If it is found to be incompatible -EINVAL will be returned.
> > + *
> > + * Any other -errno indicates an unknown error, and filesystem mounting
> > + * should be aborted.
> > + */
> > +static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
> > +{
> > +	int err = -EINVAL;
> > +
> > +	if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
> > +		err = ovl_check_incompat_volatile(p, path);
> > +
> > +	if (err == -EINVAL)
> > +		pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> > +	else
> > +		pr_debug("incompat '%s' handled: %d\n", p->name, err);
> > +
> > +	return err;
> > +}
> >  
> >  static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> >  {
> > @@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> >  			if (p->len == 2 && p->name[1] == '.')
> >  				continue;
> >  		} else if (incompat) {
> > -			pr_err("overlay with incompat feature '%s' cannot be mounted\n",
> > -				p->name);
> > -			err = -EINVAL;
> > -			break;
> > +			err = ovl_check_incompat(p, path);
> > +			if (err)
> > +				break;
> >  		}
> >  		dentry = lookup_one_len(p->name, path->dentry, p->len);
> >  		if (IS_ERR(dentry))
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 2ee0ba16cc7b..94980898009f 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/seq_file.h>
> >  #include <linux/posix_acl_xattr.h>
> >  #include <linux/exportfs.h>
> > +#include <linux/uuid.h>
> >  #include "overlayfs.h"
> >  
> >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > @@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
> >  
> >  
> >  struct ovl_dir_cache;
> > +uuid_t overlay_boot_id;
> >  
> >  #define OVL_MAX_STACK 500
> >  
> > @@ -1246,20 +1248,35 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
> >   */
> >  static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
> >  {
> > +	int err;
> >  	unsigned int ctr;
> >  	struct dentry *d = dget(ofs->workbasedir);
> >  	static const char *const volatile_path[] = {
> > -		OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
> > +		OVL_WORKDIR_NAME,
> > +		OVL_INCOMPATDIR_NAME,
> > +		OVL_VOLATILEDIR_NAME,
> > +		OVL_VOLATILE_DIRTY_NAME,
> >  	};
> >  	const char *const *name = volatile_path;
> > +	struct overlay_volatile_info info = {};
> >  
> >  	for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
> >  		d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
> >  		if (IS_ERR(d))
> >  			return PTR_ERR(d);
> >  	}
> > +
> > +	uuid_copy(&info.overlay_boot_id, &overlay_boot_id);
> > +	info.s_instance_id = d->d_inode->i_sb->s_instance_id;
> > +	info.errseq = errseq_sample(&d->d_inode->i_sb->s_wb_err);
> > +
> > +
> > +	err = ovl_do_setxattr(d, OVL_XATTR_VOLATILE, &info, sizeof(info), 0);
> > +	if (err == -EOPNOTSUPP)
> > +		err = 0;
> > +
> >  	dput(d);
> > -	return 0;
> > +	return err;
> >  }
> >  
> >  static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
> > @@ -2045,6 +2062,7 @@ static int __init ovl_init(void)
> >  {
> >  	int err;
> >  
> > +	uuid_gen(&overlay_boot_id);
> >  	ovl_inode_cachep = kmem_cache_create("ovl_inode",
> >  					     sizeof(struct ovl_inode), 0,
> >  					     (SLAB_RECLAIM_ACCOUNT|
> > -- 
> > 2.25.1
> > 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 16:36       ` Vivek Goyal
@ 2020-11-16 18:25         ` Sargun Dhillon
  2020-11-16 19:27           ` Vivek Goyal
  2020-11-16 20:18         ` Amir Goldstein
  1 sibling, 1 reply; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16 18:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 11:36:15AM -0500, Vivek Goyal wrote:
> On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > >
> > > > A user might want to remount this fs, but we do not let the user because
> > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > to do something like[1]:
> > > >
> > > > $ sync -f /root/upperdir
> > > > $ rm -rf /root/workdir/incompat/volatile
> > > >
> > > > There are two ways to go about this. You can call sync on the underlying
> > > > filesystem, check the error code, and delete the dirty file if everything
> > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > >
> > >
> > > Hi Sargun,
> > >
> > > I had asked bunch of questions in previous mail thread to be more
> > > clear on your requirements but never got any response. It would
> > > have helped understanding your requirements better.
> > >
> > > How about following patch set which seems to sync only dirty inodes of
> > > upper belonging to a particular overlayfs instance.
> > >
> > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > >
> > > So if could implement a mount option which ignores fsync but upon
> > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > make sure we are not syncing whole of the upper fs. And we could
> > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > successful sync.
> > >
> > > Looks like this will be much simpler method and should be able to
> > > meet your requirements (As long as you are fine with syncing dirty
> > > upper inodes of this overlay instance on unmount).
> > >
> > 
> > Do note that the latest patch set by Chengguang not only syncs dirty
> > inodes of this overlay instance, but also waits for in-flight writeback on
> > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > inodes - maybe that can be fixed.
> > 
> > Also, I am not sure anymore that we can safely remove the dirty file after
> > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > and consumed the error, we may have a copied up file in upper whose
> > data is not on disk, but when we sync_fs on unmount we won't get an
> > error? not sure.
> 
> May be we can save errseq_t when mounting overlay and compare with
> errseq_t stored in upper sb after unmount. That will tell us whether
> error has happened since we mounted overlay. (Similar to what Sargun
> is doing).
> 
> In fact, if this is a concern, we have this issue with user space
> "sync <upper>" too? Other sync might fail and this one succeeds
> and we will think upper is just fine. May be container tools can
> keep a file/dir open at the time of mount and call syncfs using
> that fd instead. (And that should catch errors since that fd
> was opened, I am assuming).
> 
> > 
> > I am less concerned about ways to allow re-mount of volatile
> > overlayfs than I am about turning volatile overlayfs into non-volatile.
> 
> If we are not interested in converting volatile containers into
> non-volatile, then whole point of these patch series is to detect
> if any writeback error has happened or not. If writeback error has
> happened, then we detect that at remount and possibly throw away
> container.
> 
> What happens today if writeback error has happened. Is that page thrown
> away from page cache and read back from disk? IOW, will user lose
> the data it had written in page cache because writeback failed. I am
> assuming we can't keep the dirty page around for very long otherwise
> it has potential to fill up all the available ram with dirty pages which
> can't be written back.
> 
> Why is it important to detect writeback error only during remount. What
> happens if container overlay instance is already mounted and writeback
> error happens. We will not detct that, right?
> 
> IOW, if capturing writeback error is important for volatile containers,
> then capturing it only during remount time is not enough. Normally
> fsync/syncfs should catch it and now we have skipped those, so in
> the process we lost mechanism to detect writeback errrors for
> volatile containers?
> 
> Thanks
> Vivek
> 

At least for my use case, any kind of syncing is generally bad unless
it can be controlled, and:
1. Generate a limited set of IOPs
2. Not block metadata operatons
----

This is a challenge that is left up to the filesystem developers that hasn't 
really been addressed yet. The closest we've seen is individual block devices 
per upper dir using something like device mapper, and throttling at that level.

I liken this to "eatmydata". I think it makes sense to force the user to go from 
volatile -> volatile. I do think that adding the safety feature which explicitly 
warns users that their system is a state where they may be experiencing data 
loss (checking errseq_t) is useful. Although, we emit the error via dmesg today, 
if we move over to the new mount API, we could emit the error from the fsfd, 
either forcing the user to set another flag, "reallyvolatile" or deleting the 
dirty bit on disk. I'm partial to the flag approach because it involves less
API surface area.

Partially because one of the overall use cases I want to be able to implement
is LXC-style seccomp-fd based mount syscall interception, and the fewer
things to juggle (and corner cases to handle), the better.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 18:25         ` Sargun Dhillon
@ 2020-11-16 19:27           ` Vivek Goyal
  0 siblings, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-16 19:27 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 06:25:56PM +0000, Sargun Dhillon wrote:
> On Mon, Nov 16, 2020 at 11:36:15AM -0500, Vivek Goyal wrote:
> > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > >
> > > > > A user might want to remount this fs, but we do not let the user because
> > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > to do something like[1]:
> > > > >
> > > > > $ sync -f /root/upperdir
> > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > >
> > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > >
> > > >
> > > > Hi Sargun,
> > > >
> > > > I had asked bunch of questions in previous mail thread to be more
> > > > clear on your requirements but never got any response. It would
> > > > have helped understanding your requirements better.
> > > >
> > > > How about following patch set which seems to sync only dirty inodes of
> > > > upper belonging to a particular overlayfs instance.
> > > >
> > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > >
> > > > So if could implement a mount option which ignores fsync but upon
> > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > make sure we are not syncing whole of the upper fs. And we could
> > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > successful sync.
> > > >
> > > > Looks like this will be much simpler method and should be able to
> > > > meet your requirements (As long as you are fine with syncing dirty
> > > > upper inodes of this overlay instance on unmount).
> > > >
> > > 
> > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > inodes - maybe that can be fixed.
> > > 
> > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > and consumed the error, we may have a copied up file in upper whose
> > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > error? not sure.
> > 
> > May be we can save errseq_t when mounting overlay and compare with
> > errseq_t stored in upper sb after unmount. That will tell us whether
> > error has happened since we mounted overlay. (Similar to what Sargun
> > is doing).
> > 
> > In fact, if this is a concern, we have this issue with user space
> > "sync <upper>" too? Other sync might fail and this one succeeds
> > and we will think upper is just fine. May be container tools can
> > keep a file/dir open at the time of mount and call syncfs using
> > that fd instead. (And that should catch errors since that fd
> > was opened, I am assuming).
> > 
> > > 
> > > I am less concerned about ways to allow re-mount of volatile
> > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> > 
> > If we are not interested in converting volatile containers into
> > non-volatile, then whole point of these patch series is to detect
> > if any writeback error has happened or not. If writeback error has
> > happened, then we detect that at remount and possibly throw away
> > container.
> > 
> > What happens today if writeback error has happened. Is that page thrown
> > away from page cache and read back from disk? IOW, will user lose
> > the data it had written in page cache because writeback failed. I am
> > assuming we can't keep the dirty page around for very long otherwise
> > it has potential to fill up all the available ram with dirty pages which
> > can't be written back.
> > 
> > Why is it important to detect writeback error only during remount. What
> > happens if container overlay instance is already mounted and writeback
> > error happens. We will not detct that, right?
> > 
> > IOW, if capturing writeback error is important for volatile containers,
> > then capturing it only during remount time is not enough. Normally
> > fsync/syncfs should catch it and now we have skipped those, so in
> > the process we lost mechanism to detect writeback errrors for
> > volatile containers?
> > 
> > Thanks
> > Vivek
> > 
> 
> At least for my use case, any kind of syncing is generally bad unless
> it can be controlled, and:
> 1. Generate a limited set of IOPs
> 2. Not block metadata operatons
> ----
> 
> This is a challenge that is left up to the filesystem developers that hasn't 
> really been addressed yet. The closest we've seen is individual block devices 
> per upper dir using something like device mapper, and throttling at that level.

I think I have heard complaints about cloud providers throttling kicking
in as well and that can result in long stalls. But throttling can kick in
due to memory pressure writebacks also, so disabling sync/fsync is no
guarantee that this will not happen?

Help me understand one thing. Say a volatile container is running (hence
overlay is mounted). Now a writeback error happens (say on the page 
which was written by container app). How are you detecting that
writeback error? fsync now has been disabled so overlay will return
success.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 16:36       ` Vivek Goyal
  2020-11-16 18:25         ` Sargun Dhillon
@ 2020-11-16 20:18         ` Amir Goldstein
  2020-11-16 21:09           ` Vivek Goyal
  2020-11-16 21:26           ` Vivek Goyal
  1 sibling, 2 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-16 20:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > >
> > > > A user might want to remount this fs, but we do not let the user because
> > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > to do something like[1]:
> > > >
> > > > $ sync -f /root/upperdir
> > > > $ rm -rf /root/workdir/incompat/volatile
> > > >
> > > > There are two ways to go about this. You can call sync on the underlying
> > > > filesystem, check the error code, and delete the dirty file if everything
> > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > >
> > >
> > > Hi Sargun,
> > >
> > > I had asked bunch of questions in previous mail thread to be more
> > > clear on your requirements but never got any response. It would
> > > have helped understanding your requirements better.
> > >
> > > How about following patch set which seems to sync only dirty inodes of
> > > upper belonging to a particular overlayfs instance.
> > >
> > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > >
> > > So if could implement a mount option which ignores fsync but upon
> > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > make sure we are not syncing whole of the upper fs. And we could
> > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > successful sync.
> > >
> > > Looks like this will be much simpler method and should be able to
> > > meet your requirements (As long as you are fine with syncing dirty
> > > upper inodes of this overlay instance on unmount).
> > >
> >
> > Do note that the latest patch set by Chengguang not only syncs dirty
> > inodes of this overlay instance, but also waits for in-flight writeback on
> > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > inodes - maybe that can be fixed.
> >
> > Also, I am not sure anymore that we can safely remove the dirty file after
> > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > and consumed the error, we may have a copied up file in upper whose
> > data is not on disk, but when we sync_fs on unmount we won't get an
> > error? not sure.
>
> May be we can save errseq_t when mounting overlay and compare with
> errseq_t stored in upper sb after unmount. That will tell us whether
> error has happened since we mounted overlay. (Similar to what Sargun
> is doing).
>

I suppose so.

> In fact, if this is a concern, we have this issue with user space
> "sync <upper>" too? Other sync might fail and this one succeeds
> and we will think upper is just fine. May be container tools can
> keep a file/dir open at the time of mount and call syncfs using
> that fd instead. (And that should catch errors since that fd
> was opened, I am assuming).
>

Did not understand the problem with userspace sync.

> >
> > I am less concerned about ways to allow re-mount of volatile
> > overlayfs than I am about turning volatile overlayfs into non-volatile.
>
> If we are not interested in converting volatile containers into
> non-volatile, then whole point of these patch series is to detect
> if any writeback error has happened or not. If writeback error has
> happened, then we detect that at remount and possibly throw away
> container.
>
> What happens today if writeback error has happened. Is that page thrown
> away from page cache and read back from disk? IOW, will user lose
> the data it had written in page cache because writeback failed. I am
> assuming we can't keep the dirty page around for very long otherwise
> it has potential to fill up all the available ram with dirty pages which
> can't be written back.
>

Right. the resulting data is undefined after error.

> Why is it important to detect writeback error only during remount. What
> happens if container overlay instance is already mounted and writeback
> error happens. We will not detct that, right?
>
> IOW, if capturing writeback error is important for volatile containers,
> then capturing it only during remount time is not enough. Normally
> fsync/syncfs should catch it and now we have skipped those, so in
> the process we lost mechanism to detect writeback errrors for
> volatile containers?
>

Yes, you are right.
It's an issue with volatile that we should probably document.

I think upper files data can "evaporate" even as the overlay is still mounted.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 20:18         ` Amir Goldstein
@ 2020-11-16 21:09           ` Vivek Goyal
  2020-11-17  5:33             ` Amir Goldstein
  2020-11-16 21:26           ` Vivek Goyal
  1 sibling, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2020-11-16 21:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 10:18:03PM +0200, Amir Goldstein wrote:
> On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > >
> > > > > A user might want to remount this fs, but we do not let the user because
> > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > to do something like[1]:
> > > > >
> > > > > $ sync -f /root/upperdir
> > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > >
> > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > >
> > > >
> > > > Hi Sargun,
> > > >
> > > > I had asked bunch of questions in previous mail thread to be more
> > > > clear on your requirements but never got any response. It would
> > > > have helped understanding your requirements better.
> > > >
> > > > How about following patch set which seems to sync only dirty inodes of
> > > > upper belonging to a particular overlayfs instance.
> > > >
> > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > >
> > > > So if could implement a mount option which ignores fsync but upon
> > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > make sure we are not syncing whole of the upper fs. And we could
> > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > successful sync.
> > > >
> > > > Looks like this will be much simpler method and should be able to
> > > > meet your requirements (As long as you are fine with syncing dirty
> > > > upper inodes of this overlay instance on unmount).
> > > >
> > >
> > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > inodes - maybe that can be fixed.
> > >
> > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > and consumed the error, we may have a copied up file in upper whose
> > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > error? not sure.
> >
> > May be we can save errseq_t when mounting overlay and compare with
> > errseq_t stored in upper sb after unmount. That will tell us whether
> > error has happened since we mounted overlay. (Similar to what Sargun
> > is doing).
> >
> 
> I suppose so.
> 
> > In fact, if this is a concern, we have this issue with user space
> > "sync <upper>" too? Other sync might fail and this one succeeds
> > and we will think upper is just fine. May be container tools can
> > keep a file/dir open at the time of mount and call syncfs using
> > that fd instead. (And that should catch errors since that fd
> > was opened, I am assuming).
> >
> 
> Did not understand the problem with userspace sync.

Say volatile container A is using upper/ which is on xfs. Assume, container A
does following.

1. Container A writes some data/copies up some files.
2. sync -f upper/
3. Remove incompat dir.
4. Remount overlay and restart container A.

Now normally if some error happend in writeback on upper/, then "sync -f"
should catch that and return an error. In that case container manager can
throw away the container.

What if another container B was doing same thing and issues ssues
"sync -f upper/" and that sync reports errors. Now container A issues
sync and IIUC, we will not see error on super block because it has
already been seen by container B.

And container A will assume that all data written by it safely made
it to disk and it is safe to remove incompat/volatile/ dir. 

If container manager keeps a file descriptor open to one of the files
in upper/, and uses that for sync, then it will still catch the
error because file->f_sb_err should be previous to error happened
and we will get any error since then.

> 
> > >
> > > I am less concerned about ways to allow re-mount of volatile
> > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> >
> > If we are not interested in converting volatile containers into
> > non-volatile, then whole point of these patch series is to detect
> > if any writeback error has happened or not. If writeback error has
> > happened, then we detect that at remount and possibly throw away
> > container.
> >
> > What happens today if writeback error has happened. Is that page thrown
> > away from page cache and read back from disk? IOW, will user lose
> > the data it had written in page cache because writeback failed. I am
> > assuming we can't keep the dirty page around for very long otherwise
> > it has potential to fill up all the available ram with dirty pages which
> > can't be written back.
> >
> 
> Right. the resulting data is undefined after error.

So application will not come to know of error until and unless it does
an fsync()? IOW, if I write to a file and read back same pages after
a while, I might not get back what I had written. So application 
should first write data, fsync it and upon successful fsync, consume
back the data written?

If yes, this is a problem for volatile containers. If somebody is
using these to build images, there is a possibility that image
is corrupted (because writeback error led to data loss). If yes,
then safe way to generate image with volatile containers
will be to first sync upper (or sync on umount somehow) and if
no errors are reported, then it is safe to read back that data
and pack into image.

> 
> > Why is it important to detect writeback error only during remount. What
> > happens if container overlay instance is already mounted and writeback
> > error happens. We will not detct that, right?
> >
> > IOW, if capturing writeback error is important for volatile containers,
> > then capturing it only during remount time is not enough. Normally
> > fsync/syncfs should catch it and now we have skipped those, so in
> > the process we lost mechanism to detect writeback errrors for
> > volatile containers?
> >
> 
> Yes, you are right.
> It's an issue with volatile that we should probably document.
> 
> I think upper files data can "evaporate" even as the overlay is still mounted.

How do we reliably consume that data back (if it can evaporate). That
means, syncing whole fs (syncfs) is a requirement for volatile containers
before data written is read back. Otherwise we don't know if we are
reading back correct data or corrupted data.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 20:18         ` Amir Goldstein
  2020-11-16 21:09           ` Vivek Goyal
@ 2020-11-16 21:26           ` Vivek Goyal
  2020-11-16 22:14             ` Sargun Dhillon
  1 sibling, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2020-11-16 21:26 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 10:18:03PM +0200, Amir Goldstein wrote:
> On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > >
> > > > > A user might want to remount this fs, but we do not let the user because
> > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > to do something like[1]:
> > > > >
> > > > > $ sync -f /root/upperdir
> > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > >
> > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > >
> > > >
> > > > Hi Sargun,
> > > >
> > > > I had asked bunch of questions in previous mail thread to be more
> > > > clear on your requirements but never got any response. It would
> > > > have helped understanding your requirements better.
> > > >
> > > > How about following patch set which seems to sync only dirty inodes of
> > > > upper belonging to a particular overlayfs instance.
> > > >
> > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > >
> > > > So if could implement a mount option which ignores fsync but upon
> > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > make sure we are not syncing whole of the upper fs. And we could
> > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > successful sync.
> > > >
> > > > Looks like this will be much simpler method and should be able to
> > > > meet your requirements (As long as you are fine with syncing dirty
> > > > upper inodes of this overlay instance on unmount).
> > > >
> > >
> > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > inodes - maybe that can be fixed.
> > >
> > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > and consumed the error, we may have a copied up file in upper whose
> > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > error? not sure.
> >
> > May be we can save errseq_t when mounting overlay and compare with
> > errseq_t stored in upper sb after unmount. That will tell us whether
> > error has happened since we mounted overlay. (Similar to what Sargun
> > is doing).
> >
> 
> I suppose so.
> 
> > In fact, if this is a concern, we have this issue with user space
> > "sync <upper>" too? Other sync might fail and this one succeeds
> > and we will think upper is just fine. May be container tools can
> > keep a file/dir open at the time of mount and call syncfs using
> > that fd instead. (And that should catch errors since that fd
> > was opened, I am assuming).
> >
> 
> Did not understand the problem with userspace sync.
> 
> > >
> > > I am less concerned about ways to allow re-mount of volatile
> > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> >
> > If we are not interested in converting volatile containers into
> > non-volatile, then whole point of these patch series is to detect
> > if any writeback error has happened or not. If writeback error has
> > happened, then we detect that at remount and possibly throw away
> > container.
> >
> > What happens today if writeback error has happened. Is that page thrown
> > away from page cache and read back from disk? IOW, will user lose
> > the data it had written in page cache because writeback failed. I am
> > assuming we can't keep the dirty page around for very long otherwise
> > it has potential to fill up all the available ram with dirty pages which
> > can't be written back.
> >
> 
> Right. the resulting data is undefined after error.
> 
> > Why is it important to detect writeback error only during remount. What
> > happens if container overlay instance is already mounted and writeback
> > error happens. We will not detct that, right?
> >
> > IOW, if capturing writeback error is important for volatile containers,
> > then capturing it only during remount time is not enough. Normally
> > fsync/syncfs should catch it and now we have skipped those, so in
> > the process we lost mechanism to detect writeback errrors for
> > volatile containers?
> >
> 
> Yes, you are right.
> It's an issue with volatile that we should probably document.
> 
> I think upper files data can "evaporate" even as the overlay is still mounted.

I think assumption of volatile containers was that data will remain
valid as long as machine does not crash/shutdown. We missed the case
of possibility of writeback errors during those discussions. 

And if data can evaporate without anyway to know that somehthing
is gone wrong, I don't know how that's useful for applications.

Also, first we need to fix the case of writeback error handling
for volatile containers while it is mounted before one tries to fix it
for writeback error detection during remount, IMHO.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 21:26           ` Vivek Goyal
@ 2020-11-16 22:14             ` Sargun Dhillon
  2020-11-17  5:41               ` Amir Goldstein
  2020-11-17 17:05               ` Vivek Goyal
  0 siblings, 2 replies; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-16 22:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 04:26:44PM -0500, Vivek Goyal wrote:
> On Mon, Nov 16, 2020 at 10:18:03PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > > >
> > > > > > A user might want to remount this fs, but we do not let the user because
> > > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > > to do something like[1]:
> > > > > >
> > > > > > $ sync -f /root/upperdir
> > > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > > >
> > > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > > >
> > > > >
> > > > > Hi Sargun,
> > > > >
> > > > > I had asked bunch of questions in previous mail thread to be more
> > > > > clear on your requirements but never got any response. It would
> > > > > have helped understanding your requirements better.
> > > > >
> > > > > How about following patch set which seems to sync only dirty inodes of
> > > > > upper belonging to a particular overlayfs instance.
> > > > >
> > > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > > >
> > > > > So if could implement a mount option which ignores fsync but upon
> > > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > > make sure we are not syncing whole of the upper fs. And we could
> > > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > > successful sync.
> > > > >
> > > > > Looks like this will be much simpler method and should be able to
> > > > > meet your requirements (As long as you are fine with syncing dirty
> > > > > upper inodes of this overlay instance on unmount).
> > > > >
> > > >
> > > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > > inodes - maybe that can be fixed.
> > > >
> > > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > > and consumed the error, we may have a copied up file in upper whose
> > > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > > error? not sure.
> > >
> > > May be we can save errseq_t when mounting overlay and compare with
> > > errseq_t stored in upper sb after unmount. That will tell us whether
> > > error has happened since we mounted overlay. (Similar to what Sargun
> > > is doing).
> > >
> > 
> > I suppose so.
> > 
> > > In fact, if this is a concern, we have this issue with user space
> > > "sync <upper>" too? Other sync might fail and this one succeeds
> > > and we will think upper is just fine. May be container tools can
> > > keep a file/dir open at the time of mount and call syncfs using
> > > that fd instead. (And that should catch errors since that fd
> > > was opened, I am assuming).
> > >
> > 
> > Did not understand the problem with userspace sync.
> > 
> > > >
> > > > I am less concerned about ways to allow re-mount of volatile
> > > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> > >
> > > If we are not interested in converting volatile containers into
> > > non-volatile, then whole point of these patch series is to detect
> > > if any writeback error has happened or not. If writeback error has
> > > happened, then we detect that at remount and possibly throw away
> > > container.
> > >
> > > What happens today if writeback error has happened. Is that page thrown
> > > away from page cache and read back from disk? IOW, will user lose
> > > the data it had written in page cache because writeback failed. I am
> > > assuming we can't keep the dirty page around for very long otherwise
> > > it has potential to fill up all the available ram with dirty pages which
> > > can't be written back.
> > >
> > 
> > Right. the resulting data is undefined after error.
> > 
> > > Why is it important to detect writeback error only during remount. What
> > > happens if container overlay instance is already mounted and writeback
> > > error happens. We will not detct that, right?
> > >
> > > IOW, if capturing writeback error is important for volatile containers,
> > > then capturing it only during remount time is not enough. Normally
> > > fsync/syncfs should catch it and now we have skipped those, so in
> > > the process we lost mechanism to detect writeback errrors for
> > > volatile containers?
> > >
> > 
> > Yes, you are right.
> > It's an issue with volatile that we should probably document.
> > 
> > I think upper files data can "evaporate" even as the overlay is still mounted.
> 
> I think assumption of volatile containers was that data will remain
> valid as long as machine does not crash/shutdown. We missed the case
> of possibility of writeback errors during those discussions. 
> 
> And if data can evaporate without anyway to know that somehthing
> is gone wrong, I don't know how that's useful for applications.
> 
> Also, first we need to fix the case of writeback error handling
> for volatile containers while it is mounted before one tries to fix it
> for writeback error detection during remount, IMHO.
> 
> Thanks
> Vivek
> 

I feel like this is an infamous Linux problem, and lots[1][2][3][4] has been said
on the topic, and there's not really a general purpose solution to it. I think that
most filesystems offer a choice of "continue" or "fail-stop" (readonly), and if
the upperdir lives on that filesystem, we will get the feedback from it.

I can respin my patch with just the "boot id" and superblock ID check if folks
are fine with that, and we can figure out how to resolve the writeback issues
later.

[1]: https://lwn.net/Articles/752063/
[2]: https://lwn.net/Articles/724307/
[3]: https://www.usenix.org/system/files/atc20-rebello.pdf
[4]: https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 21:09           ` Vivek Goyal
@ 2020-11-17  5:33             ` Amir Goldstein
  2020-11-17 14:48               ` Vivek Goyal
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2020-11-17  5:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 11:09 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Nov 16, 2020 at 10:18:03PM +0200, Amir Goldstein wrote:
> > On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > > >
> > > > > > A user might want to remount this fs, but we do not let the user because
> > > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > > to do something like[1]:
> > > > > >
> > > > > > $ sync -f /root/upperdir
> > > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > > >
> > > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > > >
> > > > >
> > > > > Hi Sargun,
> > > > >
> > > > > I had asked bunch of questions in previous mail thread to be more
> > > > > clear on your requirements but never got any response. It would
> > > > > have helped understanding your requirements better.
> > > > >
> > > > > How about following patch set which seems to sync only dirty inodes of
> > > > > upper belonging to a particular overlayfs instance.
> > > > >
> > > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > > >
> > > > > So if could implement a mount option which ignores fsync but upon
> > > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > > make sure we are not syncing whole of the upper fs. And we could
> > > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > > successful sync.
> > > > >
> > > > > Looks like this will be much simpler method and should be able to
> > > > > meet your requirements (As long as you are fine with syncing dirty
> > > > > upper inodes of this overlay instance on unmount).
> > > > >
> > > >
> > > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > > inodes - maybe that can be fixed.
> > > >
> > > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > > and consumed the error, we may have a copied up file in upper whose
> > > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > > error? not sure.
> > >
> > > May be we can save errseq_t when mounting overlay and compare with
> > > errseq_t stored in upper sb after unmount. That will tell us whether
> > > error has happened since we mounted overlay. (Similar to what Sargun
> > > is doing).
> > >
> >
> > I suppose so.
> >
> > > In fact, if this is a concern, we have this issue with user space
> > > "sync <upper>" too? Other sync might fail and this one succeeds
> > > and we will think upper is just fine. May be container tools can
> > > keep a file/dir open at the time of mount and call syncfs using
> > > that fd instead. (And that should catch errors since that fd
> > > was opened, I am assuming).
> > >
> >
> > Did not understand the problem with userspace sync.
>
> Say volatile container A is using upper/ which is on xfs. Assume, container A
> does following.
>
> 1. Container A writes some data/copies up some files.
> 2. sync -f upper/
> 3. Remove incompat dir.
> 4. Remount overlay and restart container A.
>
> Now normally if some error happend in writeback on upper/, then "sync -f"
> should catch that and return an error. In that case container manager can
> throw away the container.
>
> What if another container B was doing same thing and issues ssues
> "sync -f upper/" and that sync reports errors. Now container A issues
> sync and IIUC, we will not see error on super block because it has
> already been seen by container B.
>
> And container A will assume that all data written by it safely made
> it to disk and it is safe to remove incompat/volatile/ dir.
>
> If container manager keeps a file descriptor open to one of the files
> in upper/, and uses that for sync, then it will still catch the
> error because file->f_sb_err should be previous to error happened
> and we will get any error since then.
>

Yeh, we should probably record upper sb_err on mount either way,
On fsync in volatile, instead of noop we can check if upper fs had
writeback errors since volatile mount and return error instead of 0.



> >
> > > >
> > > > I am less concerned about ways to allow re-mount of volatile
> > > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> > >
> > > If we are not interested in converting volatile containers into
> > > non-volatile, then whole point of these patch series is to detect
> > > if any writeback error has happened or not. If writeback error has
> > > happened, then we detect that at remount and possibly throw away
> > > container.
> > >
> > > What happens today if writeback error has happened. Is that page thrown
> > > away from page cache and read back from disk? IOW, will user lose
> > > the data it had written in page cache because writeback failed. I am
> > > assuming we can't keep the dirty page around for very long otherwise
> > > it has potential to fill up all the available ram with dirty pages which
> > > can't be written back.
> > >
> >
> > Right. the resulting data is undefined after error.
>
> So application will not come to know of error until and unless it does
> an fsync()? IOW, if I write to a file and read back same pages after
> a while, I might not get back what I had written. So application
> should first write data, fsync it and upon successful fsync, consume
> back the data written?

I think so. Think of ENOSPC and delayed disk space allocation
and COW blocks with btrfs clones.
Filesystems will do their best to reserve space in such cases
before actual blocks allocation, but it doesn't always work.

>
> If yes, this is a problem for volatile containers. If somebody is
> using these to build images, there is a possibility that image
> is corrupted (because writeback error led to data loss). If yes,
> then safe way to generate image with volatile containers
> will be to first sync upper (or sync on umount somehow) and if
> no errors are reported, then it is safe to read back that data
> and pack into image.
>

I guess if we change fsync and syncfs to do nothing but return
error if any writeback error happened since mount we will be ok?

> >
> > > Why is it important to detect writeback error only during remount. What
> > > happens if container overlay instance is already mounted and writeback
> > > error happens. We will not detct that, right?
> > >
> > > IOW, if capturing writeback error is important for volatile containers,
> > > then capturing it only during remount time is not enough. Normally
> > > fsync/syncfs should catch it and now we have skipped those, so in
> > > the process we lost mechanism to detect writeback errrors for
> > > volatile containers?
> > >
> >
> > Yes, you are right.
> > It's an issue with volatile that we should probably document.
> >
> > I think upper files data can "evaporate" even as the overlay is still mounted.
>
> How do we reliably consume that data back (if it can evaporate). That
> means, syncing whole fs (syncfs) is a requirement for volatile containers
> before data written is read back. Otherwise we don't know if we are
> reading back correct data or corrupted data.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 22:14             ` Sargun Dhillon
@ 2020-11-17  5:41               ` Amir Goldstein
  2020-11-17 17:05               ` Vivek Goyal
  1 sibling, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-17  5:41 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

> > > I think upper files data can "evaporate" even as the overlay is still mounted.
> >
> > I think assumption of volatile containers was that data will remain
> > valid as long as machine does not crash/shutdown. We missed the case
> > of possibility of writeback errors during those discussions.
> >
> > And if data can evaporate without anyway to know that somehthing
> > is gone wrong, I don't know how that's useful for applications.
> >
> > Also, first we need to fix the case of writeback error handling
> > for volatile containers while it is mounted before one tries to fix it
> > for writeback error detection during remount, IMHO.
> >
> > Thanks
> > Vivek
> >
>
> I feel like this is an infamous Linux problem, and lots[1][2][3][4] has been said
> on the topic, and there's not really a general purpose solution to it. I think that
> most filesystems offer a choice of "continue" or "fail-stop" (readonly), and if
> the upperdir lives on that filesystem, we will get the feedback from it.
>
> I can respin my patch with just the "boot id" and superblock ID check if folks
> are fine with that, and we can figure out how to resolve the writeback issues
> later.
>

On the contrary. Your code for error check is very valuable and more
important than the remount feature.

If you change ovl_should_sync() to check for error since mount and
return error in that case, which all callers will check, then I think you
fix the evaporating files issue and that needs to come first with
stable kernel backport IMO.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17  5:33             ` Amir Goldstein
@ 2020-11-17 14:48               ` Vivek Goyal
  2020-11-17 15:24                 ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2020-11-17 14:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Tue, Nov 17, 2020 at 07:33:26AM +0200, Amir Goldstein wrote:
> On Mon, Nov 16, 2020 at 11:09 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Nov 16, 2020 at 10:18:03PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > >
> > > > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > > > >
> > > > > > > A user might want to remount this fs, but we do not let the user because
> > > > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > > > to do something like[1]:
> > > > > > >
> > > > > > > $ sync -f /root/upperdir
> > > > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > > > >
> > > > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > > > >
> > > > > >
> > > > > > Hi Sargun,
> > > > > >
> > > > > > I had asked bunch of questions in previous mail thread to be more
> > > > > > clear on your requirements but never got any response. It would
> > > > > > have helped understanding your requirements better.
> > > > > >
> > > > > > How about following patch set which seems to sync only dirty inodes of
> > > > > > upper belonging to a particular overlayfs instance.
> > > > > >
> > > > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > > > >
> > > > > > So if could implement a mount option which ignores fsync but upon
> > > > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > > > make sure we are not syncing whole of the upper fs. And we could
> > > > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > > > successful sync.
> > > > > >
> > > > > > Looks like this will be much simpler method and should be able to
> > > > > > meet your requirements (As long as you are fine with syncing dirty
> > > > > > upper inodes of this overlay instance on unmount).
> > > > > >
> > > > >
> > > > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > > > inodes - maybe that can be fixed.
> > > > >
> > > > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > > > and consumed the error, we may have a copied up file in upper whose
> > > > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > > > error? not sure.
> > > >
> > > > May be we can save errseq_t when mounting overlay and compare with
> > > > errseq_t stored in upper sb after unmount. That will tell us whether
> > > > error has happened since we mounted overlay. (Similar to what Sargun
> > > > is doing).
> > > >
> > >
> > > I suppose so.
> > >
> > > > In fact, if this is a concern, we have this issue with user space
> > > > "sync <upper>" too? Other sync might fail and this one succeeds
> > > > and we will think upper is just fine. May be container tools can
> > > > keep a file/dir open at the time of mount and call syncfs using
> > > > that fd instead. (And that should catch errors since that fd
> > > > was opened, I am assuming).
> > > >
> > >
> > > Did not understand the problem with userspace sync.
> >
> > Say volatile container A is using upper/ which is on xfs. Assume, container A
> > does following.
> >
> > 1. Container A writes some data/copies up some files.
> > 2. sync -f upper/
> > 3. Remove incompat dir.
> > 4. Remount overlay and restart container A.
> >
> > Now normally if some error happend in writeback on upper/, then "sync -f"
> > should catch that and return an error. In that case container manager can
> > throw away the container.
> >
> > What if another container B was doing same thing and issues ssues
> > "sync -f upper/" and that sync reports errors. Now container A issues
> > sync and IIUC, we will not see error on super block because it has
> > already been seen by container B.
> >
> > And container A will assume that all data written by it safely made
> > it to disk and it is safe to remove incompat/volatile/ dir.
> >
> > If container manager keeps a file descriptor open to one of the files
> > in upper/, and uses that for sync, then it will still catch the
> > error because file->f_sb_err should be previous to error happened
> > and we will get any error since then.
> >
> 
> Yeh, we should probably record upper sb_err on mount either way,
> On fsync in volatile, instead of noop we can check if upper fs had
> writeback errors since volatile mount and return error instead of 0.
> 
> 
> 
> > >
> > > > >
> > > > > I am less concerned about ways to allow re-mount of volatile
> > > > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> > > >
> > > > If we are not interested in converting volatile containers into
> > > > non-volatile, then whole point of these patch series is to detect
> > > > if any writeback error has happened or not. If writeback error has
> > > > happened, then we detect that at remount and possibly throw away
> > > > container.
> > > >
> > > > What happens today if writeback error has happened. Is that page thrown
> > > > away from page cache and read back from disk? IOW, will user lose
> > > > the data it had written in page cache because writeback failed. I am
> > > > assuming we can't keep the dirty page around for very long otherwise
> > > > it has potential to fill up all the available ram with dirty pages which
> > > > can't be written back.
> > > >
> > >
> > > Right. the resulting data is undefined after error.
> >
> > So application will not come to know of error until and unless it does
> > an fsync()? IOW, if I write to a file and read back same pages after
> > a while, I might not get back what I had written. So application
> > should first write data, fsync it and upon successful fsync, consume
> > back the data written?
> 
> I think so. Think of ENOSPC and delayed disk space allocation
> and COW blocks with btrfs clones.
> Filesystems will do their best to reserve space in such cases
> before actual blocks allocation, but it doesn't always work.
> 
> >
> > If yes, this is a problem for volatile containers. If somebody is
> > using these to build images, there is a possibility that image
> > is corrupted (because writeback error led to data loss). If yes,
> > then safe way to generate image with volatile containers
> > will be to first sync upper (or sync on umount somehow) and if
> > no errors are reported, then it is safe to read back that data
> > and pack into image.
> >
> 
> I guess if we change fsync and syncfs to do nothing but return
> error if any writeback error happened since mount we will be ok?

I guess that will not be sufficient. Because overlay fsync/syncfs can
only retrun any error which has happened so far. It is still possible
that error happens right after this fsync call and application still
reads back old/corrupted data.

So this proposal reduces the race window but does not completely
eliminate it.

We probably will have to sync upper/ and if there are no errors reported,
then it should be ok to consume data back.

This leads back to same issue of doing fsync/sync which we are trying
to avoid with volatile containers. So we have two options.

A. Build volatile containers should sync upper and then pack upper/ into
  an image. if final sync returns error, throw away the container and
  rebuild image. This will avoid intermediate fsync calls but does not
  eliminate final syncfs requirement on upper. Now one can either choose
  to do syncfs on upper/ or implement a more optimized syncfs through
  overlay so that selctives dirty inodes are synced instead.

B. Alternatively, live dangerously and know that it is possible that
  writeback error happens and you read back corrupted data. 

I personally will be willing to pay the cost of syncfs at the end and
use option A instead of always wondering if image I generated is corrupted
or not.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17 14:48               ` Vivek Goyal
@ 2020-11-17 15:24                 ` Amir Goldstein
  2020-11-17 15:40                   ` Vivek Goyal
  2020-11-17 16:46                   ` Vivek Goyal
  0 siblings, 2 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-17 15:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

> > I guess if we change fsync and syncfs to do nothing but return
> > error if any writeback error happened since mount we will be ok?
>
> I guess that will not be sufficient. Because overlay fsync/syncfs can
> only retrun any error which has happened so far. It is still possible
> that error happens right after this fsync call and application still
> reads back old/corrupted data.
>
> So this proposal reduces the race window but does not completely
> eliminate it.
>

That's true.

> We probably will have to sync upper/ and if there are no errors reported,
> then it should be ok to consume data back.
>
> This leads back to same issue of doing fsync/sync which we are trying
> to avoid with volatile containers. So we have two options.
>
> A. Build volatile containers should sync upper and then pack upper/ into
>   an image. if final sync returns error, throw away the container and
>   rebuild image. This will avoid intermediate fsync calls but does not
>   eliminate final syncfs requirement on upper. Now one can either choose
>   to do syncfs on upper/ or implement a more optimized syncfs through
>   overlay so that selctives dirty inodes are synced instead.
>
> B. Alternatively, live dangerously and know that it is possible that
>   writeback error happens and you read back corrupted data.
>

C. "shutdown" the filesystem if writeback errors happened and return
     EIO from any read, like some blockdev filesystems will do in face
     of metadata write errors

I happen to have a branch ready for that ;-)
https://github.com/amir73il/linux/commits/ovl-shutdown

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17 15:24                 ` Amir Goldstein
@ 2020-11-17 15:40                   ` Vivek Goyal
  2020-11-17 16:46                   ` Vivek Goyal
  1 sibling, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-17 15:40 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Tue, Nov 17, 2020 at 05:24:33PM +0200, Amir Goldstein wrote:
> > > I guess if we change fsync and syncfs to do nothing but return
> > > error if any writeback error happened since mount we will be ok?
> >
> > I guess that will not be sufficient. Because overlay fsync/syncfs can
> > only retrun any error which has happened so far. It is still possible
> > that error happens right after this fsync call and application still
> > reads back old/corrupted data.
> >
> > So this proposal reduces the race window but does not completely
> > eliminate it.
> >
> 
> That's true.
> 
> > We probably will have to sync upper/ and if there are no errors reported,
> > then it should be ok to consume data back.
> >
> > This leads back to same issue of doing fsync/sync which we are trying
> > to avoid with volatile containers. So we have two options.
> >
> > A. Build volatile containers should sync upper and then pack upper/ into
> >   an image. if final sync returns error, throw away the container and
> >   rebuild image. This will avoid intermediate fsync calls but does not
> >   eliminate final syncfs requirement on upper. Now one can either choose
> >   to do syncfs on upper/ or implement a more optimized syncfs through
> >   overlay so that selctives dirty inodes are synced instead.
> >
> > B. Alternatively, live dangerously and know that it is possible that
> >   writeback error happens and you read back corrupted data.
> >
> 
> C. "shutdown" the filesystem if writeback errors happened and return
>      EIO from any read, like some blockdev filesystems will do in face
>      of metadata write errors
> 

Option C sounds interesting. If data writeback fails, shutdown overlay
filesystem and that way image build should fail, container manager
can throw away container and rebuild. And we avoid all the fysnc/syncfs
as we wanted to.

> I happen to have a branch ready for that ;-)
> https://github.com/amir73il/linux/commits/ovl-shutdown

I will check it out.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17 15:24                 ` Amir Goldstein
  2020-11-17 15:40                   ` Vivek Goyal
@ 2020-11-17 16:46                   ` Vivek Goyal
  2020-11-17 18:03                     ` Amir Goldstein
  1 sibling, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2020-11-17 16:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Tue, Nov 17, 2020 at 05:24:33PM +0200, Amir Goldstein wrote:
> > > I guess if we change fsync and syncfs to do nothing but return
> > > error if any writeback error happened since mount we will be ok?
> >
> > I guess that will not be sufficient. Because overlay fsync/syncfs can
> > only retrun any error which has happened so far. It is still possible
> > that error happens right after this fsync call and application still
> > reads back old/corrupted data.
> >
> > So this proposal reduces the race window but does not completely
> > eliminate it.
> >
> 
> That's true.
> 
> > We probably will have to sync upper/ and if there are no errors reported,
> > then it should be ok to consume data back.
> >
> > This leads back to same issue of doing fsync/sync which we are trying
> > to avoid with volatile containers. So we have two options.
> >
> > A. Build volatile containers should sync upper and then pack upper/ into
> >   an image. if final sync returns error, throw away the container and
> >   rebuild image. This will avoid intermediate fsync calls but does not
> >   eliminate final syncfs requirement on upper. Now one can either choose
> >   to do syncfs on upper/ or implement a more optimized syncfs through
> >   overlay so that selctives dirty inodes are synced instead.
> >
> > B. Alternatively, live dangerously and know that it is possible that
> >   writeback error happens and you read back corrupted data.
> >
> 
> C. "shutdown" the filesystem if writeback errors happened and return
>      EIO from any read, like some blockdev filesystems will do in face
>      of metadata write errors
> 
> I happen to have a branch ready for that ;-)
> https://github.com/amir73il/linux/commits/ovl-shutdown


This branch seems to implement shutdown ioctl. So it will still need
glue code to detect writeback failure in upper/ and trigger shutdown
internally?

And if that works, then Sargun's patches can fit in nicely on top which 
detect writeback failures on remount and will shutdown fs.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-16 22:14             ` Sargun Dhillon
  2020-11-17  5:41               ` Amir Goldstein
@ 2020-11-17 17:05               ` Vivek Goyal
  1 sibling, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-17 17:05 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Mon, Nov 16, 2020 at 10:14:02PM +0000, Sargun Dhillon wrote:
> On Mon, Nov 16, 2020 at 04:26:44PM -0500, Vivek Goyal wrote:
> > On Mon, Nov 16, 2020 at 10:18:03PM +0200, Amir Goldstein wrote:
> > > On Mon, Nov 16, 2020 at 6:36 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Mon, Nov 16, 2020 at 05:20:04PM +0200, Amir Goldstein wrote:
> > > > > On Mon, Nov 16, 2020 at 4:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > >
> > > > > > On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > > > > > > Overlayfs added the ability to setup mounts where all syncs could be
> > > > > > > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > > > > > >
> > > > > > > A user might want to remount this fs, but we do not let the user because
> > > > > > > of the "incompat" detection feature. In the case of volatile, it is safe
> > > > > > > to do something like[1]:
> > > > > > >
> > > > > > > $ sync -f /root/upperdir
> > > > > > > $ rm -rf /root/workdir/incompat/volatile
> > > > > > >
> > > > > > > There are two ways to go about this. You can call sync on the underlying
> > > > > > > filesystem, check the error code, and delete the dirty file if everything
> > > > > > > is clean. If you're running lots of containers on the same filesystem, or
> > > > > > > you want to avoid all unnecessary I/O, this may be suboptimal.
> > > > > > >
> > > > > >
> > > > > > Hi Sargun,
> > > > > >
> > > > > > I had asked bunch of questions in previous mail thread to be more
> > > > > > clear on your requirements but never got any response. It would
> > > > > > have helped understanding your requirements better.
> > > > > >
> > > > > > How about following patch set which seems to sync only dirty inodes of
> > > > > > upper belonging to a particular overlayfs instance.
> > > > > >
> > > > > > https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> > > > > >
> > > > > > So if could implement a mount option which ignores fsync but upon
> > > > > > syncfs, only syncs dirty inodes of that overlayfs instance, it will
> > > > > > make sure we are not syncing whole of the upper fs. And we could
> > > > > > do this syncing on unmount of overlayfs and remove dirty file upon
> > > > > > successful sync.
> > > > > >
> > > > > > Looks like this will be much simpler method and should be able to
> > > > > > meet your requirements (As long as you are fine with syncing dirty
> > > > > > upper inodes of this overlay instance on unmount).
> > > > > >
> > > > >
> > > > > Do note that the latest patch set by Chengguang not only syncs dirty
> > > > > inodes of this overlay instance, but also waits for in-flight writeback on
> > > > > all the upper fs inodes and I think that with !ovl_should_sync(ofs)
> > > > > we will not re-dirty the ovl inodes and lose track of the list of dirty
> > > > > inodes - maybe that can be fixed.
> > > > >
> > > > > Also, I am not sure anymore that we can safely remove the dirty file after
> > > > > sync dirty inodes sync_fs and umount. If someone did sync_fs before us
> > > > > and consumed the error, we may have a copied up file in upper whose
> > > > > data is not on disk, but when we sync_fs on unmount we won't get an
> > > > > error? not sure.
> > > >
> > > > May be we can save errseq_t when mounting overlay and compare with
> > > > errseq_t stored in upper sb after unmount. That will tell us whether
> > > > error has happened since we mounted overlay. (Similar to what Sargun
> > > > is doing).
> > > >
> > > 
> > > I suppose so.
> > > 
> > > > In fact, if this is a concern, we have this issue with user space
> > > > "sync <upper>" too? Other sync might fail and this one succeeds
> > > > and we will think upper is just fine. May be container tools can
> > > > keep a file/dir open at the time of mount and call syncfs using
> > > > that fd instead. (And that should catch errors since that fd
> > > > was opened, I am assuming).
> > > >
> > > 
> > > Did not understand the problem with userspace sync.
> > > 
> > > > >
> > > > > I am less concerned about ways to allow re-mount of volatile
> > > > > overlayfs than I am about turning volatile overlayfs into non-volatile.
> > > >
> > > > If we are not interested in converting volatile containers into
> > > > non-volatile, then whole point of these patch series is to detect
> > > > if any writeback error has happened or not. If writeback error has
> > > > happened, then we detect that at remount and possibly throw away
> > > > container.
> > > >
> > > > What happens today if writeback error has happened. Is that page thrown
> > > > away from page cache and read back from disk? IOW, will user lose
> > > > the data it had written in page cache because writeback failed. I am
> > > > assuming we can't keep the dirty page around for very long otherwise
> > > > it has potential to fill up all the available ram with dirty pages which
> > > > can't be written back.
> > > >
> > > 
> > > Right. the resulting data is undefined after error.
> > > 
> > > > Why is it important to detect writeback error only during remount. What
> > > > happens if container overlay instance is already mounted and writeback
> > > > error happens. We will not detct that, right?
> > > >
> > > > IOW, if capturing writeback error is important for volatile containers,
> > > > then capturing it only during remount time is not enough. Normally
> > > > fsync/syncfs should catch it and now we have skipped those, so in
> > > > the process we lost mechanism to detect writeback errrors for
> > > > volatile containers?
> > > >
> > > 
> > > Yes, you are right.
> > > It's an issue with volatile that we should probably document.
> > > 
> > > I think upper files data can "evaporate" even as the overlay is still mounted.
> > 
> > I think assumption of volatile containers was that data will remain
> > valid as long as machine does not crash/shutdown. We missed the case
> > of possibility of writeback errors during those discussions. 
> > 
> > And if data can evaporate without anyway to know that somehthing
> > is gone wrong, I don't know how that's useful for applications.
> > 
> > Also, first we need to fix the case of writeback error handling
> > for volatile containers while it is mounted before one tries to fix it
> > for writeback error detection during remount, IMHO.
> > 
> > Thanks
> > Vivek
> > 
> 
> I feel like this is an infamous Linux problem, and lots[1][2][3][4] has been said
> on the topic, and there's not really a general purpose solution to it. I think that
> most filesystems offer a choice of "continue" or "fail-stop" (readonly), and if
> the upperdir lives on that filesystem, we will get the feedback from it.

In case of fsync/writeback data failures, we will not hear anything back.
Only mechanism to know about failure seems to be fsync()/syncfs() and
we disable both in overlayfs. So that alone is not enough. For overlay
volatile mode, we need another way to deal with writeback failures
in upper/, IIUC.

> 
> I can respin my patch with just the "boot id" and superblock ID check if folks
> are fine with that, and we can figure out how to resolve the writeback issues
> later.

Keeping track of "boot id" and removing incompat/volatile automatically
if boot id is same, just moves processing from user space to kernel
space. But user space tools can do the same thing as well. So I am
not sure why not teach user space tools to manage incompat/volatile
directory.

Havind said that, I am not opposed to the idea of keeping track of "boot id"
in kernel removing incomapt/volatile automatically on next mount if
boot id is same. 

Thanks
Vivek


> 
> [1]: https://lwn.net/Articles/752063/
> [2]: https://lwn.net/Articles/724307/
> [3]: https://www.usenix.org/system/files/atc20-rebello.pdf
> [4]: https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17 16:46                   ` Vivek Goyal
@ 2020-11-17 18:03                     ` Amir Goldstein
  2020-11-17 18:29                       ` Vivek Goyal
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2020-11-17 18:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

> > C. "shutdown" the filesystem if writeback errors happened and return
> >      EIO from any read, like some blockdev filesystems will do in face
> >      of metadata write errors
> >
> > I happen to have a branch ready for that ;-)
> > https://github.com/amir73il/linux/commits/ovl-shutdown
>
>
> This branch seems to implement shutdown ioctl. So it will still need
> glue code to detect writeback failure in upper/ and trigger shutdown
> internally?
>

Yes.
ovl_get_acess() can check both the administrative ofs->goingdown
command and the upper writeback error condition for volatile ovl
or something like that.

> And if that works, then Sargun's patches can fit in nicely on top which
> detect writeback failures on remount and will shutdown fs.
>

Not sure why remount needs to shutdown. It needs to fail mount,
but yeh, all those things should fit nicely together.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17 18:03                     ` Amir Goldstein
@ 2020-11-17 18:29                       ` Vivek Goyal
  2020-11-18  7:24                         ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2020-11-17 18:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Tue, Nov 17, 2020 at 08:03:16PM +0200, Amir Goldstein wrote:
> > > C. "shutdown" the filesystem if writeback errors happened and return
> > >      EIO from any read, like some blockdev filesystems will do in face
> > >      of metadata write errors
> > >
> > > I happen to have a branch ready for that ;-)
> > > https://github.com/amir73il/linux/commits/ovl-shutdown
> >
> >
> > This branch seems to implement shutdown ioctl. So it will still need
> > glue code to detect writeback failure in upper/ and trigger shutdown
> > internally?
> >
> 
> Yes.
> ovl_get_acess() can check both the administrative ofs->goingdown
> command and the upper writeback error condition for volatile ovl
> or something like that.

This approach will not help mmaped() pages though, if I do.

- Store to addr
- msync
- Load from addr

There is a chance that I can still read back old data.

> 
> > And if that works, then Sargun's patches can fit in nicely on top which
> > detect writeback failures on remount and will shutdown fs.
> >
> 
> Not sure why remount needs to shutdown. It needs to fail mount,
> but yeh, all those things should fit nicely together.

Agreed. mount/remount can just fail in that case.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-17 18:29                       ` Vivek Goyal
@ 2020-11-18  7:24                         ` Amir Goldstein
  2020-11-18  8:27                           ` Sargun Dhillon
  2020-11-18 14:55                           ` Vivek Goyal
  0 siblings, 2 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-18  7:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Tue, Nov 17, 2020 at 8:29 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Nov 17, 2020 at 08:03:16PM +0200, Amir Goldstein wrote:
> > > > C. "shutdown" the filesystem if writeback errors happened and return
> > > >      EIO from any read, like some blockdev filesystems will do in face
> > > >      of metadata write errors
> > > >
> > > > I happen to have a branch ready for that ;-)
> > > > https://github.com/amir73il/linux/commits/ovl-shutdown
> > >
> > >
> > > This branch seems to implement shutdown ioctl. So it will still need
> > > glue code to detect writeback failure in upper/ and trigger shutdown
> > > internally?
> > >
> >
> > Yes.
> > ovl_get_acess() can check both the administrative ofs->goingdown
> > command and the upper writeback error condition for volatile ovl
> > or something like that.
>
> This approach will not help mmaped() pages though, if I do.
>
> - Store to addr
> - msync
> - Load from addr
>
> There is a chance that I can still read back old data.
>

msync does not go through overlay. It goes directly to upper fs,
so it will sync pages and return error on volatile overlay as well.

Maybe there will still be weird corner cases, but the shutdown approach
should cover most or all of the interesting cases.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-18  7:24                         ` Amir Goldstein
@ 2020-11-18  8:27                           ` Sargun Dhillon
  2020-11-18 10:46                             ` Amir Goldstein
  2020-11-18 14:55                           ` Vivek Goyal
  1 sibling, 1 reply; 34+ messages in thread
From: Sargun Dhillon @ 2020-11-18  8:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vivek Goyal, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Wed, Nov 18, 2020 at 09:24:04AM +0200, Amir Goldstein wrote:
> On Tue, Nov 17, 2020 at 8:29 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Nov 17, 2020 at 08:03:16PM +0200, Amir Goldstein wrote:
> > > > > C. "shutdown" the filesystem if writeback errors happened and return
> > > > >      EIO from any read, like some blockdev filesystems will do in face
> > > > >      of metadata write errors
> > > > >
> > > > > I happen to have a branch ready for that ;-)
> > > > > https://github.com/amir73il/linux/commits/ovl-shutdown
> > > >
> > > >
> > > > This branch seems to implement shutdown ioctl. So it will still need
> > > > glue code to detect writeback failure in upper/ and trigger shutdown
> > > > internally?
> > > >
> > >
> > > Yes.
> > > ovl_get_acess() can check both the administrative ofs->goingdown
> > > command and the upper writeback error condition for volatile ovl
> > > or something like that.
> >
> > This approach will not help mmaped() pages though, if I do.
> >
> > - Store to addr
> > - msync
> > - Load from addr
> >
> > There is a chance that I can still read back old data.
> >
> 
> msync does not go through overlay. It goes directly to upper fs,
> so it will sync pages and return error on volatile overlay as well.
> 
> Maybe there will still be weird corner cases, but the shutdown approach
> should cover most or all of the interesting cases.
When would we check the errseq_t of the upperdir? Only when the user
calls fsync, or upon close? Periodically?

> 
> Thanks,
> Amir.

We can tackle this later, but I suggest the following semantics, which
follow how ext4 works:

https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
errors=remount-ro	Remount the filesystem read-only on an error.
errors=continue		Keep going on a filesystem error.
[Sargun: We probably don't want this one]
errors=panic		Panic and halt the machine if an error occurs.
                        (These mount options override the errors behavior
                        specified in the superblock, which can be configured
                        using tune2fs)

----
We can potentially add a fourth option, which is shutdown -- that would
return something like EIO or ESHUTDOWN for all calls.

In addition to that, we should pass through the right errseqs to make
the errseq helpers work:
int filemap_check_wb_err(struct address_space *mapping, errseq_t since) [1]
errseq_t filemap_sample_wb_err(struct address_space *mapping) [2]
errseq_t file_sample_sb_err(struct file *file)

etc...

These are used by the VFS layer to check for errors after syncfs or for 
interactions with mapped files. 

[1]: https://elixir.bootlin.com/linux/v5.9.7/source/include/linux/fs.h#L2665
[2]: https://elixir.bootlin.com/linux/v5.9.7/source/include/linux/fs.h#L2688
[3]: https://elixir.bootlin.com/linux/v5.9.7/source/include/linux/fs.h#L2700

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-18  8:27                           ` Sargun Dhillon
@ 2020-11-18 10:46                             ` Amir Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Amir Goldstein @ 2020-11-18 10:46 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Wed, Nov 18, 2020 at 10:27 AM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Wed, Nov 18, 2020 at 09:24:04AM +0200, Amir Goldstein wrote:
> > On Tue, Nov 17, 2020 at 8:29 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Tue, Nov 17, 2020 at 08:03:16PM +0200, Amir Goldstein wrote:
> > > > > > C. "shutdown" the filesystem if writeback errors happened and return
> > > > > >      EIO from any read, like some blockdev filesystems will do in face
> > > > > >      of metadata write errors
> > > > > >
> > > > > > I happen to have a branch ready for that ;-)
> > > > > > https://github.com/amir73il/linux/commits/ovl-shutdown
> > > > >
> > > > >
> > > > > This branch seems to implement shutdown ioctl. So it will still need
> > > > > glue code to detect writeback failure in upper/ and trigger shutdown
> > > > > internally?
> > > > >
> > > >
> > > > Yes.
> > > > ovl_get_acess() can check both the administrative ofs->goingdown
> > > > command and the upper writeback error condition for volatile ovl
> > > > or something like that.
> > >
> > > This approach will not help mmaped() pages though, if I do.
> > >
> > > - Store to addr
> > > - msync
> > > - Load from addr
> > >
> > > There is a chance that I can still read back old data.
> > >
> >
> > msync does not go through overlay. It goes directly to upper fs,
> > so it will sync pages and return error on volatile overlay as well.
> >
> > Maybe there will still be weird corner cases, but the shutdown approach
> > should cover most or all of the interesting cases.
> When would we check the errseq_t of the upperdir? Only when the user
> calls fsync, or upon close? Periodically?
>

Ideally, if it is not too costly, on every "access".

The ovl-shutdown branch adds a ovl_get_access() call before access to any
overlay object.

> >
> > Thanks,
> > Amir.
>
> We can tackle this later, but I suggest the following semantics, which
> follow how ext4 works:
>
> https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
> errors=remount-ro       Remount the filesystem read-only on an error.
> errors=continue         Keep going on a filesystem error.
> [Sargun: We probably don't want this one]
> errors=panic            Panic and halt the machine if an error occurs.
>                         (These mount options override the errors behavior
>                         specified in the superblock, which can be configured
>                         using tune2fs)

None of these modes seem relevant to volatile overlay IMO.

>
> ----
> We can potentially add a fourth option, which is shutdown -- that would
> return something like EIO or ESHUTDOWN for all calls.
>

FWIW, that's the only mode XFS supports.

> In addition to that, we should pass through the right errseqs to make
> the errseq helpers work:
> int filemap_check_wb_err(struct address_space *mapping, errseq_t since) [1]
> errseq_t filemap_sample_wb_err(struct address_space *mapping) [2]
> errseq_t file_sample_sb_err(struct file *file)
>

Are you referring to volatile or non-valatile overlayfs?

For fsync, because every overlay file has a "shadow" real file,
I think errseq of overlayfs file should already reflect the correct state
of the errseq of the real file.

For syncfs, we should record the errseq of upper fs on mount, as your
patch did.

For volatile overlay, syncfs should fail permanently if there was a writeback
error since mount, not only once, so there is no reason to update the
errseq on the overlay sb? It is not like one syncfs can observe an error and
in the next it will be gone.

For non-volatile overlay, we probably need to report syncfs error once if
upper fs errseq is bigger than ovl sb errseq and advance ovl sb errseq.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
  2020-11-18  7:24                         ` Amir Goldstein
  2020-11-18  8:27                           ` Sargun Dhillon
@ 2020-11-18 14:55                           ` Vivek Goyal
  1 sibling, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2020-11-18 14:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro,
	Giuseppe Scrivano, Daniel J Walsh, David Howells, linux-fsdevel,
	Chengguang Xu

On Wed, Nov 18, 2020 at 09:24:04AM +0200, Amir Goldstein wrote:
> On Tue, Nov 17, 2020 at 8:29 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Tue, Nov 17, 2020 at 08:03:16PM +0200, Amir Goldstein wrote:
> > > > > C. "shutdown" the filesystem if writeback errors happened and return
> > > > >      EIO from any read, like some blockdev filesystems will do in face
> > > > >      of metadata write errors
> > > > >
> > > > > I happen to have a branch ready for that ;-)
> > > > > https://github.com/amir73il/linux/commits/ovl-shutdown
> > > >
> > > >
> > > > This branch seems to implement shutdown ioctl. So it will still need
> > > > glue code to detect writeback failure in upper/ and trigger shutdown
> > > > internally?
> > > >
> > >
> > > Yes.
> > > ovl_get_acess() can check both the administrative ofs->goingdown
> > > command and the upper writeback error condition for volatile ovl
> > > or something like that.
> >
> > This approach will not help mmaped() pages though, if I do.
> >
> > - Store to addr
> > - msync
> > - Load from addr
> >
> > There is a chance that I can still read back old data.
> >
> 
> msync does not go through overlay. It goes directly to upper fs,
> so it will sync pages and return error on volatile overlay as well.

Ok. Its because vma->vm_file points to realfile.

So even for volatile containers we only avoid fsync/syncfs and not msync.
msync will directly call into upper/. 

> 
> Maybe there will still be weird corner cases, but the shutdown approach
> should cover most or all of the interesting cases.

Agreed.

Vivek


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2020-11-18 14:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-16  4:57 [RFC PATCH 0/3] Make overlayfs volatile mounts reusable Sargun Dhillon
2020-11-16  4:57 ` [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
2020-11-16  5:07   ` Sargun Dhillon
2020-11-16  4:57 ` [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper Sargun Dhillon
2020-11-16 11:00   ` Amir Goldstein
2020-11-16  4:57 ` [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
2020-11-16  9:31   ` Amir Goldstein
2020-11-16 10:30     ` Sargun Dhillon
2020-11-16 11:17       ` Amir Goldstein
2020-11-16 12:52         ` Amir Goldstein
2020-11-16 14:42   ` Vivek Goyal
2020-11-16 14:45     ` Vivek Goyal
2020-11-16 15:20     ` Amir Goldstein
2020-11-16 16:36       ` Vivek Goyal
2020-11-16 18:25         ` Sargun Dhillon
2020-11-16 19:27           ` Vivek Goyal
2020-11-16 20:18         ` Amir Goldstein
2020-11-16 21:09           ` Vivek Goyal
2020-11-17  5:33             ` Amir Goldstein
2020-11-17 14:48               ` Vivek Goyal
2020-11-17 15:24                 ` Amir Goldstein
2020-11-17 15:40                   ` Vivek Goyal
2020-11-17 16:46                   ` Vivek Goyal
2020-11-17 18:03                     ` Amir Goldstein
2020-11-17 18:29                       ` Vivek Goyal
2020-11-18  7:24                         ` Amir Goldstein
2020-11-18  8:27                           ` Sargun Dhillon
2020-11-18 10:46                             ` Amir Goldstein
2020-11-18 14:55                           ` Vivek Goyal
2020-11-16 21:26           ` Vivek Goyal
2020-11-16 22:14             ` Sargun Dhillon
2020-11-17  5:41               ` Amir Goldstein
2020-11-17 17:05               ` Vivek Goyal
2020-11-16 17:38     ` Sargun Dhillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).