linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
@ 2020-08-31 18:15 Vivek Goyal
  2020-09-01  8:22 ` Amir Goldstein
  2020-11-06 17:58 ` Sargun Dhillon
  0 siblings, 2 replies; 20+ messages in thread
From: Vivek Goyal @ 2020-08-31 18:15 UTC (permalink / raw)
  To: linux-unionfs, miklos; +Cc: Amir Goldstein, Giuseppe Scrivano, Daniel J Walsh

Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build
requirement is such that they don't care if a node goes down while
build was still going on. In that case, they will simply throw away
unfinished layer and start new build. So they don't care about syncing
intermediate state to the disk and hence don't want to pay the price
associated with sync.

So they are asking for mount options where they can disable sync on overlay
mount point.

They primarily seem to have two use cases.

- For building images, they will mount overlay with nosync and then sync
  upper layer after unmounting overlay and reuse upper as lower for next
  layer.

- For running containers, they don't seem to care about syncing upper
  layer because if node goes down, they will simply throw away upper
  layer and create a fresh one.

So this patch provides a mount option "volatile" which disables all forms
of sync. Now it is caller's responsibility to throw away upper if
system crashes or shuts down and start fresh.

With "volatile", I am seeing roughly 20% speed up in my VM where I am just
installing emacs in an image. Installation time drops from 31 seconds to
25 seconds when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take
out the network operations latency out of the measurement.

Giuseppe is also looking to cut down on number of iops done on the
disk. He is complaining that often in cloud their VMs are throttled
if they cross the limit. This option can help them where they reduce
number of iops (by cutting down on frequent sync and writebacks).

Changes from v6:
- Got rid of logic to check for volatile/dirty file. Now Amir's
  patch checks for presence of incomat/volatile directory and errors
  out if present. User is now required to remove volatile
  directory. (Amir).

Changes from v5:
- Added support to detect that previous overlay was mounted with
  "volatile" option and fail mount. (Miklos and Amir).

Changes from v4:
- Dropped support for sync=fs (Miklos)
- Renamed "sync=off" to "volatile". (Miklos)

Changes from v3:
- Used only enums and dropped bit flags (Amir Goldstein)
- Dropped error when conflicting sync options provided. (Amir Goldstein)

Changes from v2:
- Added helper functions (Amir Goldstein)
- Used enums to keep sync state (Amir Goldstein)

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/filesystems/overlayfs.rst | 17 +++++
 fs/overlayfs/copy_up.c                  | 12 ++--
 fs/overlayfs/file.c                     | 10 ++-
 fs/overlayfs/ovl_entry.h                |  6 ++
 fs/overlayfs/readdir.c                  |  3 +
 fs/overlayfs/super.c                    | 88 ++++++++++++++++++++++++-
 6 files changed, 128 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 8ea83a51c266..b33465fdf260 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -563,6 +563,23 @@ This verification may cause significant overhead in some cases.
 Note: the mount options index=off,nfs_export=on are conflicting for a
 read-write mount and will result in an error.
 
+Disable sync
+------------
+By default, overlay skips sync on files residing on a lower layer.  It
+is possible to skip sync operations for files on the upper layer as well
+with the "volatile" mount option.
+
+"volatile" mount option disables all forms of sync from overlay, including
+the one done at umount/remount. If system crashes or shuts down, user
+should throw away upper directory and start fresh.
+
+When overlay is mounted with "volatile" option, overlay creates an internal
+directory "$workdir/work/incompat/volatile". During next mount, overlay
+checks for this directory and refuses to mount if present. This is a strong
+indicator that user should throw away upper and work directories and
+create fresh one. In very limited cases where user knows system has not
+crashed and contents in upperdir are intact, one can remove the "volatile"
+directory and retry mount.
 
 Testsuite
 ---------
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index d07fb92b7253..9d17e42d184b 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
 	return error;
 }
 
-static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
+static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
+			    struct path *new, loff_t len)
 {
 	struct file *old_file;
 	struct file *new_file;
@@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 		len -= bytes;
 	}
 out:
-	if (!error)
+	if (!error && ovl_should_sync(ofs))
 		error = vfs_fsync(new_file, 0);
 	fput(new_file);
 out_fput:
@@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 
 static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 {
+	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
 	int err;
 
 	/*
@@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 		upperpath.dentry = temp;
 
 		ovl_path_lowerdata(c->dentry, &datapath);
-		err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
+		err = ovl_copy_up_data(ofs, &datapath, &upperpath,
+				       c->stat.size);
 		if (err)
 			return err;
 	}
@@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
 /* Copy up data of an inode which was copied up metadata only in the past. */
 static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 {
+	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
 	struct path upperpath, datapath;
 	int err;
 	char *capability = NULL;
@@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 			goto out;
 	}
 
-	err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
+	err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
 	if (err)
 		goto out_free;
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 0d940e29d62b..3582c3ae819c 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -331,6 +331,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	struct fd real;
 	const struct cred *old_cred;
 	ssize_t ret;
+	int ifl = iocb->ki_flags;
 
 	if (!iov_iter_count(iter))
 		return 0;
@@ -346,11 +347,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	if (ret)
 		goto out_unlock;
 
+	if (!ovl_should_sync(OVL_FS(inode->i_sb)))
+		ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
+
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	if (is_sync_kiocb(iocb)) {
 		file_start_write(real.file);
 		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
-				     ovl_iocb_to_rwf(iocb->ki_flags));
+				     ovl_iocb_to_rwf(ifl));
 		file_end_write(real.file);
 		/* Update size */
 		ovl_copyattr(ovl_inode_real(inode), inode);
@@ -370,6 +374,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 		real.flags = 0;
 		aio_req->orig_iocb = iocb;
 		kiocb_clone(&aio_req->iocb, iocb, real.file);
+		aio_req->iocb.ki_flags = ifl;
 		aio_req->iocb.ki_complete = ovl_aio_rw_complete;
 		ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
 		if (ret != -EIOCBQUEUED)
@@ -433,6 +438,9 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	const struct cred *old_cred;
 	int ret;
 
+	if (!ovl_should_sync(OVL_FS(file_inode(file)->i_sb)))
+		return 0;
+
 	ret = ovl_real_fdget_meta(file, &real, !datasync);
 	if (ret)
 		return ret;
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index b429c80879ee..1b5a2094df8e 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -17,6 +17,7 @@ struct ovl_config {
 	bool nfs_export;
 	int xino;
 	bool metacopy;
+	bool ovl_volatile;
 };
 
 struct ovl_sb {
@@ -90,6 +91,11 @@ static inline struct ovl_fs *OVL_FS(struct super_block *sb)
 	return (struct ovl_fs *)sb->s_fs_info;
 }
 
+static inline bool ovl_should_sync(struct ovl_fs *ofs)
+{
+	return !ofs->config.ovl_volatile;
+}
+
 /* private information held for every overlayfs dentry */
 struct ovl_entry {
 	union {
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 683c6f27ab77..f50a9f20e72d 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
 		return 0;
 
+	if (!ovl_should_sync(OVL_FS(dentry->d_sb)))
+		return 0;
+
 	/*
 	 * Need to check if we started out being a lower dir, but got copied up
 	 */
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 3cd47e4b2eae..f0f7ad8da4be 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	if (!ovl_upper_mnt(ofs))
 		return 0;
 
+	if (!ovl_should_sync(ofs))
+		return 0;
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	if (ofs->config.metacopy != ovl_metacopy_def)
 		seq_printf(m, ",metacopy=%s",
 			   ofs->config.metacopy ? "on" : "off");
+	if (ofs->config.ovl_volatile)
+		seq_printf(m, ",volatile");
 	return 0;
 }
 
@@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 
 	if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
 		upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
-		down_read(&upper_sb->s_umount);
-		ret = sync_filesystem(upper_sb);
-		up_read(&upper_sb->s_umount);
+		if (ovl_should_sync(ofs)) {
+			down_read(&upper_sb->s_umount);
+			ret = sync_filesystem(upper_sb);
+			up_read(&upper_sb->s_umount);
+		}
 	}
 
 	return ret;
@@ -411,6 +417,7 @@ enum {
 	OPT_XINO_AUTO,
 	OPT_METACOPY_ON,
 	OPT_METACOPY_OFF,
+	OPT_VOLATILE,
 	OPT_ERR,
 };
 
@@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
 	{OPT_XINO_AUTO,			"xino=auto"},
 	{OPT_METACOPY_ON,		"metacopy=on"},
 	{OPT_METACOPY_OFF,		"metacopy=off"},
+	{OPT_VOLATILE,			"volatile"},
 	{OPT_ERR,			NULL}
 };
 
@@ -573,6 +581,10 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 			metacopy_opt = true;
 			break;
 
+		case OPT_VOLATILE:
+			config->ovl_volatile = true;
+			break;
+
 		default:
 			pr_err("unrecognized mount option \"%s\" or missing value\n",
 					p);
@@ -595,6 +607,11 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 		config->index = false;
 	}
 
+	if (!config->upperdir && config->ovl_volatile) {
+		pr_info("option \"volatile\" is meaningless in a non-upper mount, ignoring it.\n");
+		config->ovl_volatile = false;
+	}
+
 	err = ovl_parse_redirect_mode(config, config->redirect_mode);
 	if (err)
 		return err;
@@ -1203,6 +1220,59 @@ static int ovl_check_rename_whiteout(struct dentry *workdir)
 	return err;
 }
 
+/*
+ * Creates $workdir/work/incompat/volatile/dirty file if it is not
+ * already present.
+ */
+static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
+{
+	struct dentry *parent, *child;
+	char *name;
+	int i, len, err;
+	char *dirty_path[] = {OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"};
+	int nr_elems = ARRAY_SIZE(dirty_path);
+
+	err = 0;
+	parent = ofs->workbasedir;
+	dget(parent);
+
+	for (i = 0; i < nr_elems; i++) {
+		name = dirty_path[i];
+		len = strlen(name);
+		inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
+		child = lookup_one_len(name, parent, len);
+		if (IS_ERR(child)) {
+			err = PTR_ERR(child);
+			goto out_unlock;
+		}
+
+		if (!child->d_inode) {
+			unsigned short ftype;
+
+			ftype = (i == (nr_elems - 1)) ? S_IFREG : S_IFDIR;
+			child = ovl_create_real(parent->d_inode, child,
+						OVL_CATTR(ftype | 0));
+			if (IS_ERR(child)) {
+				err = PTR_ERR(child);
+				goto out_unlock;
+			}
+		}
+
+		inode_unlock(parent->d_inode);
+		dput(parent);
+		parent = child;
+		child = NULL;
+	}
+
+	dput(parent);
+	return err;
+
+out_unlock:
+	inode_unlock(parent->d_inode);
+	dput(parent);
+	return err;
+}
+
 static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
 			    struct path *workpath)
 {
@@ -1286,6 +1356,18 @@ static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
 		goto out;
 	}
 
+	/*
+	 * For volatile mount, create a incompat/volatile/dirty file to keep
+	 * track of it.
+	 */
+	if (ofs->config.ovl_volatile) {
+		err = ovl_create_volatile_dirty(ofs);
+		if (err < 0) {
+			pr_err("Failed to create volatile/dirty file.\n");
+			goto out;
+		}
+	}
+
 	/* Check if upper/work fs supports file handles */
 	fh_type = ovl_can_decode_fh(ofs->workdir->d_sb);
 	if (ofs->config.index && !fh_type) {
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-08-31 18:15 [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync Vivek Goyal
@ 2020-09-01  8:22 ` Amir Goldstein
  2020-09-01 13:14   ` Vivek Goyal
  2020-11-06 17:58 ` Sargun Dhillon
  1 sibling, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2020-09-01  8:22 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: overlayfs, Miklos Szeredi, Giuseppe Scrivano, Daniel J Walsh

On Mon, Aug 31, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Container folks are complaining that dnf/yum issues too many sync while
> installing packages and this slows down the image build. Build
> requirement is such that they don't care if a node goes down while
> build was still going on. In that case, they will simply throw away
> unfinished layer and start new build. So they don't care about syncing
> intermediate state to the disk and hence don't want to pay the price
> associated with sync.
>
> So they are asking for mount options where they can disable sync on overlay
> mount point.
>
> They primarily seem to have two use cases.
>
> - For building images, they will mount overlay with nosync and then sync
>   upper layer after unmounting overlay and reuse upper as lower for next
>   layer.
>
> - For running containers, they don't seem to care about syncing upper
>   layer because if node goes down, they will simply throw away upper
>   layer and create a fresh one.
>
> So this patch provides a mount option "volatile" which disables all forms
> of sync. Now it is caller's responsibility to throw away upper if
> system crashes or shuts down and start fresh.
>
> With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> installing emacs in an image. Installation time drops from 31 seconds to
> 25 seconds when nosync option is used. This is for the case of building on top
> of an image where all packages are already cached. That way I take
> out the network operations latency out of the measurement.
>
> Giuseppe is also looking to cut down on number of iops done on the
> disk. He is complaining that often in cloud their VMs are throttled
> if they cross the limit. This option can help them where they reduce
> number of iops (by cutting down on frequent sync and writebacks).
>
> Changes from v6:
> - Got rid of logic to check for volatile/dirty file. Now Amir's
>   patch checks for presence of incomat/volatile directory and errors
>   out if present. User is now required to remove volatile
>   directory. (Amir).
>
> Changes from v5:
> - Added support to detect that previous overlay was mounted with
>   "volatile" option and fail mount. (Miklos and Amir).
>
> Changes from v4:
> - Dropped support for sync=fs (Miklos)
> - Renamed "sync=off" to "volatile". (Miklos)
>
> Changes from v3:
> - Used only enums and dropped bit flags (Amir Goldstein)
> - Dropped error when conflicting sync options provided. (Amir Goldstein)
>
> Changes from v2:
> - Added helper functions (Amir Goldstein)
> - Used enums to keep sync state (Amir Goldstein)
>
> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

See one suggestion below, but you may ignore it...


> +/*
> + * Creates $workdir/work/incompat/volatile/dirty file if it is not
> + * already present.
> + */
> +static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
> +{
> +       struct dentry *parent, *child;
> +       char *name;
> +       int i, len, err;
> +       char *dirty_path[] = {OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"};

Technically, you are calling this right after creating OVL_WORKDIR_NAME, so you
could start from ofs->workdir and drop the first level, but as you wrote it this
function could also be called also after the assignment ovl->workdir =
ovl->indexdir
so it is probably safer to start with ofs->workbasedir as you did.

> +       int nr_elems = ARRAY_SIZE(dirty_path);
> +
> +       err = 0;
> +       parent = ofs->workbasedir;
> +       dget(parent);
> +
> +       for (i = 0; i < nr_elems; i++) {
> +               name = dirty_path[i];
> +               len = strlen(name);
> +               inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
> +               child = lookup_one_len(name, parent, len);
> +               if (IS_ERR(child)) {
> +                       err = PTR_ERR(child);
> +                       goto out_unlock;
> +               }
> +
> +               if (!child->d_inode) {
> +                       unsigned short ftype;
> +
> +                       ftype = (i == (nr_elems - 1)) ? S_IFREG : S_IFDIR;
> +                       child = ovl_create_real(parent->d_inode, child,
> +                                               OVL_CATTR(ftype | 0));
> +                       if (IS_ERR(child)) {
> +                               err = PTR_ERR(child);
> +                               goto out_unlock;
> +                       }
> +               }
> +
> +               inode_unlock(parent->d_inode);
> +               dput(parent);
> +               parent = child;
> +               child = NULL;
> +       }
> +
> +       dput(parent);
> +       return err;
> +
> +out_unlock:
> +       inode_unlock(parent->d_inode);
> +       dput(parent);
> +       return err;
> +}
> +

I think a helper ovl_test_create() along the lines of the helper found on
my ovl-features branch could make this code a lot easier to follow.
Note that the helper in that branch in not ready to be cherry-picked
as is - it needs changes, so take it or leave it.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-09-01  8:22 ` Amir Goldstein
@ 2020-09-01 13:14   ` Vivek Goyal
  0 siblings, 0 replies; 20+ messages in thread
From: Vivek Goyal @ 2020-09-01 13:14 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Giuseppe Scrivano, Daniel J Walsh

On Tue, Sep 01, 2020 at 11:22:26AM +0300, Amir Goldstein wrote:
[..]
> 
> > +       int nr_elems = ARRAY_SIZE(dirty_path);
> > +
> > +       err = 0;
> > +       parent = ofs->workbasedir;
> > +       dget(parent);
> > +
> > +       for (i = 0; i < nr_elems; i++) {
> > +               name = dirty_path[i];
> > +               len = strlen(name);
> > +               inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
> > +               child = lookup_one_len(name, parent, len);
> > +               if (IS_ERR(child)) {
> > +                       err = PTR_ERR(child);
> > +                       goto out_unlock;
> > +               }
> > +
> > +               if (!child->d_inode) {
> > +                       unsigned short ftype;
> > +
> > +                       ftype = (i == (nr_elems - 1)) ? S_IFREG : S_IFDIR;
> > +                       child = ovl_create_real(parent->d_inode, child,
> > +                                               OVL_CATTR(ftype | 0));
> > +                       if (IS_ERR(child)) {
> > +                               err = PTR_ERR(child);
> > +                               goto out_unlock;
> > +                       }
> > +               }
> > +
> > +               inode_unlock(parent->d_inode);
> > +               dput(parent);
> > +               parent = child;
> > +               child = NULL;
> > +       }
> > +
> > +       dput(parent);
> > +       return err;
> > +
> > +out_unlock:
> > +       inode_unlock(parent->d_inode);
> > +       dput(parent);
> > +       return err;
> > +}
> > +
> 
> I think a helper ovl_test_create() along the lines of the helper found on
> my ovl-features branch could make this code a lot easier to follow.
> Note that the helper in that branch in not ready to be cherry-picked
> as is - it needs changes, so take it or leave it.

Hi Amir,

For now, I will like to stick with it. You can change it down then line
once ovl_test_create() is ready to be merged.

Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-08-31 18:15 [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync Vivek Goyal
  2020-09-01  8:22 ` Amir Goldstein
@ 2020-11-06 17:58 ` Sargun Dhillon
  2020-11-06 19:00   ` Amir Goldstein
  2020-11-06 19:03   ` Vivek Goyal
  1 sibling, 2 replies; 20+ messages in thread
From: Sargun Dhillon @ 2020-11-06 17:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: overlayfs, Miklos Szeredi, Amir Goldstein, Giuseppe Scrivano,
	Daniel J Walsh

On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Container folks are complaining that dnf/yum issues too many sync while
> installing packages and this slows down the image build. Build
> requirement is such that they don't care if a node goes down while
> build was still going on. In that case, they will simply throw away
> unfinished layer and start new build. So they don't care about syncing
> intermediate state to the disk and hence don't want to pay the price
> associated with sync.
>
> So they are asking for mount options where they can disable sync on overlay
> mount point.
>
> They primarily seem to have two use cases.
>
> - For building images, they will mount overlay with nosync and then sync
>   upper layer after unmounting overlay and reuse upper as lower for next
>   layer.
>
> - For running containers, they don't seem to care about syncing upper
>   layer because if node goes down, they will simply throw away upper
>   layer and create a fresh one.
>
> So this patch provides a mount option "volatile" which disables all forms
> of sync. Now it is caller's responsibility to throw away upper if
> system crashes or shuts down and start fresh.
>
> With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> installing emacs in an image. Installation time drops from 31 seconds to
> 25 seconds when nosync option is used. This is for the case of building on top
> of an image where all packages are already cached. That way I take
> out the network operations latency out of the measurement.
>
> Giuseppe is also looking to cut down on number of iops done on the
> disk. He is complaining that often in cloud their VMs are throttled
> if they cross the limit. This option can help them where they reduce
> number of iops (by cutting down on frequent sync and writebacks).
>
> Changes from v6:
> - Got rid of logic to check for volatile/dirty file. Now Amir's
>   patch checks for presence of incomat/volatile directory and errors
>   out if present. User is now required to remove volatile
>   directory. (Amir).
>
> Changes from v5:
> - Added support to detect that previous overlay was mounted with
>   "volatile" option and fail mount. (Miklos and Amir).
>
> Changes from v4:
> - Dropped support for sync=fs (Miklos)
> - Renamed "sync=off" to "volatile". (Miklos)
>
> Changes from v3:
> - Used only enums and dropped bit flags (Amir Goldstein)
> - Dropped error when conflicting sync options provided. (Amir Goldstein)
>
> Changes from v2:
> - Added helper functions (Amir Goldstein)
> - Used enums to keep sync state (Amir Goldstein)
>
> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/filesystems/overlayfs.rst | 17 +++++
>  fs/overlayfs/copy_up.c                  | 12 ++--
>  fs/overlayfs/file.c                     | 10 ++-
>  fs/overlayfs/ovl_entry.h                |  6 ++
>  fs/overlayfs/readdir.c                  |  3 +
>  fs/overlayfs/super.c                    | 88 ++++++++++++++++++++++++-
>  6 files changed, 128 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 8ea83a51c266..b33465fdf260 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -563,6 +563,23 @@ This verification may cause significant overhead in some cases.
>  Note: the mount options index=off,nfs_export=on are conflicting for a
>  read-write mount and will result in an error.
>
> +Disable sync
> +------------
> +By default, overlay skips sync on files residing on a lower layer.  It
> +is possible to skip sync operations for files on the upper layer as well
> +with the "volatile" mount option.
> +
> +"volatile" mount option disables all forms of sync from overlay, including
> +the one done at umount/remount. If system crashes or shuts down, user
> +should throw away upper directory and start fresh.
> +
> +When overlay is mounted with "volatile" option, overlay creates an internal
> +directory "$workdir/work/incompat/volatile". During next mount, overlay
> +checks for this directory and refuses to mount if present. This is a strong
> +indicator that user should throw away upper and work directories and
> +create fresh one. In very limited cases where user knows system has not
> +crashed and contents in upperdir are intact, one can remove the "volatile"
> +directory and retry mount.
>
>  Testsuite
>  ---------
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index d07fb92b7253..9d17e42d184b 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
>         return error;
>  }
>
> -static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> +static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
> +                           struct path *new, loff_t len)
>  {
>         struct file *old_file;
>         struct file *new_file;
> @@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>                 len -= bytes;
>         }
>  out:
> -       if (!error)
> +       if (!error && ovl_should_sync(ofs))
>                 error = vfs_fsync(new_file, 0);
>         fput(new_file);
>  out_fput:
> @@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>
>  static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>  {
> +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>         int err;
>
>         /*
> @@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>                 upperpath.dentry = temp;
>
>                 ovl_path_lowerdata(c->dentry, &datapath);
> -               err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> +               err = ovl_copy_up_data(ofs, &datapath, &upperpath,
> +                                      c->stat.size);
>                 if (err)
>                         return err;
>         }
> @@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
>  /* Copy up data of an inode which was copied up metadata only in the past. */
>  static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>  {
> +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>         struct path upperpath, datapath;
>         int err;
>         char *capability = NULL;
> @@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>                         goto out;
>         }
>
> -       err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> +       err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
>         if (err)
>                 goto out_free;
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 0d940e29d62b..3582c3ae819c 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -331,6 +331,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>         struct fd real;
>         const struct cred *old_cred;
>         ssize_t ret;
> +       int ifl = iocb->ki_flags;
>
>         if (!iov_iter_count(iter))
>                 return 0;
> @@ -346,11 +347,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>         if (ret)
>                 goto out_unlock;
>
> +       if (!ovl_should_sync(OVL_FS(inode->i_sb)))
> +               ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
> +
>         old_cred = ovl_override_creds(file_inode(file)->i_sb);
>         if (is_sync_kiocb(iocb)) {
>                 file_start_write(real.file);
>                 ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> -                                    ovl_iocb_to_rwf(iocb->ki_flags));
> +                                    ovl_iocb_to_rwf(ifl));
>                 file_end_write(real.file);
>                 /* Update size */
>                 ovl_copyattr(ovl_inode_real(inode), inode);
> @@ -370,6 +374,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>                 real.flags = 0;
>                 aio_req->orig_iocb = iocb;
>                 kiocb_clone(&aio_req->iocb, iocb, real.file);
> +               aio_req->iocb.ki_flags = ifl;
>                 aio_req->iocb.ki_complete = ovl_aio_rw_complete;
>                 ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
>                 if (ret != -EIOCBQUEUED)
> @@ -433,6 +438,9 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>         const struct cred *old_cred;
>         int ret;
>
> +       if (!ovl_should_sync(OVL_FS(file_inode(file)->i_sb)))
> +               return 0;
> +
>         ret = ovl_real_fdget_meta(file, &real, !datasync);
>         if (ret)
>                 return ret;
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index b429c80879ee..1b5a2094df8e 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -17,6 +17,7 @@ struct ovl_config {
>         bool nfs_export;
>         int xino;
>         bool metacopy;
> +       bool ovl_volatile;
>  };
>
>  struct ovl_sb {
> @@ -90,6 +91,11 @@ static inline struct ovl_fs *OVL_FS(struct super_block *sb)
>         return (struct ovl_fs *)sb->s_fs_info;
>  }
>
> +static inline bool ovl_should_sync(struct ovl_fs *ofs)
> +{
> +       return !ofs->config.ovl_volatile;
> +}
> +
>  /* private information held for every overlayfs dentry */
>  struct ovl_entry {
>         union {
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 683c6f27ab77..f50a9f20e72d 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
>         if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
>                 return 0;
>
> +       if (!ovl_should_sync(OVL_FS(dentry->d_sb)))
> +               return 0;
> +
>         /*
>          * Need to check if we started out being a lower dir, but got copied up
>          */
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 3cd47e4b2eae..f0f7ad8da4be 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>         if (!ovl_upper_mnt(ofs))
>                 return 0;
>
> +       if (!ovl_should_sync(ofs))
> +               return 0;
>         /*
>          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>          * All the super blocks will be iterated, including upper_sb.
> @@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
>         if (ofs->config.metacopy != ovl_metacopy_def)
>                 seq_printf(m, ",metacopy=%s",
>                            ofs->config.metacopy ? "on" : "off");
> +       if (ofs->config.ovl_volatile)
> +               seq_printf(m, ",volatile");
>         return 0;
>  }
>
> @@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>
>         if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
>                 upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> -               down_read(&upper_sb->s_umount);
> -               ret = sync_filesystem(upper_sb);
> -               up_read(&upper_sb->s_umount);
> +               if (ovl_should_sync(ofs)) {
> +                       down_read(&upper_sb->s_umount);
> +                       ret = sync_filesystem(upper_sb);
> +                       up_read(&upper_sb->s_umount);
> +               }
>         }
>
>         return ret;
> @@ -411,6 +417,7 @@ enum {
>         OPT_XINO_AUTO,
>         OPT_METACOPY_ON,
>         OPT_METACOPY_OFF,
> +       OPT_VOLATILE,
>         OPT_ERR,
>  };
>
> @@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
>         {OPT_XINO_AUTO,                 "xino=auto"},
>         {OPT_METACOPY_ON,               "metacopy=on"},
>         {OPT_METACOPY_OFF,              "metacopy=off"},
> +       {OPT_VOLATILE,                  "volatile"},
>         {OPT_ERR,                       NULL}
>  };
>
> @@ -573,6 +581,10 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>                         metacopy_opt = true;
>                         break;
>
> +               case OPT_VOLATILE:
> +                       config->ovl_volatile = true;
> +                       break;
> +
>                 default:
>                         pr_err("unrecognized mount option \"%s\" or missing value\n",
>                                         p);
> @@ -595,6 +607,11 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>                 config->index = false;
>         }
>
> +       if (!config->upperdir && config->ovl_volatile) {
> +               pr_info("option \"volatile\" is meaningless in a non-upper mount, ignoring it.\n");
> +               config->ovl_volatile = false;
> +       }
> +
>         err = ovl_parse_redirect_mode(config, config->redirect_mode);
>         if (err)
>                 return err;
> @@ -1203,6 +1220,59 @@ static int ovl_check_rename_whiteout(struct dentry *workdir)
>         return err;
>  }
>
> +/*
> + * Creates $workdir/work/incompat/volatile/dirty file if it is not
> + * already present.
> + */
> +static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
> +{
> +       struct dentry *parent, *child;
> +       char *name;
> +       int i, len, err;
> +       char *dirty_path[] = {OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"};
> +       int nr_elems = ARRAY_SIZE(dirty_path);
> +
> +       err = 0;
> +       parent = ofs->workbasedir;
> +       dget(parent);
> +
> +       for (i = 0; i < nr_elems; i++) {
> +               name = dirty_path[i];
> +               len = strlen(name);
> +               inode_lock_nested(parent->d_inode, I_MUTEX_PARENT);
> +               child = lookup_one_len(name, parent, len);
> +               if (IS_ERR(child)) {
> +                       err = PTR_ERR(child);
> +                       goto out_unlock;
> +               }
> +
> +               if (!child->d_inode) {
> +                       unsigned short ftype;
> +
> +                       ftype = (i == (nr_elems - 1)) ? S_IFREG : S_IFDIR;
> +                       child = ovl_create_real(parent->d_inode, child,
> +                                               OVL_CATTR(ftype | 0));
> +                       if (IS_ERR(child)) {
> +                               err = PTR_ERR(child);
> +                               goto out_unlock;
> +                       }
> +               }
> +
> +               inode_unlock(parent->d_inode);
> +               dput(parent);
> +               parent = child;
> +               child = NULL;
> +       }
> +
> +       dput(parent);
> +       return err;
> +
> +out_unlock:
> +       inode_unlock(parent->d_inode);
> +       dput(parent);
> +       return err;
> +}
> +
>  static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
>                             struct path *workpath)
>  {
> @@ -1286,6 +1356,18 @@ static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
>                 goto out;
>         }
>
> +       /*
> +        * For volatile mount, create a incompat/volatile/dirty file to keep
> +        * track of it.
> +        */
> +       if (ofs->config.ovl_volatile) {
> +               err = ovl_create_volatile_dirty(ofs);
> +               if (err < 0) {
> +                       pr_err("Failed to create volatile/dirty file.\n");
> +                       goto out;
> +               }
> +       }
> +
>         /* Check if upper/work fs supports file handles */
>         fh_type = ovl_can_decode_fh(ofs->workdir->d_sb);
>         if (ofs->config.index && !fh_type) {
> --
> 2.25.4
>
There is some slightly confusing behaviour here [I realize this
behaviour is as intended]:

(root) ~ # mount -t overlay -o
volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
none /mnt/foo
(root) ~ # umount /mnt/foo
(root) ~ # mount -t overlay -o
volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
none /mnt/foo
mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
missing codepage or helper program, or other error.

From my understanding, the dirty flag should only be a problem if the
existing overlayfs is unmounted uncleanly. Docker does
this (mount, and re-mounts) during startup time because it writes some
files to the overlayfs. I think that we should harden
the volatile check slightly, and make it so that within the same boot,
it's not a problem, and having to have the user clear
the workdir every time is a pain. In addition, the semantics of the
volatile patch itself do not appear to be such that they
would break mounts during the same boot / mount of upperdir -- as
overlayfs does not defer any writes in itself, and it's
only that it's short-circuiting writes to the upperdir.

Amir,
What do you think?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-06 17:58 ` Sargun Dhillon
@ 2020-11-06 19:00   ` Amir Goldstein
  2020-11-06 19:20     ` Vivek Goyal
  2020-11-09 17:22     ` Vivek Goyal
  2020-11-06 19:03   ` Vivek Goyal
  1 sibling, 2 replies; 20+ messages in thread
From: Amir Goldstein @ 2020-11-06 19:00 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, overlayfs, Miklos Szeredi, Giuseppe Scrivano,
	Daniel J Walsh

On Fri, Nov 6, 2020 at 7:59 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Container folks are complaining that dnf/yum issues too many sync while
> > installing packages and this slows down the image build. Build
> > requirement is such that they don't care if a node goes down while
> > build was still going on. In that case, they will simply throw away
> > unfinished layer and start new build. So they don't care about syncing
> > intermediate state to the disk and hence don't want to pay the price
> > associated with sync.
> >
> > So they are asking for mount options where they can disable sync on overlay
> > mount point.
> >
> > They primarily seem to have two use cases.
> >
> > - For building images, they will mount overlay with nosync and then sync
> >   upper layer after unmounting overlay and reuse upper as lower for next
> >   layer.
> >
> > - For running containers, they don't seem to care about syncing upper
> >   layer because if node goes down, they will simply throw away upper
> >   layer and create a fresh one.
> >
> > So this patch provides a mount option "volatile" which disables all forms
> > of sync. Now it is caller's responsibility to throw away upper if
> > system crashes or shuts down and start fresh.
> >
> > With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> > installing emacs in an image. Installation time drops from 31 seconds to
> > 25 seconds when nosync option is used. This is for the case of building on top
> > of an image where all packages are already cached. That way I take
> > out the network operations latency out of the measurement.
> >
> > Giuseppe is also looking to cut down on number of iops done on the
> > disk. He is complaining that often in cloud their VMs are throttled
> > if they cross the limit. This option can help them where they reduce
> > number of iops (by cutting down on frequent sync and writebacks).
> >
[...]
> There is some slightly confusing behaviour here [I realize this
> behaviour is as intended]:
>
> (root) ~ # mount -t overlay -o
> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> none /mnt/foo
> (root) ~ # umount /mnt/foo
> (root) ~ # mount -t overlay -o
> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> none /mnt/foo
> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> missing codepage or helper program, or other error.
>
> From my understanding, the dirty flag should only be a problem if the
> existing overlayfs is unmounted uncleanly. Docker does
> this (mount, and re-mounts) during startup time because it writes some
> files to the overlayfs. I think that we should harden
> the volatile check slightly, and make it so that within the same boot,
> it's not a problem, and having to have the user clear
> the workdir every time is a pain. In addition, the semantics of the
> volatile patch itself do not appear to be such that they
> would break mounts during the same boot / mount of upperdir -- as
> overlayfs does not defer any writes in itself, and it's
> only that it's short-circuiting writes to the upperdir.
>
> Amir,
> What do you think?

How do you propose to check that upperdir was used during the same boot?

Maybe a simpler check  is that upperdir inode is still in cache as an easy way
around this.

Add an overlayfs specific inode flag, similar to I_OVL_INUSE
I_OVL_WAS_INUSE.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-06 17:58 ` Sargun Dhillon
  2020-11-06 19:00   ` Amir Goldstein
@ 2020-11-06 19:03   ` Vivek Goyal
  2020-11-06 19:42     ` Giuseppe Scrivano
  1 sibling, 1 reply; 20+ messages in thread
From: Vivek Goyal @ 2020-11-06 19:03 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: overlayfs, Miklos Szeredi, Amir Goldstein, Giuseppe Scrivano,
	Daniel J Walsh

On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:

[..]
> There is some slightly confusing behaviour here [I realize this
> behaviour is as intended]:
> 
> (root) ~ # mount -t overlay -o
> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> none /mnt/foo
> (root) ~ # umount /mnt/foo
> (root) ~ # mount -t overlay -o
> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> none /mnt/foo
> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> missing codepage or helper program, or other error.
> 
> From my understanding, the dirty flag should only be a problem if the
> existing overlayfs is unmounted uncleanly. Docker does
> this (mount, and re-mounts) during startup time because it writes some
> files to the overlayfs. I think that we should harden
> the volatile check slightly, and make it so that within the same boot,
> it's not a problem, and having to have the user clear
> the workdir every time is a pain. In addition, the semantics of the
> volatile patch itself do not appear to be such that they
> would break mounts during the same boot / mount of upperdir -- as
> overlayfs does not defer any writes in itself, and it's
> only that it's short-circuiting writes to the upperdir.

umount does a sync normally and with "volatile" overlayfs skips that
sync. So a successful unmount does not mean that file got synced
to backing store. It is possible, after umount, system crashed
and after reboot, user tried to mount upper which is corrupted
now and overlay will not detect it.

You seem to be asking for an alternate option where we disable
fsync() but not syncfs. In that case sync on umount will still
be done. And that means a successful umount should mean upper
is fine and it could automatically remove incomapt dir upon
umount.

Intial version of patches had both the volatile modes implemented.
Later we dropped one because it was not clear who wants this
second mode. If this is something which is useful for you, it
can possibly be introduced.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-06 19:00   ` Amir Goldstein
@ 2020-11-06 19:20     ` Vivek Goyal
  2020-11-09 17:22     ` Vivek Goyal
  1 sibling, 0 replies; 20+ messages in thread
From: Vivek Goyal @ 2020-11-06 19:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Giuseppe Scrivano,
	Daniel J Walsh

On Fri, Nov 06, 2020 at 09:00:07PM +0200, Amir Goldstein wrote:
> On Fri, Nov 6, 2020 at 7:59 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > Container folks are complaining that dnf/yum issues too many sync while
> > > installing packages and this slows down the image build. Build
> > > requirement is such that they don't care if a node goes down while
> > > build was still going on. In that case, they will simply throw away
> > > unfinished layer and start new build. So they don't care about syncing
> > > intermediate state to the disk and hence don't want to pay the price
> > > associated with sync.
> > >
> > > So they are asking for mount options where they can disable sync on overlay
> > > mount point.
> > >
> > > They primarily seem to have two use cases.
> > >
> > > - For building images, they will mount overlay with nosync and then sync
> > >   upper layer after unmounting overlay and reuse upper as lower for next
> > >   layer.
> > >
> > > - For running containers, they don't seem to care about syncing upper
> > >   layer because if node goes down, they will simply throw away upper
> > >   layer and create a fresh one.
> > >
> > > So this patch provides a mount option "volatile" which disables all forms
> > > of sync. Now it is caller's responsibility to throw away upper if
> > > system crashes or shuts down and start fresh.
> > >
> > > With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> > > installing emacs in an image. Installation time drops from 31 seconds to
> > > 25 seconds when nosync option is used. This is for the case of building on top
> > > of an image where all packages are already cached. That way I take
> > > out the network operations latency out of the measurement.
> > >
> > > Giuseppe is also looking to cut down on number of iops done on the
> > > disk. He is complaining that often in cloud their VMs are throttled
> > > if they cross the limit. This option can help them where they reduce
> > > number of iops (by cutting down on frequent sync and writebacks).
> > >
> [...]
> > There is some slightly confusing behaviour here [I realize this
> > behaviour is as intended]:
> >
> > (root) ~ # mount -t overlay -o
> > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > none /mnt/foo
> > (root) ~ # umount /mnt/foo
> > (root) ~ # mount -t overlay -o
> > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > none /mnt/foo
> > mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > missing codepage or helper program, or other error.
> >
> > From my understanding, the dirty flag should only be a problem if the
> > existing overlayfs is unmounted uncleanly. Docker does
> > this (mount, and re-mounts) during startup time because it writes some
> > files to the overlayfs. I think that we should harden
> > the volatile check slightly, and make it so that within the same boot,
> > it's not a problem, and having to have the user clear
> > the workdir every time is a pain. In addition, the semantics of the
> > volatile patch itself do not appear to be such that they
> > would break mounts during the same boot / mount of upperdir -- as
> > overlayfs does not defer any writes in itself, and it's
> > only that it's short-circuiting writes to the upperdir.
> >
> > Amir,
> > What do you think?
> 
> How do you propose to check that upperdir was used during the same boot?
> 
> Maybe a simpler check  is that upperdir inode is still in cache as an easy way
> around this.
> 
> Add an overlayfs specific inode flag, similar to I_OVL_INUSE
> I_OVL_WAS_INUSE.

So this works only if inode has not been evicted. That means sometimes
it will work and other times it will error out. If that's the case
user has to write code to deal with the error anyway and does not
make life simpler.

Mayh be sync=fs was middle ground option where we ignore fsync() but still
do filesystem sync. And there we will do a sync of upper on umount and
then can remote this incompat directory.

https://lore.kernel.org/linux-unionfs/20200701215029.GF369085@redhat.com/

Thanks
Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-06 19:03   ` Vivek Goyal
@ 2020-11-06 19:42     ` Giuseppe Scrivano
  2020-11-07  9:35       ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Giuseppe Scrivano @ 2020-11-06 19:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Amir Goldstein,
	Daniel J Walsh

Vivek Goyal <vgoyal@redhat.com> writes:

> On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
>
> [..]
>> There is some slightly confusing behaviour here [I realize this
>> behaviour is as intended]:
>> 
>> (root) ~ # mount -t overlay -o
>> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
>> none /mnt/foo
>> (root) ~ # umount /mnt/foo
>> (root) ~ # mount -t overlay -o
>> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
>> none /mnt/foo
>> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
>> missing codepage or helper program, or other error.
>> 
>> From my understanding, the dirty flag should only be a problem if the
>> existing overlayfs is unmounted uncleanly. Docker does
>> this (mount, and re-mounts) during startup time because it writes some
>> files to the overlayfs. I think that we should harden
>> the volatile check slightly, and make it so that within the same boot,
>> it's not a problem, and having to have the user clear
>> the workdir every time is a pain. In addition, the semantics of the
>> volatile patch itself do not appear to be such that they
>> would break mounts during the same boot / mount of upperdir -- as
>> overlayfs does not defer any writes in itself, and it's
>> only that it's short-circuiting writes to the upperdir.
>
> umount does a sync normally and with "volatile" overlayfs skips that
> sync. So a successful unmount does not mean that file got synced
> to backing store. It is possible, after umount, system crashed
> and after reboot, user tried to mount upper which is corrupted
> now and overlay will not detect it.
>
> You seem to be asking for an alternate option where we disable
> fsync() but not syncfs. In that case sync on umount will still
> be done. And that means a successful umount should mean upper
> is fine and it could automatically remove incomapt dir upon
> umount.

could this be handled in user space?  It should still be possible to do
the equivalent of:

# sync -f /root/upperdir
# rm -rf /root/workdir/incompat/volatile

Regards,
Giuseppe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-06 19:42     ` Giuseppe Scrivano
@ 2020-11-07  9:35       ` Amir Goldstein
  2020-11-07 11:52         ` Sargun Dhillon
                           ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Amir Goldstein @ 2020-11-07  9:35 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Vivek Goyal, Sargun Dhillon, overlayfs, Miklos Szeredi, Daniel J Walsh

On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
>
> Vivek Goyal <vgoyal@redhat.com> writes:
>
> > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> >
> > [..]
> >> There is some slightly confusing behaviour here [I realize this
> >> behaviour is as intended]:
> >>
> >> (root) ~ # mount -t overlay -o
> >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> >> none /mnt/foo
> >> (root) ~ # umount /mnt/foo
> >> (root) ~ # mount -t overlay -o
> >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> >> none /mnt/foo
> >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> >> missing codepage or helper program, or other error.
> >>
> >> From my understanding, the dirty flag should only be a problem if the
> >> existing overlayfs is unmounted uncleanly. Docker does
> >> this (mount, and re-mounts) during startup time because it writes some
> >> files to the overlayfs. I think that we should harden
> >> the volatile check slightly, and make it so that within the same boot,
> >> it's not a problem, and having to have the user clear
> >> the workdir every time is a pain. In addition, the semantics of the
> >> volatile patch itself do not appear to be such that they
> >> would break mounts during the same boot / mount of upperdir -- as
> >> overlayfs does not defer any writes in itself, and it's
> >> only that it's short-circuiting writes to the upperdir.
> >
> > umount does a sync normally and with "volatile" overlayfs skips that
> > sync. So a successful unmount does not mean that file got synced
> > to backing store. It is possible, after umount, system crashed
> > and after reboot, user tried to mount upper which is corrupted
> > now and overlay will not detect it.
> >
> > You seem to be asking for an alternate option where we disable
> > fsync() but not syncfs. In that case sync on umount will still
> > be done. And that means a successful umount should mean upper
> > is fine and it could automatically remove incomapt dir upon
> > umount.
>
> could this be handled in user space?  It should still be possible to do
> the equivalent of:
>
> # sync -f /root/upperdir
> # rm -rf /root/workdir/incompat/volatile
>

FWIW, the sync -f command above is
1. Not needed when re-mounting overlayfs as volatile
2. Not enough when re-mounting overlayfs as non-volatile

In the latter case, a full sync (no -f) is required.

Handling this is userspace is the preferred option IMO,
but if there is an *appealing* reason to allow opportunistic
volatile overlayfs re-mount as long as the upperdir inode
is in cache (userspace can make sure of that), then
all I am saying is that it is possible and not terribly hard.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-07  9:35       ` Amir Goldstein
@ 2020-11-07 11:52         ` Sargun Dhillon
  2020-11-09 20:40           ` Vivek Goyal
  2020-11-09  8:53         ` Giuseppe Scrivano
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Sargun Dhillon @ 2020-11-07 11:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Giuseppe Scrivano, Vivek Goyal, overlayfs, Miklos Szeredi,
	Daniel J Walsh

On Sat, Nov 07, 2020 at 11:35:04AM +0200, Amir Goldstein wrote:
> On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> >
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> > >
> > > [..]
> > >> There is some slightly confusing behaviour here [I realize this
> > >> behaviour is as intended]:
> > >>
> > >> (root) ~ # mount -t overlay -o
> > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > >> none /mnt/foo
> > >> (root) ~ # umount /mnt/foo
> > >> (root) ~ # mount -t overlay -o
> > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > >> none /mnt/foo
> > >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > >> missing codepage or helper program, or other error.
> > >>
> > >> From my understanding, the dirty flag should only be a problem if the
> > >> existing overlayfs is unmounted uncleanly. Docker does
> > >> this (mount, and re-mounts) during startup time because it writes some
> > >> files to the overlayfs. I think that we should harden
> > >> the volatile check slightly, and make it so that within the same boot,
> > >> it's not a problem, and having to have the user clear
> > >> the workdir every time is a pain. In addition, the semantics of the
> > >> volatile patch itself do not appear to be such that they
> > >> would break mounts during the same boot / mount of upperdir -- as
> > >> overlayfs does not defer any writes in itself, and it's
> > >> only that it's short-circuiting writes to the upperdir.
> > >
> > > umount does a sync normally and with "volatile" overlayfs skips that
> > > sync. So a successful unmount does not mean that file got synced
> > > to backing store. It is possible, after umount, system crashed
> > > and after reboot, user tried to mount upper which is corrupted
> > > now and overlay will not detect it.
> > >
We explicitly disable this in our infrastructure via a small kernel patch that 
stubs out the sync behaviour. IIRC, it was added some time after 4.15, and when 
we picked up the related overlayfs patch it caused a lot of machines to crash.

This was due to high container churn -- and other containers having a lot of 
outstanding dirty pages at exit time. When we would teardown their mounts, 
syncfs would get called [on the entire underlying device / fs], and that would 
stall out all of the containers on the machine. We really do not want this 
behaviour.

> > > You seem to be asking for an alternate option where we disable
> > > fsync() but not syncfs. In that case sync on umount will still
> > > be done. And that means a successful umount should mean upper
> > > is fine and it could automatically remove incomapt dir upon
> > > umount.
> >
> > could this be handled in user space?  It should still be possible to do
> > the equivalent of:
> >
> > # sync -f /root/upperdir
> > # rm -rf /root/workdir/incompat/volatile
> >
> 
> FWIW, the sync -f command above is
> 1. Not needed when re-mounting overlayfs as volatile
> 2. Not enough when re-mounting overlayfs as non-volatile
> 
> In the latter case, a full sync (no -f) is required.
> 
> Handling this is userspace is the preferred option IMO,
> but if there is an *appealing* reason to allow opportunistic
> volatile overlayfs re-mount as long as the upperdir inode
> is in cache (userspace can make sure of that), then
> all I am saying is that it is possible and not terribly hard.
> 
> Thanks,
> Amir.


I think I have two approaches in mind that are viable. Both approaches rely on 
adding a small amount of data (either via an xattr, or data in the file itself) 
that allows us to ascertain whether or not the upperdir is okay to reuse, even 
when it was mounted volatile:

1. We introduce a guid to the superblock structure itself. I think that this 
would actually be valuable independently from overlayfs in order to do things 
like "my database restarted, should it do an integrity check, or is the same SB 
mounted?" I started down the route of cooking up an ioctl for this, but I think 
that this is killing a mosquito with a canon. Perhaps, this approach is the 
right long-term approach, but I don't think it'll be easy to get through.

2. I've started cooking up this patch a little bit more where we override 
kill_sb. Specifically, we assign kill_sb on the upperdir / workdir to our own 
killsb, and keep track of superblocks, and the errseq on the super block. We 
then keep a list of tracked superblocks in memory, the last observed errseq, and 
a guid. Upon mount of the overlayfs, we write the a key in that uniquely 
identifies the sb + errseq. Upon remount, we check if the errseq, or if the sb 
have changed. If so, we throw an error. Otherwise, we allow things to pass.

This approach has seen some usage in net[1].

[1]: https://elixir.bootlin.com/linux/v5.9.6/source/drivers/net/ipvlan/ipvlan_main.c#L80

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-07  9:35       ` Amir Goldstein
  2020-11-07 11:52         ` Sargun Dhillon
@ 2020-11-09  8:53         ` Giuseppe Scrivano
  2020-11-09 10:10           ` Amir Goldstein
  2020-11-09 16:36         ` Vivek Goyal
  2020-11-09 17:09         ` Vivek Goyal
  3 siblings, 1 reply; 20+ messages in thread
From: Giuseppe Scrivano @ 2020-11-09  8:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vivek Goyal, Sargun Dhillon, overlayfs, Miklos Szeredi, Daniel J Walsh

> On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
>>
>> Vivek Goyal <vgoyal@redhat.com> writes:
>>
>> > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
>> >
>> > [..]
>> >> There is some slightly confusing behaviour here [I realize this
>> >> behaviour is as intended]:
>> >>
>> >> (root) ~ # mount -t overlay -o
>> >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
>> >> none /mnt/foo
>> >> (root) ~ # umount /mnt/foo
>> >> (root) ~ # mount -t overlay -o
>> >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
>> >> none /mnt/foo
>> >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
>> >> missing codepage or helper program, or other error.
>> >>
>> >> From my understanding, the dirty flag should only be a problem if the
>> >> existing overlayfs is unmounted uncleanly. Docker does
>> >> this (mount, and re-mounts) during startup time because it writes some
>> >> files to the overlayfs. I think that we should harden
>> >> the volatile check slightly, and make it so that within the same boot,
>> >> it's not a problem, and having to have the user clear
>> >> the workdir every time is a pain. In addition, the semantics of the
>> >> volatile patch itself do not appear to be such that they
>> >> would break mounts during the same boot / mount of upperdir -- as
>> >> overlayfs does not defer any writes in itself, and it's
>> >> only that it's short-circuiting writes to the upperdir.
>> >
>> > umount does a sync normally and with "volatile" overlayfs skips that
>> > sync. So a successful unmount does not mean that file got synced
>> > to backing store. It is possible, after umount, system crashed
>> > and after reboot, user tried to mount upper which is corrupted
>> > now and overlay will not detect it.
>> >
>> > You seem to be asking for an alternate option where we disable
>> > fsync() but not syncfs. In that case sync on umount will still
>> > be done. And that means a successful umount should mean upper
>> > is fine and it could automatically remove incomapt dir upon
>> > umount.
>>
>> could this be handled in user space?  It should still be possible to do
>> the equivalent of:
>>
>> # sync -f /root/upperdir
>> # rm -rf /root/workdir/incompat/volatile
>>
>
> FWIW, the sync -f command above is
> 1. Not needed when re-mounting overlayfs as volatile
> 2. Not enough when re-mounting overlayfs as non-volatile
>
> In the latter case, a full sync (no -f) is required.

Thanks for the clarification.  Why wouldn't a syncfs on the upper
directory be enough to ensure files are persisted and safe to reuse
after a crash?

Regards,
Giuseppe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-09  8:53         ` Giuseppe Scrivano
@ 2020-11-09 10:10           ` Amir Goldstein
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2020-11-09 10:10 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Vivek Goyal, Sargun Dhillon, overlayfs, Miklos Szeredi, Daniel J Walsh

On Mon, Nov 9, 2020 at 10:53 AM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
>
> > On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> >>
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >>
> >> > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> >> >
> >> > [..]
> >> >> There is some slightly confusing behaviour here [I realize this
> >> >> behaviour is as intended]:
> >> >>
> >> >> (root) ~ # mount -t overlay -o
> >> >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> >> >> none /mnt/foo
> >> >> (root) ~ # umount /mnt/foo
> >> >> (root) ~ # mount -t overlay -o
> >> >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> >> >> none /mnt/foo
> >> >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> >> >> missing codepage or helper program, or other error.
> >> >>
> >> >> From my understanding, the dirty flag should only be a problem if the
> >> >> existing overlayfs is unmounted uncleanly. Docker does
> >> >> this (mount, and re-mounts) during startup time because it writes some
> >> >> files to the overlayfs. I think that we should harden
> >> >> the volatile check slightly, and make it so that within the same boot,
> >> >> it's not a problem, and having to have the user clear
> >> >> the workdir every time is a pain. In addition, the semantics of the
> >> >> volatile patch itself do not appear to be such that they
> >> >> would break mounts during the same boot / mount of upperdir -- as
> >> >> overlayfs does not defer any writes in itself, and it's
> >> >> only that it's short-circuiting writes to the upperdir.
> >> >
> >> > umount does a sync normally and with "volatile" overlayfs skips that
> >> > sync. So a successful unmount does not mean that file got synced
> >> > to backing store. It is possible, after umount, system crashed
> >> > and after reboot, user tried to mount upper which is corrupted
> >> > now and overlay will not detect it.
> >> >
> >> > You seem to be asking for an alternate option where we disable
> >> > fsync() but not syncfs. In that case sync on umount will still
> >> > be done. And that means a successful umount should mean upper
> >> > is fine and it could automatically remove incomapt dir upon
> >> > umount.
> >>
> >> could this be handled in user space?  It should still be possible to do
> >> the equivalent of:
> >>
> >> # sync -f /root/upperdir
> >> # rm -rf /root/workdir/incompat/volatile
> >>
> >
> > FWIW, the sync -f command above is
> > 1. Not needed when re-mounting overlayfs as volatile
> > 2. Not enough when re-mounting overlayfs as non-volatile
> >
> > In the latter case, a full sync (no -f) is required.
>
> Thanks for the clarification.  Why wouldn't a syncfs on the upper
> directory be enough to ensure files are persisted and safe to reuse
> after a crash?
>

My bad. I always confuse sync -f as fsync().

Sorry for the noise,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-07  9:35       ` Amir Goldstein
  2020-11-07 11:52         ` Sargun Dhillon
  2020-11-09  8:53         ` Giuseppe Scrivano
@ 2020-11-09 16:36         ` Vivek Goyal
  2020-11-09 17:09         ` Vivek Goyal
  3 siblings, 0 replies; 20+ messages in thread
From: Vivek Goyal @ 2020-11-09 16:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Giuseppe Scrivano, Sargun Dhillon, overlayfs, Miklos Szeredi,
	Daniel J Walsh

On Sat, Nov 07, 2020 at 11:35:04AM +0200, Amir Goldstein wrote:
> On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> >
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> > >
> > > [..]
> > >> There is some slightly confusing behaviour here [I realize this
> > >> behaviour is as intended]:
> > >>
> > >> (root) ~ # mount -t overlay -o
> > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > >> none /mnt/foo
> > >> (root) ~ # umount /mnt/foo
> > >> (root) ~ # mount -t overlay -o
> > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > >> none /mnt/foo
> > >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > >> missing codepage or helper program, or other error.
> > >>
> > >> From my understanding, the dirty flag should only be a problem if the
> > >> existing overlayfs is unmounted uncleanly. Docker does
> > >> this (mount, and re-mounts) during startup time because it writes some
> > >> files to the overlayfs. I think that we should harden
> > >> the volatile check slightly, and make it so that within the same boot,
> > >> it's not a problem, and having to have the user clear
> > >> the workdir every time is a pain. In addition, the semantics of the
> > >> volatile patch itself do not appear to be such that they
> > >> would break mounts during the same boot / mount of upperdir -- as
> > >> overlayfs does not defer any writes in itself, and it's
> > >> only that it's short-circuiting writes to the upperdir.
> > >
> > > umount does a sync normally and with "volatile" overlayfs skips that
> > > sync. So a successful unmount does not mean that file got synced
> > > to backing store. It is possible, after umount, system crashed
> > > and after reboot, user tried to mount upper which is corrupted
> > > now and overlay will not detect it.
> > >
> > > You seem to be asking for an alternate option where we disable
> > > fsync() but not syncfs. In that case sync on umount will still
> > > be done. And that means a successful umount should mean upper
> > > is fine and it could automatically remove incomapt dir upon
> > > umount.
> >
> > could this be handled in user space?  It should still be possible to do
> > the equivalent of:
> >
> > # sync -f /root/upperdir
> > # rm -rf /root/workdir/incompat/volatile
> >
> 
> FWIW, the sync -f command above is
> 1. Not needed when re-mounting overlayfs as volatile
> 2. Not enough when re-mounting overlayfs as non-volatile
> 
> In the latter case, a full sync (no -f) is required.

Hi Amir,

I am wondering why "sync -f upper/" is not sufficient and why full sync
is required.

Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-07  9:35       ` Amir Goldstein
                           ` (2 preceding siblings ...)
  2020-11-09 16:36         ` Vivek Goyal
@ 2020-11-09 17:09         ` Vivek Goyal
  2020-11-09 17:20           ` Amir Goldstein
  3 siblings, 1 reply; 20+ messages in thread
From: Vivek Goyal @ 2020-11-09 17:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Giuseppe Scrivano, Sargun Dhillon, overlayfs, Miklos Szeredi,
	Daniel J Walsh

On Sat, Nov 07, 2020 at 11:35:04AM +0200, Amir Goldstein wrote:
> On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> >
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> > >
> > > [..]
> > >> There is some slightly confusing behaviour here [I realize this
> > >> behaviour is as intended]:
> > >>
> > >> (root) ~ # mount -t overlay -o
> > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > >> none /mnt/foo
> > >> (root) ~ # umount /mnt/foo
> > >> (root) ~ # mount -t overlay -o
> > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > >> none /mnt/foo
> > >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > >> missing codepage or helper program, or other error.
> > >>
> > >> From my understanding, the dirty flag should only be a problem if the
> > >> existing overlayfs is unmounted uncleanly. Docker does
> > >> this (mount, and re-mounts) during startup time because it writes some
> > >> files to the overlayfs. I think that we should harden
> > >> the volatile check slightly, and make it so that within the same boot,
> > >> it's not a problem, and having to have the user clear
> > >> the workdir every time is a pain. In addition, the semantics of the
> > >> volatile patch itself do not appear to be such that they
> > >> would break mounts during the same boot / mount of upperdir -- as
> > >> overlayfs does not defer any writes in itself, and it's
> > >> only that it's short-circuiting writes to the upperdir.
> > >
> > > umount does a sync normally and with "volatile" overlayfs skips that
> > > sync. So a successful unmount does not mean that file got synced
> > > to backing store. It is possible, after umount, system crashed
> > > and after reboot, user tried to mount upper which is corrupted
> > > now and overlay will not detect it.
> > >
> > > You seem to be asking for an alternate option where we disable
> > > fsync() but not syncfs. In that case sync on umount will still
> > > be done. And that means a successful umount should mean upper
> > > is fine and it could automatically remove incomapt dir upon
> > > umount.
> >
> > could this be handled in user space?  It should still be possible to do
> > the equivalent of:
> >
> > # sync -f /root/upperdir
> > # rm -rf /root/workdir/incompat/volatile
> >
> 
> FWIW, the sync -f command above is
> 1. Not needed when re-mounting overlayfs as volatile
> 2. Not enough when re-mounting overlayfs as non-volatile
> 
> In the latter case, a full sync (no -f) is required.
> 
> Handling this is userspace is the preferred option IMO,
> but if there is an *appealing* reason to allow opportunistic
> volatile overlayfs re-mount as long as the upperdir inode
> is in cache (userspace can make sure of that), then
> all I am saying is that it is possible and not terribly hard.

Hi Amir,

Taking a step back and I am wondering what are the problems if we
remoung a volatile mount after system crash. I mean how it is different
from non-volatile mount after crash.

One difference which comes to my mind is that an application might have
done fsync and after remount it will expect changes to have made to
persistent storage and be available. With volatile mount sunch guarantee
can not be given.

Can we keep track if we skipped any sync/fsync or not. If not, we can delete
incomat directory on umount allowing next mount to succeed without any
user intervention.

This probably means that there needs to be a variant of umount() which
does not request sync and container tools need to do a umount without
request sync. Or may be the very fact container-tools/app mounted ovelay
"volatile" they already opted in to not sync over umount. So they can't
expect any guarantees of data hitting disk after umount.

IOW, is it ok to remove "incomapt" directory if application never did
an fsync. I don't know how common it is though because the problem we
faced was excessive amount of fsync. So keeping found of fsync might
not help at all.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-09 17:09         ` Vivek Goyal
@ 2020-11-09 17:20           ` Amir Goldstein
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2020-11-09 17:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Giuseppe Scrivano, Sargun Dhillon, overlayfs, Miklos Szeredi,
	Daniel J Walsh

On Mon, Nov 9, 2020 at 7:09 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Sat, Nov 07, 2020 at 11:35:04AM +0200, Amir Goldstein wrote:
> > On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> > >
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> > > >
> > > > [..]
> > > >> There is some slightly confusing behaviour here [I realize this
> > > >> behaviour is as intended]:
> > > >>
> > > >> (root) ~ # mount -t overlay -o
> > > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > >> none /mnt/foo
> > > >> (root) ~ # umount /mnt/foo
> > > >> (root) ~ # mount -t overlay -o
> > > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > >> none /mnt/foo
> > > >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > > >> missing codepage or helper program, or other error.
> > > >>
> > > >> From my understanding, the dirty flag should only be a problem if the
> > > >> existing overlayfs is unmounted uncleanly. Docker does
> > > >> this (mount, and re-mounts) during startup time because it writes some
> > > >> files to the overlayfs. I think that we should harden
> > > >> the volatile check slightly, and make it so that within the same boot,
> > > >> it's not a problem, and having to have the user clear
> > > >> the workdir every time is a pain. In addition, the semantics of the
> > > >> volatile patch itself do not appear to be such that they
> > > >> would break mounts during the same boot / mount of upperdir -- as
> > > >> overlayfs does not defer any writes in itself, and it's
> > > >> only that it's short-circuiting writes to the upperdir.
> > > >
> > > > umount does a sync normally and with "volatile" overlayfs skips that
> > > > sync. So a successful unmount does not mean that file got synced
> > > > to backing store. It is possible, after umount, system crashed
> > > > and after reboot, user tried to mount upper which is corrupted
> > > > now and overlay will not detect it.
> > > >
> > > > You seem to be asking for an alternate option where we disable
> > > > fsync() but not syncfs. In that case sync on umount will still
> > > > be done. And that means a successful umount should mean upper
> > > > is fine and it could automatically remove incomapt dir upon
> > > > umount.
> > >
> > > could this be handled in user space?  It should still be possible to do
> > > the equivalent of:
> > >
> > > # sync -f /root/upperdir
> > > # rm -rf /root/workdir/incompat/volatile
> > >
> >
> > FWIW, the sync -f command above is
> > 1. Not needed when re-mounting overlayfs as volatile
> > 2. Not enough when re-mounting overlayfs as non-volatile
> >
> > In the latter case, a full sync (no -f) is required.
> >
> > Handling this is userspace is the preferred option IMO,
> > but if there is an *appealing* reason to allow opportunistic
> > volatile overlayfs re-mount as long as the upperdir inode
> > is in cache (userspace can make sure of that), then
> > all I am saying is that it is possible and not terribly hard.
>
> Hi Amir,
>
> Taking a step back and I am wondering what are the problems if we
> remoung a volatile mount after system crash. I mean how it is different
> from non-volatile mount after crash.
>
> One difference which comes to my mind is that an application might have
> done fsync and after remount it will expect changes to have made to
> persistent storage and be available. With volatile mount sunch guarantee
> can not be given.
>
> Can we keep track if we skipped any sync/fsync or not. If not, we can delete
> incomat directory on umount allowing next mount to succeed without any
> user intervention.
>
> This probably means that there needs to be a variant of umount() which
> does not request sync and container tools need to do a umount without
> request sync. Or may be the very fact container-tools/app mounted ovelay
> "volatile" they already opted in to not sync over umount. So they can't
> expect any guarantees of data hitting disk after umount.
>
> IOW, is it ok to remove "incomapt" directory if application never did
> an fsync. I don't know how common it is though because the problem we
> faced was excessive amount of fsync. So keeping found of fsync might
> not help at all.
>

Lots of applications do fsync of course.
Also copy up does fsync before moving the upper file into place.
Without this fsync (in volatile mode) upper files could very well be
corrupted even if applications never wrote to them anything and
never did fsync.

So is there a good reason to defer creation of incompat dir until
the first copy up or fsync? I don't think so.

Thanks,
Vivek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-06 19:00   ` Amir Goldstein
  2020-11-06 19:20     ` Vivek Goyal
@ 2020-11-09 17:22     ` Vivek Goyal
  2020-11-09 17:25       ` Sargun Dhillon
  1 sibling, 1 reply; 20+ messages in thread
From: Vivek Goyal @ 2020-11-09 17:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Giuseppe Scrivano,
	Daniel J Walsh

On Fri, Nov 06, 2020 at 09:00:07PM +0200, Amir Goldstein wrote:
> On Fri, Nov 6, 2020 at 7:59 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > Container folks are complaining that dnf/yum issues too many sync while
> > > installing packages and this slows down the image build. Build
> > > requirement is such that they don't care if a node goes down while
> > > build was still going on. In that case, they will simply throw away
> > > unfinished layer and start new build. So they don't care about syncing
> > > intermediate state to the disk and hence don't want to pay the price
> > > associated with sync.
> > >
> > > So they are asking for mount options where they can disable sync on overlay
> > > mount point.
> > >
> > > They primarily seem to have two use cases.
> > >
> > > - For building images, they will mount overlay with nosync and then sync
> > >   upper layer after unmounting overlay and reuse upper as lower for next
> > >   layer.
> > >
> > > - For running containers, they don't seem to care about syncing upper
> > >   layer because if node goes down, they will simply throw away upper
> > >   layer and create a fresh one.
> > >
> > > So this patch provides a mount option "volatile" which disables all forms
> > > of sync. Now it is caller's responsibility to throw away upper if
> > > system crashes or shuts down and start fresh.
> > >
> > > With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> > > installing emacs in an image. Installation time drops from 31 seconds to
> > > 25 seconds when nosync option is used. This is for the case of building on top
> > > of an image where all packages are already cached. That way I take
> > > out the network operations latency out of the measurement.
> > >
> > > Giuseppe is also looking to cut down on number of iops done on the
> > > disk. He is complaining that often in cloud their VMs are throttled
> > > if they cross the limit. This option can help them where they reduce
> > > number of iops (by cutting down on frequent sync and writebacks).
> > >
> [...]
> > There is some slightly confusing behaviour here [I realize this
> > behaviour is as intended]:
> >
> > (root) ~ # mount -t overlay -o
> > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > none /mnt/foo
> > (root) ~ # umount /mnt/foo
> > (root) ~ # mount -t overlay -o
> > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > none /mnt/foo
> > mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > missing codepage or helper program, or other error.
> >
> > From my understanding, the dirty flag should only be a problem if the
> > existing overlayfs is unmounted uncleanly. Docker does
> > this (mount, and re-mounts) during startup time because it writes some
> > files to the overlayfs. I think that we should harden
> > the volatile check slightly, and make it so that within the same boot,
> > it's not a problem, and having to have the user clear
> > the workdir every time is a pain. In addition, the semantics of the
> > volatile patch itself do not appear to be such that they
> > would break mounts during the same boot / mount of upperdir -- as
> > overlayfs does not defer any writes in itself, and it's
> > only that it's short-circuiting writes to the upperdir.
> >
> > Amir,
> > What do you think?
> 
> How do you propose to check that upperdir was used during the same boot?

Can we read and store "/proc/sys/kernel/random/boot_id". I am assuming
this will change if system is rebooting after a shutdown/reboot/crash.

If boot_id has not changed, we can allow remount and delete incomapt
dir ourseleves. May be we can drop a file in incomat to store boot_id
at the time of overlay mount.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-09 17:22     ` Vivek Goyal
@ 2020-11-09 17:25       ` Sargun Dhillon
  2020-11-09 19:39         ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Sargun Dhillon @ 2020-11-09 17:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Giuseppe Scrivano,
	Daniel J Walsh

On Mon, Nov 9, 2020 at 9:22 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Nov 06, 2020 at 09:00:07PM +0200, Amir Goldstein wrote:
> > On Fri, Nov 6, 2020 at 7:59 PM Sargun Dhillon <sargun@sargun.me> wrote:
> > >
> > > On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > Container folks are complaining that dnf/yum issues too many sync while
> > > > installing packages and this slows down the image build. Build
> > > > requirement is such that they don't care if a node goes down while
> > > > build was still going on. In that case, they will simply throw away
> > > > unfinished layer and start new build. So they don't care about syncing
> > > > intermediate state to the disk and hence don't want to pay the price
> > > > associated with sync.
> > > >
> > > > So they are asking for mount options where they can disable sync on overlay
> > > > mount point.
> > > >
> > > > They primarily seem to have two use cases.
> > > >
> > > > - For building images, they will mount overlay with nosync and then sync
> > > >   upper layer after unmounting overlay and reuse upper as lower for next
> > > >   layer.
> > > >
> > > > - For running containers, they don't seem to care about syncing upper
> > > >   layer because if node goes down, they will simply throw away upper
> > > >   layer and create a fresh one.
> > > >
> > > > So this patch provides a mount option "volatile" which disables all forms
> > > > of sync. Now it is caller's responsibility to throw away upper if
> > > > system crashes or shuts down and start fresh.
> > > >
> > > > With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> > > > installing emacs in an image. Installation time drops from 31 seconds to
> > > > 25 seconds when nosync option is used. This is for the case of building on top
> > > > of an image where all packages are already cached. That way I take
> > > > out the network operations latency out of the measurement.
> > > >
> > > > Giuseppe is also looking to cut down on number of iops done on the
> > > > disk. He is complaining that often in cloud their VMs are throttled
> > > > if they cross the limit. This option can help them where they reduce
> > > > number of iops (by cutting down on frequent sync and writebacks).
> > > >
> > [...]
> > > There is some slightly confusing behaviour here [I realize this
> > > behaviour is as intended]:
> > >
> > > (root) ~ # mount -t overlay -o
> > > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > none /mnt/foo
> > > (root) ~ # umount /mnt/foo
> > > (root) ~ # mount -t overlay -o
> > > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > none /mnt/foo
> > > mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > > missing codepage or helper program, or other error.
> > >
> > > From my understanding, the dirty flag should only be a problem if the
> > > existing overlayfs is unmounted uncleanly. Docker does
> > > this (mount, and re-mounts) during startup time because it writes some
> > > files to the overlayfs. I think that we should harden
> > > the volatile check slightly, and make it so that within the same boot,
> > > it's not a problem, and having to have the user clear
> > > the workdir every time is a pain. In addition, the semantics of the
> > > volatile patch itself do not appear to be such that they
> > > would break mounts during the same boot / mount of upperdir -- as
> > > overlayfs does not defer any writes in itself, and it's
> > > only that it's short-circuiting writes to the upperdir.
> > >
> > > Amir,
> > > What do you think?
> >
> > How do you propose to check that upperdir was used during the same boot?
>
> Can we read and store "/proc/sys/kernel/random/boot_id". I am assuming
> this will change if system is rebooting after a shutdown/reboot/crash.
>
> If boot_id has not changed, we can allow remount and delete incomapt
> dir ourseleves. May be we can drop a file in incomat to store boot_id
> at the time of overlay mount.
>
> Thanks
> Vivek
>

Storing boot_id is not good enough. You need to store the identity of the
superblock, because remounts can occur. Also, if errors happen
after flushing pages through writeback, they may never have been reported
to the user, so we need to see if those happened as well.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-09 17:25       ` Sargun Dhillon
@ 2020-11-09 19:39         ` Amir Goldstein
  2020-11-09 20:24           ` Vivek Goyal
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2020-11-09 19:39 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vivek Goyal, overlayfs, Miklos Szeredi, Giuseppe Scrivano,
	Daniel J Walsh

On Mon, Nov 9, 2020 at 7:26 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Mon, Nov 9, 2020 at 9:22 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Nov 06, 2020 at 09:00:07PM +0200, Amir Goldstein wrote:
> > > On Fri, Nov 6, 2020 at 7:59 PM Sargun Dhillon <sargun@sargun.me> wrote:
> > > >
> > > > On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > Container folks are complaining that dnf/yum issues too many sync while
> > > > > installing packages and this slows down the image build. Build
> > > > > requirement is such that they don't care if a node goes down while
> > > > > build was still going on. In that case, they will simply throw away
> > > > > unfinished layer and start new build. So they don't care about syncing
> > > > > intermediate state to the disk and hence don't want to pay the price
> > > > > associated with sync.
> > > > >
> > > > > So they are asking for mount options where they can disable sync on overlay
> > > > > mount point.
> > > > >
> > > > > They primarily seem to have two use cases.
> > > > >
> > > > > - For building images, they will mount overlay with nosync and then sync
> > > > >   upper layer after unmounting overlay and reuse upper as lower for next
> > > > >   layer.
> > > > >
> > > > > - For running containers, they don't seem to care about syncing upper
> > > > >   layer because if node goes down, they will simply throw away upper
> > > > >   layer and create a fresh one.
> > > > >
> > > > > So this patch provides a mount option "volatile" which disables all forms
> > > > > of sync. Now it is caller's responsibility to throw away upper if
> > > > > system crashes or shuts down and start fresh.
> > > > >
> > > > > With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> > > > > installing emacs in an image. Installation time drops from 31 seconds to
> > > > > 25 seconds when nosync option is used. This is for the case of building on top
> > > > > of an image where all packages are already cached. That way I take
> > > > > out the network operations latency out of the measurement.
> > > > >
> > > > > Giuseppe is also looking to cut down on number of iops done on the
> > > > > disk. He is complaining that often in cloud their VMs are throttled
> > > > > if they cross the limit. This option can help them where they reduce
> > > > > number of iops (by cutting down on frequent sync and writebacks).
> > > > >
> > > [...]
> > > > There is some slightly confusing behaviour here [I realize this
> > > > behaviour is as intended]:
> > > >
> > > > (root) ~ # mount -t overlay -o
> > > > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > > none /mnt/foo
> > > > (root) ~ # umount /mnt/foo
> > > > (root) ~ # mount -t overlay -o
> > > > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > > none /mnt/foo
> > > > mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > > > missing codepage or helper program, or other error.
> > > >
> > > > From my understanding, the dirty flag should only be a problem if the
> > > > existing overlayfs is unmounted uncleanly. Docker does
> > > > this (mount, and re-mounts) during startup time because it writes some
> > > > files to the overlayfs. I think that we should harden
> > > > the volatile check slightly, and make it so that within the same boot,
> > > > it's not a problem, and having to have the user clear
> > > > the workdir every time is a pain. In addition, the semantics of the
> > > > volatile patch itself do not appear to be such that they
> > > > would break mounts during the same boot / mount of upperdir -- as
> > > > overlayfs does not defer any writes in itself, and it's
> > > > only that it's short-circuiting writes to the upperdir.
> > > >
> > > > Amir,
> > > > What do you think?
> > >
> > > How do you propose to check that upperdir was used during the same boot?
> >
> > Can we read and store "/proc/sys/kernel/random/boot_id". I am assuming
> > this will change if system is rebooting after a shutdown/reboot/crash.
> >
> > If boot_id has not changed, we can allow remount and delete incomapt
> > dir ourseleves. May be we can drop a file in incomat to store boot_id
> > at the time of overlay mount.
> >
> > Thanks
> > Vivek
> >
>
> Storing boot_id is not good enough. You need to store the identity of the
> superblock, because remounts can occur. Also, if errors happen
> after flushing pages through writeback, they may never have been reported
> to the user, so we need to see if those happened as well.

It is not clear to me what problem we are trying to solve.
What is wrong with the userspace option to remove the dirty file?

Docker has to be changed anyway to use the 'volatile' mount option,
right?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-09 19:39         ` Amir Goldstein
@ 2020-11-09 20:24           ` Vivek Goyal
  0 siblings, 0 replies; 20+ messages in thread
From: Vivek Goyal @ 2020-11-09 20:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Sargun Dhillon, overlayfs, Miklos Szeredi, Giuseppe Scrivano,
	Daniel J Walsh

On Mon, Nov 09, 2020 at 09:39:59PM +0200, Amir Goldstein wrote:
> On Mon, Nov 9, 2020 at 7:26 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > On Mon, Nov 9, 2020 at 9:22 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Nov 06, 2020 at 09:00:07PM +0200, Amir Goldstein wrote:
> > > > On Fri, Nov 6, 2020 at 7:59 PM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > >
> > > > > On Mon, Aug 31, 2020 at 11:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > >
> > > > > > Container folks are complaining that dnf/yum issues too many sync while
> > > > > > installing packages and this slows down the image build. Build
> > > > > > requirement is such that they don't care if a node goes down while
> > > > > > build was still going on. In that case, they will simply throw away
> > > > > > unfinished layer and start new build. So they don't care about syncing
> > > > > > intermediate state to the disk and hence don't want to pay the price
> > > > > > associated with sync.
> > > > > >
> > > > > > So they are asking for mount options where they can disable sync on overlay
> > > > > > mount point.
> > > > > >
> > > > > > They primarily seem to have two use cases.
> > > > > >
> > > > > > - For building images, they will mount overlay with nosync and then sync
> > > > > >   upper layer after unmounting overlay and reuse upper as lower for next
> > > > > >   layer.
> > > > > >
> > > > > > - For running containers, they don't seem to care about syncing upper
> > > > > >   layer because if node goes down, they will simply throw away upper
> > > > > >   layer and create a fresh one.
> > > > > >
> > > > > > So this patch provides a mount option "volatile" which disables all forms
> > > > > > of sync. Now it is caller's responsibility to throw away upper if
> > > > > > system crashes or shuts down and start fresh.
> > > > > >
> > > > > > With "volatile", I am seeing roughly 20% speed up in my VM where I am just
> > > > > > installing emacs in an image. Installation time drops from 31 seconds to
> > > > > > 25 seconds when nosync option is used. This is for the case of building on top
> > > > > > of an image where all packages are already cached. That way I take
> > > > > > out the network operations latency out of the measurement.
> > > > > >
> > > > > > Giuseppe is also looking to cut down on number of iops done on the
> > > > > > disk. He is complaining that often in cloud their VMs are throttled
> > > > > > if they cross the limit. This option can help them where they reduce
> > > > > > number of iops (by cutting down on frequent sync and writebacks).
> > > > > >
> > > > [...]
> > > > > There is some slightly confusing behaviour here [I realize this
> > > > > behaviour is as intended]:
> > > > >
> > > > > (root) ~ # mount -t overlay -o
> > > > > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > > > none /mnt/foo
> > > > > (root) ~ # umount /mnt/foo
> > > > > (root) ~ # mount -t overlay -o
> > > > > volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > > > none /mnt/foo
> > > > > mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > > > > missing codepage or helper program, or other error.
> > > > >
> > > > > From my understanding, the dirty flag should only be a problem if the
> > > > > existing overlayfs is unmounted uncleanly. Docker does
> > > > > this (mount, and re-mounts) during startup time because it writes some
> > > > > files to the overlayfs. I think that we should harden
> > > > > the volatile check slightly, and make it so that within the same boot,
> > > > > it's not a problem, and having to have the user clear
> > > > > the workdir every time is a pain. In addition, the semantics of the
> > > > > volatile patch itself do not appear to be such that they
> > > > > would break mounts during the same boot / mount of upperdir -- as
> > > > > overlayfs does not defer any writes in itself, and it's
> > > > > only that it's short-circuiting writes to the upperdir.
> > > > >
> > > > > Amir,
> > > > > What do you think?
> > > >
> > > > How do you propose to check that upperdir was used during the same boot?
> > >
> > > Can we read and store "/proc/sys/kernel/random/boot_id". I am assuming
> > > this will change if system is rebooting after a shutdown/reboot/crash.
> > >
> > > If boot_id has not changed, we can allow remount and delete incomapt
> > > dir ourseleves. May be we can drop a file in incomat to store boot_id
> > > at the time of overlay mount.
> > >
> > > Thanks
> > > Vivek
> > >
> >
> > Storing boot_id is not good enough. You need to store the identity of the
> > superblock, because remounts can occur. Also, if errors happen
> > after flushing pages through writeback, they may never have been reported
> > to the user, so we need to see if those happened as well.
> 
> It is not clear to me what problem we are trying to solve.
> What is wrong with the userspace option to remove the dirty file?
> 
> Docker has to be changed anyway to use the 'volatile' mount option,
> right?

Is it about detecting any writeback error on remount (which might
have happened after umount of volatile).

But that should be doable in user space too. That is when syncfs
is issued on upper/ it should return error if something failed.

Havind said that, I guess, Sargun does not want to issue sync on
upper due to its affect on other containers latencies. He probably
wants normal writeback and if there is an error in that writeback,
detect that error upon next mount of upper/. And all this is
detected by keeping a track of upper's super block id and erreseq_t
somewhere in overlay.

I have not looked at this patch yet, just guessing...

Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync
  2020-11-07 11:52         ` Sargun Dhillon
@ 2020-11-09 20:40           ` Vivek Goyal
  0 siblings, 0 replies; 20+ messages in thread
From: Vivek Goyal @ 2020-11-09 20:40 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Amir Goldstein, Giuseppe Scrivano, overlayfs, Miklos Szeredi,
	Daniel J Walsh

On Sat, Nov 07, 2020 at 11:52:27AM +0000, Sargun Dhillon wrote:
> On Sat, Nov 07, 2020 at 11:35:04AM +0200, Amir Goldstein wrote:
> > On Fri, Nov 6, 2020 at 9:43 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> > >
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Fri, Nov 06, 2020 at 09:58:39AM -0800, Sargun Dhillon wrote:
> > > >
> > > > [..]
> > > >> There is some slightly confusing behaviour here [I realize this
> > > >> behaviour is as intended]:
> > > >>
> > > >> (root) ~ # mount -t overlay -o
> > > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > >> none /mnt/foo
> > > >> (root) ~ # umount /mnt/foo
> > > >> (root) ~ # mount -t overlay -o
> > > >> volatile,index=off,lowerdir=/root/lowerdir,upperdir=/root/upperdir,workdir=/root/workdir
> > > >> none /mnt/foo
> > > >> mount: /mnt/foo: wrong fs type, bad option, bad superblock on none,
> > > >> missing codepage or helper program, or other error.
> > > >>
> > > >> From my understanding, the dirty flag should only be a problem if the
> > > >> existing overlayfs is unmounted uncleanly. Docker does
> > > >> this (mount, and re-mounts) during startup time because it writes some
> > > >> files to the overlayfs. I think that we should harden
> > > >> the volatile check slightly, and make it so that within the same boot,
> > > >> it's not a problem, and having to have the user clear
> > > >> the workdir every time is a pain. In addition, the semantics of the
> > > >> volatile patch itself do not appear to be such that they
> > > >> would break mounts during the same boot / mount of upperdir -- as
> > > >> overlayfs does not defer any writes in itself, and it's
> > > >> only that it's short-circuiting writes to the upperdir.
> > > >
> > > > umount does a sync normally and with "volatile" overlayfs skips that
> > > > sync. So a successful unmount does not mean that file got synced
> > > > to backing store. It is possible, after umount, system crashed
> > > > and after reboot, user tried to mount upper which is corrupted
> > > > now and overlay will not detect it.
> > > >
> We explicitly disable this in our infrastructure via a small kernel patch that 
> stubs out the sync behaviour. IIRC, it was added some time after 4.15, and when 
> we picked up the related overlayfs patch it caused a lot of machines to crash.
> 
> This was due to high container churn -- and other containers having a lot of 
> outstanding dirty pages at exit time. When we would teardown their mounts, 
> syncfs would get called [on the entire underlying device / fs], and that would 
> stall out all of the containers on the machine. We really do not want this 
> behaviour.
> 
> > > > You seem to be asking for an alternate option where we disable
> > > > fsync() but not syncfs. In that case sync on umount will still
> > > > be done. And that means a successful umount should mean upper
> > > > is fine and it could automatically remove incomapt dir upon
> > > > umount.
> > >
> > > could this be handled in user space?  It should still be possible to do
> > > the equivalent of:
> > >
> > > # sync -f /root/upperdir
> > > # rm -rf /root/workdir/incompat/volatile
> > >
> > 
> > FWIW, the sync -f command above is
> > 1. Not needed when re-mounting overlayfs as volatile
> > 2. Not enough when re-mounting overlayfs as non-volatile
> > 
> > In the latter case, a full sync (no -f) is required.
> > 
> > Handling this is userspace is the preferred option IMO,
> > but if there is an *appealing* reason to allow opportunistic
> > volatile overlayfs re-mount as long as the upperdir inode
> > is in cache (userspace can make sure of that), then
> > all I am saying is that it is possible and not terribly hard.
> > 
> > Thanks,
> > Amir.
> 
> 
> I think I have two approaches in mind that are viable. Both approaches rely on 
> adding a small amount of data (either via an xattr, or data in the file itself) 
> that allows us to ascertain whether or not the upperdir is okay to reuse, even 
> when it was mounted volatile:
> 
> 1. We introduce a guid to the superblock structure itself. I think that this 
> would actually be valuable independently from overlayfs in order to do things 
> like "my database restarted, should it do an integrity check, or is the same SB 
> mounted?" I started down the route of cooking up an ioctl for this, but I think 
> that this is killing a mosquito with a canon. Perhaps, this approach is the 
> right long-term approach, but I don't think it'll be easy to get through.
> 

> 2. I've started cooking up this patch a little bit more where we override 
> kill_sb. Specifically, we assign kill_sb on the upperdir / workdir to our own 
> killsb, and keep track of superblocks, and the errseq on the super block. We 
> then keep a list of tracked superblocks in memory, the last observed errseq, and 
> a guid. Upon mount of the overlayfs, we write the a key in that uniquely 
> identifies the sb + errseq. Upon remount, we check if the errseq, or if the sb 
> have changed. If so, we throw an error. Otherwise, we allow things to pass.
> 
> This approach has seen some usage in net[1].

So what happens if system crashed and you are booting back. You throw
away all the containers?

The mechanism you described above sounds like you want to detect writeback
errors during next mount and fail that mount (and possibly throw away
the container)?

If I start 5 containers and mount overlay with volatile and these
containers exit. Later say 4 new contaieners were started and some
error happened in writeback. Now if I restart any of the first
5 containers, they all will see the error, right? And they all
will fail to start. Is that what you are trying to achieve. Or I missed
the point completely.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-11-09 20:40 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-31 18:15 [PATCH v7] overlayfs: Provide a mount option "volatile" to skip sync Vivek Goyal
2020-09-01  8:22 ` Amir Goldstein
2020-09-01 13:14   ` Vivek Goyal
2020-11-06 17:58 ` Sargun Dhillon
2020-11-06 19:00   ` Amir Goldstein
2020-11-06 19:20     ` Vivek Goyal
2020-11-09 17:22     ` Vivek Goyal
2020-11-09 17:25       ` Sargun Dhillon
2020-11-09 19:39         ` Amir Goldstein
2020-11-09 20:24           ` Vivek Goyal
2020-11-06 19:03   ` Vivek Goyal
2020-11-06 19:42     ` Giuseppe Scrivano
2020-11-07  9:35       ` Amir Goldstein
2020-11-07 11:52         ` Sargun Dhillon
2020-11-09 20:40           ` Vivek Goyal
2020-11-09  8:53         ` Giuseppe Scrivano
2020-11-09 10:10           ` Amir Goldstein
2020-11-09 16:36         ` Vivek Goyal
2020-11-09 17:09         ` Vivek Goyal
2020-11-09 17:20           ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).