linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync
@ 2020-06-30 19:37 Vivek Goyal
  2020-07-01 10:31 ` Amir Goldstein
  2020-07-16 20:41 ` Vivek Goyal
  0 siblings, 2 replies; 5+ messages in thread
From: Vivek Goyal @ 2020-06-30 19:37 UTC (permalink / raw)
  To: linux-unionfs, miklos
  Cc: amir73il, gscrivan, pmatilai, dwalsh, swhiteho, sandeen

Container folks are complaining that dnf/yum issues too many sync while
installing packages and this slows down the image build. Build
requirement is such that they don't care if a node goes down while
build was still going on. In that case, they will simply throw away
unfinished layer and start new build. So they don't care about syncing
intermediate state to the disk and hence don't want to pay the price
associated with sync.

So they are asking for an option where they can disable sync on overlay
mount point completely and user space will do sync management on upper
layer as needed.

They primarily seem to have two use cases.

- For building images, they will mount overlay with nosync and then sync
  upper layer after unmounting overlay and reuse upper as lower for next
  layer.

- For running containers, they don't seem to care about syncing upper
  layer because if node goes down, they will simply throw away upper
  layer and create a fresh one.

So this patch provides a mount option "nosync" which disables all forms
of sync. Now it is caller's responsibility to manage sync of upper layer
before it is reused again.

I am seeing roughly 20% speed up in my VM where I am just installing
emacs in an image. Installation time drops from 31 seconds to 25 seconds
when nosync option is used. This is for the case of building on top
of an image where all packages are already cached. That way I take
out the network operations latency out of the measurement.

Giuseppe is also looking to cut down on number of iops done on the
disk. He is complaining that often in cloud their VMs are throttled
if they cross the limit. This option can help them where they reduce
number of iops (by cutting down on frequent sync and writebacks).

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/filesystems/overlayfs.rst | 20 ++++++++++++++++++++
 fs/overlayfs/copy_up.c                  | 12 ++++++++----
 fs/overlayfs/file.c                     | 11 ++++++++++-
 fs/overlayfs/ovl_entry.h                |  1 +
 fs/overlayfs/readdir.c                  |  3 +++
 fs/overlayfs/super.c                    | 23 ++++++++++++++++++++---
 6 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 660dbaf0b9b8..0a42f26a3f0c 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -563,6 +563,26 @@ This verification may cause significant overhead in some cases.
 Note: the mount options index=off,nfs_export=on are conflicting and will
 result in an error.
 
+Disable sync
+------------
+By default, overlay skips sync on files residing on a lower layer.  It
+is possible to skip sync operations for files on the upper layer as well
+with the 'nosync' mount option. This option disables all forms of sync
+from overlay, including the one done at umount/remount and it is
+user's responsibility to sync upper layer on the file system it
+is residing.
+
+With this option, data loss will happen if overlayfs upper layer is
+not synced. So use this option very carefully. This is only for the
+use cases where users discard upper layer if they could not sync it
+successfully.
+
+Typically workflow will be.
+
+- mount overlay
+- Do bunch of operations
+- unmount overlay
+- sync filesystem container upper layer
 
 Testsuite
 ---------
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 79dd052c7dbf..5431a89bbd8a 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
 	return error;
 }
 
-static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
+static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
+			    struct path *new, loff_t len)
 {
 	struct file *old_file;
 	struct file *new_file;
@@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 		len -= bytes;
 	}
 out:
-	if (!error)
+	if (!error && !ofs->config.nosync)
 		error = vfs_fsync(new_file, 0);
 	fput(new_file);
 out_fput:
@@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 
 static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 {
+	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
 	int err;
 
 	/*
@@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 		upperpath.dentry = temp;
 
 		ovl_path_lowerdata(c->dentry, &datapath);
-		err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
+		err = ovl_copy_up_data(ofs, &datapath, &upperpath,
+				       c->stat.size);
 		if (err)
 			return err;
 	}
@@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
 /* Copy up data of an inode which was copied up metadata only in the past. */
 static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 {
+	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
 	struct path upperpath, datapath;
 	int err;
 	char *capability = NULL;
@@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
 			goto out;
 	}
 
-	err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
+	err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
 	if (err)
 		goto out_free;
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 01820e654a21..a361890a8d05 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -329,6 +329,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	struct fd real;
 	const struct cred *old_cred;
 	ssize_t ret;
+	int ifl = iocb->ki_flags;
 
 	if (!iov_iter_count(iter))
 		return 0;
@@ -344,11 +345,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	if (ret)
 		goto out_unlock;
 
+	if (OVL_FS(inode->i_sb)->config.nosync)
+		ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
+
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	if (is_sync_kiocb(iocb)) {
 		file_start_write(real.file);
 		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
-				     ovl_iocb_to_rwf(iocb->ki_flags));
+				     ovl_iocb_to_rwf(ifl));
 		file_end_write(real.file);
 		/* Update size */
 		ovl_copyattr(ovl_inode_real(inode), inode);
@@ -368,6 +372,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 		real.flags = 0;
 		aio_req->orig_iocb = iocb;
 		kiocb_clone(&aio_req->iocb, iocb, real.file);
+		aio_req->iocb.ki_flags = ifl;
 		aio_req->iocb.ki_complete = ovl_aio_rw_complete;
 		ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
 		if (ret != -EIOCBQUEUED)
@@ -430,6 +435,10 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct fd real;
 	const struct cred *old_cred;
 	int ret;
+	struct ovl_fs *ofs = OVL_FS(file_inode(file)->i_sb);
+
+	if (ofs->config.nosync)
+		return 0;
 
 	ret = ovl_real_fdget_meta(file, &real, !datasync);
 	if (ret)
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index b429c80879ee..034a8d9897e0 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -17,6 +17,7 @@ struct ovl_config {
 	bool nfs_export;
 	int xino;
 	bool metacopy;
+	bool nosync;
 };
 
 struct ovl_sb {
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 6918b98faeb6..9e93db028dbf 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
 		return 0;
 
+	if (OVL_FS(dentry->d_sb)->config.nosync)
+		return 0;
+
 	/*
 	 * Need to check if we started out being a lower dir, but got copied up
 	 */
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 91476bc422f9..c28ab39b5c70 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	if (!ovl_upper_mnt(ofs))
 		return 0;
 
+	if (ofs->config.nosync)
+		return 0;
 	/*
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
@@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	if (ofs->config.metacopy != ovl_metacopy_def)
 		seq_printf(m, ",metacopy=%s",
 			   ofs->config.metacopy ? "on" : "off");
+	if (ofs->config.nosync)
+		seq_puts(m, ",nosync");
 	return 0;
 }
 
@@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 
 	if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
 		upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
-		down_read(&upper_sb->s_umount);
-		ret = sync_filesystem(upper_sb);
-		up_read(&upper_sb->s_umount);
+		if (!ofs->config.nosync) {
+			down_read(&upper_sb->s_umount);
+			ret = sync_filesystem(upper_sb);
+			up_read(&upper_sb->s_umount);
+		}
 	}
 
 	return ret;
@@ -411,6 +417,7 @@ enum {
 	OPT_XINO_AUTO,
 	OPT_METACOPY_ON,
 	OPT_METACOPY_OFF,
+	OPT_NOSYNC,
 	OPT_ERR,
 };
 
@@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
 	{OPT_XINO_AUTO,			"xino=auto"},
 	{OPT_METACOPY_ON,		"metacopy=on"},
 	{OPT_METACOPY_OFF,		"metacopy=off"},
+	{OPT_NOSYNC,			"nosync"},
 	{OPT_ERR,			NULL}
 };
 
@@ -573,6 +581,10 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 			metacopy_opt = true;
 			break;
 
+		case OPT_NOSYNC:
+			config->nosync = true;
+			break;
+
 		default:
 			pr_err("unrecognized mount option \"%s\" or missing value\n",
 					p);
@@ -588,6 +600,11 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 		config->workdir = NULL;
 	}
 
+	if (!config->upperdir && config->nosync) {
+		pr_info("option nosync is meaningless in a non-upper mount, ignoring it.\n");
+		config->nosync = false;
+	}
+
 	err = ovl_parse_redirect_mode(config, config->redirect_mode);
 	if (err)
 		return err;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync
  2020-06-30 19:37 [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync Vivek Goyal
@ 2020-07-01 10:31 ` Amir Goldstein
  2020-07-01 16:25   ` Vivek Goyal
  2020-07-16 20:41 ` Vivek Goyal
  1 sibling, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-07-01 10:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: overlayfs, Miklos Szeredi, Giuseppe Scrivano, pmatilai,
	Daniel J Walsh, Steven Whitehouse, Eric Sandeen

On Tue, Jun 30, 2020 at 10:37 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Container folks are complaining that dnf/yum issues too many sync while
> installing packages and this slows down the image build. Build
> requirement is such that they don't care if a node goes down while
> build was still going on. In that case, they will simply throw away
> unfinished layer and start new build. So they don't care about syncing
> intermediate state to the disk and hence don't want to pay the price
> associated with sync.
>
> So they are asking for an option where they can disable sync on overlay
> mount point completely and user space will do sync management on upper
> layer as needed.
>

Sounds reasonable.
I have a lot of comments below, but the bottom line is if you change "nosync"
to "sync=off" and adapt documentation, the patch itself looks fine to me
for addressing the "volatile container" use case.

> They primarily seem to have two use cases.
>
> - For building images, they will mount overlay with nosync and then sync
>   upper layer after unmounting overlay and reuse upper as lower for next
>   layer.

This sentence reads to me as if "sync upper layer" is simple, which is not
entirely true. syncfs(2) will sync ALL the upper layers of all containers even
in the best case where the filesystem is dedicated to containers storage.
The fact is that for this specific use case, the most optimal handling would
have been "sync on unmount/remount/syncfs but skip fsync".

But of course, without improving the implementation of ovl_sync_fs(),
this is currently equivalent to what you are describing.
Still I feel that we do need to make this distinction and provide mount option
"sync=fs" instead of letting the container runtime take care of
"syncing upper layer"
this way when the day comes and "sync=fs" is properly implemented,
all container runtimes will win on kernel upgrade.

>
> - For running containers, they don't seem to care about syncing upper
>   layer because if node goes down, they will simply throw away upper
>   layer and create a fresh one.
>
> So this patch provides a mount option "nosync" which disables all forms
> of sync. Now it is caller's responsibility to manage sync of upper layer
> before it is reused again.
>
> I am seeing roughly 20% speed up in my VM where I am just installing
> emacs in an image. Installation time drops from 31 seconds to 25 seconds
> when nosync option is used. This is for the case of building on top
> of an image where all packages are already cached. That way I take
> out the network operations latency out of the measurement.
>
> Giuseppe is also looking to cut down on number of iops done on the
> disk. He is complaining that often in cloud their VMs are throttled
> if they cross the limit. This option can help them where they reduce
> number of iops (by cutting down on frequent sync and writebacks).
>
> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/filesystems/overlayfs.rst | 20 ++++++++++++++++++++
>  fs/overlayfs/copy_up.c                  | 12 ++++++++----
>  fs/overlayfs/file.c                     | 11 ++++++++++-
>  fs/overlayfs/ovl_entry.h                |  1 +
>  fs/overlayfs/readdir.c                  |  3 +++
>  fs/overlayfs/super.c                    | 23 ++++++++++++++++++++---
>  6 files changed, 62 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 660dbaf0b9b8..0a42f26a3f0c 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -563,6 +563,26 @@ This verification may cause significant overhead in some cases.
>  Note: the mount options index=off,nfs_export=on are conflicting and will
>  result in an error.
>
> +Disable sync
> +------------
> +By default, overlay skips sync on files residing on a lower layer.  It
> +is possible to skip sync operations for files on the upper layer as well
> +with the 'nosync' mount option. This option disables all forms of sync
> +from overlay, including the one done at umount/remount and it is
> +user's responsibility to sync upper layer on the file system it
> +is residing.
> +
> +With this option, data loss will happen if overlayfs upper layer is
> +not synced. So use this option very carefully. This is only for the
> +use cases where users discard upper layer if they could not sync it
> +successfully.
> +
> +Typically workflow will be.
> +
> +- mount overlay
> +- Do bunch of operations
> +- unmount overlay
> +- sync filesystem container upper layer

I don't like to document this workflow, because I think it is wrong.
Please document only the "volatile container" use case for "sync=off".

>
>  Testsuite
>  ---------
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 79dd052c7dbf..5431a89bbd8a 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
>         return error;
>  }
>
> -static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> +static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
> +                           struct path *new, loff_t len)
>  {
>         struct file *old_file;
>         struct file *new_file;
> @@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>                 len -= bytes;
>         }
>  out:
> -       if (!error)
> +       if (!error && !ofs->config.nosync)
>                 error = vfs_fsync(new_file, 0);

Two points about this:

1. The purpose of this particular fsync is different from user requested fsync.
Example:
If a user chowns all files in a tree on xfs/ext4, ~1 minute later,
changes will likely
be safely stored because of periodic journal commit.
This is not a filesystem guaranty, but that's the way it is and if that were to
change, surely some users will notice and complain as happened in the past
with ext3 -> ext4 transition [1].

If a user chowns all files in a tree on overlay (without metacopy)
over xfs/ext4,
~1 minute later, changes will not have been safely stored if it wasn't
for this fsync.
The reason is delayed allocation of blocks of the new upper file.
ext4 has some heuristics in place to start writeback after rename over a file
(see NO_AUTO_DA_ALLOC), but not for linking an O_TMPFILE.

2. What could be useful is a mount option (e.g. sync=writeback) to convert this
vfs_fsync() to either filemap_flush() or filemap_fdatawrite().
This will start writeback on the new file with/without blocking and
without issuing
any FLUSH to block layer. Periodic journal commit will take care of the rest.
Again, this is not a guarantee that filesystems make and my attempts
to formalize
this as an user API in LSFMM did not go far, but it is a powerful tool
that container
administrators who know which underlying filesystem they use can make use of
and the performance benefit for setup of thousands of containers should be very
noticeable.

[1] https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/


>         fput(new_file);
>  out_fput:
> @@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>
>  static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>  {
> +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>         int err;
>
>         /*
> @@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>                 upperpath.dentry = temp;
>
>                 ovl_path_lowerdata(c->dentry, &datapath);
> -               err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> +               err = ovl_copy_up_data(ofs, &datapath, &upperpath,
> +                                      c->stat.size);
>                 if (err)
>                         return err;
>         }
> @@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
>  /* Copy up data of an inode which was copied up metadata only in the past. */
>  static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>  {
> +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>         struct path upperpath, datapath;
>         int err;
>         char *capability = NULL;
> @@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>                         goto out;
>         }
>
> -       err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> +       err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
>         if (err)
>                 goto out_free;
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 01820e654a21..a361890a8d05 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -329,6 +329,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>         struct fd real;
>         const struct cred *old_cred;
>         ssize_t ret;
> +       int ifl = iocb->ki_flags;
>
>         if (!iov_iter_count(iter))
>                 return 0;
> @@ -344,11 +345,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>         if (ret)
>                 goto out_unlock;
>
> +       if (OVL_FS(inode->i_sb)->config.nosync)
> +               ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
> +
>         old_cred = ovl_override_creds(file_inode(file)->i_sb);
>         if (is_sync_kiocb(iocb)) {
>                 file_start_write(real.file);
>                 ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> -                                    ovl_iocb_to_rwf(iocb->ki_flags));
> +                                    ovl_iocb_to_rwf(ifl));
>                 file_end_write(real.file);
>                 /* Update size */
>                 ovl_copyattr(ovl_inode_real(inode), inode);
> @@ -368,6 +372,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>                 real.flags = 0;
>                 aio_req->orig_iocb = iocb;
>                 kiocb_clone(&aio_req->iocb, iocb, real.file);
> +               aio_req->iocb.ki_flags = ifl;
>                 aio_req->iocb.ki_complete = ovl_aio_rw_complete;
>                 ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
>                 if (ret != -EIOCBQUEUED)
> @@ -430,6 +435,10 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>         struct fd real;
>         const struct cred *old_cred;
>         int ret;
> +       struct ovl_fs *ofs = OVL_FS(file_inode(file)->i_sb);
> +
> +       if (ofs->config.nosync)
> +               return 0;
>

Can convert the vfs_sync_range() to filemap_fdatawrite_range() with
"sync=writeback"

>         ret = ovl_real_fdget_meta(file, &real, !datasync);
>         if (ret)
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index b429c80879ee..034a8d9897e0 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -17,6 +17,7 @@ struct ovl_config {
>         bool nfs_export;
>         int xino;
>         bool metacopy;
> +       bool nosync;
>  };
>
>  struct ovl_sb {
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 6918b98faeb6..9e93db028dbf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
>         if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
>                 return 0;
>
> +       if (OVL_FS(dentry->d_sb)->config.nosync)
> +               return 0;
> +
>         /*
>          * Need to check if we started out being a lower dir, but got copied up
>          */
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 91476bc422f9..c28ab39b5c70 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>         if (!ovl_upper_mnt(ofs))
>                 return 0;
>
> +       if (ofs->config.nosync)
> +               return 0;

I'd be happier if we implement "sync=off/fs" from the start, or at least make
ofs->config.sync an enum or bit mask to represent these modes, even if
we only implement mount option "sync=off" to begin with.

>         /*
>          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>          * All the super blocks will be iterated, including upper_sb.
> @@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
>         if (ofs->config.metacopy != ovl_metacopy_def)
>                 seq_printf(m, ",metacopy=%s",
>                            ofs->config.metacopy ? "on" : "off");
> +       if (ofs->config.nosync)
> +               seq_puts(m, ",nosync");
>         return 0;
>  }
>
> @@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>
>         if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
>                 upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> -               down_read(&upper_sb->s_umount);
> -               ret = sync_filesystem(upper_sb);
> -               up_read(&upper_sb->s_umount);
> +               if (!ofs->config.nosync) {
> +                       down_read(&upper_sb->s_umount);
> +                       ret = sync_filesystem(upper_sb);
> +                       up_read(&upper_sb->s_umount);
> +               }
>         }
>
>         return ret;
> @@ -411,6 +417,7 @@ enum {
>         OPT_XINO_AUTO,
>         OPT_METACOPY_ON,
>         OPT_METACOPY_OFF,
> +       OPT_NOSYNC,
>         OPT_ERR,
>  };
>
> @@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
>         {OPT_XINO_AUTO,                 "xino=auto"},
>         {OPT_METACOPY_ON,               "metacopy=on"},
>         {OPT_METACOPY_OFF,              "metacopy=off"},
> +       {OPT_NOSYNC,                    "nosync"},

As should be very clear by now, I prefer that we call the option "sync=off",
so that we can later (or now) also implement "sync=fs" and maybe also
"sync=writeback", but plus that one semantic change, I am fine with the
patch as is.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync
  2020-07-01 10:31 ` Amir Goldstein
@ 2020-07-01 16:25   ` Vivek Goyal
  2020-07-01 17:42     ` Amir Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Vivek Goyal @ 2020-07-01 16:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: overlayfs, Miklos Szeredi, Giuseppe Scrivano, pmatilai,
	Daniel J Walsh, Steven Whitehouse, Eric Sandeen

On Wed, Jul 01, 2020 at 01:31:01PM +0300, Amir Goldstein wrote:
> On Tue, Jun 30, 2020 at 10:37 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Container folks are complaining that dnf/yum issues too many sync while
> > installing packages and this slows down the image build. Build
> > requirement is such that they don't care if a node goes down while
> > build was still going on. In that case, they will simply throw away
> > unfinished layer and start new build. So they don't care about syncing
> > intermediate state to the disk and hence don't want to pay the price
> > associated with sync.
> >
> > So they are asking for an option where they can disable sync on overlay
> > mount point completely and user space will do sync management on upper
> > layer as needed.
> >
> 
> Sounds reasonable.
> I have a lot of comments below, but the bottom line is if you change "nosync"
> to "sync=off" and adapt documentation, the patch itself looks fine to me
> for addressing the "volatile container" use case.

"sync" already seems to be a vfs mount option. man mount says.

       sync   All I/O to the filesystem should be done synchronously.  In  the
              case  of  media with a limited number of write cycles (e.g. some
              flash drives), sync may cause life-cycle shortening.

It will be good to choose a name which avoids confusion with filesystem
independent option.

Either we can separate options like "nosync" "fssync" "writebacksync" etc
when use cases for difference kind of sync arise.

Or would it make sense to call it "ovlsync=<foo>". That way it is
plenty clear that it is overlay filesystem specific mount option. 

> 
> > They primarily seem to have two use cases.
> >
> > - For building images, they will mount overlay with nosync and then sync
> >   upper layer after unmounting overlay and reuse upper as lower for next
> >   layer.
> 
> This sentence reads to me as if "sync upper layer" is simple, which is not
> entirely true. syncfs(2) will sync ALL the upper layers of all containers even
> in the best case where the filesystem is dedicated to containers storage.
> The fact is that for this specific use case, the most optimal handling would
> have been "sync on unmount/remount/syncfs but skip fsync".
> 
> But of course, without improving the implementation of ovl_sync_fs(),
> this is currently equivalent to what you are describing.
> Still I feel that we do need to make this distinction and provide mount option
> "sync=fs" instead of letting the container runtime take care of
> "syncing upper layer"
> this way when the day comes and "sync=fs" is properly implemented,
> all container runtimes will win on kernel upgrade.

I think this option should be useful for some cases where they want
to skip intermediate sync but will like to sync on umount/remount/syncfs.
In fact image build case probably should benefit from it.

Giuseppe, is that correct that for image build you will need to sync
upper layer. If that's the case, then providing a mount option and
using that now is better so that applications don't have to change
later when we have more efficient implementation of ovl_sync_fs().

IOW, for the case of completely volatile container (kubernetes
restarts container on separate node if node goes down), we could
use ovlsync=off (or nosync) and for image build case, use ovlsync=syncfs.

Giuseppe, does this sound reasonable for your needs.

> 
> >
> > - For running containers, they don't seem to care about syncing upper
> >   layer because if node goes down, they will simply throw away upper
> >   layer and create a fresh one.
> >
> > So this patch provides a mount option "nosync" which disables all forms
> > of sync. Now it is caller's responsibility to manage sync of upper layer
> > before it is reused again.
> >
> > I am seeing roughly 20% speed up in my VM where I am just installing
> > emacs in an image. Installation time drops from 31 seconds to 25 seconds
> > when nosync option is used. This is for the case of building on top
> > of an image where all packages are already cached. That way I take
> > out the network operations latency out of the measurement.
> >
> > Giuseppe is also looking to cut down on number of iops done on the
> > disk. He is complaining that often in cloud their VMs are throttled
> > if they cross the limit. This option can help them where they reduce
> > number of iops (by cutting down on frequent sync and writebacks).
> >
> > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/filesystems/overlayfs.rst | 20 ++++++++++++++++++++
> >  fs/overlayfs/copy_up.c                  | 12 ++++++++----
> >  fs/overlayfs/file.c                     | 11 ++++++++++-
> >  fs/overlayfs/ovl_entry.h                |  1 +
> >  fs/overlayfs/readdir.c                  |  3 +++
> >  fs/overlayfs/super.c                    | 23 ++++++++++++++++++++---
> >  6 files changed, 62 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > index 660dbaf0b9b8..0a42f26a3f0c 100644
> > --- a/Documentation/filesystems/overlayfs.rst
> > +++ b/Documentation/filesystems/overlayfs.rst
> > @@ -563,6 +563,26 @@ This verification may cause significant overhead in some cases.
> >  Note: the mount options index=off,nfs_export=on are conflicting and will
> >  result in an error.
> >
> > +Disable sync
> > +------------
> > +By default, overlay skips sync on files residing on a lower layer.  It
> > +is possible to skip sync operations for files on the upper layer as well
> > +with the 'nosync' mount option. This option disables all forms of sync
> > +from overlay, including the one done at umount/remount and it is
> > +user's responsibility to sync upper layer on the file system it
> > +is residing.
> > +
> > +With this option, data loss will happen if overlayfs upper layer is
> > +not synced. So use this option very carefully. This is only for the
> > +use cases where users discard upper layer if they could not sync it
> > +successfully.
> > +
> > +Typically workflow will be.
> > +
> > +- mount overlay
> > +- Do bunch of operations
> > +- unmount overlay
> > +- sync filesystem container upper layer
> 
> I don't like to document this workflow, because I think it is wrong.

Why is it wrong. User can do "syncfs" on filesystem upper is residing.
If we decide to provide "ovlsync=syncfs", then we should document
that instead.

> Please document only the "volatile container" use case for "sync=off".

I guess we are now splitting it in two use cases. "volatile container"
will use sync=off. And image build will use sync=fs.

> 
> >
> >  Testsuite
> >  ---------
> > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> > index 79dd052c7dbf..5431a89bbd8a 100644
> > --- a/fs/overlayfs/copy_up.c
> > +++ b/fs/overlayfs/copy_up.c
> > @@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
> >         return error;
> >  }
> >
> > -static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> > +static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
> > +                           struct path *new, loff_t len)
> >  {
> >         struct file *old_file;
> >         struct file *new_file;
> > @@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> >                 len -= bytes;
> >         }
> >  out:
> > -       if (!error)
> > +       if (!error && !ofs->config.nosync)
> >                 error = vfs_fsync(new_file, 0);
> 
> Two points about this:
> 
> 1. The purpose of this particular fsync is different from user requested fsync.
> Example:
> If a user chowns all files in a tree on xfs/ext4, ~1 minute later,
> changes will likely
> be safely stored because of periodic journal commit.
> This is not a filesystem guaranty, but that's the way it is and if that were to
> change, surely some users will notice and complain as happened in the past
> with ext3 -> ext4 transition [1].
> 
> If a user chowns all files in a tree on overlay (without metacopy)
> over xfs/ext4,
> ~1 minute later, changes will not have been safely stored if it wasn't
> for this fsync.
> The reason is delayed allocation of blocks of the new upper file.
> ext4 has some heuristics in place to start writeback after rename over a file
> (see NO_AUTO_DA_ALLOC), but not for linking an O_TMPFILE.
> 
> 2. What could be useful is a mount option (e.g. sync=writeback) to convert this
> vfs_fsync() to either filemap_flush() or filemap_fdatawrite().
> This will start writeback on the new file with/without blocking and
> without issuing
> any FLUSH to block layer. Periodic journal commit will take care of the rest.
> Again, this is not a guarantee that filesystems make and my attempts
> to formalize
> this as an user API in LSFMM did not go far, but it is a powerful tool
> that container
> administrators who know which underlying filesystem they use can make use of
> and the performance benefit for setup of thousands of containers should be very
> noticeable.
> 
> [1] https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

Ok, so somebody who needs this behavior down the line that copied up
file gets to stable storage after a minute or so, then we need to
look into sync=writeback option.

> 
> 
> >         fput(new_file);
> >  out_fput:
> > @@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
> >
> >  static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
> >  {
> > +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
> >         int err;
> >
> >         /*
> > @@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
> >                 upperpath.dentry = temp;
> >
> >                 ovl_path_lowerdata(c->dentry, &datapath);
> > -               err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> > +               err = ovl_copy_up_data(ofs, &datapath, &upperpath,
> > +                                      c->stat.size);
> >                 if (err)
> >                         return err;
> >         }
> > @@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
> >  /* Copy up data of an inode which was copied up metadata only in the past. */
> >  static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> >  {
> > +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
> >         struct path upperpath, datapath;
> >         int err;
> >         char *capability = NULL;
> > @@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> >                         goto out;
> >         }
> >
> > -       err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> > +       err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
> >         if (err)
> >                 goto out_free;
> >
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index 01820e654a21..a361890a8d05 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -329,6 +329,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> >         struct fd real;
> >         const struct cred *old_cred;
> >         ssize_t ret;
> > +       int ifl = iocb->ki_flags;
> >
> >         if (!iov_iter_count(iter))
> >                 return 0;
> > @@ -344,11 +345,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> >         if (ret)
> >                 goto out_unlock;
> >
> > +       if (OVL_FS(inode->i_sb)->config.nosync)
> > +               ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
> > +
> >         old_cred = ovl_override_creds(file_inode(file)->i_sb);
> >         if (is_sync_kiocb(iocb)) {
> >                 file_start_write(real.file);
> >                 ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> > -                                    ovl_iocb_to_rwf(iocb->ki_flags));
> > +                                    ovl_iocb_to_rwf(ifl));
> >                 file_end_write(real.file);
> >                 /* Update size */
> >                 ovl_copyattr(ovl_inode_real(inode), inode);
> > @@ -368,6 +372,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> >                 real.flags = 0;
> >                 aio_req->orig_iocb = iocb;
> >                 kiocb_clone(&aio_req->iocb, iocb, real.file);
> > +               aio_req->iocb.ki_flags = ifl;
> >                 aio_req->iocb.ki_complete = ovl_aio_rw_complete;
> >                 ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
> >                 if (ret != -EIOCBQUEUED)
> > @@ -430,6 +435,10 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >         struct fd real;
> >         const struct cred *old_cred;
> >         int ret;
> > +       struct ovl_fs *ofs = OVL_FS(file_inode(file)->i_sb);
> > +
> > +       if (ofs->config.nosync)
> > +               return 0;
> >
> 
> Can convert the vfs_sync_range() to filemap_fdatawrite_range() with
> "sync=writeback"

For now I will leave it as it is. This can be changed once we implement
sync=writeback.

> 
> >         ret = ovl_real_fdget_meta(file, &real, !datasync);
> >         if (ret)
> > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > index b429c80879ee..034a8d9897e0 100644
> > --- a/fs/overlayfs/ovl_entry.h
> > +++ b/fs/overlayfs/ovl_entry.h
> > @@ -17,6 +17,7 @@ struct ovl_config {
> >         bool nfs_export;
> >         int xino;
> >         bool metacopy;
> > +       bool nosync;
> >  };
> >
> >  struct ovl_sb {
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 6918b98faeb6..9e93db028dbf 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
> >         if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
> >                 return 0;
> >
> > +       if (OVL_FS(dentry->d_sb)->config.nosync)
> > +               return 0;
> > +
> >         /*
> >          * Need to check if we started out being a lower dir, but got copied up
> >          */
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 91476bc422f9..c28ab39b5c70 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> >         if (!ovl_upper_mnt(ofs))
> >                 return 0;
> >
> > +       if (ofs->config.nosync)
> > +               return 0;
> 
> I'd be happier if we implement "sync=off/fs" from the start, or at least make
> ofs->config.sync an enum or bit mask to represent these modes, even if
> we only implement mount option "sync=off" to begin with.

I am also in favor of implementing ovlsync=fs from the beginning, if
that's what image build use case is going to use. That way, build
will use correct options from the beginning and they will automatically
benefit when overlayfs improves sync_fs to only sync inodes of
upper/ (and not whole filesystem).

> 
> >         /*
> >          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> >          * All the super blocks will be iterated, including upper_sb.
> > @@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
> >         if (ofs->config.metacopy != ovl_metacopy_def)
> >                 seq_printf(m, ",metacopy=%s",
> >                            ofs->config.metacopy ? "on" : "off");
> > +       if (ofs->config.nosync)
> > +               seq_puts(m, ",nosync");
> >         return 0;
> >  }
> >
> > @@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
> >
> >         if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
> >                 upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > -               down_read(&upper_sb->s_umount);
> > -               ret = sync_filesystem(upper_sb);
> > -               up_read(&upper_sb->s_umount);
> > +               if (!ofs->config.nosync) {
> > +                       down_read(&upper_sb->s_umount);
> > +                       ret = sync_filesystem(upper_sb);
> > +                       up_read(&upper_sb->s_umount);
> > +               }
> >         }
> >
> >         return ret;
> > @@ -411,6 +417,7 @@ enum {
> >         OPT_XINO_AUTO,
> >         OPT_METACOPY_ON,
> >         OPT_METACOPY_OFF,
> > +       OPT_NOSYNC,
> >         OPT_ERR,
> >  };
> >
> > @@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
> >         {OPT_XINO_AUTO,                 "xino=auto"},
> >         {OPT_METACOPY_ON,               "metacopy=on"},
> >         {OPT_METACOPY_OFF,              "metacopy=off"},
> > +       {OPT_NOSYNC,                    "nosync"},
> 
> As should be very clear by now, I prefer that we call the option "sync=off",
> so that we can later (or now) also implement "sync=fs" and maybe also
> "sync=writeback", but plus that one semantic change, I am fine with the
> patch as is.

I think sync=writeback I will defer for later when somebody needs it.
For now, I like the idea of implementing sync=off and sync=fs. Now
we just need to decide on naming of the option. ovlsync=off/fs sounds
good?

Thanks
Vivek


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync
  2020-07-01 16:25   ` Vivek Goyal
@ 2020-07-01 17:42     ` Amir Goldstein
  0 siblings, 0 replies; 5+ messages in thread
From: Amir Goldstein @ 2020-07-01 17:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: overlayfs, Miklos Szeredi, Giuseppe Scrivano, pmatilai,
	Daniel J Walsh, Steven Whitehouse, Eric Sandeen

On Wed, Jul 1, 2020 at 7:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Jul 01, 2020 at 01:31:01PM +0300, Amir Goldstein wrote:
> > On Tue, Jun 30, 2020 at 10:37 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > Container folks are complaining that dnf/yum issues too many sync while
> > > installing packages and this slows down the image build. Build
> > > requirement is such that they don't care if a node goes down while
> > > build was still going on. In that case, they will simply throw away
> > > unfinished layer and start new build. So they don't care about syncing
> > > intermediate state to the disk and hence don't want to pay the price
> > > associated with sync.
> > >
> > > So they are asking for an option where they can disable sync on overlay
> > > mount point completely and user space will do sync management on upper
> > > layer as needed.
> > >
> >
> > Sounds reasonable.
> > I have a lot of comments below, but the bottom line is if you change "nosync"
> > to "sync=off" and adapt documentation, the patch itself looks fine to me
> > for addressing the "volatile container" use case.
>
> "sync" already seems to be a vfs mount option. man mount says.
>
>        sync   All I/O to the filesystem should be done synchronously.  In  the
>               case  of  media with a limited number of write cycles (e.g. some
>               flash drives), sync may cause life-cycle shortening.
>
> It will be good to choose a name which avoids confusion with filesystem
> independent option.
>
> Either we can separate options like "nosync" "fssync" "writebacksync" etc
> when use cases for difference kind of sync arise.
>
> Or would it make sense to call it "ovlsync=<foo>". That way it is
> plenty clear that it is overlay filesystem specific mount option.
>
> >
> > > They primarily seem to have two use cases.
> > >
> > > - For building images, they will mount overlay with nosync and then sync
> > >   upper layer after unmounting overlay and reuse upper as lower for next
> > >   layer.
> >
> > This sentence reads to me as if "sync upper layer" is simple, which is not
> > entirely true. syncfs(2) will sync ALL the upper layers of all containers even
> > in the best case where the filesystem is dedicated to containers storage.
> > The fact is that for this specific use case, the most optimal handling would
> > have been "sync on unmount/remount/syncfs but skip fsync".
> >
> > But of course, without improving the implementation of ovl_sync_fs(),
> > this is currently equivalent to what you are describing.
> > Still I feel that we do need to make this distinction and provide mount option
> > "sync=fs" instead of letting the container runtime take care of
> > "syncing upper layer"
> > this way when the day comes and "sync=fs" is properly implemented,
> > all container runtimes will win on kernel upgrade.
>
> I think this option should be useful for some cases where they want
> to skip intermediate sync but will like to sync on umount/remount/syncfs.
> In fact image build case probably should benefit from it.
>
> Giuseppe, is that correct that for image build you will need to sync
> upper layer. If that's the case, then providing a mount option and
> using that now is better so that applications don't have to change
> later when we have more efficient implementation of ovl_sync_fs().
>
> IOW, for the case of completely volatile container (kubernetes
> restarts container on separate node if node goes down), we could
> use ovlsync=off (or nosync) and for image build case, use ovlsync=syncfs.
>
> Giuseppe, does this sound reasonable for your needs.
>
> >
> > >
> > > - For running containers, they don't seem to care about syncing upper
> > >   layer because if node goes down, they will simply throw away upper
> > >   layer and create a fresh one.
> > >
> > > So this patch provides a mount option "nosync" which disables all forms
> > > of sync. Now it is caller's responsibility to manage sync of upper layer
> > > before it is reused again.
> > >
> > > I am seeing roughly 20% speed up in my VM where I am just installing
> > > emacs in an image. Installation time drops from 31 seconds to 25 seconds
> > > when nosync option is used. This is for the case of building on top
> > > of an image where all packages are already cached. That way I take
> > > out the network operations latency out of the measurement.
> > >
> > > Giuseppe is also looking to cut down on number of iops done on the
> > > disk. He is complaining that often in cloud their VMs are throttled
> > > if they cross the limit. This option can help them where they reduce
> > > number of iops (by cutting down on frequent sync and writebacks).
> > >
> > > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  Documentation/filesystems/overlayfs.rst | 20 ++++++++++++++++++++
> > >  fs/overlayfs/copy_up.c                  | 12 ++++++++----
> > >  fs/overlayfs/file.c                     | 11 ++++++++++-
> > >  fs/overlayfs/ovl_entry.h                |  1 +
> > >  fs/overlayfs/readdir.c                  |  3 +++
> > >  fs/overlayfs/super.c                    | 23 ++++++++++++++++++++---
> > >  6 files changed, 62 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > > index 660dbaf0b9b8..0a42f26a3f0c 100644
> > > --- a/Documentation/filesystems/overlayfs.rst
> > > +++ b/Documentation/filesystems/overlayfs.rst
> > > @@ -563,6 +563,26 @@ This verification may cause significant overhead in some cases.
> > >  Note: the mount options index=off,nfs_export=on are conflicting and will
> > >  result in an error.
> > >
> > > +Disable sync
> > > +------------
> > > +By default, overlay skips sync on files residing on a lower layer.  It
> > > +is possible to skip sync operations for files on the upper layer as well
> > > +with the 'nosync' mount option. This option disables all forms of sync
> > > +from overlay, including the one done at umount/remount and it is
> > > +user's responsibility to sync upper layer on the file system it
> > > +is residing.
> > > +
> > > +With this option, data loss will happen if overlayfs upper layer is
> > > +not synced. So use this option very carefully. This is only for the
> > > +use cases where users discard upper layer if they could not sync it
> > > +successfully.
> > > +
> > > +Typically workflow will be.
> > > +
> > > +- mount overlay
> > > +- Do bunch of operations
> > > +- unmount overlay
> > > +- sync filesystem container upper layer
> >
> > I don't like to document this workflow, because I think it is wrong.
>
> Why is it wrong. User can do "syncfs" on filesystem upper is residing.
> If we decide to provide "ovlsync=syncfs", then we should document
> that instead.
>

not wrong in the sense that it would not work.
wrong in the sense that it is the wrong way to go IMO as opposed to sync=fs.

> > Please document only the "volatile container" use case for "sync=off".
>
> I guess we are now splitting it in two use cases. "volatile container"
> will use sync=off. And image build will use sync=fs.
>
> >
> > >
> > >  Testsuite
> > >  ---------
> > > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> > > index 79dd052c7dbf..5431a89bbd8a 100644
> > > --- a/fs/overlayfs/copy_up.c
> > > +++ b/fs/overlayfs/copy_up.c
> > > @@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
> > >         return error;
> > >  }
> > >
> > > -static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> > > +static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
> > > +                           struct path *new, loff_t len)
> > >  {
> > >         struct file *old_file;
> > >         struct file *new_file;
> > > @@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> > >                 len -= bytes;
> > >         }
> > >  out:
> > > -       if (!error)
> > > +       if (!error && !ofs->config.nosync)
> > >                 error = vfs_fsync(new_file, 0);
> >
> > Two points about this:
> >
> > 1. The purpose of this particular fsync is different from user requested fsync.
> > Example:
> > If a user chowns all files in a tree on xfs/ext4, ~1 minute later,
> > changes will likely
> > be safely stored because of periodic journal commit.
> > This is not a filesystem guaranty, but that's the way it is and if that were to
> > change, surely some users will notice and complain as happened in the past
> > with ext3 -> ext4 transition [1].
> >
> > If a user chowns all files in a tree on overlay (without metacopy)
> > over xfs/ext4,
> > ~1 minute later, changes will not have been safely stored if it wasn't
> > for this fsync.
> > The reason is delayed allocation of blocks of the new upper file.
> > ext4 has some heuristics in place to start writeback after rename over a file
> > (see NO_AUTO_DA_ALLOC), but not for linking an O_TMPFILE.
> >
> > 2. What could be useful is a mount option (e.g. sync=writeback) to convert this
> > vfs_fsync() to either filemap_flush() or filemap_fdatawrite().
> > This will start writeback on the new file with/without blocking and
> > without issuing
> > any FLUSH to block layer. Periodic journal commit will take care of the rest.
> > Again, this is not a guarantee that filesystems make and my attempts
> > to formalize
> > this as an user API in LSFMM did not go far, but it is a powerful tool
> > that container
> > administrators who know which underlying filesystem they use can make use of
> > and the performance benefit for setup of thousands of containers should be very
> > noticeable.
> >
> > [1] https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
>
> Ok, so somebody who needs this behavior down the line that copied up
> file gets to stable storage after a minute or so, then we need to
> look into sync=writeback option.
>

Sure, I just told the long story to explain that other sync options could
develop in the future.

> >
> >
> > >         fput(new_file);
> > >  out_fput:
> > > @@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
> > >
> > >  static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
> > >  {
> > > +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
> > >         int err;
> > >
> > >         /*
> > > @@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
> > >                 upperpath.dentry = temp;
> > >
> > >                 ovl_path_lowerdata(c->dentry, &datapath);
> > > -               err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> > > +               err = ovl_copy_up_data(ofs, &datapath, &upperpath,
> > > +                                      c->stat.size);
> > >                 if (err)
> > >                         return err;
> > >         }
> > > @@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
> > >  /* Copy up data of an inode which was copied up metadata only in the past. */
> > >  static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> > >  {
> > > +       struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
> > >         struct path upperpath, datapath;
> > >         int err;
> > >         char *capability = NULL;
> > > @@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> > >                         goto out;
> > >         }
> > >
> > > -       err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> > > +       err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
> > >         if (err)
> > >                 goto out_free;
> > >
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index 01820e654a21..a361890a8d05 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -329,6 +329,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >         struct fd real;
> > >         const struct cred *old_cred;
> > >         ssize_t ret;
> > > +       int ifl = iocb->ki_flags;
> > >
> > >         if (!iov_iter_count(iter))
> > >                 return 0;
> > > @@ -344,11 +345,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >         if (ret)
> > >                 goto out_unlock;
> > >
> > > +       if (OVL_FS(inode->i_sb)->config.nosync)
> > > +               ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
> > > +
> > >         old_cred = ovl_override_creds(file_inode(file)->i_sb);
> > >         if (is_sync_kiocb(iocb)) {
> > >                 file_start_write(real.file);
> > >                 ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> > > -                                    ovl_iocb_to_rwf(iocb->ki_flags));
> > > +                                    ovl_iocb_to_rwf(ifl));
> > >                 file_end_write(real.file);
> > >                 /* Update size */
> > >                 ovl_copyattr(ovl_inode_real(inode), inode);
> > > @@ -368,6 +372,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >                 real.flags = 0;
> > >                 aio_req->orig_iocb = iocb;
> > >                 kiocb_clone(&aio_req->iocb, iocb, real.file);
> > > +               aio_req->iocb.ki_flags = ifl;
> > >                 aio_req->iocb.ki_complete = ovl_aio_rw_complete;
> > >                 ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
> > >                 if (ret != -EIOCBQUEUED)
> > > @@ -430,6 +435,10 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> > >         struct fd real;
> > >         const struct cred *old_cred;
> > >         int ret;
> > > +       struct ovl_fs *ofs = OVL_FS(file_inode(file)->i_sb);
> > > +
> > > +       if (ofs->config.nosync)
> > > +               return 0;
> > >
> >
> > Can convert the vfs_sync_range() to filemap_fdatawrite_range() with
> > "sync=writeback"
>
> For now I will leave it as it is. This can be changed once we implement
> sync=writeback.
>
> >
> > >         ret = ovl_real_fdget_meta(file, &real, !datasync);
> > >         if (ret)
> > > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > > index b429c80879ee..034a8d9897e0 100644
> > > --- a/fs/overlayfs/ovl_entry.h
> > > +++ b/fs/overlayfs/ovl_entry.h
> > > @@ -17,6 +17,7 @@ struct ovl_config {
> > >         bool nfs_export;
> > >         int xino;
> > >         bool metacopy;
> > > +       bool nosync;
> > >  };
> > >
> > >  struct ovl_sb {
> > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > index 6918b98faeb6..9e93db028dbf 100644
> > > --- a/fs/overlayfs/readdir.c
> > > +++ b/fs/overlayfs/readdir.c
> > > @@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
> > >         if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
> > >                 return 0;
> > >
> > > +       if (OVL_FS(dentry->d_sb)->config.nosync)
> > > +               return 0;
> > > +
> > >         /*
> > >          * Need to check if we started out being a lower dir, but got copied up
> > >          */
> > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > index 91476bc422f9..c28ab39b5c70 100644
> > > --- a/fs/overlayfs/super.c
> > > +++ b/fs/overlayfs/super.c
> > > @@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
> > >         if (!ovl_upper_mnt(ofs))
> > >                 return 0;
> > >
> > > +       if (ofs->config.nosync)
> > > +               return 0;
> >
> > I'd be happier if we implement "sync=off/fs" from the start, or at least make
> > ofs->config.sync an enum or bit mask to represent these modes, even if
> > we only implement mount option "sync=off" to begin with.
>
> I am also in favor of implementing ovlsync=fs from the beginning, if
> that's what image build use case is going to use. That way, build
> will use correct options from the beginning and they will automatically
> benefit when overlayfs improves sync_fs to only sync inodes of
> upper/ (and not whole filesystem).
>
> >
> > >         /*
> > >          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
> > >          * All the super blocks will be iterated, including upper_sb.
> > > @@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
> > >         if (ofs->config.metacopy != ovl_metacopy_def)
> > >                 seq_printf(m, ",metacopy=%s",
> > >                            ofs->config.metacopy ? "on" : "off");
> > > +       if (ofs->config.nosync)
> > > +               seq_puts(m, ",nosync");
> > >         return 0;
> > >  }
> > >
> > > @@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
> > >
> > >         if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
> > >                 upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > -               down_read(&upper_sb->s_umount);
> > > -               ret = sync_filesystem(upper_sb);
> > > -               up_read(&upper_sb->s_umount);
> > > +               if (!ofs->config.nosync) {
> > > +                       down_read(&upper_sb->s_umount);
> > > +                       ret = sync_filesystem(upper_sb);
> > > +                       up_read(&upper_sb->s_umount);
> > > +               }
> > >         }
> > >
> > >         return ret;
> > > @@ -411,6 +417,7 @@ enum {
> > >         OPT_XINO_AUTO,
> > >         OPT_METACOPY_ON,
> > >         OPT_METACOPY_OFF,
> > > +       OPT_NOSYNC,
> > >         OPT_ERR,
> > >  };
> > >
> > > @@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
> > >         {OPT_XINO_AUTO,                 "xino=auto"},
> > >         {OPT_METACOPY_ON,               "metacopy=on"},
> > >         {OPT_METACOPY_OFF,              "metacopy=off"},
> > > +       {OPT_NOSYNC,                    "nosync"},
> >
> > As should be very clear by now, I prefer that we call the option "sync=off",
> > so that we can later (or now) also implement "sync=fs" and maybe also
> > "sync=writeback", but plus that one semantic change, I am fine with the
> > patch as is.
>
> I think sync=writeback I will defer for later when somebody needs it.
> For now, I like the idea of implementing sync=off and sync=fs. Now
> we just need to decide on naming of the option. ovlsync=off/fs sounds
> good?
>

uppersync=off/fs?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync
  2020-06-30 19:37 [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync Vivek Goyal
  2020-07-01 10:31 ` Amir Goldstein
@ 2020-07-16 20:41 ` Vivek Goyal
  1 sibling, 0 replies; 5+ messages in thread
From: Vivek Goyal @ 2020-07-16 20:41 UTC (permalink / raw)
  To: linux-unionfs, miklos
  Cc: amir73il, gscrivan, pmatilai, dwalsh, swhiteho, sandeen

On Tue, Jun 30, 2020 at 03:37:08PM -0400, Vivek Goyal wrote:
> Container folks are complaining that dnf/yum issues too many sync while
> installing packages and this slows down the image build. Build
> requirement is such that they don't care if a node goes down while
> build was still going on. In that case, they will simply throw away
> unfinished layer and start new build. So they don't care about syncing
> intermediate state to the disk and hence don't want to pay the price
> associated with sync.
> 

Hi Miklos,

Ping for this patch. What do you think about this patch. Can this be
merged.

Thanks
Vivek

> So they are asking for an option where they can disable sync on overlay
> mount point completely and user space will do sync management on upper
> layer as needed.
> 
> They primarily seem to have two use cases.
> 
> - For building images, they will mount overlay with nosync and then sync
>   upper layer after unmounting overlay and reuse upper as lower for next
>   layer.
> 
> - For running containers, they don't seem to care about syncing upper
>   layer because if node goes down, they will simply throw away upper
>   layer and create a fresh one.
> 
> So this patch provides a mount option "nosync" which disables all forms
> of sync. Now it is caller's responsibility to manage sync of upper layer
> before it is reused again.
> 
> I am seeing roughly 20% speed up in my VM where I am just installing
> emacs in an image. Installation time drops from 31 seconds to 25 seconds
> when nosync option is used. This is for the case of building on top
> of an image where all packages are already cached. That way I take
> out the network operations latency out of the measurement.
> 
> Giuseppe is also looking to cut down on number of iops done on the
> disk. He is complaining that often in cloud their VMs are throttled
> if they cross the limit. This option can help them where they reduce
> number of iops (by cutting down on frequent sync and writebacks).
> 
> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/filesystems/overlayfs.rst | 20 ++++++++++++++++++++
>  fs/overlayfs/copy_up.c                  | 12 ++++++++----
>  fs/overlayfs/file.c                     | 11 ++++++++++-
>  fs/overlayfs/ovl_entry.h                |  1 +
>  fs/overlayfs/readdir.c                  |  3 +++
>  fs/overlayfs/super.c                    | 23 ++++++++++++++++++++---
>  6 files changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> index 660dbaf0b9b8..0a42f26a3f0c 100644
> --- a/Documentation/filesystems/overlayfs.rst
> +++ b/Documentation/filesystems/overlayfs.rst
> @@ -563,6 +563,26 @@ This verification may cause significant overhead in some cases.
>  Note: the mount options index=off,nfs_export=on are conflicting and will
>  result in an error.
>  
> +Disable sync
> +------------
> +By default, overlay skips sync on files residing on a lower layer.  It
> +is possible to skip sync operations for files on the upper layer as well
> +with the 'nosync' mount option. This option disables all forms of sync
> +from overlay, including the one done at umount/remount and it is
> +user's responsibility to sync upper layer on the file system it
> +is residing.
> +
> +With this option, data loss will happen if overlayfs upper layer is
> +not synced. So use this option very carefully. This is only for the
> +use cases where users discard upper layer if they could not sync it
> +successfully.
> +
> +Typically workflow will be.
> +
> +- mount overlay
> +- Do bunch of operations
> +- unmount overlay
> +- sync filesystem container upper layer
>  
>  Testsuite
>  ---------
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 79dd052c7dbf..5431a89bbd8a 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -128,7 +128,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
>  	return error;
>  }
>  
> -static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> +static int ovl_copy_up_data(struct ovl_fs *ofs, struct path *old,
> +			    struct path *new, loff_t len)
>  {
>  	struct file *old_file;
>  	struct file *new_file;
> @@ -218,7 +219,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>  		len -= bytes;
>  	}
>  out:
> -	if (!error)
> +	if (!error && !ofs->config.nosync)
>  		error = vfs_fsync(new_file, 0);
>  	fput(new_file);
>  out_fput:
> @@ -484,6 +485,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
>  
>  static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>  {
> +	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>  	int err;
>  
>  	/*
> @@ -499,7 +501,8 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
>  		upperpath.dentry = temp;
>  
>  		ovl_path_lowerdata(c->dentry, &datapath);
> -		err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> +		err = ovl_copy_up_data(ofs, &datapath, &upperpath,
> +				       c->stat.size);
>  		if (err)
>  			return err;
>  	}
> @@ -784,6 +787,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
>  /* Copy up data of an inode which was copied up metadata only in the past. */
>  static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>  {
> +	struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
>  	struct path upperpath, datapath;
>  	int err;
>  	char *capability = NULL;
> @@ -804,7 +808,7 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
>  			goto out;
>  	}
>  
> -	err = ovl_copy_up_data(&datapath, &upperpath, c->stat.size);
> +	err = ovl_copy_up_data(ofs, &datapath, &upperpath, c->stat.size);
>  	if (err)
>  		goto out_free;
>  
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 01820e654a21..a361890a8d05 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -329,6 +329,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	struct fd real;
>  	const struct cred *old_cred;
>  	ssize_t ret;
> +	int ifl = iocb->ki_flags;
>  
>  	if (!iov_iter_count(iter))
>  		return 0;
> @@ -344,11 +345,14 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	if (ret)
>  		goto out_unlock;
>  
> +	if (OVL_FS(inode->i_sb)->config.nosync)
> +		ifl &= ~(IOCB_DSYNC | IOCB_SYNC);
> +
>  	old_cred = ovl_override_creds(file_inode(file)->i_sb);
>  	if (is_sync_kiocb(iocb)) {
>  		file_start_write(real.file);
>  		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> -				     ovl_iocb_to_rwf(iocb->ki_flags));
> +				     ovl_iocb_to_rwf(ifl));
>  		file_end_write(real.file);
>  		/* Update size */
>  		ovl_copyattr(ovl_inode_real(inode), inode);
> @@ -368,6 +372,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		real.flags = 0;
>  		aio_req->orig_iocb = iocb;
>  		kiocb_clone(&aio_req->iocb, iocb, real.file);
> +		aio_req->iocb.ki_flags = ifl;
>  		aio_req->iocb.ki_complete = ovl_aio_rw_complete;
>  		ret = vfs_iocb_iter_write(real.file, &aio_req->iocb, iter);
>  		if (ret != -EIOCBQUEUED)
> @@ -430,6 +435,10 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  	struct fd real;
>  	const struct cred *old_cred;
>  	int ret;
> +	struct ovl_fs *ofs = OVL_FS(file_inode(file)->i_sb);
> +
> +	if (ofs->config.nosync)
> +		return 0;
>  
>  	ret = ovl_real_fdget_meta(file, &real, !datasync);
>  	if (ret)
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index b429c80879ee..034a8d9897e0 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -17,6 +17,7 @@ struct ovl_config {
>  	bool nfs_export;
>  	int xino;
>  	bool metacopy;
> +	bool nosync;
>  };
>  
>  struct ovl_sb {
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 6918b98faeb6..9e93db028dbf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -863,6 +863,9 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
>  	if (!OVL_TYPE_UPPER(ovl_path_type(dentry)))
>  		return 0;
>  
> +	if (OVL_FS(dentry->d_sb)->config.nosync)
> +		return 0;
> +
>  	/*
>  	 * Need to check if we started out being a lower dir, but got copied up
>  	 */
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 91476bc422f9..c28ab39b5c70 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -264,6 +264,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
>  	if (!ovl_upper_mnt(ofs))
>  		return 0;
>  
> +	if (ofs->config.nosync)
> +		return 0;
>  	/*
>  	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>  	 * All the super blocks will be iterated, including upper_sb.
> @@ -362,6 +364,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
>  	if (ofs->config.metacopy != ovl_metacopy_def)
>  		seq_printf(m, ",metacopy=%s",
>  			   ofs->config.metacopy ? "on" : "off");
> +	if (ofs->config.nosync)
> +		seq_puts(m, ",nosync");
>  	return 0;
>  }
>  
> @@ -376,9 +380,11 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>  
>  	if (*flags & SB_RDONLY && !sb_rdonly(sb)) {
>  		upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> -		down_read(&upper_sb->s_umount);
> -		ret = sync_filesystem(upper_sb);
> -		up_read(&upper_sb->s_umount);
> +		if (!ofs->config.nosync) {
> +			down_read(&upper_sb->s_umount);
> +			ret = sync_filesystem(upper_sb);
> +			up_read(&upper_sb->s_umount);
> +		}
>  	}
>  
>  	return ret;
> @@ -411,6 +417,7 @@ enum {
>  	OPT_XINO_AUTO,
>  	OPT_METACOPY_ON,
>  	OPT_METACOPY_OFF,
> +	OPT_NOSYNC,
>  	OPT_ERR,
>  };
>  
> @@ -429,6 +436,7 @@ static const match_table_t ovl_tokens = {
>  	{OPT_XINO_AUTO,			"xino=auto"},
>  	{OPT_METACOPY_ON,		"metacopy=on"},
>  	{OPT_METACOPY_OFF,		"metacopy=off"},
> +	{OPT_NOSYNC,			"nosync"},
>  	{OPT_ERR,			NULL}
>  };
>  
> @@ -573,6 +581,10 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>  			metacopy_opt = true;
>  			break;
>  
> +		case OPT_NOSYNC:
> +			config->nosync = true;
> +			break;
> +
>  		default:
>  			pr_err("unrecognized mount option \"%s\" or missing value\n",
>  					p);
> @@ -588,6 +600,11 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>  		config->workdir = NULL;
>  	}
>  
> +	if (!config->upperdir && config->nosync) {
> +		pr_info("option nosync is meaningless in a non-upper mount, ignoring it.\n");
> +		config->nosync = false;
> +	}
> +
>  	err = ovl_parse_redirect_mode(config, config->redirect_mode);
>  	if (err)
>  		return err;
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-16 20:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-30 19:37 [RFC PATCH] overlayfs: Provide a mount option "nosync" to skip sync Vivek Goyal
2020-07-01 10:31 ` Amir Goldstein
2020-07-01 16:25   ` Vivek Goyal
2020-07-01 17:42     ` Amir Goldstein
2020-07-16 20:41 ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).