linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sargun Dhillon <sargun@sargun.me>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-unionfs@vger.kernel.org, miklos@szeredi.hu,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Giuseppe Scrivano <gscrivan@redhat.com>,
	Daniel J Walsh <dwalsh@redhat.com>,
	David Howells <dhowells@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	Amir Goldstein <amir73il@gmail.com>
Subject: Re: [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe
Date: Mon, 16 Nov 2020 17:38:34 +0000	[thread overview]
Message-ID: <20201116173833.GA18698@ircssh-2.c.rugged-nimbus-611.internal> (raw)
In-Reply-To: <20201116144240.GA9190@redhat.com>

On Mon, Nov 16, 2020 at 09:42:40AM -0500, Vivek Goyal wrote:
> On Sun, Nov 15, 2020 at 08:57:58PM -0800, Sargun Dhillon wrote:
> > Overlayfs added the ability to setup mounts where all syncs could be
> > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile").
> > 
> > A user might want to remount this fs, but we do not let the user because
> > of the "incompat" detection feature. In the case of volatile, it is safe
> > to do something like[1]:
> > 
> > $ sync -f /root/upperdir
> > $ rm -rf /root/workdir/incompat/volatile
> > 
> > There are two ways to go about this. You can call sync on the underlying
> > filesystem, check the error code, and delete the dirty file if everything
> > is clean. If you're running lots of containers on the same filesystem, or
> > you want to avoid all unnecessary I/O, this may be suboptimal.
> > 
> 
> Hi Sargun,
> 
> I had asked bunch of questions in previous mail thread to be more
> clear on your requirements but never got any response. It would
> have helped understanding your requirements better.
> 
Sorry, I didn't see your questions.


> How about following patch set which seems to sync only dirty inodes of
> upper belonging to a particular overlayfs instance.
> 
> https://lore.kernel.org/linux-unionfs/20201113065555.147276-1-cgxu519@mykernel.net/
> 
> So if could implement a mount option which ignores fsync but upon
> syncfs, only syncs dirty inodes of that overlayfs instance, it will
> make sure we are not syncing whole of the upper fs. And we could
> do this syncing on unmount of overlayfs and remove dirty file upon
> successful sync.
> 
Doing any kind of sync that involves a head-of-line blocking metadata flush,
or even data flush causes significan problems on our systems. We at some 
point did some analysis on systems in our fleet that did sync and noticed
that it has a very suboptimal effect on system-wide performance because
the sudden rush of IOPs caused our drives to stall.

I didn't dig too much into it, but on XFS, even letting users do sync
directly on their inodes could lead to trouble in terms of the spurious
burst of IOPs generated. Even though our drives can sustain a large amount
of I/O over time, a sudden burst causes our cloud provider to throttle them,
which in turn can lead to slow I/O across the system, and depending on what's
going on, this can turn into WQ stalls.

> Looks like this will be much simpler method and should be able to
> meet your requirements (As long as you are fine with syncing dirty
> upper inodes of this overlay instance on unmount).
> 
> Thanks
> Vivek
> 
> > Alternatively, you can blindly delete the dirty file, and "hope for the
> > best".
> > 
> > This patch introduces transparent functionality to check if it is
> > (relatively) safe to reuse the upperdir. It ensures that the filesystem
> > hasn't been remounted, the system hasn't been rebooted, nor has the
> > overlayfs code changed. It also checks the errseq on the superblock
> > indicating if there have been any writeback errors since the previous
> > mount. Currently, this information is not directly exposed to userspace, so
> > the user cannot make decisions based on this. Instead we checkpoint
> > this information to disk, and upon remount we see if any of it has
> > changed. Since the structure is explicitly not meant to be used
> > between different versions of the code, its stability does not
> > matter so much.
> 
> > [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c
> > 
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  Documentation/filesystems/overlayfs.rst |  5 +-
> >  fs/overlayfs/overlayfs.h                | 34 ++++++++++
> >  fs/overlayfs/readdir.c                  | 86 +++++++++++++++++++++++--
> >  fs/overlayfs/super.c                    | 22 ++++++-
> >  4 files changed, 139 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
> > index 580ab9a0fe31..fa3faeeab727 100644
> > --- a/Documentation/filesystems/overlayfs.rst
> > +++ b/Documentation/filesystems/overlayfs.rst
> > @@ -581,7 +581,10 @@ checks for this directory and refuses to mount if present. This is a strong
> >  indicator that user should throw away upper and work directories and create
> >  fresh one. In very limited cases where the user knows that the system has
> >  not crashed and contents of upperdir are intact, The "volatile" directory
> > -can be removed.
> > +can be removed.  In certain cases it the filesystem can detect that the
> > +upperdir can be reused safely, and it will not require the user to
> > +manually delete the volatile directory.
> > +
> >  
> >  Testsuite
> >  ---------
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index 9eb911f243e1..980d2c930f7a 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -30,6 +30,11 @@ enum ovl_path_type {
> >  #define OVL_XATTR_NLINK OVL_XATTR_PREFIX "nlink"
> >  #define OVL_XATTR_UPPER OVL_XATTR_PREFIX "upper"
> >  #define OVL_XATTR_METACOPY OVL_XATTR_PREFIX "metacopy"
> > +#define OVL_XATTR_VOLATILE OVL_XATTR_PREFIX "volatile"
> > +
> > +#define OVL_INCOMPATDIR_NAME "incompat"
> > +#define OVL_VOLATILEDIR_NAME "volatile"
> > +#define OVL_VOLATILE_DIRTY_NAME "dirty"
> >  
> >  enum ovl_inode_flag {
> >  	/* Pure upper dir that may contain non pure upper entries */
> > @@ -54,6 +59,32 @@ enum {
> >  	OVL_XINO_ON,
> >  };
> >  
> > +/*
> > + * This is copied into the volatile xattr, and the user does not interact with
> > + * it. There is no stability requirement, as a reboot explicitly invalidates
> > + * a volatile workdir. It is explicitly meant not to be a stable api.
> > + *
> > + * Although this structure isn't meant to be stable it is exposed to potentially
> > + * unprivileged users. We don't do any kind of cryptographic operations with
> > + * the structure, so it could be tampered with, or inspected. Don't put
> > + * kernel memory pointers in it, or anything else that could cause problems,
> > + * or information disclosure.
> > + */
> > +struct overlay_volatile_info {
> > +	/*
> > +	 * This uniquely identifies a boot, and is reset if overlayfs itself
> > +	 * is reloaded. Therefore we check our current / known boot_id
> > +	 * against this before looking at any other fields to validate:
> > +	 * 1. Is this datastructure laid out in the way we expect? (Overlayfs
> > +	 *    module, reboot, etc...)
> > +	 * 2. Could something have changed (like the s_instance_id counter
> > +	 *    resetting)
> > +	 */
> > +	uuid_t		overlay_boot_id;
> > +	u64		s_instance_id;
> > +	errseq_t	errseq; /* Just a u32 */
> > +} __packed;
> > +
> >  /*
> >   * The tuple (fh,uuid) is a universal unique identifier for a copy up origin,
> >   * where:
> > @@ -501,3 +532,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
> >  
> >  /* export.c */
> >  extern const struct export_operations ovl_export_operations;
> > +
> > +/* super.c */
> > +extern uuid_t overlay_boot_id;
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index f8cc15533afa..ee0d2b88a19c 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1054,7 +1054,84 @@ int ovl_check_d_type_supported(struct path *realpath)
> >  	return rdd.d_type_supported;
> >  }
> >  
> > -#define OVL_INCOMPATDIR_NAME "incompat"
> > +static int ovl_check_incompat_volatile(struct ovl_cache_entry *p,
> > +				       struct path *path)
> > +{
> > +	int err, ret = -EINVAL;
> > +	struct overlay_volatile_info info;
> > +	struct dentry *d_volatile, *d_dirty;
> > +
> > +	d_volatile = lookup_one_len(p->name, path->dentry, p->len);
> > +	if (IS_ERR(d_volatile))
> > +		return PTR_ERR(d_volatile);
> > +
> > +	inode_lock_nested(d_volatile->d_inode, I_MUTEX_PARENT);
> > +	d_dirty = lookup_one_len(OVL_VOLATILE_DIRTY_NAME, d_volatile,
> > +				 strlen(OVL_VOLATILE_DIRTY_NAME));
> > +	if (IS_ERR(d_dirty)) {
> > +		err = PTR_ERR(d_dirty);
> > +		if (err != -ENOENT)
> > +			ret = err;
> > +		goto out_putvolatile;
> > +	}
> > +
> > +	if (!d_dirty->d_inode)
> > +		goto out_putdirty;
> > +
> > +	inode_lock_nested(d_dirty->d_inode, I_MUTEX_XATTR);
> > +	err = ovl_do_getxattr(d_dirty, OVL_XATTR_VOLATILE, &info, sizeof(info));
> > +	inode_unlock(d_dirty->d_inode);
> > +	if (err != sizeof(info))
> > +		goto out_putdirty;
> > +
> > +	if (!uuid_equal(&overlay_boot_id, &info.overlay_boot_id)) {
> > +		pr_debug("boot id has changed (reboot or module reloaded)\n");
> > +		goto out_putdirty;
> > +	}
> > +
> > +	if (d_dirty->d_inode->i_sb->s_instance_id != info.s_instance_id) {
> > +		pr_debug("workdir has been unmounted and remounted\n");
> > +		goto out_putdirty;
> > +	}
> > +
> > +	err = errseq_check(&d_dirty->d_inode->i_sb->s_wb_err, info.errseq);
> > +	if (err) {
> > +		pr_debug("workdir dir has experienced errors: %d\n", err);
> > +		goto out_putdirty;
> > +	}
> > +
> > +	/* Dirty file is okay, delete it. */
> > +	ret = ovl_do_unlink(d_volatile->d_inode, d_dirty);
> > +
> > +out_putdirty:
> > +	dput(d_dirty);
> > +out_putvolatile:
> > +	inode_unlock(d_volatile->d_inode);
> > +	dput(d_volatile);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * check_incompat checks this specific incompat entry for incompatibility.
> > + * If it is found to be incompatible -EINVAL will be returned.
> > + *
> > + * Any other -errno indicates an unknown error, and filesystem mounting
> > + * should be aborted.
> > + */
> > +static int ovl_check_incompat(struct ovl_cache_entry *p, struct path *path)
> > +{
> > +	int err = -EINVAL;
> > +
> > +	if (!strcmp(p->name, OVL_VOLATILEDIR_NAME))
> > +		err = ovl_check_incompat_volatile(p, path);
> > +
> > +	if (err == -EINVAL)
> > +		pr_err("incompat feature '%s' cannot be mounted\n", p->name);
> > +	else
> > +		pr_debug("incompat '%s' handled: %d\n", p->name, err);
> > +
> > +	return err;
> > +}
> >  
> >  static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> >  {
> > @@ -1098,10 +1175,9 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level)
> >  			if (p->len == 2 && p->name[1] == '.')
> >  				continue;
> >  		} else if (incompat) {
> > -			pr_err("overlay with incompat feature '%s' cannot be mounted\n",
> > -				p->name);
> > -			err = -EINVAL;
> > -			break;
> > +			err = ovl_check_incompat(p, path);
> > +			if (err)
> > +				break;
> >  		}
> >  		dentry = lookup_one_len(p->name, path->dentry, p->len);
> >  		if (IS_ERR(dentry))
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 2ee0ba16cc7b..94980898009f 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/seq_file.h>
> >  #include <linux/posix_acl_xattr.h>
> >  #include <linux/exportfs.h>
> > +#include <linux/uuid.h>
> >  #include "overlayfs.h"
> >  
> >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > @@ -23,6 +24,7 @@ MODULE_LICENSE("GPL");
> >  
> >  
> >  struct ovl_dir_cache;
> > +uuid_t overlay_boot_id;
> >  
> >  #define OVL_MAX_STACK 500
> >  
> > @@ -1246,20 +1248,35 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent,
> >   */
> >  static int ovl_create_volatile_dirty(struct ovl_fs *ofs)
> >  {
> > +	int err;
> >  	unsigned int ctr;
> >  	struct dentry *d = dget(ofs->workbasedir);
> >  	static const char *const volatile_path[] = {
> > -		OVL_WORKDIR_NAME, "incompat", "volatile", "dirty"
> > +		OVL_WORKDIR_NAME,
> > +		OVL_INCOMPATDIR_NAME,
> > +		OVL_VOLATILEDIR_NAME,
> > +		OVL_VOLATILE_DIRTY_NAME,
> >  	};
> >  	const char *const *name = volatile_path;
> > +	struct overlay_volatile_info info = {};
> >  
> >  	for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) {
> >  		d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG);
> >  		if (IS_ERR(d))
> >  			return PTR_ERR(d);
> >  	}
> > +
> > +	uuid_copy(&info.overlay_boot_id, &overlay_boot_id);
> > +	info.s_instance_id = d->d_inode->i_sb->s_instance_id;
> > +	info.errseq = errseq_sample(&d->d_inode->i_sb->s_wb_err);
> > +
> > +
> > +	err = ovl_do_setxattr(d, OVL_XATTR_VOLATILE, &info, sizeof(info), 0);
> > +	if (err == -EOPNOTSUPP)
> > +		err = 0;
> > +
> >  	dput(d);
> > -	return 0;
> > +	return err;
> >  }
> >  
> >  static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs,
> > @@ -2045,6 +2062,7 @@ static int __init ovl_init(void)
> >  {
> >  	int err;
> >  
> > +	uuid_gen(&overlay_boot_id);
> >  	ovl_inode_cachep = kmem_cache_create("ovl_inode",
> >  					     sizeof(struct ovl_inode), 0,
> >  					     (SLAB_RECLAIM_ACCOUNT|
> > -- 
> > 2.25.1
> > 
> 

      parent reply	other threads:[~2020-11-16 17:39 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-16  4:57 [RFC PATCH 0/3] Make overlayfs volatile mounts reusable Sargun Dhillon
2020-11-16  4:57 ` [RFC PATCH 1/3] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon
2020-11-16  5:07   ` Sargun Dhillon
2020-11-16  4:57 ` [RFC PATCH 2/3] overlay: Add ovl_do_getxattr helper Sargun Dhillon
2020-11-16 11:00   ` Amir Goldstein
2020-11-16  4:57 ` [RFC PATCH 3/3] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon
2020-11-16  9:31   ` Amir Goldstein
2020-11-16 10:30     ` Sargun Dhillon
2020-11-16 11:17       ` Amir Goldstein
2020-11-16 12:52         ` Amir Goldstein
2020-11-16 14:42   ` Vivek Goyal
2020-11-16 14:45     ` Vivek Goyal
2020-11-16 15:20     ` Amir Goldstein
2020-11-16 16:36       ` Vivek Goyal
2020-11-16 18:25         ` Sargun Dhillon
2020-11-16 19:27           ` Vivek Goyal
2020-11-16 20:18         ` Amir Goldstein
2020-11-16 21:09           ` Vivek Goyal
2020-11-17  5:33             ` Amir Goldstein
2020-11-17 14:48               ` Vivek Goyal
2020-11-17 15:24                 ` Amir Goldstein
2020-11-17 15:40                   ` Vivek Goyal
2020-11-17 16:46                   ` Vivek Goyal
2020-11-17 18:03                     ` Amir Goldstein
2020-11-17 18:29                       ` Vivek Goyal
2020-11-18  7:24                         ` Amir Goldstein
2020-11-18  8:27                           ` Sargun Dhillon
2020-11-18 10:46                             ` Amir Goldstein
2020-11-18 14:55                           ` Vivek Goyal
2020-11-16 21:26           ` Vivek Goyal
2020-11-16 22:14             ` Sargun Dhillon
2020-11-17  5:41               ` Amir Goldstein
2020-11-17 17:05               ` Vivek Goyal
2020-11-16 17:38     ` Sargun Dhillon [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201116173833.GA18698@ircssh-2.c.rugged-nimbus-611.internal \
    --to=sargun@sargun.me \
    --cc=amir73il@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=dwalsh@redhat.com \
    --cc=gscrivan@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-unionfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=vgoyal@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).