All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] ovl: consistent_fd feature
@ 2017-03-29 14:36 Amir Goldstein
  2017-03-29 14:36 ` [PATCH 1/6] ovl: store path type in dentry Amir Goldstein
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

Mikos,

This patch set implement the 'simple' solution we discussed on LSF.
For the special case of all overlays on the same fs with clone support,
files are copied up on open for read (as O_TMPFILE) and linked to
upperdir on first open for write.

- Patches 1-2 are the refactoring I sent you earlier.  They are not
  strictly needed for the consistent_fd feature - up to you.
- Patches 3-4 test 'samefs' and 'cloneup' properties of underlying fs.
- Patch 5 copies up open for read (for the special case).
- Patch 6 defers linking the tmpfile to first open for write.

Some of the design choices for patch 6 are questionable:
the storing of tempfile in ovl_dentry_update(),
temp dentry is freed only on overlay dentry release.
awaiting your feedback about those choises.

xfstests run of ./check -overlay -g quick passed on
both ext4 (no clone) and xfs+reflink (yes clone).
test overlay/016 ("Test ro/rw fd data inconsistecies")
passes with xfs+reflink.
test overlay/013 ("Test truncate running executable...")
fails with xfs+reflink, because test expects the inconsistency.

I also modified unionmount-testsuite and added ./run --ov --samefs
to setup lower+upper on same base fs of your choise [1].
It passes with base fs tmpfs (no clone) and base fs xfs+reflink.

Amir.

[1] https://github.com/amir73il/unionmount-testsuite/commits/overlayfs-devel

Amir Goldstein (6):
  ovl: store path type in dentry
  ovl: cram opaque boolean into type flags
  ovl: check if all layers are on the same fs
  ovl: check if clone from lower to upper is supported
  ovl: copy on read with consistent_fd feature
  ovl: link upper tempfile on open for write

 Documentation/filesystems/vfs.txt | 13 +++---
 fs/overlayfs/Kconfig              | 18 ++++++++
 fs/overlayfs/copy_up.c            | 97 ++++++++++++++++++++++++++++++++-------
 fs/overlayfs/inode.c              | 41 ++++++++++++-----
 fs/overlayfs/namei.c              | 10 ++--
 fs/overlayfs/overlayfs.h          | 12 ++++-
 fs/overlayfs/ovl_entry.h          | 20 ++++++--
 fs/overlayfs/super.c              | 59 ++++++++++++++++++++++--
 fs/overlayfs/util.c               | 69 +++++++++++++++++++++++-----
 9 files changed, 277 insertions(+), 62 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] ovl: store path type in dentry
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
@ 2017-03-29 14:36 ` Amir Goldstein
  2017-03-29 14:36 ` [PATCH 2/6] ovl: cram opaque boolean into type flags Amir Goldstein
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

ovl_path_type() is called quite often (e.g.: on every file open)
to calculate the path type flags of overlay dentry.  Those flags
are rarely changed after dentry instantiation and when changed,
flags can only be added. (e.g.: on copyup, rename and create).

Store the type value in ovl_entry and update the flags when
needed, so ovl_path_type() just returns the stored value.

The old ovl_path_type() took care of clearing OVL_TYPE_MERGE
for a non-dir entry.  It did so for a reason that seem obsolete.
In any case, the code never checks OVL_TYPE_MERGE on non-dir entry,
so merge flag for non-dir is harmless.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/namei.c     |  1 +
 fs/overlayfs/overlayfs.h |  1 +
 fs/overlayfs/ovl_entry.h |  3 +++
 fs/overlayfs/super.c     |  1 +
 fs/overlayfs/util.c      | 32 ++++++++++++++++++++++++--------
 5 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index b8b0778..9d7884d 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -338,6 +338,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	kfree(stack);
 	kfree(d.redirect);
 	dentry->d_fsdata = oe;
+	ovl_update_type(dentry);
 	d_add(dentry, inode);
 
 	return NULL;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 741dc0b..51419b5 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -155,6 +155,7 @@ struct ovl_entry *ovl_alloc_entry(unsigned int numlower);
 bool ovl_dentry_remote(struct dentry *dentry);
 bool ovl_dentry_weird(struct dentry *dentry);
 enum ovl_path_type ovl_path_type(struct dentry *dentry);
+enum ovl_path_type ovl_update_type(struct dentry *dentry);
 void ovl_path_upper(struct dentry *dentry, struct path *path);
 void ovl_path_lower(struct dentry *dentry, struct path *path);
 enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 59614fa..293be5f 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -31,6 +31,8 @@ struct ovl_fs {
 	wait_queue_head_t copyup_wq;
 };
 
+enum ovl_path_type;
+
 /* private information held for every overlayfs dentry */
 struct ovl_entry {
 	struct dentry *__upperdentry;
@@ -44,6 +46,7 @@ struct ovl_entry {
 		};
 		struct rcu_head rcu;
 	};
+	enum ovl_path_type __type;
 	unsigned numlower;
 	struct path lowerstack[];
 };
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index c9e70d3..6d9766f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -939,6 +939,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(stack);
 
 	root_dentry->d_fsdata = oe;
+	ovl_update_type(root_dentry);
 
 	realinode = d_inode(ovl_dentry_real(root_dentry));
 	ovl_inode_init(d_inode(root_dentry), realinode, !!upperpath.dentry);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 1953986..aaef2e28 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -70,21 +70,36 @@ bool ovl_dentry_weird(struct dentry *dentry)
 enum ovl_path_type ovl_path_type(struct dentry *dentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;
-	enum ovl_path_type type = 0;
+	enum ovl_path_type type = oe->__type;
 
-	if (oe->__upperdentry) {
-		type = __OVL_PATH_UPPER;
+	/* Matches smp_wmb() in ovl_update_type() */
+	smp_rmb();
+	return type;
+}
+
+enum ovl_path_type ovl_update_type(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+	enum ovl_path_type type = oe->__type;
 
-		/*
-		 * Non-dir dentry can hold lower dentry from previous
-		 * location.
-		 */
-		if (oe->numlower && d_is_dir(dentry))
+	/*
+	 * During the lifetime of an overlay dentry, those flags can
+	 * only be set, never cleared.
+	 */
+	if (oe->__upperdentry) {
+		type |= __OVL_PATH_UPPER;
+		if (oe->numlower)
 			type |= __OVL_PATH_MERGE;
 	} else {
 		if (oe->numlower > 1)
 			type |= __OVL_PATH_MERGE;
 	}
+	/*
+	 * Make sure type is consistent with __upperdentry before making it
+	 * visible to ovl_path_type().
+	 */
+	smp_wmb();
+	oe->__type = type;
 	return type;
 }
 
@@ -220,6 +235,7 @@ void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
 	 */
 	smp_wmb();
 	oe->__upperdentry = upperdentry;
+	ovl_update_type(dentry);
 }
 
 void ovl_inode_init(struct inode *inode, struct inode *realinode, bool is_upper)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] ovl: cram opaque boolean into type flags
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
  2017-03-29 14:36 ` [PATCH 1/6] ovl: store path type in dentry Amir Goldstein
@ 2017-03-29 14:36 ` Amir Goldstein
  2017-03-29 14:36 ` [PATCH 3/6] ovl: check if all layers are on the same fs Amir Goldstein
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

Like the other type flags, 'opaque' is a state discovered in lookup,
set on certain ops (e.g.: create, rename) and never cleared.

We would like to add more state info to ovl_entry soon (for const ino)
and this state info would be added as type flags. It makes little sense
to have the 'opaque' state occupy a boolean, so drop the boolean member
of ovl_entry and use a type bit to represent opaqueness.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/namei.c     | 9 +++++----
 fs/overlayfs/overlayfs.h | 2 ++
 fs/overlayfs/ovl_entry.h | 1 -
 fs/overlayfs/util.c      | 5 +++--
 4 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 9d7884d..cc9e041 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -224,7 +224,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	struct dentry *upperdir, *upperdentry = NULL;
 	unsigned int ctr = 0;
 	struct inode *inode = NULL;
-	bool upperopaque = false;
+	enum ovl_path_type type = 0;
 	char *upperredirect = NULL;
 	struct dentry *this;
 	unsigned int i;
@@ -261,7 +261,8 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 			if (d.redirect[0] == '/')
 				poe = dentry->d_sb->s_root->d_fsdata;
 		}
-		upperopaque = d.opaque;
+		if (d.opaque)
+			type |= __OVL_PATH_OPAQUE;
 	}
 
 	if (!d.stop && poe->numlower) {
@@ -331,7 +332,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	}
 
 	revert_creds(old_cred);
-	oe->opaque = upperopaque;
+	oe->__type = type;
 	oe->redirect = upperredirect;
 	oe->__upperdentry = upperdentry;
 	memcpy(oe->lowerstack, stack, sizeof(struct path) * ctr);
@@ -372,7 +373,7 @@ bool ovl_lower_positive(struct dentry *dentry)
 	 * whiteout.
 	 */
 	if (!dentry->d_inode)
-		return oe->opaque;
+		return OVL_TYPE_OPAQUE(oe->__type);
 
 	/* Negative upper -> positive lower */
 	if (!oe->__upperdentry)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 51419b5..079051e 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -12,10 +12,12 @@
 enum ovl_path_type {
 	__OVL_PATH_UPPER	= (1 << 0),
 	__OVL_PATH_MERGE	= (1 << 1),
+	__OVL_PATH_OPAQUE	= (1 << 2),
 };
 
 #define OVL_TYPE_UPPER(type)	((type) & __OVL_PATH_UPPER)
 #define OVL_TYPE_MERGE(type)	((type) & __OVL_PATH_MERGE)
+#define OVL_TYPE_OPAQUE(type)	((type) & __OVL_PATH_OPAQUE)
 
 #define OVL_XATTR_PREFIX XATTR_TRUSTED_PREFIX "overlay."
 #define OVL_XATTR_OPAQUE OVL_XATTR_PREFIX "opaque"
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 293be5f..12c4922 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -41,7 +41,6 @@ struct ovl_entry {
 		struct {
 			u64 version;
 			const char *redirect;
-			bool opaque;
 			bool copying;
 		};
 		struct rcu_head rcu;
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index aaef2e28..159e851 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -179,7 +179,8 @@ void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache)
 bool ovl_dentry_is_opaque(struct dentry *dentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;
-	return oe->opaque;
+
+	return OVL_TYPE_OPAQUE(oe->__type);
 }
 
 bool ovl_dentry_is_whiteout(struct dentry *dentry)
@@ -191,7 +192,7 @@ void ovl_dentry_set_opaque(struct dentry *dentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;
 
-	oe->opaque = true;
+	oe->__type |= __OVL_PATH_OPAQUE;
 }
 
 bool ovl_redirect_dir(struct super_block *sb)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] ovl: check if all layers are on the same fs
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
  2017-03-29 14:36 ` [PATCH 1/6] ovl: store path type in dentry Amir Goldstein
  2017-03-29 14:36 ` [PATCH 2/6] ovl: cram opaque boolean into type flags Amir Goldstein
@ 2017-03-29 14:36 ` Amir Goldstein
  2017-03-29 14:36 ` [PATCH 4/6] ovl: check if clone from lower to upper is supported Amir Goldstein
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

Some features can only work when all layers are on the same fs.
Check that during mount time, so features can test it later.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/ovl_entry.h | 3 ++-
 fs/overlayfs/super.c     | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 12c4922..0e159c3 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -27,7 +27,8 @@ struct ovl_fs {
 	struct ovl_config config;
 	/* creds of process who forced instantiation of super block */
 	const struct cred *creator_cred;
-	bool tmpfile;
+	bool tmpfile;	/* upper supports O_TMPFILE */
+	bool samefs;	/* all layers on same fs */
 	wait_queue_head_t copyup_wq;
 };
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 6d9766f..b8a83e8 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -876,6 +876,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	ufs->lower_mnt = kcalloc(numlower, sizeof(struct vfsmount *), GFP_KERNEL);
 	if (ufs->lower_mnt == NULL)
 		goto out_put_workdir;
+
+	ufs->samefs = true;
 	for (i = 0; i < numlower; i++) {
 		struct vfsmount *mnt = clone_private_mount(&stack[i]);
 
@@ -892,6 +894,11 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
 		ufs->lower_mnt[ufs->numlower] = mnt;
 		ufs->numlower++;
+
+		/* Check if all layers on same sb */
+		if ((ufs->upper_mnt && ufs->upper_mnt->mnt_sb != mnt->mnt_sb) ||
+		    (i > 0 && ufs->lower_mnt[0]->mnt_sb != mnt->mnt_sb))
+			ufs->samefs = false;
 	}
 
 	/* If the upper fs is nonexistent, we mark overlayfs r/o too */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] ovl: check if clone from lower to upper is supported
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
                   ` (2 preceding siblings ...)
  2017-03-29 14:36 ` [PATCH 3/6] ovl: check if all layers are on the same fs Amir Goldstein
@ 2017-03-29 14:36 ` Amir Goldstein
  2017-03-29 14:36 ` [PATCH 5/6] ovl: copy on read with consistent_fd feature Amir Goldstein
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

Currently copy up will always try to clone and fall back
to copy if clone fails.

Check clone support during mount time and don't try to clone
up if clone is not supported.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/copy_up.c   | 23 ++++++++++++++---------
 fs/overlayfs/ovl_entry.h |  1 +
 fs/overlayfs/super.c     | 11 ++++++++---
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 906ea6c..3053e33 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -134,7 +134,8 @@ int ovl_copy_xattr(struct dentry *old, struct dentry *new)
 	return error;
 }
 
-static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
+static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len,
+			    bool cloneup)
 {
 	struct file *old_file;
 	struct file *new_file;
@@ -155,12 +156,15 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 		goto out_fput;
 	}
 
-	/* Try to use clone_file_range to clone up within the same fs */
-	error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
-	if (!error)
-		goto out;
-	/* Couldn't clone, so now we try to copy the data */
-	error = 0;
+	if (cloneup) {
+		/* Try to use clone_file_range to clone up within the same fs */
+		error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
+		if (!error)
+			goto out;
+
+		/* Couldn't clone, so now we try to copy the data */
+		error = 0;
+	}
 
 	/* FIXME: copy up sparse files efficiently */
 	while (len) {
@@ -251,6 +255,7 @@ static int ovl_copy_up_locked(struct dentry *workdir, struct dentry *upperdir,
 		.rdev = stat->rdev,
 		.link = link
 	};
+	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
 
 	upper = lookup_one_len(dentry->d_name.name, upperdir,
 			       dentry->d_name.len);
@@ -295,11 +300,11 @@ static int ovl_copy_up_locked(struct dentry *workdir, struct dentry *upperdir,
 		if (tmpfile) {
 			inode_unlock(udir);
 			err = ovl_copy_up_data(lowerpath, &upperpath,
-					       stat->size);
+					       stat->size, ofs->cloneup);
 			inode_lock_nested(udir, I_MUTEX_PARENT);
 		} else {
 			err = ovl_copy_up_data(lowerpath, &upperpath,
-					       stat->size);
+					       stat->size, ofs->cloneup);
 		}
 
 		if (err)
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 0e159c3..fb1210d 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -29,6 +29,7 @@ struct ovl_fs {
 	const struct cred *creator_cred;
 	bool tmpfile;	/* upper supports O_TMPFILE */
 	bool samefs;	/* all layers on same fs */
+	bool cloneup;	/* can clone from lower to upper */
 	wait_queue_head_t copyup_wq;
 };
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index b8a83e8..75b93d6 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -865,10 +865,15 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 			/* Check if upper/work fs supports O_TMPFILE */
 			temp = ovl_do_tmpfile(ufs->workdir, S_IFREG | 0);
 			ufs->tmpfile = !IS_ERR(temp);
-			if (ufs->tmpfile)
+			if (ufs->tmpfile) {
+				/* Check if upper/work supports clone */
+				if (temp->d_inode && temp->d_inode->i_fop &&
+				    temp->d_inode->i_fop->clone_file_range)
+					ufs->cloneup = true;
 				dput(temp);
-			else
+			} else {
 				pr_warn("overlayfs: upper fs does not support tmpfile.\n");
+			}
 		}
 	}
 
@@ -898,7 +903,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 		/* Check if all layers on same sb */
 		if ((ufs->upper_mnt && ufs->upper_mnt->mnt_sb != mnt->mnt_sb) ||
 		    (i > 0 && ufs->lower_mnt[0]->mnt_sb != mnt->mnt_sb))
-			ufs->samefs = false;
+			ufs->cloneup = ufs->samefs = false;
 	}
 
 	/* If the upper fs is nonexistent, we mark overlayfs r/o too */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] ovl: copy on read with consistent_fd feature
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
                   ` (3 preceding siblings ...)
  2017-03-29 14:36 ` [PATCH 4/6] ovl: check if clone from lower to upper is supported Amir Goldstein
@ 2017-03-29 14:36 ` Amir Goldstein
  2017-03-30 11:28   ` Amir Goldstein
  2017-03-31 17:58   ` Vivek Goyal
  2017-03-29 14:36 ` [PATCH 6/6] ovl: link upper tempfile on open for write Amir Goldstein
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

Overlayfs copies up a file when the file is opened for write.  Writes
to the returned file descriptor are not visible to file descriptors
that were opened for read prior to copy up.

If this config option is enabled then overlay filesystems will default
to copy up a file that is opened for read to avoid this inconsistency.
In this case, it is still possible to turn off consistent fd globally
with the "consistent_fd=off" module option or on a filesystem instance
basis with the "consistent_fd=off" mount option.

The feature will not be turned on for an overlay mount unless all
the layers are on the same underlying filesystem and this filesystem
supports clone.

This introduces a slight change in d_real() semantics. Now d_real()
may return an error with NULL inode argument also in the zero open
flags case. vfs API documentation has been updated.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/vfs.txt | 13 ++++++-------
 fs/overlayfs/Kconfig              | 18 ++++++++++++++++++
 fs/overlayfs/inode.c              | 27 ++++++++++++++++-----------
 fs/overlayfs/overlayfs.h          |  4 +++-
 fs/overlayfs/ovl_entry.h          |  2 ++
 fs/overlayfs/super.c              | 37 +++++++++++++++++++++++++++++++++----
 fs/overlayfs/util.c               |  7 +++++++
 7 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 5692117..0dd317b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -1088,22 +1088,21 @@ struct dentry_operations {
 	dentry being transited from.
 
   d_real: overlay/union type filesystems implement this method to return one of
-	the underlying dentries hidden by the overlay.  It is used in three
+	the underlying dentries hidden by the overlay.  It is used in two
 	different modes:
 
 	Called from open it may need to copy-up the file depending on the
-	supplied open flags.  This mode is selected with a non-zero flags
-	argument.  In this mode the d_real method can return an error.
+	supplied open flags.  It returns the upper real dentry if file was
+	copied up or the topmost real underlying dentry otherwise.
+	This mode is selected with a NULL inode argument.
+	In this mode the d_real method can return an error.
 
 	Called from file_dentry() it returns the real dentry matching the inode
 	argument.  The real dentry may be from a lower layer already copied up,
 	but still referenced from the file.  This mode is selected with a
 	non-NULL inode argument.  This will always succeed.
 
-	With NULL inode and zero flags the topmost real underlying dentry is
-	returned.  This will always succeed.
-
-	This method is never called with both non-NULL inode and non-zero flags.
+	This method is never called with non-NULL inode and non-zero open flags.
 
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index 0daac51..7425862 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -22,3 +22,21 @@ config OVERLAY_FS_REDIRECT_DIR
 	  Note, that redirects are not backward compatible.  That is, mounting
 	  an overlay which has redirects on a kernel that doesn't support this
 	  feature will have unexpected results.
+
+config OVERLAY_FS_CONSISTENT_FD
+	bool "Overlayfs: turn on consistent fd feature by default"
+	depends on OVERLAY_FS
+	help
+	  Overlayfs copies up a file when the file is opened for write.  Writes
+          to the returned file descriptor are not visible to file descriptors
+          that were opened for read prior to copy up.
+
+          If this config option is enabled then overlay filesystems will default
+          to copy up a file that is opened for read to avoid this inconsistency.
+          In this case, it is still possible to turn off consistent fd globally
+          with the "consistent_fd=off" module option or on a filesystem instance
+          basis with the "consistent_fd=off" mount option.
+
+          The feature will not be turned on for an overlay mount unless all
+          the layers are on the same underlying filesystem and this filesystem
+          supports clone.
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 17b8418..f2f55e1 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -231,7 +231,7 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
 }
 
 static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
-				  struct dentry *realdentry)
+				  struct dentry *realdentry, bool rocopyup)
 {
 	if (OVL_TYPE_UPPER(type))
 		return false;
@@ -239,25 +239,30 @@ static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
 	if (special_file(realdentry->d_inode->i_mode))
 		return false;
 
+	/* Copy up on open for read for consistent fd */
+	if (rocopyup)
+		return true;
+
 	if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
 		return false;
 
 	return true;
 }
 
-int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags)
+int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
+			   bool rocopyup)
 {
 	int err = 0;
 	struct path realpath;
-	enum ovl_path_type type;
-
-	type = ovl_path_real(dentry, &realpath);
-	if (ovl_open_need_copy_up(file_flags, type, realpath.dentry)) {
-		err = ovl_want_write(dentry);
-		if (!err) {
-			err = ovl_copy_up_flags(dentry, file_flags);
-			ovl_drop_write(dentry);
-		}
+	enum ovl_path_type type = ovl_path_real(dentry, &realpath);
+
+	if (!ovl_open_need_copy_up(file_flags, type, realpath.dentry, rocopyup))
+		return 0;
+
+	err = ovl_want_write(dentry);
+	if (!err) {
+		err = ovl_copy_up_flags(dentry, file_flags);
+		ovl_drop_write(dentry);
 	}
 
 	return err;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 079051e..d13ad5f 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -169,6 +169,7 @@ void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
 bool ovl_dentry_is_opaque(struct dentry *dentry);
 bool ovl_dentry_is_whiteout(struct dentry *dentry);
 void ovl_dentry_set_opaque(struct dentry *dentry);
+bool ovl_consistent_fd(struct super_block *sb);
 bool ovl_redirect_dir(struct super_block *sb);
 void ovl_clear_redirect_dir(struct super_block *sb);
 const char *ovl_dentry_get_redirect(struct dentry *dentry);
@@ -207,7 +208,8 @@ int ovl_xattr_get(struct dentry *dentry, const char *name,
 		  void *value, size_t size);
 ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
 struct posix_acl *ovl_get_acl(struct inode *inode, int type);
-int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags);
+int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
+			   bool rocopyup);
 int ovl_update_time(struct inode *inode, struct timespec *ts, int flags);
 bool ovl_is_private_xattr(const char *name);
 
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index fb1210d..c11a72d 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -14,6 +14,7 @@ struct ovl_config {
 	char *workdir;
 	bool default_permissions;
 	bool redirect_dir;
+	bool consistent_fd;
 };
 
 /* private information held for overlayfs's superblock */
@@ -30,6 +31,7 @@ struct ovl_fs {
 	bool tmpfile;	/* upper supports O_TMPFILE */
 	bool samefs;	/* all layers on same fs */
 	bool cloneup;	/* can clone from lower to upper */
+	bool rocopyup;	/* copy up on open for read */
 	wait_queue_head_t copyup_wq;
 };
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 75b93d6..e5fd53a 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -34,6 +34,11 @@ module_param_named(redirect_dir, ovl_redirect_dir_def, bool, 0644);
 MODULE_PARM_DESC(ovl_redirect_dir_def,
 		 "Default to on or off for the redirect_dir feature");
 
+static bool ovl_consistent_fd_def = IS_ENABLED(CONFIG_OVERLAY_FS_CONSISTENT_FD);
+module_param_named(consistent_fd, ovl_consistent_fd_def, bool, 0644);
+MODULE_PARM_DESC(ovl_consistent_fd_def,
+		 "Default to on or off for the consistent_fd feature");
+
 static void ovl_dentry_release(struct dentry *dentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;
@@ -54,6 +59,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
 				 unsigned int open_flags)
 {
 	struct dentry *real;
+	bool rocopyup = !inode && ovl_consistent_fd(dentry->d_sb);
+
+	if (WARN_ON(open_flags && inode))
+		return dentry;
 
 	if (!d_is_reg(dentry)) {
 		if (!inode || inode == d_inode(dentry))
@@ -64,8 +73,8 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
 	if (d_is_negative(dentry))
 		return dentry;
 
-	if (open_flags) {
-		int err = ovl_open_maybe_copy_up(dentry, open_flags);
+	if (open_flags || rocopyup) {
+		int err = ovl_open_maybe_copy_up(dentry, open_flags, rocopyup);
 
 		if (err)
 			return ERR_PTR(err);
@@ -227,6 +236,10 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	if (ufs->config.redirect_dir != ovl_redirect_dir_def)
 		seq_printf(m, ",redirect_dir=%s",
 			   ufs->config.redirect_dir ? "on" : "off");
+	if (!(sb->s_flags & MS_RDONLY) &&
+	    ufs->config.consistent_fd != ovl_consistent_fd_def)
+		seq_printf(m, ",consistent_fd=%s",
+			   ufs->config.consistent_fd ? "on" : "off");
 	return 0;
 }
 
@@ -256,6 +269,8 @@ enum {
 	OPT_DEFAULT_PERMISSIONS,
 	OPT_REDIRECT_DIR_ON,
 	OPT_REDIRECT_DIR_OFF,
+	OPT_CONSISTENT_FD_ON,
+	OPT_CONSISTENT_FD_OFF,
 	OPT_ERR,
 };
 
@@ -266,6 +281,8 @@ static const match_table_t ovl_tokens = {
 	{OPT_DEFAULT_PERMISSIONS,	"default_permissions"},
 	{OPT_REDIRECT_DIR_ON,		"redirect_dir=on"},
 	{OPT_REDIRECT_DIR_OFF,		"redirect_dir=off"},
+	{OPT_CONSISTENT_FD_ON,		"consistent_fd=on"},
+	{OPT_CONSISTENT_FD_OFF,		"consistent_fd=off"},
 	{OPT_ERR,			NULL}
 };
 
@@ -338,6 +355,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 			config->redirect_dir = false;
 			break;
 
+		case OPT_CONSISTENT_FD_ON:
+			config->consistent_fd = true;
+			break;
+
+		case OPT_CONSISTENT_FD_OFF:
+			config->consistent_fd = false;
+			break;
+
 		default:
 			pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
 			return -EINVAL;
@@ -732,6 +757,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
 	init_waitqueue_head(&ufs->copyup_wq);
 	ufs->config.redirect_dir = ovl_redirect_dir_def;
+	ufs->config.consistent_fd = ovl_consistent_fd_def;
 	err = ovl_parse_opt((char *) data, &ufs->config);
 	if (err)
 		goto out_free_config;
@@ -862,11 +888,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 			if (!err)
 				pr_warn("overlayfs: upper fs needs to support d_type.\n");
 
-			/* Check if upper/work fs supports O_TMPFILE */
+			/* Check if upper fs supports O_TMPFILE and clone */
 			temp = ovl_do_tmpfile(ufs->workdir, S_IFREG | 0);
 			ufs->tmpfile = !IS_ERR(temp);
 			if (ufs->tmpfile) {
-				/* Check if upper/work supports clone */
 				if (temp->d_inode && temp->d_inode->i_fop &&
 				    temp->d_inode->i_fop->clone_file_range)
 					ufs->cloneup = true;
@@ -910,6 +935,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	if (!ufs->upper_mnt)
 		sb->s_flags |= MS_RDONLY;
 
+	/* Copy on read for consistent fd depends on clone support */
+	if (!ufs->cloneup)
+		ufs->config.consistent_fd = false;
+
 	if (remote)
 		sb->s_d_op = &ovl_reval_dentry_operations;
 	else
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 159e851..be0a993 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -195,6 +195,13 @@ void ovl_dentry_set_opaque(struct dentry *dentry)
 	oe->__type |= __OVL_PATH_OPAQUE;
 }
 
+bool ovl_consistent_fd(struct super_block *sb)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+
+	return ofs->config.consistent_fd;
+}
+
 bool ovl_redirect_dir(struct super_block *sb)
 {
 	struct ovl_fs *ofs = sb->s_fs_info;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] ovl: link upper tempfile on open for write
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
                   ` (4 preceding siblings ...)
  2017-03-29 14:36 ` [PATCH 5/6] ovl: copy on read with consistent_fd feature Amir Goldstein
@ 2017-03-29 14:36 ` Amir Goldstein
  2017-03-30 11:26 ` [PATCH 7/7] ovl: prevent copy on read if no upper/work dir Amir Goldstein
  2017-03-30 11:34 ` [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
  7 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-29 14:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

With consistent_fd feature enabled, after copy up on open for read,
upperdentry remains an unlinked tmpfile.
On first open for write, temp upperdentry is linked to upper dir.

This keeps the file system inode allocated for as long as there are open
file descriptors and until overlay inode is evicted from cache.
After force reboot, file system will cleanup all temporary allocated
inodes (orphans).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/copy_up.c   | 74 ++++++++++++++++++++++++++++++++++++++++++------
 fs/overlayfs/inode.c     | 26 ++++++++++++-----
 fs/overlayfs/overlayfs.h |  5 +++-
 fs/overlayfs/ovl_entry.h | 10 ++++++-
 fs/overlayfs/super.c     |  7 +++++
 fs/overlayfs/util.c      | 27 ++++++++++++++++--
 6 files changed, 129 insertions(+), 20 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 3053e33..09ed60f 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -321,14 +321,14 @@ static int ovl_copy_up_locked(struct dentry *workdir, struct dentry *upperdir,
 	if (err)
 		goto out_cleanup;
 
-	if (tmpfile)
-		err = ovl_do_link(temp, udir, upper, true);
-	else
+	if (!tmpfile)
 		err = ovl_do_rename(wdir, temp, udir, upper, 0);
+	else if (stat->nlink)
+		err = ovl_do_link(temp, udir, upper, true);
 	if (err)
 		goto out_cleanup;
 
-	newdentry = dget(tmpfile ? upper : temp);
+	newdentry = dget((tmpfile && stat->nlink) ? upper : temp);
 	ovl_dentry_update(dentry, newdentry);
 	ovl_inode_update(d_inode(dentry), d_inode(newdentry));
 
@@ -348,6 +348,43 @@ static int ovl_copy_up_locked(struct dentry *workdir, struct dentry *upperdir,
 }
 
 /*
+ * Link a temp upperdentry to upper dir
+ *
+ * After copy up on open for read, upperdentry remains unlinked.
+ * On first copy up on open for write upperdentry should be linked.
+ */
+static int ovl_link_ro_upper(struct dentry *upperdir, struct dentry *dentry,
+			     struct dentry *temp, struct kstat *stat,
+			     struct kstat *pstat)
+{
+	struct dentry *upper = NULL;
+	int err;
+
+	upper = lookup_one_len(dentry->d_name.name, upperdir,
+			       dentry->d_name.len);
+	err = PTR_ERR(upper);
+	if (IS_ERR(upper))
+		goto out;
+
+	err = ovl_do_link(temp, upperdir->d_inode, upper, true);
+	if (err)
+		goto out_dput;
+
+	/* inode is already hashed. only need to update upperdentry */
+	ovl_dentry_update(dentry, upper);
+
+	/* Restore timestamps on parent (best effort) */
+	ovl_set_timestamps(upperdir, pstat);
+
+	return err;
+
+out_dput:
+	dput(upper);
+out:
+	return err;
+}
+
+/*
  * Copy up a single dentry
  *
  * All renames start with copy up of source if necessary.  The actual
@@ -366,6 +403,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 	struct path parentpath;
 	struct dentry *lowerdentry = lowerpath->dentry;
 	struct dentry *upperdir;
+	struct dentry *temp = NULL;
 	const char *link = NULL;
 	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
 
@@ -388,6 +426,19 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 			return PTR_ERR(link);
 	}
 
+	/* Maybe link a ro temp upper dentry on open for write? */
+	if (S_ISREG(stat->mode))
+		temp = ovl_dentry_ro_upper(dentry);
+	if (temp) {
+		inode_lock_nested(upperdir->d_inode, I_MUTEX_PARENT);
+		/* Make sure we did not race with another temp link */
+		if (likely(!ovl_dentry_upper(dentry)))
+			err = ovl_link_ro_upper(upperdir, dentry, temp, stat,
+						&pstat);
+		inode_unlock(upperdir->d_inode);
+		goto out_done;
+	}
+
 	/* Should we copyup with O_TMPFILE or with workdir? */
 	if (S_ISREG(stat->mode) && ofs->tmpfile) {
 		err = ovl_copy_up_start(dentry);
@@ -428,7 +479,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 	return err;
 }
 
-int ovl_copy_up_flags(struct dentry *dentry, int flags)
+int ovl_copy_up_flags(struct dentry *dentry, int flags, bool rocopyup)
 {
 	int err = 0;
 	const struct cred *old_cred = ovl_override_creds(dentry->d_sb);
@@ -440,7 +491,8 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
 		struct kstat stat;
 		enum ovl_path_type type = ovl_path_type(dentry);
 
-		if (OVL_TYPE_UPPER(type))
+		if (OVL_TYPE_UPPER(type) ||
+		    (rocopyup && OVL_TYPE_RO_UPPER(type)))
 			break;
 
 		next = dget(dentry);
@@ -459,9 +511,15 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
 		ovl_path_lower(next, &lowerpath);
 		err = vfs_getattr(&lowerpath, &stat,
 				  STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
-		/* maybe truncate regular file. this has no effect on dirs */
+		/*
+		 * maybe truncate regular file and maybe copy up as unlinked
+		 * tempfile for readonly open. this has no effect on dirs.
+		 */
+		WARN_ON(stat.nlink == 0);
 		if (flags & O_TRUNC)
 			stat.size = 0;
+		else if (rocopyup)
+			stat.nlink = 0;
 		if (!err)
 			err = ovl_copy_up_one(parent, next, &lowerpath, &stat);
 
@@ -475,5 +533,5 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
 
 int ovl_copy_up(struct dentry *dentry)
 {
-	return ovl_copy_up_flags(dentry, 0);
+	return ovl_copy_up_flags(dentry, 0, false);
 }
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index f2f55e1..2fb90be 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -230,8 +230,15 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
 	return acl;
 }
 
+/*
+ * Depending on open flags and overlay dentry type, determines if file needs
+ * to be copied up on open.  If *rocopyup is true, then files needs to be
+ * copied up to unlinked tmpfiles on open for read.  If this file has already
+ * been copied up to unlinked tmpfile or if this is an open for write, then
+ * *rocopyup will be set to false.
+ */
 static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
-				  struct dentry *realdentry, bool rocopyup)
+				  struct dentry *realdentry, bool *rocopyup)
 {
 	if (OVL_TYPE_UPPER(type))
 		return false;
@@ -239,13 +246,17 @@ static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
 	if (special_file(realdentry->d_inode->i_mode))
 		return false;
 
-	/* Copy up on open for read for consistent fd */
-	if (rocopyup)
-		return true;
+	/* Need copy up to unlinked tmpfile on open for read? */
+	if (*rocopyup &&
+	    (!S_ISREG(realdentry->d_inode->i_mode) ||
+	     OVL_TYPE_RO_UPPER(type)))
+		*rocopyup = false;
 
 	if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
-		return false;
+		return *rocopyup;
 
+	/* Open for write - need properly linked copy up */
+	*rocopyup = false;
 	return true;
 }
 
@@ -256,12 +267,13 @@ int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
 	struct path realpath;
 	enum ovl_path_type type = ovl_path_real(dentry, &realpath);
 
-	if (!ovl_open_need_copy_up(file_flags, type, realpath.dentry, rocopyup))
+	if (!ovl_open_need_copy_up(file_flags, type, realpath.dentry,
+				   &rocopyup))
 		return 0;
 
 	err = ovl_want_write(dentry);
 	if (!err) {
-		err = ovl_copy_up_flags(dentry, file_flags);
+		err = ovl_copy_up_flags(dentry, file_flags, rocopyup);
 		ovl_drop_write(dentry);
 	}
 
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index d13ad5f..cc76197 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -13,11 +13,13 @@ enum ovl_path_type {
 	__OVL_PATH_UPPER	= (1 << 0),
 	__OVL_PATH_MERGE	= (1 << 1),
 	__OVL_PATH_OPAQUE	= (1 << 2),
+	__OVL_PATH_RO_UPPER	= (1 << 3),
 };
 
 #define OVL_TYPE_UPPER(type)	((type) & __OVL_PATH_UPPER)
 #define OVL_TYPE_MERGE(type)	((type) & __OVL_PATH_MERGE)
 #define OVL_TYPE_OPAQUE(type)	((type) & __OVL_PATH_OPAQUE)
+#define OVL_TYPE_RO_UPPER(type)	((type) & __OVL_PATH_RO_UPPER)
 
 #define OVL_XATTR_PREFIX XATTR_TRUSTED_PREFIX "overlay."
 #define OVL_XATTR_OPAQUE OVL_XATTR_PREFIX "opaque"
@@ -162,6 +164,7 @@ void ovl_path_upper(struct dentry *dentry, struct path *path);
 void ovl_path_lower(struct dentry *dentry, struct path *path);
 enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path);
 struct dentry *ovl_dentry_upper(struct dentry *dentry);
+struct dentry *ovl_dentry_ro_upper(struct dentry *dentry);
 struct dentry *ovl_dentry_lower(struct dentry *dentry);
 struct dentry *ovl_dentry_real(struct dentry *dentry);
 struct ovl_dir_cache *ovl_dir_cache(struct dentry *dentry);
@@ -240,6 +243,6 @@ void ovl_cleanup(struct inode *dir, struct dentry *dentry);
 
 /* copy_up.c */
 int ovl_copy_up(struct dentry *dentry);
-int ovl_copy_up_flags(struct dentry *dentry, int flags);
+int ovl_copy_up_flags(struct dentry *dentry, int flags, bool rocopyup);
 int ovl_copy_xattr(struct dentry *old, struct dentry *new);
 int ovl_set_attr(struct dentry *upper, struct kstat *stat);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index c11a72d..87ade0b 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -40,7 +40,10 @@ enum ovl_path_type;
 /* private information held for every overlayfs dentry */
 struct ovl_entry {
 	struct dentry *__upperdentry;
-	struct ovl_dir_cache *cache;
+	union {
+		struct dentry *__roupperdentry; /* regular file */
+		struct ovl_dir_cache *cache; /* directory */
+	};
 	union {
 		struct {
 			u64 version;
@@ -60,3 +63,8 @@ static inline struct dentry *ovl_upperdentry_dereference(struct ovl_entry *oe)
 {
 	return lockless_dereference(oe->__upperdentry);
 }
+
+static inline struct dentry *ovl_roupperdentry_dereference(struct ovl_entry *oe)
+{
+	return lockless_dereference(oe->__roupperdentry);
+}
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index e5fd53a..ef159f8 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -47,6 +47,7 @@ static void ovl_dentry_release(struct dentry *dentry)
 		unsigned int i;
 
 		dput(oe->__upperdentry);
+		dput(oe->__roupperdentry);
 		kfree(oe->redirect);
 		for (i = 0; i < oe->numlower; i++)
 			dput(oe->lowerstack[i].dentry);
@@ -84,6 +85,12 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
 	if (real && (!inode || inode == d_inode(real)))
 		return real;
 
+	if (rocopyup) {
+		real = ovl_dentry_ro_upper(dentry);
+		if (real)
+			return real;
+	}
+
 	real = ovl_dentry_lower(dentry);
 	if (!real)
 		goto bug;
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index be0a993..00cd2d6 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -90,12 +90,14 @@ enum ovl_path_type ovl_update_type(struct dentry *dentry)
 		type |= __OVL_PATH_UPPER;
 		if (oe->numlower)
 			type |= __OVL_PATH_MERGE;
+	} else if (oe->__roupperdentry) {
+		type |= __OVL_PATH_RO_UPPER;
 	} else {
 		if (oe->numlower > 1)
 			type |= __OVL_PATH_MERGE;
 	}
 	/*
-	 * Make sure type is consistent with __upperdentry before making it
+	 * Make sure type is consistent with __[ro]upperdentry before making it
 	 * visible to ovl_path_type().
 	 */
 	smp_wmb();
@@ -138,6 +140,13 @@ struct dentry *ovl_dentry_upper(struct dentry *dentry)
 	return ovl_upperdentry_dereference(oe);
 }
 
+struct dentry *ovl_dentry_ro_upper(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	return ovl_roupperdentry_dereference(oe);
+}
+
 static struct dentry *__ovl_dentry_lower(struct ovl_entry *oe)
 {
 	return oe->numlower ? oe->lowerstack[0].dentry : NULL;
@@ -231,18 +240,30 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
 	oe->redirect = redirect;
 }
 
+/*
+ * May be called up to twice in the lifetime of an overlay dentry -
+ * the first time when updating a ro upper dentry with a tempfile (nlink == 0)
+ * and the second time when updating a linked upper dentry (nlink > 0).
+ * Linked upper must have the same inode as the temp ro upper.
+ */
 void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;
+	struct inode *inode = upperdentry->d_inode;
 
 	WARN_ON(!inode_is_locked(upperdentry->d_parent->d_inode));
 	WARN_ON(oe->__upperdentry);
+	if (WARN_ON(!inode))
+		return;
 	/*
 	 * Make sure upperdentry is consistent before making it visible to
-	 * ovl_upperdentry_dereference().
+	 * ovl_[ro]upperdentry_dereference()
 	 */
 	smp_wmb();
-	oe->__upperdentry = upperdentry;
+	if (inode->i_nlink)
+		oe->__upperdentry = upperdentry;
+	else
+		oe->__roupperdentry = upperdentry;
 	ovl_update_type(dentry);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/7] ovl: prevent copy on read if no upper/work dir
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
                   ` (5 preceding siblings ...)
  2017-03-29 14:36 ` [PATCH 6/6] ovl: link upper tempfile on open for write Amir Goldstein
@ 2017-03-30 11:26 ` Amir Goldstein
  2017-03-30 11:34 ` [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
  7 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-30 11:26 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On ro mount there are only ro fd's, so all fd's of the same file
are always consistent on a ro overlay mount.  But, if overlay can
be later remounted rw, we need to copy on read anyway, so that
ro fd that was opened during ro mount will be consistent with
rw fd that is opened after remount rw.

Therefore, we need to disable consistent_fd feature only in case
there are no upper/work dirs, so overlay can never be mounted rw.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Miklos,

Patch 5 should have prevented copy on read for ro mount,
but forgot to do so.

This extra patch relaxes the ro mount constrain and also
enforces the stronger contrain of no copy up if no upper/work.

If you're ok with the patch series, I can either fold this
patch into patch 5 or fix patch 5 and keep this one separate.

Amir.

 fs/overlayfs/super.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index ef159f8..688c0b0 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -222,6 +222,12 @@ static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return err;
 }
 
+/* Will this overlay be forced to mount/remount ro? */
+static bool ovl_force_readonly(struct ovl_fs *ufs)
+{
+	return (!ufs->upper_mnt || !ufs->workdir);
+}
+
 /**
  * ovl_show_options
  *
@@ -243,7 +249,7 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
 	if (ufs->config.redirect_dir != ovl_redirect_dir_def)
 		seq_printf(m, ",redirect_dir=%s",
 			   ufs->config.redirect_dir ? "on" : "off");
-	if (!(sb->s_flags & MS_RDONLY) &&
+	if (!ovl_force_readonly(ufs) &&
 	    ufs->config.consistent_fd != ovl_consistent_fd_def)
 		seq_printf(m, ",consistent_fd=%s",
 			   ufs->config.consistent_fd ? "on" : "off");
@@ -254,7 +260,7 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 {
 	struct ovl_fs *ufs = sb->s_fs_info;
 
-	if (!(*flags & MS_RDONLY) && (!ufs->upper_mnt || !ufs->workdir))
+	if (!(*flags & MS_RDONLY) && ovl_force_readonly(ufs))
 		return -EROFS;
 
 	return 0;
@@ -942,8 +948,14 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	if (!ufs->upper_mnt)
 		sb->s_flags |= MS_RDONLY;
 
-	/* Copy on read for consistent fd depends on clone support */
-	if (!ufs->cloneup)
+	/*
+	 * Copy on read for consistent fd depends on clone support.
+	 * On ro mount fd is always consistent, but if overlay can be
+	 * later remounted rw, we need to copy on read anyway, so that
+	 * ro fd that was opened during ro mount will be consistent with
+	 * rw fd that is opened after remount rw.
+	 */
+	if (!ufs->cloneup || ovl_force_readonly(ufs))
 		ufs->config.consistent_fd = false;
 
 	if (remote)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] ovl: copy on read with consistent_fd feature
  2017-03-29 14:36 ` [PATCH 5/6] ovl: copy on read with consistent_fd feature Amir Goldstein
@ 2017-03-30 11:28   ` Amir Goldstein
  2017-03-31 17:58   ` Vivek Goyal
  1 sibling, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-03-30 11:28 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Wed, Mar 29, 2017 at 5:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Overlayfs copies up a file when the file is opened for write.  Writes
> to the returned file descriptor are not visible to file descriptors
> that were opened for read prior to copy up.
>
> If this config option is enabled then overlay filesystems will default
> to copy up a file that is opened for read to avoid this inconsistency.
> In this case, it is still possible to turn off consistent fd globally
> with the "consistent_fd=off" module option or on a filesystem instance
> basis with the "consistent_fd=off" mount option.
>
> The feature will not be turned on for an overlay mount unless all
> the layers are on the same underlying filesystem and this filesystem
> supports clone.
>
> This introduces a slight change in d_real() semantics. Now d_real()
> may return an error with NULL inode argument also in the zero open
> flags case. vfs API documentation has been updated.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  Documentation/filesystems/vfs.txt | 13 ++++++-------
>  fs/overlayfs/Kconfig              | 18 ++++++++++++++++++
>  fs/overlayfs/inode.c              | 27 ++++++++++++++++-----------
>  fs/overlayfs/overlayfs.h          |  4 +++-
>  fs/overlayfs/ovl_entry.h          |  2 ++
>  fs/overlayfs/super.c              | 37 +++++++++++++++++++++++++++++++++----
>  fs/overlayfs/util.c               |  7 +++++++
>  7 files changed, 85 insertions(+), 23 deletions(-)
>
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 5692117..0dd317b 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -1088,22 +1088,21 @@ struct dentry_operations {
>         dentry being transited from.
>
>    d_real: overlay/union type filesystems implement this method to return one of
> -       the underlying dentries hidden by the overlay.  It is used in three
> +       the underlying dentries hidden by the overlay.  It is used in two
>         different modes:
>
>         Called from open it may need to copy-up the file depending on the
> -       supplied open flags.  This mode is selected with a non-zero flags
> -       argument.  In this mode the d_real method can return an error.
> +       supplied open flags.  It returns the upper real dentry if file was
> +       copied up or the topmost real underlying dentry otherwise.
> +       This mode is selected with a NULL inode argument.
> +       In this mode the d_real method can return an error.
>
>         Called from file_dentry() it returns the real dentry matching the inode
>         argument.  The real dentry may be from a lower layer already copied up,
>         but still referenced from the file.  This mode is selected with a
>         non-NULL inode argument.  This will always succeed.
>
> -       With NULL inode and zero flags the topmost real underlying dentry is
> -       returned.  This will always succeed.
> -
> -       This method is never called with both non-NULL inode and non-zero flags.
> +       This method is never called with non-NULL inode and non-zero open flags.
>
>  Each dentry has a pointer to its parent dentry, as well as a hash list
>  of child dentries. Child dentries are basically like files in a
> diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
> index 0daac51..7425862 100644
> --- a/fs/overlayfs/Kconfig
> +++ b/fs/overlayfs/Kconfig
> @@ -22,3 +22,21 @@ config OVERLAY_FS_REDIRECT_DIR
>           Note, that redirects are not backward compatible.  That is, mounting
>           an overlay which has redirects on a kernel that doesn't support this
>           feature will have unexpected results.
> +
> +config OVERLAY_FS_CONSISTENT_FD
> +       bool "Overlayfs: turn on consistent fd feature by default"
> +       depends on OVERLAY_FS
> +       help
> +         Overlayfs copies up a file when the file is opened for write.  Writes
> +          to the returned file descriptor are not visible to file descriptors
> +          that were opened for read prior to copy up.
> +
> +          If this config option is enabled then overlay filesystems will default
> +          to copy up a file that is opened for read to avoid this inconsistency.
> +          In this case, it is still possible to turn off consistent fd globally
> +          with the "consistent_fd=off" module option or on a filesystem instance
> +          basis with the "consistent_fd=off" mount option.
> +
> +          The feature will not be turned on for an overlay mount unless all
> +          the layers are on the same underlying filesystem and this filesystem
> +          supports clone.
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index 17b8418..f2f55e1 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -231,7 +231,7 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
>  }
>
>  static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
> -                                 struct dentry *realdentry)
> +                                 struct dentry *realdentry, bool rocopyup)
>  {
>         if (OVL_TYPE_UPPER(type))
>                 return false;
> @@ -239,25 +239,30 @@ static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
>         if (special_file(realdentry->d_inode->i_mode))
>                 return false;
>
> +       /* Copy up on open for read for consistent fd */
> +       if (rocopyup)
> +               return true;
> +
>         if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
>                 return false;
>
>         return true;
>  }
>
> -int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags)
> +int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
> +                          bool rocopyup)
>  {
>         int err = 0;
>         struct path realpath;
> -       enum ovl_path_type type;
> -
> -       type = ovl_path_real(dentry, &realpath);
> -       if (ovl_open_need_copy_up(file_flags, type, realpath.dentry)) {
> -               err = ovl_want_write(dentry);
> -               if (!err) {
> -                       err = ovl_copy_up_flags(dentry, file_flags);
> -                       ovl_drop_write(dentry);
> -               }
> +       enum ovl_path_type type = ovl_path_real(dentry, &realpath);
> +
> +       if (!ovl_open_need_copy_up(file_flags, type, realpath.dentry, rocopyup))
> +               return 0;
> +
> +       err = ovl_want_write(dentry);
> +       if (!err) {
> +               err = ovl_copy_up_flags(dentry, file_flags);
> +               ovl_drop_write(dentry);
>         }
>
>         return err;
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 079051e..d13ad5f 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -169,6 +169,7 @@ void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
>  bool ovl_dentry_is_opaque(struct dentry *dentry);
>  bool ovl_dentry_is_whiteout(struct dentry *dentry);
>  void ovl_dentry_set_opaque(struct dentry *dentry);
> +bool ovl_consistent_fd(struct super_block *sb);
>  bool ovl_redirect_dir(struct super_block *sb);
>  void ovl_clear_redirect_dir(struct super_block *sb);
>  const char *ovl_dentry_get_redirect(struct dentry *dentry);
> @@ -207,7 +208,8 @@ int ovl_xattr_get(struct dentry *dentry, const char *name,
>                   void *value, size_t size);
>  ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
>  struct posix_acl *ovl_get_acl(struct inode *inode, int type);
> -int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags);
> +int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
> +                          bool rocopyup);
>  int ovl_update_time(struct inode *inode, struct timespec *ts, int flags);
>  bool ovl_is_private_xattr(const char *name);
>
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index fb1210d..c11a72d 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -14,6 +14,7 @@ struct ovl_config {
>         char *workdir;
>         bool default_permissions;
>         bool redirect_dir;
> +       bool consistent_fd;
>  };
>
>  /* private information held for overlayfs's superblock */
> @@ -30,6 +31,7 @@ struct ovl_fs {
>         bool tmpfile;   /* upper supports O_TMPFILE */
>         bool samefs;    /* all layers on same fs */
>         bool cloneup;   /* can clone from lower to upper */
> +       bool rocopyup;  /* copy up on open for read */
>         wait_queue_head_t copyup_wq;
>  };
>
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 75b93d6..e5fd53a 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -34,6 +34,11 @@ module_param_named(redirect_dir, ovl_redirect_dir_def, bool, 0644);
>  MODULE_PARM_DESC(ovl_redirect_dir_def,
>                  "Default to on or off for the redirect_dir feature");
>
> +static bool ovl_consistent_fd_def = IS_ENABLED(CONFIG_OVERLAY_FS_CONSISTENT_FD);
> +module_param_named(consistent_fd, ovl_consistent_fd_def, bool, 0644);
> +MODULE_PARM_DESC(ovl_consistent_fd_def,
> +                "Default to on or off for the consistent_fd feature");
> +
>  static void ovl_dentry_release(struct dentry *dentry)
>  {
>         struct ovl_entry *oe = dentry->d_fsdata;
> @@ -54,6 +59,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
>                                  unsigned int open_flags)
>  {
>         struct dentry *real;
> +       bool rocopyup = !inode && ovl_consistent_fd(dentry->d_sb);
> +
> +       if (WARN_ON(open_flags && inode))
> +               return dentry;
>
>         if (!d_is_reg(dentry)) {
>                 if (!inode || inode == d_inode(dentry))
> @@ -64,8 +73,8 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
>         if (d_is_negative(dentry))
>                 return dentry;
>
> -       if (open_flags) {
> -               int err = ovl_open_maybe_copy_up(dentry, open_flags);
> +       if (open_flags || rocopyup) {
> +               int err = ovl_open_maybe_copy_up(dentry, open_flags, rocopyup);
>
>                 if (err)
>                         return ERR_PTR(err);
> @@ -227,6 +236,10 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
>         if (ufs->config.redirect_dir != ovl_redirect_dir_def)
>                 seq_printf(m, ",redirect_dir=%s",
>                            ufs->config.redirect_dir ? "on" : "off");
> +       if (!(sb->s_flags & MS_RDONLY) &&
> +           ufs->config.consistent_fd != ovl_consistent_fd_def)
> +               seq_printf(m, ",consistent_fd=%s",
> +                          ufs->config.consistent_fd ? "on" : "off");
>         return 0;
>  }
>
> @@ -256,6 +269,8 @@ enum {
>         OPT_DEFAULT_PERMISSIONS,
>         OPT_REDIRECT_DIR_ON,
>         OPT_REDIRECT_DIR_OFF,
> +       OPT_CONSISTENT_FD_ON,
> +       OPT_CONSISTENT_FD_OFF,
>         OPT_ERR,
>  };
>
> @@ -266,6 +281,8 @@ static const match_table_t ovl_tokens = {
>         {OPT_DEFAULT_PERMISSIONS,       "default_permissions"},
>         {OPT_REDIRECT_DIR_ON,           "redirect_dir=on"},
>         {OPT_REDIRECT_DIR_OFF,          "redirect_dir=off"},
> +       {OPT_CONSISTENT_FD_ON,          "consistent_fd=on"},
> +       {OPT_CONSISTENT_FD_OFF,         "consistent_fd=off"},
>         {OPT_ERR,                       NULL}
>  };
>
> @@ -338,6 +355,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>                         config->redirect_dir = false;
>                         break;
>
> +               case OPT_CONSISTENT_FD_ON:
> +                       config->consistent_fd = true;
> +                       break;
> +
> +               case OPT_CONSISTENT_FD_OFF:
> +                       config->consistent_fd = false;
> +                       break;
> +
>                 default:
>                         pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
>                         return -EINVAL;
> @@ -732,6 +757,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>
>         init_waitqueue_head(&ufs->copyup_wq);
>         ufs->config.redirect_dir = ovl_redirect_dir_def;
> +       ufs->config.consistent_fd = ovl_consistent_fd_def;
>         err = ovl_parse_opt((char *) data, &ufs->config);
>         if (err)
>                 goto out_free_config;
> @@ -862,11 +888,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>                         if (!err)
>                                 pr_warn("overlayfs: upper fs needs to support d_type.\n");
>
> -                       /* Check if upper/work fs supports O_TMPFILE */
> +                       /* Check if upper fs supports O_TMPFILE and clone */
>                         temp = ovl_do_tmpfile(ufs->workdir, S_IFREG | 0);
>                         ufs->tmpfile = !IS_ERR(temp);
>                         if (ufs->tmpfile) {
> -                               /* Check if upper/work supports clone */
>                                 if (temp->d_inode && temp->d_inode->i_fop &&
>                                     temp->d_inode->i_fop->clone_file_range)
>                                         ufs->cloneup = true;
> @@ -910,6 +935,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>         if (!ufs->upper_mnt)
>                 sb->s_flags |= MS_RDONLY;
>
> +       /* Copy on read for consistent fd depends on clone support */
> +       if (!ufs->cloneup)

I missed || (sb->s_flags & MS_RDONLY) here. posted patch 7/7 with
fix to ro mount + some other changes.

> +               ufs->config.consistent_fd = false;
> +
>         if (remote)
>                 sb->s_d_op = &ovl_reval_dentry_operations;
>         else
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 159e851..be0a993 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -195,6 +195,13 @@ void ovl_dentry_set_opaque(struct dentry *dentry)
>         oe->__type |= __OVL_PATH_OPAQUE;
>  }
>
> +bool ovl_consistent_fd(struct super_block *sb)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +
> +       return ofs->config.consistent_fd;
> +}
> +
>  bool ovl_redirect_dir(struct super_block *sb)
>  {
>         struct ovl_fs *ofs = sb->s_fs_info;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
                   ` (6 preceding siblings ...)
  2017-03-30 11:26 ` [PATCH 7/7] ovl: prevent copy on read if no upper/work dir Amir Goldstein
@ 2017-03-30 11:34 ` Amir Goldstein
  2017-04-06 15:20   ` Miklos Szeredi
  7 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2017-03-30 11:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Wed, Mar 29, 2017 at 5:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Mikos,
>
> This patch set implement the 'simple' solution we discussed on LSF.
> For the special case of all overlays on the same fs with clone support,
> files are copied up on open for read (as O_TMPFILE) and linked to
> upperdir on first open for write.
>
> - Patches 1-2 are the refactoring I sent you earlier.  They are not
>   strictly needed for the consistent_fd feature - up to you.
> - Patches 3-4 test 'samefs' and 'cloneup' properties of underlying fs.
> - Patch 5 copies up open for read (for the special case).
> - Patch 6 defers linking the tmpfile to first open for write.
>
> Some of the design choices for patch 6 are questionable:
> the storing of tempfile in ovl_dentry_update(),
> temp dentry is freed only on overlay dentry release.
> awaiting your feedback about those choises.
>
> xfstests run of ./check -overlay -g quick passed on
> both ext4 (no clone) and xfs+reflink (yes clone).
> test overlay/016 ("Test ro/rw fd data inconsistecies")
> passes with xfs+reflink.
> test overlay/013 ("Test truncate running executable...")
> fails with xfs+reflink, because test expects the inconsistency.
>
> I also modified unionmount-testsuite and added ./run --ov --samefs
> to setup lower+upper on same base fs of your choise [1].
> It passes with base fs tmpfs (no clone) and base fs xfs+reflink.
>

FYI, I also used this test patch to easily verify when temp inode has been
allocated (on open for read) and when it was freed (on drop_caches),
simply by looking at the overlay inode number.

I don't think we want to make this patch official, because then inode
number can change several times during the lifetime of overlay
instead of just once and worse, changing between 2 subsequent reads...

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 2fb90be..cd06425 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -62,10 +62,15 @@ static int ovl_getattr(const struct path *path,
struct kstat *stat,
 {
        struct dentry *dentry = path->dentry;
        struct path realpath;
+       enum ovl_path_type type = ovl_path_real(dentry, &realpath);
        const struct cred *old_cred;
        int err;

-       ovl_path_real(dentry, &realpath);
+       if (!OVL_TYPE_UPPER(type) && OVL_TYPE_RO_UPPER(type)) {
+               ovl_path_upper(dentry, &realpath);
+               realpath.dentry = ovl_dentry_ro_upper(dentry);
+       }
+
        old_cred = ovl_override_creds(dentry->d_sb);
        err = vfs_getattr(&realpath, stat, request_mask, flags);
        revert_creds(old_cred);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] ovl: copy on read with consistent_fd feature
  2017-03-29 14:36 ` [PATCH 5/6] ovl: copy on read with consistent_fd feature Amir Goldstein
  2017-03-30 11:28   ` Amir Goldstein
@ 2017-03-31 17:58   ` Vivek Goyal
  2017-04-01  9:27     ` Amir Goldstein
  1 sibling, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2017-03-31 17:58 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, linux-unionfs

On Wed, Mar 29, 2017 at 05:36:05PM +0300, Amir Goldstein wrote:
> Overlayfs copies up a file when the file is opened for write.  Writes
> to the returned file descriptor are not visible to file descriptors
> that were opened for read prior to copy up.
> 
> If this config option is enabled then overlay filesystems will default
> to copy up a file that is opened for read to avoid this inconsistency.
> In this case, it is still possible to turn off consistent fd globally
> with the "consistent_fd=off" module option or on a filesystem instance
> basis with the "consistent_fd=off" mount option.

Hi Amir,

So all readers will pay a small cost of copying up file always (as temp
file). I am curious to know if that cost is significant.

Are there any other downsides of opting in for this behavior by default.
I am assuming page cache usage will not be higher due to clone operation.

Vivek


> 
> The feature will not be turned on for an overlay mount unless all
> the layers are on the same underlying filesystem and this filesystem
> supports clone.
> 
> This introduces a slight change in d_real() semantics. Now d_real()
> may return an error with NULL inode argument also in the zero open
> flags case. vfs API documentation has been updated.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  Documentation/filesystems/vfs.txt | 13 ++++++-------
>  fs/overlayfs/Kconfig              | 18 ++++++++++++++++++
>  fs/overlayfs/inode.c              | 27 ++++++++++++++++-----------
>  fs/overlayfs/overlayfs.h          |  4 +++-
>  fs/overlayfs/ovl_entry.h          |  2 ++
>  fs/overlayfs/super.c              | 37 +++++++++++++++++++++++++++++++++----
>  fs/overlayfs/util.c               |  7 +++++++
>  7 files changed, 85 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 5692117..0dd317b 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -1088,22 +1088,21 @@ struct dentry_operations {
>  	dentry being transited from.
>  
>    d_real: overlay/union type filesystems implement this method to return one of
> -	the underlying dentries hidden by the overlay.  It is used in three
> +	the underlying dentries hidden by the overlay.  It is used in two
>  	different modes:
>  
>  	Called from open it may need to copy-up the file depending on the
> -	supplied open flags.  This mode is selected with a non-zero flags
> -	argument.  In this mode the d_real method can return an error.
> +	supplied open flags.  It returns the upper real dentry if file was
> +	copied up or the topmost real underlying dentry otherwise.
> +	This mode is selected with a NULL inode argument.
> +	In this mode the d_real method can return an error.
>  
>  	Called from file_dentry() it returns the real dentry matching the inode
>  	argument.  The real dentry may be from a lower layer already copied up,
>  	but still referenced from the file.  This mode is selected with a
>  	non-NULL inode argument.  This will always succeed.
>  
> -	With NULL inode and zero flags the topmost real underlying dentry is
> -	returned.  This will always succeed.
> -
> -	This method is never called with both non-NULL inode and non-zero flags.
> +	This method is never called with non-NULL inode and non-zero open flags.
>  
>  Each dentry has a pointer to its parent dentry, as well as a hash list
>  of child dentries. Child dentries are basically like files in a
> diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
> index 0daac51..7425862 100644
> --- a/fs/overlayfs/Kconfig
> +++ b/fs/overlayfs/Kconfig
> @@ -22,3 +22,21 @@ config OVERLAY_FS_REDIRECT_DIR
>  	  Note, that redirects are not backward compatible.  That is, mounting
>  	  an overlay which has redirects on a kernel that doesn't support this
>  	  feature will have unexpected results.
> +
> +config OVERLAY_FS_CONSISTENT_FD
> +	bool "Overlayfs: turn on consistent fd feature by default"
> +	depends on OVERLAY_FS
> +	help
> +	  Overlayfs copies up a file when the file is opened for write.  Writes
> +          to the returned file descriptor are not visible to file descriptors
> +          that were opened for read prior to copy up.
> +
> +          If this config option is enabled then overlay filesystems will default
> +          to copy up a file that is opened for read to avoid this inconsistency.
> +          In this case, it is still possible to turn off consistent fd globally
> +          with the "consistent_fd=off" module option or on a filesystem instance
> +          basis with the "consistent_fd=off" mount option.
> +
> +          The feature will not be turned on for an overlay mount unless all
> +          the layers are on the same underlying filesystem and this filesystem
> +          supports clone.
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index 17b8418..f2f55e1 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -231,7 +231,7 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
>  }
>  
>  static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
> -				  struct dentry *realdentry)
> +				  struct dentry *realdentry, bool rocopyup)
>  {
>  	if (OVL_TYPE_UPPER(type))
>  		return false;
> @@ -239,25 +239,30 @@ static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
>  	if (special_file(realdentry->d_inode->i_mode))
>  		return false;
>  
> +	/* Copy up on open for read for consistent fd */
> +	if (rocopyup)
> +		return true;
> +
>  	if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
>  		return false;
>  
>  	return true;
>  }
>  
> -int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags)
> +int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
> +			   bool rocopyup)
>  {
>  	int err = 0;
>  	struct path realpath;
> -	enum ovl_path_type type;
> -
> -	type = ovl_path_real(dentry, &realpath);
> -	if (ovl_open_need_copy_up(file_flags, type, realpath.dentry)) {
> -		err = ovl_want_write(dentry);
> -		if (!err) {
> -			err = ovl_copy_up_flags(dentry, file_flags);
> -			ovl_drop_write(dentry);
> -		}
> +	enum ovl_path_type type = ovl_path_real(dentry, &realpath);
> +
> +	if (!ovl_open_need_copy_up(file_flags, type, realpath.dentry, rocopyup))
> +		return 0;
> +
> +	err = ovl_want_write(dentry);
> +	if (!err) {
> +		err = ovl_copy_up_flags(dentry, file_flags);
> +		ovl_drop_write(dentry);
>  	}
>  
>  	return err;
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 079051e..d13ad5f 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -169,6 +169,7 @@ void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
>  bool ovl_dentry_is_opaque(struct dentry *dentry);
>  bool ovl_dentry_is_whiteout(struct dentry *dentry);
>  void ovl_dentry_set_opaque(struct dentry *dentry);
> +bool ovl_consistent_fd(struct super_block *sb);
>  bool ovl_redirect_dir(struct super_block *sb);
>  void ovl_clear_redirect_dir(struct super_block *sb);
>  const char *ovl_dentry_get_redirect(struct dentry *dentry);
> @@ -207,7 +208,8 @@ int ovl_xattr_get(struct dentry *dentry, const char *name,
>  		  void *value, size_t size);
>  ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
>  struct posix_acl *ovl_get_acl(struct inode *inode, int type);
> -int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags);
> +int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
> +			   bool rocopyup);
>  int ovl_update_time(struct inode *inode, struct timespec *ts, int flags);
>  bool ovl_is_private_xattr(const char *name);
>  
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index fb1210d..c11a72d 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -14,6 +14,7 @@ struct ovl_config {
>  	char *workdir;
>  	bool default_permissions;
>  	bool redirect_dir;
> +	bool consistent_fd;
>  };
>  
>  /* private information held for overlayfs's superblock */
> @@ -30,6 +31,7 @@ struct ovl_fs {
>  	bool tmpfile;	/* upper supports O_TMPFILE */
>  	bool samefs;	/* all layers on same fs */
>  	bool cloneup;	/* can clone from lower to upper */
> +	bool rocopyup;	/* copy up on open for read */
>  	wait_queue_head_t copyup_wq;
>  };
>  
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 75b93d6..e5fd53a 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -34,6 +34,11 @@ module_param_named(redirect_dir, ovl_redirect_dir_def, bool, 0644);
>  MODULE_PARM_DESC(ovl_redirect_dir_def,
>  		 "Default to on or off for the redirect_dir feature");
>  
> +static bool ovl_consistent_fd_def = IS_ENABLED(CONFIG_OVERLAY_FS_CONSISTENT_FD);
> +module_param_named(consistent_fd, ovl_consistent_fd_def, bool, 0644);
> +MODULE_PARM_DESC(ovl_consistent_fd_def,
> +		 "Default to on or off for the consistent_fd feature");
> +
>  static void ovl_dentry_release(struct dentry *dentry)
>  {
>  	struct ovl_entry *oe = dentry->d_fsdata;
> @@ -54,6 +59,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
>  				 unsigned int open_flags)
>  {
>  	struct dentry *real;
> +	bool rocopyup = !inode && ovl_consistent_fd(dentry->d_sb);
> +
> +	if (WARN_ON(open_flags && inode))
> +		return dentry;
>  
>  	if (!d_is_reg(dentry)) {
>  		if (!inode || inode == d_inode(dentry))
> @@ -64,8 +73,8 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
>  	if (d_is_negative(dentry))
>  		return dentry;
>  
> -	if (open_flags) {
> -		int err = ovl_open_maybe_copy_up(dentry, open_flags);
> +	if (open_flags || rocopyup) {
> +		int err = ovl_open_maybe_copy_up(dentry, open_flags, rocopyup);
>  
>  		if (err)
>  			return ERR_PTR(err);
> @@ -227,6 +236,10 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
>  	if (ufs->config.redirect_dir != ovl_redirect_dir_def)
>  		seq_printf(m, ",redirect_dir=%s",
>  			   ufs->config.redirect_dir ? "on" : "off");
> +	if (!(sb->s_flags & MS_RDONLY) &&
> +	    ufs->config.consistent_fd != ovl_consistent_fd_def)
> +		seq_printf(m, ",consistent_fd=%s",
> +			   ufs->config.consistent_fd ? "on" : "off");
>  	return 0;
>  }
>  
> @@ -256,6 +269,8 @@ enum {
>  	OPT_DEFAULT_PERMISSIONS,
>  	OPT_REDIRECT_DIR_ON,
>  	OPT_REDIRECT_DIR_OFF,
> +	OPT_CONSISTENT_FD_ON,
> +	OPT_CONSISTENT_FD_OFF,
>  	OPT_ERR,
>  };
>  
> @@ -266,6 +281,8 @@ static const match_table_t ovl_tokens = {
>  	{OPT_DEFAULT_PERMISSIONS,	"default_permissions"},
>  	{OPT_REDIRECT_DIR_ON,		"redirect_dir=on"},
>  	{OPT_REDIRECT_DIR_OFF,		"redirect_dir=off"},
> +	{OPT_CONSISTENT_FD_ON,		"consistent_fd=on"},
> +	{OPT_CONSISTENT_FD_OFF,		"consistent_fd=off"},
>  	{OPT_ERR,			NULL}
>  };
>  
> @@ -338,6 +355,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>  			config->redirect_dir = false;
>  			break;
>  
> +		case OPT_CONSISTENT_FD_ON:
> +			config->consistent_fd = true;
> +			break;
> +
> +		case OPT_CONSISTENT_FD_OFF:
> +			config->consistent_fd = false;
> +			break;
> +
>  		default:
>  			pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
>  			return -EINVAL;
> @@ -732,6 +757,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	init_waitqueue_head(&ufs->copyup_wq);
>  	ufs->config.redirect_dir = ovl_redirect_dir_def;
> +	ufs->config.consistent_fd = ovl_consistent_fd_def;
>  	err = ovl_parse_opt((char *) data, &ufs->config);
>  	if (err)
>  		goto out_free_config;
> @@ -862,11 +888,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  			if (!err)
>  				pr_warn("overlayfs: upper fs needs to support d_type.\n");
>  
> -			/* Check if upper/work fs supports O_TMPFILE */
> +			/* Check if upper fs supports O_TMPFILE and clone */
>  			temp = ovl_do_tmpfile(ufs->workdir, S_IFREG | 0);
>  			ufs->tmpfile = !IS_ERR(temp);
>  			if (ufs->tmpfile) {
> -				/* Check if upper/work supports clone */
>  				if (temp->d_inode && temp->d_inode->i_fop &&
>  				    temp->d_inode->i_fop->clone_file_range)
>  					ufs->cloneup = true;
> @@ -910,6 +935,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!ufs->upper_mnt)
>  		sb->s_flags |= MS_RDONLY;
>  
> +	/* Copy on read for consistent fd depends on clone support */
> +	if (!ufs->cloneup)
> +		ufs->config.consistent_fd = false;
> +
>  	if (remote)
>  		sb->s_d_op = &ovl_reval_dentry_operations;
>  	else
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 159e851..be0a993 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -195,6 +195,13 @@ void ovl_dentry_set_opaque(struct dentry *dentry)
>  	oe->__type |= __OVL_PATH_OPAQUE;
>  }
>  
> +bool ovl_consistent_fd(struct super_block *sb)
> +{
> +	struct ovl_fs *ofs = sb->s_fs_info;
> +
> +	return ofs->config.consistent_fd;
> +}
> +
>  bool ovl_redirect_dir(struct super_block *sb)
>  {
>  	struct ovl_fs *ofs = sb->s_fs_info;
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] ovl: copy on read with consistent_fd feature
  2017-03-31 17:58   ` Vivek Goyal
@ 2017-04-01  9:27     ` Amir Goldstein
  2017-04-05 13:20       ` Amir Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2017-04-01  9:27 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Miklos Szeredi, linux-unionfs

On Fri, Mar 31, 2017 at 8:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Mar 29, 2017 at 05:36:05PM +0300, Amir Goldstein wrote:
>> Overlayfs copies up a file when the file is opened for write.  Writes
>> to the returned file descriptor are not visible to file descriptors
>> that were opened for read prior to copy up.
>>
>> If this config option is enabled then overlay filesystems will default
>> to copy up a file that is opened for read to avoid this inconsistency.
>> In this case, it is still possible to turn off consistent fd globally
>> with the "consistent_fd=off" module option or on a filesystem instance
>> basis with the "consistent_fd=off" mount option.
>
> Hi Amir,
>

Hi Vivek,

> So all readers will pay a small cost of copying up file always (as temp
> file). I am curious to know if that cost is significant.
>

Cost depends of course on the performance of allocating inode and clone
of specific fs. Essentially, the answer is the same as the question:
"what is the cost of open for write on overlay with clone up support".
But the cost is only for the first reader, before overlay dentry is in cache.
Also the cost of first reader is deducted from cost of first writer..

In some very preliminary filebench tests I ran on overlay over xfs+reflink
xfs open was 0.1ms
overlay first open was 17ms
overlay second open after file is already copied up was 1.2ms
overlay second open when overlay dentry is in cache was 0.1ms

> Are there any other downsides of opting in for this behavior by default.
> I am assuming page cache usage will not be higher due to clone operation.
>

Sadly, page cache is not shared between cloned file.
That's something Miklos is trying to look into.

So at this point, enabling consistent_fd should be done only for
use cases that really care about consistent fd and willing to pay
the cost of clone, extra (temporary) usage of page cache and
the extra I/O of reading those pages.

It so happens that I have such a use case...

>
>
>>
>> The feature will not be turned on for an overlay mount unless all
>> the layers are on the same underlying filesystem and this filesystem
>> supports clone.
>>
>> This introduces a slight change in d_real() semantics. Now d_real()
>> may return an error with NULL inode argument also in the zero open
>> flags case. vfs API documentation has been updated.
>>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>  Documentation/filesystems/vfs.txt | 13 ++++++-------
>>  fs/overlayfs/Kconfig              | 18 ++++++++++++++++++
>>  fs/overlayfs/inode.c              | 27 ++++++++++++++++-----------
>>  fs/overlayfs/overlayfs.h          |  4 +++-
>>  fs/overlayfs/ovl_entry.h          |  2 ++
>>  fs/overlayfs/super.c              | 37 +++++++++++++++++++++++++++++++++----
>>  fs/overlayfs/util.c               |  7 +++++++
>>  7 files changed, 85 insertions(+), 23 deletions(-)
>>
>> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
>> index 5692117..0dd317b 100644
>> --- a/Documentation/filesystems/vfs.txt
>> +++ b/Documentation/filesystems/vfs.txt
>> @@ -1088,22 +1088,21 @@ struct dentry_operations {
>>       dentry being transited from.
>>
>>    d_real: overlay/union type filesystems implement this method to return one of
>> -     the underlying dentries hidden by the overlay.  It is used in three
>> +     the underlying dentries hidden by the overlay.  It is used in two
>>       different modes:
>>
>>       Called from open it may need to copy-up the file depending on the
>> -     supplied open flags.  This mode is selected with a non-zero flags
>> -     argument.  In this mode the d_real method can return an error.
>> +     supplied open flags.  It returns the upper real dentry if file was
>> +     copied up or the topmost real underlying dentry otherwise.
>> +     This mode is selected with a NULL inode argument.
>> +     In this mode the d_real method can return an error.
>>
>>       Called from file_dentry() it returns the real dentry matching the inode
>>       argument.  The real dentry may be from a lower layer already copied up,
>>       but still referenced from the file.  This mode is selected with a
>>       non-NULL inode argument.  This will always succeed.
>>
>> -     With NULL inode and zero flags the topmost real underlying dentry is
>> -     returned.  This will always succeed.
>> -
>> -     This method is never called with both non-NULL inode and non-zero flags.
>> +     This method is never called with non-NULL inode and non-zero open flags.
>>
>>  Each dentry has a pointer to its parent dentry, as well as a hash list
>>  of child dentries. Child dentries are basically like files in a
>> diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
>> index 0daac51..7425862 100644
>> --- a/fs/overlayfs/Kconfig
>> +++ b/fs/overlayfs/Kconfig
>> @@ -22,3 +22,21 @@ config OVERLAY_FS_REDIRECT_DIR
>>         Note, that redirects are not backward compatible.  That is, mounting
>>         an overlay which has redirects on a kernel that doesn't support this
>>         feature will have unexpected results.
>> +
>> +config OVERLAY_FS_CONSISTENT_FD
>> +     bool "Overlayfs: turn on consistent fd feature by default"
>> +     depends on OVERLAY_FS
>> +     help
>> +       Overlayfs copies up a file when the file is opened for write.  Writes
>> +          to the returned file descriptor are not visible to file descriptors
>> +          that were opened for read prior to copy up.
>> +
>> +          If this config option is enabled then overlay filesystems will default
>> +          to copy up a file that is opened for read to avoid this inconsistency.
>> +          In this case, it is still possible to turn off consistent fd globally
>> +          with the "consistent_fd=off" module option or on a filesystem instance
>> +          basis with the "consistent_fd=off" mount option.
>> +
>> +          The feature will not be turned on for an overlay mount unless all
>> +          the layers are on the same underlying filesystem and this filesystem
>> +          supports clone.
>> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
>> index 17b8418..f2f55e1 100644
>> --- a/fs/overlayfs/inode.c
>> +++ b/fs/overlayfs/inode.c
>> @@ -231,7 +231,7 @@ struct posix_acl *ovl_get_acl(struct inode *inode, int type)
>>  }
>>
>>  static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
>> -                               struct dentry *realdentry)
>> +                               struct dentry *realdentry, bool rocopyup)
>>  {
>>       if (OVL_TYPE_UPPER(type))
>>               return false;
>> @@ -239,25 +239,30 @@ static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
>>       if (special_file(realdentry->d_inode->i_mode))
>>               return false;
>>
>> +     /* Copy up on open for read for consistent fd */
>> +     if (rocopyup)
>> +             return true;
>> +
>>       if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
>>               return false;
>>
>>       return true;
>>  }
>>
>> -int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags)
>> +int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
>> +                        bool rocopyup)
>>  {
>>       int err = 0;
>>       struct path realpath;
>> -     enum ovl_path_type type;
>> -
>> -     type = ovl_path_real(dentry, &realpath);
>> -     if (ovl_open_need_copy_up(file_flags, type, realpath.dentry)) {
>> -             err = ovl_want_write(dentry);
>> -             if (!err) {
>> -                     err = ovl_copy_up_flags(dentry, file_flags);
>> -                     ovl_drop_write(dentry);
>> -             }
>> +     enum ovl_path_type type = ovl_path_real(dentry, &realpath);
>> +
>> +     if (!ovl_open_need_copy_up(file_flags, type, realpath.dentry, rocopyup))
>> +             return 0;
>> +
>> +     err = ovl_want_write(dentry);
>> +     if (!err) {
>> +             err = ovl_copy_up_flags(dentry, file_flags);
>> +             ovl_drop_write(dentry);
>>       }
>>
>>       return err;
>> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
>> index 079051e..d13ad5f 100644
>> --- a/fs/overlayfs/overlayfs.h
>> +++ b/fs/overlayfs/overlayfs.h
>> @@ -169,6 +169,7 @@ void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
>>  bool ovl_dentry_is_opaque(struct dentry *dentry);
>>  bool ovl_dentry_is_whiteout(struct dentry *dentry);
>>  void ovl_dentry_set_opaque(struct dentry *dentry);
>> +bool ovl_consistent_fd(struct super_block *sb);
>>  bool ovl_redirect_dir(struct super_block *sb);
>>  void ovl_clear_redirect_dir(struct super_block *sb);
>>  const char *ovl_dentry_get_redirect(struct dentry *dentry);
>> @@ -207,7 +208,8 @@ int ovl_xattr_get(struct dentry *dentry, const char *name,
>>                 void *value, size_t size);
>>  ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
>>  struct posix_acl *ovl_get_acl(struct inode *inode, int type);
>> -int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags);
>> +int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags,
>> +                        bool rocopyup);
>>  int ovl_update_time(struct inode *inode, struct timespec *ts, int flags);
>>  bool ovl_is_private_xattr(const char *name);
>>
>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
>> index fb1210d..c11a72d 100644
>> --- a/fs/overlayfs/ovl_entry.h
>> +++ b/fs/overlayfs/ovl_entry.h
>> @@ -14,6 +14,7 @@ struct ovl_config {
>>       char *workdir;
>>       bool default_permissions;
>>       bool redirect_dir;
>> +     bool consistent_fd;
>>  };
>>
>>  /* private information held for overlayfs's superblock */
>> @@ -30,6 +31,7 @@ struct ovl_fs {
>>       bool tmpfile;   /* upper supports O_TMPFILE */
>>       bool samefs;    /* all layers on same fs */
>>       bool cloneup;   /* can clone from lower to upper */
>> +     bool rocopyup;  /* copy up on open for read */
>>       wait_queue_head_t copyup_wq;
>>  };
>>
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index 75b93d6..e5fd53a 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -34,6 +34,11 @@ module_param_named(redirect_dir, ovl_redirect_dir_def, bool, 0644);
>>  MODULE_PARM_DESC(ovl_redirect_dir_def,
>>                "Default to on or off for the redirect_dir feature");
>>
>> +static bool ovl_consistent_fd_def = IS_ENABLED(CONFIG_OVERLAY_FS_CONSISTENT_FD);
>> +module_param_named(consistent_fd, ovl_consistent_fd_def, bool, 0644);
>> +MODULE_PARM_DESC(ovl_consistent_fd_def,
>> +              "Default to on or off for the consistent_fd feature");
>> +
>>  static void ovl_dentry_release(struct dentry *dentry)
>>  {
>>       struct ovl_entry *oe = dentry->d_fsdata;
>> @@ -54,6 +59,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
>>                                unsigned int open_flags)
>>  {
>>       struct dentry *real;
>> +     bool rocopyup = !inode && ovl_consistent_fd(dentry->d_sb);
>> +
>> +     if (WARN_ON(open_flags && inode))
>> +             return dentry;
>>
>>       if (!d_is_reg(dentry)) {
>>               if (!inode || inode == d_inode(dentry))
>> @@ -64,8 +73,8 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
>>       if (d_is_negative(dentry))
>>               return dentry;
>>
>> -     if (open_flags) {
>> -             int err = ovl_open_maybe_copy_up(dentry, open_flags);
>> +     if (open_flags || rocopyup) {
>> +             int err = ovl_open_maybe_copy_up(dentry, open_flags, rocopyup);
>>
>>               if (err)
>>                       return ERR_PTR(err);
>> @@ -227,6 +236,10 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
>>       if (ufs->config.redirect_dir != ovl_redirect_dir_def)
>>               seq_printf(m, ",redirect_dir=%s",
>>                          ufs->config.redirect_dir ? "on" : "off");
>> +     if (!(sb->s_flags & MS_RDONLY) &&
>> +         ufs->config.consistent_fd != ovl_consistent_fd_def)
>> +             seq_printf(m, ",consistent_fd=%s",
>> +                        ufs->config.consistent_fd ? "on" : "off");
>>       return 0;
>>  }
>>
>> @@ -256,6 +269,8 @@ enum {
>>       OPT_DEFAULT_PERMISSIONS,
>>       OPT_REDIRECT_DIR_ON,
>>       OPT_REDIRECT_DIR_OFF,
>> +     OPT_CONSISTENT_FD_ON,
>> +     OPT_CONSISTENT_FD_OFF,
>>       OPT_ERR,
>>  };
>>
>> @@ -266,6 +281,8 @@ static const match_table_t ovl_tokens = {
>>       {OPT_DEFAULT_PERMISSIONS,       "default_permissions"},
>>       {OPT_REDIRECT_DIR_ON,           "redirect_dir=on"},
>>       {OPT_REDIRECT_DIR_OFF,          "redirect_dir=off"},
>> +     {OPT_CONSISTENT_FD_ON,          "consistent_fd=on"},
>> +     {OPT_CONSISTENT_FD_OFF,         "consistent_fd=off"},
>>       {OPT_ERR,                       NULL}
>>  };
>>
>> @@ -338,6 +355,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
>>                       config->redirect_dir = false;
>>                       break;
>>
>> +             case OPT_CONSISTENT_FD_ON:
>> +                     config->consistent_fd = true;
>> +                     break;
>> +
>> +             case OPT_CONSISTENT_FD_OFF:
>> +                     config->consistent_fd = false;
>> +                     break;
>> +
>>               default:
>>                       pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
>>                       return -EINVAL;
>> @@ -732,6 +757,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>
>>       init_waitqueue_head(&ufs->copyup_wq);
>>       ufs->config.redirect_dir = ovl_redirect_dir_def;
>> +     ufs->config.consistent_fd = ovl_consistent_fd_def;
>>       err = ovl_parse_opt((char *) data, &ufs->config);
>>       if (err)
>>               goto out_free_config;
>> @@ -862,11 +888,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>                       if (!err)
>>                               pr_warn("overlayfs: upper fs needs to support d_type.\n");
>>
>> -                     /* Check if upper/work fs supports O_TMPFILE */
>> +                     /* Check if upper fs supports O_TMPFILE and clone */
>>                       temp = ovl_do_tmpfile(ufs->workdir, S_IFREG | 0);
>>                       ufs->tmpfile = !IS_ERR(temp);
>>                       if (ufs->tmpfile) {
>> -                             /* Check if upper/work supports clone */
>>                               if (temp->d_inode && temp->d_inode->i_fop &&
>>                                   temp->d_inode->i_fop->clone_file_range)
>>                                       ufs->cloneup = true;
>> @@ -910,6 +935,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>       if (!ufs->upper_mnt)
>>               sb->s_flags |= MS_RDONLY;
>>
>> +     /* Copy on read for consistent fd depends on clone support */
>> +     if (!ufs->cloneup)
>> +             ufs->config.consistent_fd = false;
>> +
>>       if (remote)
>>               sb->s_d_op = &ovl_reval_dentry_operations;
>>       else
>> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
>> index 159e851..be0a993 100644
>> --- a/fs/overlayfs/util.c
>> +++ b/fs/overlayfs/util.c
>> @@ -195,6 +195,13 @@ void ovl_dentry_set_opaque(struct dentry *dentry)
>>       oe->__type |= __OVL_PATH_OPAQUE;
>>  }
>>
>> +bool ovl_consistent_fd(struct super_block *sb)
>> +{
>> +     struct ovl_fs *ofs = sb->s_fs_info;
>> +
>> +     return ofs->config.consistent_fd;
>> +}
>> +
>>  bool ovl_redirect_dir(struct super_block *sb)
>>  {
>>       struct ovl_fs *ofs = sb->s_fs_info;
>> --
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] ovl: copy on read with consistent_fd feature
  2017-04-01  9:27     ` Amir Goldstein
@ 2017-04-05 13:20       ` Amir Goldstein
  0 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-04-05 13:20 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs, Vivek Goyal

On Sat, Apr 1, 2017 at 12:27 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 8:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Wed, Mar 29, 2017 at 05:36:05PM +0300, Amir Goldstein wrote:
>>> Overlayfs copies up a file when the file is opened for write.  Writes
>>> to the returned file descriptor are not visible to file descriptors
>>> that were opened for read prior to copy up.
>>>
>>> If this config option is enabled then overlay filesystems will default
>>> to copy up a file that is opened for read to avoid this inconsistency.
>>> In this case, it is still possible to turn off consistent fd globally
>>> with the "consistent_fd=off" module option or on a filesystem instance
>>> basis with the "consistent_fd=off" mount option.
>>
>> Hi Amir,
>>
>
> Hi Vivek,
>
>> So all readers will pay a small cost of copying up file always (as temp
>> file). I am curious to know if that cost is significant.
>>
>
[...]
>> Are there any other downsides of opting in for this behavior by default.
>> I am assuming page cache usage will not be higher due to clone operation.
>>
>
> Sadly, page cache is not shared between cloned file.
> That's something Miklos is trying to look into.
>
> So at this point, enabling consistent_fd should be done only for
> use cases that really care about consistent fd and willing to pay
> the cost of clone, extra usage of page cache and
> the extra I/O of reading those pages.
>

Miklos,

Did you get a change to look at the patches?
Is this aligned with what you had in mind?

Also, wrt Vivek's question on other impacts, I forgot to mention
that "temp read-only copy up" does an actual copy up of parent directory
ancestry, which does not go away when file is closed.

Which means that it may make sense to couple consistent_fd feature
with your old proposal to opt-in for copy up directory on getattr:
http://www.spinics.net/lists/linux-unionfs/msg00862.html

The logic being: if you care about ro/rw consistency and willing to pay
the cost of copy up of every parent dir of a file that has ever been read,
perhaps the extra cost of copy up dir on getattr is a small extra to pay
to get a bit more consistency?

Or perhaps you think that the cost of parent ancestry copy up
for open for read is too expensive to begin with and I should try to
get rid of this cost?

There are a few reasons I chose to copy up parents on rocopyup:
1. It keeps the code simpler (more code reuse with rw tmpfile copy up)
2. vfs_tmpfile() checks inode_permission(MAY_WRITE | MAY_EXEC)
    on parent dir inode and passes the inode to i_op->tmpfile().
    it's not clear to me if that dir inode is really important or if I
could just
    pass workdir inode or something(?)
3. inode_lock_nested(upperdir->d_inode) is used to serialize ro upper
    copy up with rw upper copy up -
    damn! not only should I have used copyup_wq for that purpose, but
    also I think I got the rocopyup-rocopyup race completely wrong :-/

I'll await your feedback on the series before I fix this race.

Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-03-30 11:34 ` [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
@ 2017-04-06 15:20   ` Miklos Szeredi
  2017-04-06 15:37     ` Miklos Szeredi
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Miklos Szeredi @ 2017-04-06 15:20 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-unionfs

On Thu, Mar 30, 2017 at 1:34 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Mar 29, 2017 at 5:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> Mikos,
>>
>> This patch set implement the 'simple' solution we discussed on LSF.
>> For the special case of all overlays on the same fs with clone support,
>> files are copied up on open for read (as O_TMPFILE) and linked to
>> upperdir on first open for write.
>>
>> - Patches 1-2 are the refactoring I sent you earlier.  They are not
>>   strictly needed for the consistent_fd feature - up to you.
>> - Patches 3-4 test 'samefs' and 'cloneup' properties of underlying fs.
>> - Patch 5 copies up open for read (for the special case).
>> - Patch 6 defers linking the tmpfile to first open for write.
>>
>> Some of the design choices for patch 6 are questionable:
>> the storing of tempfile in ovl_dentry_update(),
>> temp dentry is freed only on overlay dentry release.
>> awaiting your feedback about those choises.

My gut feel is that we should throw away the tempfile once all files
referring to the dentry are closed.  Memory pressure on dcache does
not seem like a good way to control the number inodes allocated on
underlying fs.

I'm hoping that this is going to be a temporary solution, because I
think it's unreasonably heavyweight compared to the rareness of the
issue it is solving.  I think this should be emphasized: this is for
the paranoid and it will cause a performance to degrade and resource
use to increase (although it may be an acceptable amount).

I also have a feeling that we must get the inode number from the lower
file in this case or it's going to break things (e.g. the pattern:
stat(), open(), fstat() verify st_dev/st_ino unchanged). Same goes for
atime, although that's a much smaller issue.

Yeah, yeah, copy-up has that issue, but lets remember that copy-up is
a rare event compared to open(O_RDONLY).

And we need to fix the constant inode number issue anyway.

So I thought about the constant inode number thing (for now only for
the samefs case) and here are my ideas:

 - need a database of upper_ino -> lower_ino mapping
 - on copy-up: append new mapping to database
 - on remove: delete mapping from database (since upper_ino might be reused)
 - on readdir and stat: check database and replace st_ino/d_ino if needed

For NFS export and hard-link-copy-up support, just need to extend this
with a reverse mapping.

The interesting questions are:

 - where to keep the database(s)
 - and how to cache them efficiently in the kernel.

I'd refrain from changes in disk format (i.e. those that rely on
backward incompatible features for underlying filesystems).  At least
for the first version.

Simplest solution is to keep mapping in the upper inode itself
(xattr).  Takes care of removal automatically.  Can be cached in
ovl_entry.  Makes readdir() really inefficient... need to cache
whether mapping needed for any directory entries in directory, which
complicates rename.

The other idea is to store the database separately from the upper tree
(this is what aufs does, apparently).  This works for reverse mapping
as well.  Makes all operations (except rename) more complicated.
Keeping whole mapping in kernel memory is prone to resource hogging
and DoS.   Could have a bitmap created by hashing the ino's that are
in the map and setting the bit for those.  Then only reach out to the
disk db if there's a possibility of hit.  Would still need to design
an efficient way to store and access the data in the db (and I have
zero experience with that sort of thing).

Thoughts?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-06 15:20   ` Miklos Szeredi
@ 2017-04-06 15:37     ` Miklos Szeredi
  2017-04-06 16:25       ` Amir Goldstein
  2017-04-06 16:46     ` Amir Goldstein
  2017-04-08  3:03     ` J. R. Okajima
  2 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2017-04-06 15:37 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-unionfs

On Thu, Apr 6, 2017 at 5:20 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Mar 30, 2017 at 1:34 PM, Amir Goldstein <amir73il@gmail.com> wrote:

> Simplest solution is to keep mapping in the upper inode itself
> (xattr).  Takes care of removal automatically.  Can be cached in
> ovl_entry.  Makes readdir() really inefficient... need to cache
> whether mapping needed for any directory entries in directory, which
> complicates rename.
>
> The other idea is to store the database separately from the upper tree
> (this is what aufs does, apparently).  This works for reverse mapping
> as well.  Makes all operations (except rename) more complicated.
> Keeping whole mapping in kernel memory is prone to resource hogging
> and DoS.   Could have a bitmap created by hashing the ino's that are
> in the map and setting the bit for those.  Then only reach out to the
> disk db if there's a possibility of hit.  Would still need to design
> an efficient way to store and access the data in the db (and I have
> zero experience with that sort of thing).

Hmm, could actually combine the two ideas:

Store mapping just in upper inode, walk upper tree on mount and fill
in bitmap.  Update bitmap as we go.

Reverse mapping still needs out-of-band solution, since walking the
whole upper tree each time we need to find a reverse map doesn't look
like a good solution.

Thanks,
Miklos

>
> Thoughts?
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-06 15:37     ` Miklos Szeredi
@ 2017-04-06 16:25       ` Amir Goldstein
  2017-04-07  9:32         ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2017-04-06 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Thu, Apr 6, 2017 at 6:37 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 6, 2017 at 5:20 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Thu, Mar 30, 2017 at 1:34 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> Simplest solution is to keep mapping in the upper inode itself
>> (xattr).  Takes care of removal automatically.  Can be cached in
>> ovl_entry.  Makes readdir() really inefficient... need to cache
>> whether mapping needed for any directory entries in directory, which
>> complicates rename.

This feels right.
We anyway store a lot of cache per directory dentry, the extra mapping
doesn't sound so bad.
rename is heavy anyway ;-)
And its got the sb wide lock so we don't have to worry about that stuff...

>>
>> The other idea is to store the database separately from the upper tree
>> (this is what aufs does, apparently).  This works for reverse mapping
>> as well.  Makes all operations (except rename) more complicated.

This feels wrong.
Makes me not want to look at this future code...

>> Keeping whole mapping in kernel memory is prone to resource hogging
>> and DoS.   Could have a bitmap created by hashing the ino's that are
>> in the map and setting the bit for those.  Then only reach out to the
>> disk db if there's a possibility of hit.  Would still need to design
>> an efficient way to store and access the data in the db (and I have
>> zero experience with that sort of thing).
>
> Hmm, could actually combine the two ideas:
>
> Store mapping just in upper inode, walk upper tree on mount and fill
> in bitmap.  Update bitmap as we go.
>

Right. You only need a local directory mapping at readdir time, so
only need to scan that dir inodes into the dir_cache.
Instead of invalidating the entire dir cache on every dir operation,
we can try to do something more subtle, so most of the inode mapping
would be preserved.

> Reverse mapping still needs out-of-band solution, since walking the
> whole upper tree each time we need to find a reverse map doesn't look
> like a good solution.
>

Yes. but remember the optimization we talked about in LSF:

At least for userspace NFS daemon that need to use the open_by_handle_at()
API, with a handle it got from name_to_handle_at(), the handle contains
the parent inode and it is possible to reconstruct the path from the handle.

This means that short of redirected directories and hardlinked files, you get
get lower path from handle and then look up the same path in overlay mount
to restore the original dentry.

So you never need reverse mapping for directories that were not redirected.
As for files that are not hardlinks for which kernel nfsd holds a handle without
parent inode... need to see how bad it could be if nfsd used a non unique
handle (which includes parent ino), but had a compare_handle() op from fs
to canonalize the handles.

If restricting reverse mapping to just redirected dirs and hardlinked files
that becomes something that can live in memory (??) and the cost of keeping
a persistent store may be simple enough.

I am thinking linked list of redirect_fh stored in upper inodes (of copy up
hardlinks and redirected dirs). Loading the list at mount is just
traversing the list
and chaining all inodes in memory.
upper remove may need to update one (prev) inode and copy up (or redirect)
may need to update one (tail) inode.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-06 15:20   ` Miklos Szeredi
  2017-04-06 15:37     ` Miklos Szeredi
@ 2017-04-06 16:46     ` Amir Goldstein
  2017-04-07  9:43       ` Miklos Szeredi
  2017-04-08  3:03     ` J. R. Okajima
  2 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2017-04-06 16:46 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Thu, Apr 6, 2017 at 6:20 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Mar 30, 2017 at 1:34 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Wed, Mar 29, 2017 at 5:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> Mikos,
>>>
>>> This patch set implement the 'simple' solution we discussed on LSF.
>>> For the special case of all overlays on the same fs with clone support,
>>> files are copied up on open for read (as O_TMPFILE) and linked to
>>> upperdir on first open for write.
>>>
>>> - Patches 1-2 are the refactoring I sent you earlier.  They are not
>>>   strictly needed for the consistent_fd feature - up to you.
>>> - Patches 3-4 test 'samefs' and 'cloneup' properties of underlying fs.
>>> - Patch 5 copies up open for read (for the special case).
>>> - Patch 6 defers linking the tmpfile to first open for write.
>>>
>>> Some of the design choices for patch 6 are questionable:
>>> the storing of tempfile in ovl_dentry_update(),
>>> temp dentry is freed only on overlay dentry release.
>>> awaiting your feedback about those choises.
>
> My gut feel is that we should throw away the tempfile once all files
> referring to the dentry are closed.  Memory pressure on dcache does
> not seem like a good way to control the number inodes allocated on
> underlying fs.
>

:-/ mmm but doing that would make the second open for read overhead
so much worse.
And not only the open itself, all temp inode pages that the first reader
read are going away leaving nothing around for the second reader.

> I'm hoping that this is going to be a temporary solution, because I
> think it's unreasonably heavyweight compared to the rareness of the
> issue it is solving.  I think this should be emphasized: this is for
> the paranoid and it will cause a performance to degrade and resource
> use to increase (although it may be an acceptable amount).
>

I agree its a big cost to pay when overlayfs is used to share
lower with many containers.
But maybe not so much for a single "development container"
use case.
If I am running a docker image and never access the lower
except though overlayfs from my single container, then I have
one copy of inode in cache one copy of the inode pages in cache.
Considering that for sane development I need all my dentries in
cache anyway, then I think that the cost of clone when dentry remains
in cache is really not that bad and the cost of an extra
allocated inode is something that modern fs (xfs, btrfs) can live with.

But of course it must be an opt-in feature...

> I also have a feeling that we must get the inode number from the lower
> file in this case or it's going to break things (e.g. the pattern:
> stat(), open(), fstat() verify st_dev/st_ino unchanged). Same goes for
> atime, although that's a much smaller issue.
>

Then how about the demo patch I sent (rocopyup on getattr).
It deals with the pattern above, but it breaks the pattern:

stat(), drop caches, stat()...

I believe that for the known issues with tar the first pattern is more important
Do you know which applications would suffer from second pattern?

> Yeah, yeah, copy-up has that issue, but lets remember that copy-up is
> a rare event compared to open(O_RDONLY).
>
> And we need to fix the constant inode number issue anyway.
>

Yes, but do you think we should delay consistent_fd until we have a full
solution for constant_ino?
Do you think it is acceptable to fix only stat (to return lower ino) and leave
readdir d_ino for later?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-06 16:25       ` Amir Goldstein
@ 2017-04-07  9:32         ` Miklos Szeredi
  2017-04-07  9:56           ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2017-04-07  9:32 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-unionfs

On Thu, Apr 6, 2017 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:

> I am thinking linked list of redirect_fh stored in upper inodes (of copy up
> hardlinks and redirected dirs). Loading the list at mount is just
> traversing the list
> and chaining all inodes in memory.
> upper remove may need to update one (prev) inode and copy up (or redirect)
> may need to update one (tail) inode.

I think you didn't take into account copied up and renamed
non-directories.  Those don't have redirect, yet need to find by lower
inode.

I have related idea:

When renaming a lower/merged directory or lower non-directory, always
add back-redirect pointer to moved object pointing to original
location, as well as forward-redirect to whiteout at original location
pointing to new location.   Possibly could do both with file handle.
Whenever either is moved/created/removed pointers need to be updated.
Hard linking a copied up file would result in duplication of the
forward pointers.  Solves the nfs export case, since we can always
find new location of objects based on location on lower layer.

Still doesn't solve the hard link copy-up, but now only a list of
partially copied up hard links is needed, which should be short
(unless someone is deliberately trying to mess with this, in which
case we really don't care about performance).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-06 16:46     ` Amir Goldstein
@ 2017-04-07  9:43       ` Miklos Szeredi
  2017-04-07 11:04         ` Amir Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2017-04-07  9:43 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-unionfs

On Thu, Apr 6, 2017 at 6:46 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Yes, but do you think we should delay consistent_fd until we have a full
> solution for constant_ino?
> Do you think it is acceptable to fix only stat (to return lower ino) and leave
> readdir d_ino for later?

I don't know, because, as we discussed previously, these are really
luxury bugs, and there's no push for either from real use cases.  So
I'd add fixes in the order of becoming the obviously right fix.  I
think we are closer to getting the constant ino resolved properly than
consistent fd.

I really want to look into sharing pages between clones.
"Unfortunately" I'll be taking next week off, so...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-07  9:32         ` Miklos Szeredi
@ 2017-04-07  9:56           ` Miklos Szeredi
  2017-04-07 10:47             ` Amir Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2017-04-07  9:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-unionfs

On Fri, Apr 7, 2017 at 11:32 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 6, 2017 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:

> When renaming a lower/merged directory or lower non-directory, always
> add back-redirect pointer to moved object pointing to original
> location, as well as forward-redirect to whiteout at original location
> pointing to new location.   Possibly could do both with file handle.
> Whenever either is moved/created/removed pointers need to be updated.
> Hard linking a copied up file would result in duplication of the
> forward pointers.  Solves the nfs export case, since we can always
> find new location of objects based on location on lower layer.

This breaks when an ancestor of a forward-redirect is moved/removed.

A linked list would still work, but all these are bit fragile.  I feel
it's better to keep this sort of info out-of-band.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-07  9:56           ` Miklos Szeredi
@ 2017-04-07 10:47             ` Amir Goldstein
  2017-04-07 13:03               ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2017-04-07 10:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Fri, Apr 7, 2017 at 12:56 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Apr 7, 2017 at 11:32 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Thu, Apr 6, 2017 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> When renaming a lower/merged directory or lower non-directory, always
>> add back-redirect pointer to moved object pointing to original
>> location, as well as forward-redirect to whiteout at original location
>> pointing to new location.   Possibly could do both with file handle.
>> Whenever either is moved/created/removed pointers need to be updated.
>> Hard linking a copied up file would result in duplication of the
>> forward pointers.  Solves the nfs export case, since we can always
>> find new location of objects based on location on lower layer.
>
> This breaks when an ancestor of a forward-redirect is moved/removed.

Right. but may still be feasible to mix this idea with a linked list
of only renamed dirs (topology transformation info).

>
> A linked list would still work, but all these are bit fragile.  I feel
> it's better to keep this sort of info out-of-band.
>

My main concern with out-of-band metadata is that we will have to sync
this metadata and it is easier to think on metadata transactions in
terms of changes
that are natively bundled together with the original inode
modification by the fs.

For that matter, I like the forward/backward redirect by fh idea, because
affected inodes are most of the time already part of the metadata transaction.
Also, backward redirect should be by path (like it is now) so it is
agnostic for more
renames and forward redirect should be by fh for the same reason and so it
is agnostic to moved object remove as well.

This is interesting enough for me to play with as part of "promotion"
for redirect_fh,
so I may just try to sketch some rough POC for NFS export.

Come to think about it, NFS export of regular file don't need to
follow renames at all:
- The handle for a regular file is always the handle for the real
lower or upper inode
- To decode a handle, create an O_TMPFILE style overlay dentry, which
is not linked
  to any path in overlay, but has the _upperdentry/lowerstack setup
- This way, nfsd can open a read-only handle from the ancient past (why not...)
- Directory handles would have to use the lowermost inode
- To decode a directory handle we need either forward redirect pointers on any
  new object created on top of lower dir or maintain a linked list  and hash or
  all renamed dirs

Am I missing something?

Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-07  9:43       ` Miklos Szeredi
@ 2017-04-07 11:04         ` Amir Goldstein
  0 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-04-07 11:04 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Fri, Apr 7, 2017 at 12:43 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 6, 2017 at 6:46 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> Yes, but do you think we should delay consistent_fd until we have a full
>> solution for constant_ino?
>> Do you think it is acceptable to fix only stat (to return lower ino) and leave
>> readdir d_ino for later?
>
> I don't know, because, as we discussed previously, these are really
> luxury bugs, and there's no push for either from real use cases.  So
> I'd add fixes in the order of becoming the obviously right fix.  I
> think we are closer to getting the constant ino resolved properly than
> consistent fd.
>
> I really want to look into sharing pages between clones.

That makes sense.
I also really want you to look into sharing pages between clone ;-)

Keep in mind that even though you are looking into sharing pages,
I think it may be still safe to propagate modification to readers lazily.

What I mean (and I may very well be talking nonsense) is that if
you choose to modify the f_mapping->host of readers some time
after copy up, maybe its enough to make a checkpoint only in
read syscall entry points?
and handle MAP_SHARED by copy up (or rocopyup) on mmap.

> "Unfortunately" I'll be taking next week off, so...
>

That is most fortunate :-)

Anyway, I sorted out the dumb rocopy race and cleaning
up the code from my tendency to over-code-reuse-with-if-else,
so will send a nicer looking version of rocopyup next week.

Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-07 10:47             ` Amir Goldstein
@ 2017-04-07 13:03               ` Miklos Szeredi
  2017-04-07 15:07                 ` Amir Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2017-04-07 13:03 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-unionfs

On Fri, Apr 7, 2017 at 12:47 PM, Amir Goldstein <amir73il@gmail.com> wrote:

> Come to think about it, NFS export of regular file don't need to
> follow renames at all:
> - The handle for a regular file is always the handle for the real
> lower or upper inode
> - To decode a handle, create an O_TMPFILE style overlay dentry, which
> is not linked
>   to any path in overlay, but has the _upperdentry/lowerstack setup

I don't think nfs will allow such a scheme.  NFS3 server is stateless,
which means there's no open/close in the protocol.   Hence we can't
copy-up on open(O_WR*) and return a different file handle for writing.
If client looks up a file currently on lower and we return file handle
based on lower file, then we must be able to decode that handle after
the file has been copied up and even after rename.  And this must work
reliably even if the overlay dentry is no longer in the dcache.

So there's no option, other than to have a reverse mapping somewhere.

One more idea:  do it out-of-band (e.g. under workdir) but do it as a
plain directory tree shadowing the lower trees that contains the
forward redirect information.  It spares us the implementation of the
database, since the filesystem does it for us.  Yes, it can get out of
sync with the overlay, but so can any mapping in-band or out-of-band.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-07 13:03               ` Miklos Szeredi
@ 2017-04-07 15:07                 ` Amir Goldstein
  0 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2017-04-07 15:07 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-unionfs

On Fri, Apr 7, 2017 at 4:03 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>
> One more idea:  do it out-of-band (e.g. under workdir) but do it as a
> plain directory tree shadowing the lower trees that contains the
> forward redirect information.  It spares us the implementation of the
> database, since the filesystem does it for us.  Yes, it can get out of
> sync with the overlay, but so can any mapping in-band or out-of-band.
>

I like this.

"don't implement a database of files, because filesystem is already
a super complex and optimized database of files metadata" is a
design concept I feel strongly about.

It always bothered me that is_lower_positive() is not cached
in a single place.

I am not concerned about getting out of sync to to an
implementation bug.
I am concerned about not being able to program power fail safe
code that does not require calling fsync for metadata operations.

Because the shadow directory existence is not a problem
we can safely create it anytime before upper rename
and if we set a redirect_fh on the shadow dir, we can also
do that anytime before rename and we are also safe wrt
upper rmdir.

So forget the linked list construct.
Shadow tree on-disk construct is better in all aspects
and it is easier to debug.

You do realize that this feature is RO_COMPATIBLE
because if workdir will have nested dirs overlay in
old kernel will be mounted readonly (at least from some
kernel version), which makes me like this idea even more :-)

Several lower layers complicates things a bit because
path A could have different forward redirects to different
upper dirs from different lower layers.

I can try to do a POC that assumes there are no redirects
in the lower layers and then see how we proceed from there.

Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] ovl: consistent_fd feature
  2017-04-06 15:20   ` Miklos Szeredi
  2017-04-06 15:37     ` Miklos Szeredi
  2017-04-06 16:46     ` Amir Goldstein
@ 2017-04-08  3:03     ` J. R. Okajima
  2 siblings, 0 replies; 25+ messages in thread
From: J. R. Okajima @ 2017-04-08  3:03 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Amir Goldstein, linux-unionfs

Miklos Szeredi:
> The other idea is to store the database separately from the upper tree
> (this is what aufs does, apparently).  This works for reverse mapping
> as well.  Makes all operations (except rename) more complicated.
> Keeping whole mapping in kernel memory is prone to resource hogging
> and DoS.   Could have a bitmap created by hashing the ino's that are
	:::

I am afraid you are misunderstanding a little.
Aufs XINO (external inode number translation table) is a simple regular
file.
- created and unlinked in mounting aufs, kept in-use.
- the default path is "<first writable layer>/.aufs.xino".
- users can change the path by a mount option "xino=..."
- users can change the path on the fly by remounting anytime.
- users can stop using XINO by a mount/remount option "noxino"

It should not be a kernel memory pressure (unless users put XINO on
tmpfs or something).

XINO is not a so complicated operation. Receiving the inum on the layer
fs, returns the aufs inum. Its maintenance simply follows the lifetime of
aufs inode.
- in creating aufs inode, the mapping of the layer fs inum and the aufs
  inum is added.
- in destroying aufs inode, if the link count of the layer fs inode is
  zero, then the mapping is removed.
- in copy-up, the mapping is added obviously.


J. R. Okajima

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-04-08  3:03 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-29 14:36 [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
2017-03-29 14:36 ` [PATCH 1/6] ovl: store path type in dentry Amir Goldstein
2017-03-29 14:36 ` [PATCH 2/6] ovl: cram opaque boolean into type flags Amir Goldstein
2017-03-29 14:36 ` [PATCH 3/6] ovl: check if all layers are on the same fs Amir Goldstein
2017-03-29 14:36 ` [PATCH 4/6] ovl: check if clone from lower to upper is supported Amir Goldstein
2017-03-29 14:36 ` [PATCH 5/6] ovl: copy on read with consistent_fd feature Amir Goldstein
2017-03-30 11:28   ` Amir Goldstein
2017-03-31 17:58   ` Vivek Goyal
2017-04-01  9:27     ` Amir Goldstein
2017-04-05 13:20       ` Amir Goldstein
2017-03-29 14:36 ` [PATCH 6/6] ovl: link upper tempfile on open for write Amir Goldstein
2017-03-30 11:26 ` [PATCH 7/7] ovl: prevent copy on read if no upper/work dir Amir Goldstein
2017-03-30 11:34 ` [PATCH 0/6] ovl: consistent_fd feature Amir Goldstein
2017-04-06 15:20   ` Miklos Szeredi
2017-04-06 15:37     ` Miklos Szeredi
2017-04-06 16:25       ` Amir Goldstein
2017-04-07  9:32         ` Miklos Szeredi
2017-04-07  9:56           ` Miklos Szeredi
2017-04-07 10:47             ` Amir Goldstein
2017-04-07 13:03               ` Miklos Szeredi
2017-04-07 15:07                 ` Amir Goldstein
2017-04-06 16:46     ` Amir Goldstein
2017-04-07  9:43       ` Miklos Szeredi
2017-04-07 11:04         ` Amir Goldstein
2017-04-08  3:03     ` J. R. Okajima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.