linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/26] VFS based Union Mount (V2)
@ 2007-07-30 16:13 Jan Blunck
  2007-07-30 16:13 ` [RFC 01/26] [PATCH 14/18] shmem: convert to using splice instead of sendfile() Jan Blunck
                   ` (27 more replies)
  0 siblings, 28 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

Here is another post of the VFS based union mount implementation. Unlike the
traditional mount which hides the contents of the mount point, union mounts
present the merged view of the mount point and the mounted filesytem.

Recent changes:
- brand new union structure no longer tied to the dentryn, now works with bind
  mounts
- generic part of the whiteout patches extracted
- introduces MS_WHITEOUT to make the white-out patches independant of the
  union-mount stuff
- uses a singleton whiteout inode for the tmpfs filesystem (I need to fix this
  for ext2/3, too)
- renaming files on unions uses copyup now
- rewrote the union mount debugging code: it is now debugfs/relay based.
- random cleanups

I'm able to compile the kernel with this patches applied on a  3 layer union
mount with the seperate layers bind mounted to different locations. I haven't
done any performance tests since I think there is a more important topic
ahead: better readdir() support.

This series is against 2.6.22-rc6-mm1.

Comments are welcome,
Jan

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 01/26] [PATCH 14/18] shmem: convert to using splice instead of sendfile()
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 02/26] VFS: Export dput_path() and path_to_nameidata() Jan Blunck
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/shmem-convert-to-splice.diff --]
[-- Type: text/plain, Size: 3292 bytes --]

From: Hugh Dickins <hugh@veritas.com>

Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs
to support loop and sendfile in 2.4 and 2.5.  Now tmpfs can support splice,
loop and sendfile in the simplest way, using generic_file_splice_read and
generic_file_splice_write (with the aid of shmem_prepare_write).

We could make some efficiency tweaks later, if there's a real need;
but this is stable and works well as is.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 mm/shmem.c |   40 ++++++++++++++++------------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1109,8 +1109,8 @@ static int shmem_getpage(struct inode *i
 	 * Normally, filepage is NULL on entry, and either found
 	 * uptodate immediately, or allocated and zeroed, or read
 	 * in under swappage, which is then assigned to filepage.
-	 * But shmem_write_begin passes in a locked filepage,
-	 * which may be found not uptodate by other callers too,
+	 * But shmem_readpage and shmem_write_begin passes in a locked
+	 * filepage, which may be found not uptodate by other callers too,
 	 * and may need to be copied from the swappage read in.
 	 */
 repeat:
@@ -1454,9 +1454,18 @@ static const struct inode_operations shm
 static const struct inode_operations shmem_symlink_inline_operations;
 
 /*
- * Normally tmpfs makes no use of shmem_write_begin, but it
- * lets a tmpfs file be used read-write below the loop driver.
+ * Normally tmpfs avoids the use of shmem_readpage and shmem_write_begin;
+ * but providing them allows a tmpfs file to be used for splice, sendfile, and
+ * below the loop driver, in the generic fashion that many filesystems support.
  */
+static int shmem_readpage(struct file *file, struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	int error = shmem_getpage(inode, page->index, &page, SGP_CACHE, NULL);
+	unlock_page(page);
+	return error;
+}
+
 static int
 shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
@@ -1701,25 +1710,6 @@ static ssize_t shmem_file_read(struct fi
 	return desc.error;
 }
 
-static ssize_t shmem_file_sendfile(struct file *in_file, loff_t *ppos,
-			 size_t count, read_actor_t actor, void *target)
-{
-	read_descriptor_t desc;
-
-	if (!count)
-		return 0;
-
-	desc.written = 0;
-	desc.count = count;
-	desc.arg.data = target;
-	desc.error = 0;
-
-	do_shmem_file_read(in_file, ppos, &desc, actor);
-	if (desc.written)
-		return desc.written;
-	return desc.error;
-}
-
 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2376,6 +2366,7 @@ static const struct address_space_operat
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
+	.readpage	= shmem_readpage,
 	.write_begin	= shmem_write_begin,
 	.write_end	= shmem_write_end,
 #endif
@@ -2389,7 +2380,8 @@ static const struct file_operations shme
 	.read		= shmem_file_read,
 	.write		= shmem_file_write,
 	.fsync		= simple_sync_file,
-	.sendfile	= shmem_file_sendfile,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= generic_file_splice_write,
 #endif
 };
 

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 02/26] VFS: Export dput_path() and path_to_nameidata()
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
  2007-07-30 16:13 ` [RFC 01/26] [PATCH 14/18] shmem: convert to using splice instead of sendfile() Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 03/26] VFS: Make lookup_hash() return a struct path Jan Blunck
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-export-dput_path.diff --]
[-- Type: text/plain, Size: 1374 bytes --]

This patch makes dput_path() and path_to_nameidata() general available.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c            |   16 ----------------
 include/linux/namei.h |   15 +++++++++++++++
 2 files changed, 15 insertions(+), 16 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -573,22 +573,6 @@ fail:
 	return PTR_ERR(link);
 }
 
-static inline void dput_path(struct path *path, struct nameidata *nd)
-{
-	dput(path->dentry);
-	if (path->mnt != nd->mnt)
-		mntput(path->mnt);
-}
-
-static inline void path_to_nameidata(struct path *path, struct nameidata *nd)
-{
-	dput(nd->dentry);
-	if (nd->mnt != path->mnt)
-		mntput(nd->mnt);
-	nd->mnt = path->mnt;
-	nd->dentry = path->dentry;
-}
-
 static __always_inline int __do_follow_link(struct path *path, struct nameidata *nd)
 {
 	int error;
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -119,5 +119,20 @@ static inline void pathput(struct path *
 	dput(path->dentry);
 	mntput(path->mnt);
 }
+static inline void dput_path(struct path *path, struct nameidata *nd)
+{
+	dput(path->dentry);
+	if (path->mnt != nd->mnt)
+		mntput(path->mnt);
+}
+
+static inline void path_to_nameidata(struct path *path, struct nameidata *nd)
+{
+	dput(nd->dentry);
+	if (nd->mnt != path->mnt)
+		mntput(nd->mnt);
+	nd->mnt = path->mnt;
+	nd->dentry = path->dentry;
+}
 
 #endif /* _LINUX_NAMEI_H */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 03/26] VFS: Make lookup_hash() return a struct path
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
  2007-07-30 16:13 ` [RFC 01/26] [PATCH 14/18] shmem: convert to using splice instead of sendfile() Jan Blunck
  2007-07-30 16:13 ` [RFC 02/26] VFS: Export dput_path() and path_to_nameidata() Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 04/26] VFS: Make lookup_create() " Jan Blunck
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-lookup_hash-returns-path.diff --]
[-- Type: text/plain, Size: 7214 bytes --]

This patch changes lookup_hash() into returning a struct path.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c |  113 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 57 insertions(+), 56 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1297,27 +1297,27 @@ out:
  * needs parent already locked. Doesn't follow mounts.
  * SMP-safe.
  */
-static inline struct dentry * __lookup_hash(struct qstr *name, struct dentry *base, struct nameidata *nd)
+static int lookup_hash(struct nameidata *nd, struct qstr *name,
+		       struct path *path)
 {
-	struct dentry *dentry;
 	struct inode *inode;
 	int err;
 
-	inode = base->d_inode;
+	inode = nd->dentry->d_inode;
 
 	err = permission(inode, MAY_EXEC, nd);
-	dentry = ERR_PTR(err);
 	if (err)
 		goto out;
 
-	dentry = __lookup_hash_kern(name, base, nd);
+	path->mnt =  nd->mnt;
+	path->dentry = __lookup_hash_kern(name, nd->dentry, nd);
+	if (IS_ERR(path->dentry)) {
+		err = PTR_ERR(path->dentry);
+		path->dentry = NULL;
+		path->mnt = NULL;
+	}
 out:
-	return dentry;
-}
-
-static struct dentry *lookup_hash(struct nameidata *nd)
-{
-	return __lookup_hash(&nd->last, nd->dentry, nd);
+	return err;
 }
 
 /* SMP-safe */
@@ -1351,7 +1351,10 @@ struct dentry *lookup_one_len_nd(const c
 	err = __lookup_one_len(name, &this, base, len);
 	if (err)
 		return ERR_PTR(err);
-	return __lookup_hash(&this, base, nd);
+	err = permission(base->d_inode, MAY_EXEC, nd);
+	if (err)
+		return ERR_PTR(err);
+	return __lookup_hash_kern(&this, base, nd);
 }
 
 struct dentry *lookup_one_len_kern(const char *name, struct dentry *base, int len)
@@ -1709,12 +1712,10 @@ int open_namei(int dfd, const char *path
 	dir = nd->dentry;
 	nd->flags &= ~LOOKUP_PARENT;
 	mutex_lock(&dir->d_inode->i_mutex);
-	path.dentry = lookup_hash(nd);
-	path.mnt = nd->mnt;
+	error = lookup_hash(nd, &nd->last, &path);
 
 do_last:
-	error = PTR_ERR(path.dentry);
-	if (IS_ERR(path.dentry)) {
+	if (error) {
 		mutex_unlock(&dir->d_inode->i_mutex);
 		goto exit;
 	}
@@ -1817,8 +1818,7 @@ do_link:
 	}
 	dir = nd->dentry;
 	mutex_lock(&dir->d_inode->i_mutex);
-	path.dentry = lookup_hash(nd);
-	path.mnt = nd->mnt;
+	error = lookup_hash(nd, &nd->last, &path);
 	__putname(nd->last.name);
 	goto do_last;
 }
@@ -1835,7 +1835,8 @@ do_link:
  */
 struct dentry *lookup_create(struct nameidata *nd, int is_dir)
 {
-	struct dentry *dentry = ERR_PTR(-EEXIST);
+	struct path path = { .dentry = ERR_PTR(-EEXIST) } ;
+	int err;
 
 	mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	/*
@@ -1851,9 +1852,11 @@ struct dentry *lookup_create(struct name
 	/*
 	 * Do the final lookup.
 	 */
-	dentry = lookup_hash(nd);
-	if (IS_ERR(dentry))
+	err = lookup_hash(nd, &nd->last, &path);
+	if (err) {
+		path.dentry = ERR_PTR(err);
 		goto fail;
+	}
 
 	/*
 	 * Special case - lookup gave negative, but... we had foo/bar/
@@ -1861,14 +1864,16 @@ struct dentry *lookup_create(struct name
 	 * all is fine. Let's be bastards - you had / on the end, you've
 	 * been asking for (non-existent) directory. -ENOENT for you.
 	 */
-	if (!is_dir && nd->last.name[nd->last.len] && !dentry->d_inode)
+	if (!is_dir && nd->last.name[nd->last.len] && !path.dentry->d_inode)
 		goto enoent;
-	return dentry;
+	if (nd->mnt != path.mnt)
+		mntput(path.mnt);
+	return path.dentry;
 enoent:
-	dput(dentry);
-	dentry = ERR_PTR(-ENOENT);
+	dput_path(&path, nd);
+	path.dentry = ERR_PTR(-ENOENT);
 fail:
-	return dentry;
+	return path.dentry;
 }
 EXPORT_SYMBOL_GPL(lookup_create);
 
@@ -2075,7 +2080,7 @@ static long do_rmdir(int dfd, const char
 {
 	int error = 0;
 	char * name;
-	struct dentry *dentry;
+	struct path path;
 	struct nameidata nd;
 
 	name = getname(pathname);
@@ -2098,12 +2103,11 @@ static long do_rmdir(int dfd, const char
 			goto exit1;
 	}
 	mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
-	dentry = lookup_hash(&nd);
-	error = PTR_ERR(dentry);
-	if (IS_ERR(dentry))
+	error = lookup_hash(&nd, &nd.last, &path);
+	if (error)
 		goto exit2;
-	error = vfs_rmdir(nd.dentry->d_inode, dentry);
-	dput(dentry);
+	error = vfs_rmdir(nd.dentry->d_inode, path.dentry);
+	dput_path(&path, &nd);
 exit2:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 exit1:
@@ -2158,7 +2162,7 @@ static long do_unlinkat(int dfd, const c
 {
 	int error = 0;
 	char * name;
-	struct dentry *dentry;
+	struct path path;
 	struct nameidata nd;
 	struct inode *inode = NULL;
 
@@ -2173,18 +2177,17 @@ static long do_unlinkat(int dfd, const c
 	if (nd.last_type != LAST_NORM)
 		goto exit1;
 	mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
-	dentry = lookup_hash(&nd);
-	error = PTR_ERR(dentry);
-	if (!IS_ERR(dentry)) {
+	error = lookup_hash(&nd, &nd.last, &path);
+	if (!error) {
 		/* Why not before? Because we want correct error value */
 		if (nd.last.name[nd.last.len])
 			goto slashes;
-		inode = dentry->d_inode;
+		inode = path.dentry->d_inode;
 		if (inode)
 			atomic_inc(&inode->i_count);
-		error = vfs_unlink(nd.dentry->d_inode, dentry);
+		error = vfs_unlink(nd.dentry->d_inode, path.dentry);
 	exit2:
-		dput(dentry);
+		dput_path(&path, &nd);
 	}
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 	if (inode)
@@ -2196,8 +2199,8 @@ exit:
 	return error;
 
 slashes:
-	error = !dentry->d_inode ? -ENOENT :
-		S_ISDIR(dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
+	error = !path.dentry->d_inode ? -ENOENT :
+		S_ISDIR(path.dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
 	goto exit2;
 }
 
@@ -2528,7 +2531,7 @@ static int do_rename(int olddfd, const c
 {
 	int error = 0;
 	struct dentry * old_dir, * new_dir;
-	struct dentry * old_dentry, *new_dentry;
+	struct path old, new;
 	struct dentry * trap;
 	struct nameidata oldnd, newnd;
 
@@ -2555,16 +2558,15 @@ static int do_rename(int olddfd, const c
 
 	trap = lock_rename(new_dir, old_dir);
 
-	old_dentry = lookup_hash(&oldnd);
-	error = PTR_ERR(old_dentry);
-	if (IS_ERR(old_dentry))
+	error = lookup_hash(&oldnd, &oldnd.last, &old);
+	if (error)
 		goto exit3;
 	/* source must exist */
 	error = -ENOENT;
-	if (!old_dentry->d_inode)
+	if (!old.dentry->d_inode)
 		goto exit4;
 	/* unless the source is a directory trailing slashes give -ENOTDIR */
-	if (!S_ISDIR(old_dentry->d_inode->i_mode)) {
+	if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
 		error = -ENOTDIR;
 		if (oldnd.last.name[oldnd.last.len])
 			goto exit4;
@@ -2573,23 +2575,22 @@ static int do_rename(int olddfd, const c
 	}
 	/* source should not be ancestor of target */
 	error = -EINVAL;
-	if (old_dentry == trap)
+	if (old.dentry == trap)
 		goto exit4;
-	new_dentry = lookup_hash(&newnd);
-	error = PTR_ERR(new_dentry);
-	if (IS_ERR(new_dentry))
+	error = lookup_hash(&newnd, &newnd.last, &new);
+	if (error)
 		goto exit4;
 	/* target should not be an ancestor of source */
 	error = -ENOTEMPTY;
-	if (new_dentry == trap)
+	if (new.dentry == trap)
 		goto exit5;
 
-	error = vfs_rename(old_dir->d_inode, old_dentry,
-				   new_dir->d_inode, new_dentry);
+	error = vfs_rename(old_dir->d_inode, old.dentry,
+				   new_dir->d_inode, new.dentry);
 exit5:
-	dput(new_dentry);
+	dput_path(&new, &newnd);
 exit4:
-	dput(old_dentry);
+	dput_path(&old, &oldnd);
 exit3:
 	unlock_rename(new_dir, old_dir);
 exit2:

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 04/26] VFS: Make lookup_create() return a struct path
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (2 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 03/26] VFS: Make lookup_hash() return a struct path Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 05/26] VFS: cache_lookup() cleanup Jan Blunck
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-lookup_create-returns-path.diff --]
[-- Type: text/plain, Size: 9074 bytes --]

This patch changes lookup_create() into returning a struct path.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 arch/powerpc/platforms/cell/spufs/inode.c |   15 ++----
 fs/namei.c                                |   75 +++++++++++++-----------------
 include/linux/dcache.h                    |    1 
 include/linux/namei.h                     |    1 
 net/unix/af_unix.c                        |   17 +++---
 5 files changed, 50 insertions(+), 59 deletions(-)

--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -456,7 +456,7 @@ static struct file_system_type spufs_typ
 
 long spufs_create(struct nameidata *nd, unsigned int flags, mode_t mode)
 {
-	struct dentry *dentry;
+	struct path path;
 	int ret;
 
 	ret = -EINVAL;
@@ -475,26 +475,25 @@ long spufs_create(struct nameidata *nd, 
 			goto out;
 	}
 
-	dentry = lookup_create(nd, 1);
-	ret = PTR_ERR(dentry);
-	if (IS_ERR(dentry))
+	ret = lookup_create(nd, 1, &path);
+	if (ret)
 		goto out_dir;
 
 	ret = -EEXIST;
-	if (dentry->d_inode)
+	if (path.dentry->d_inode)
 		goto out_dput;
 
 	mode &= ~current->fs->umask;
 
 	if (flags & SPU_CREATE_GANG)
 		return spufs_create_gang(nd->dentry->d_inode,
-					dentry, nd->mnt, mode);
+					 path.dentry, path.mnt, mode);
 	else
 		return spufs_create_context(nd->dentry->d_inode,
-					dentry, nd->mnt, flags, mode);
+					    path.dentry, path.mnt, flags, mode);
 
 out_dput:
-	dput(dentry);
+	dput_path(&path, nd);
 out_dir:
 	mutex_unlock(&nd->dentry->d_inode->i_mutex);
 out:
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1833,10 +1833,9 @@ do_link:
  *
  * Returns with nd->dentry->d_inode->i_mutex locked.
  */
-struct dentry *lookup_create(struct nameidata *nd, int is_dir)
+int lookup_create(struct nameidata *nd, int is_dir, struct path *path)
 {
-	struct path path = { .dentry = ERR_PTR(-EEXIST) } ;
-	int err;
+	int err = -EEXIST;
 
 	mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	/*
@@ -1852,11 +1851,9 @@ struct dentry *lookup_create(struct name
 	/*
 	 * Do the final lookup.
 	 */
-	err = lookup_hash(nd, &nd->last, &path);
-	if (err) {
-		path.dentry = ERR_PTR(err);
+	err = lookup_hash(nd, &nd->last, path);
+	if (err)
 		goto fail;
-	}
 
 	/*
 	 * Special case - lookup gave negative, but... we had foo/bar/
@@ -1864,16 +1861,14 @@ struct dentry *lookup_create(struct name
 	 * all is fine. Let's be bastards - you had / on the end, you've
 	 * been asking for (non-existent) directory. -ENOENT for you.
 	 */
-	if (!is_dir && nd->last.name[nd->last.len] && !path.dentry->d_inode)
+	if (!is_dir && nd->last.name[nd->last.len] && !path->dentry->d_inode)
 		goto enoent;
-	if (nd->mnt != path.mnt)
-		mntput(path.mnt);
-	return path.dentry;
+	return 0;
 enoent:
-	dput_path(&path, nd);
-	path.dentry = ERR_PTR(-ENOENT);
+	dput_path(path, nd);
+	err = -ENOENT;
 fail:
-	return path.dentry;
+	return err;
 }
 EXPORT_SYMBOL_GPL(lookup_create);
 
@@ -1906,7 +1901,7 @@ asmlinkage long sys_mknodat(int dfd, con
 {
 	int error = 0;
 	char * tmp;
-	struct dentry * dentry;
+	struct path path;
 	struct nameidata nd;
 
 	if (S_ISDIR(mode))
@@ -1918,22 +1913,23 @@ asmlinkage long sys_mknodat(int dfd, con
 	error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
 	if (error)
 		goto out;
-	dentry = lookup_create(&nd, 0);
-	error = PTR_ERR(dentry);
+	error = lookup_create(&nd, 0, &path);
 
 	if (!IS_POSIXACL(nd.dentry->d_inode))
 		mode &= ~current->fs->umask;
-	if (!IS_ERR(dentry)) {
+	if (!error) {
 		switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
-			error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
+			error = vfs_create(nd.dentry->d_inode, path.dentry,
+					   mode, &nd);
 			break;
 		case S_IFCHR: case S_IFBLK:
-			error = vfs_mknod(nd.dentry->d_inode,dentry,mode,
-					new_decode_dev(dev));
+			error = vfs_mknod(nd.dentry->d_inode, path.dentry,
+					  mode, new_decode_dev(dev));
 			break;
 		case S_IFIFO: case S_IFSOCK:
-			error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
+			error = vfs_mknod(nd.dentry->d_inode, path.dentry,
+					  mode, 0);
 			break;
 		case S_IFDIR:
 			error = -EPERM;
@@ -1941,7 +1937,7 @@ asmlinkage long sys_mknodat(int dfd, con
 		default:
 			error = -EINVAL;
 		}
-		dput(dentry);
+		dput_path(&path, &nd);
 	}
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 	path_release(&nd);
@@ -1982,7 +1978,7 @@ asmlinkage long sys_mkdirat(int dfd, con
 {
 	int error = 0;
 	char * tmp;
-	struct dentry *dentry;
+	struct path path;
 	struct nameidata nd;
 
 	tmp = getname(pathname);
@@ -1993,15 +1989,14 @@ asmlinkage long sys_mkdirat(int dfd, con
 	error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
 	if (error)
 		goto out;
-	dentry = lookup_create(&nd, 1);
-	error = PTR_ERR(dentry);
-	if (IS_ERR(dentry))
+	error = lookup_create(&nd, 1, &path);
+	if (error)
 		goto out_unlock;
 
 	if (!IS_POSIXACL(nd.dentry->d_inode))
 		mode &= ~current->fs->umask;
-	error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
-	dput(dentry);
+	error = vfs_mkdir(nd.dentry->d_inode, path.dentry, mode);
+	dput_path(&path, &nd);
 out_unlock:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 	path_release(&nd);
@@ -2247,7 +2242,7 @@ asmlinkage long sys_symlinkat(const char
 	int error = 0;
 	char * from;
 	char * to;
-	struct dentry *dentry;
+	struct path path;
 	struct nameidata nd;
 
 	from = getname(oldname);
@@ -2261,13 +2256,12 @@ asmlinkage long sys_symlinkat(const char
 	error = do_path_lookup(newdfd, to, LOOKUP_PARENT, &nd);
 	if (error)
 		goto out;
-	dentry = lookup_create(&nd, 0);
-	error = PTR_ERR(dentry);
-	if (IS_ERR(dentry))
+	error = lookup_create(&nd, 0, &path);
+	if (error)
 		goto out_unlock;
 
-	error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
-	dput(dentry);
+	error = vfs_symlink(nd.dentry->d_inode, path.dentry, from, S_IALLUGO);
+	dput_path(&path, &nd);
 out_unlock:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 	path_release(&nd);
@@ -2334,7 +2328,7 @@ asmlinkage long sys_linkat(int olddfd, c
 			   int newdfd, const char __user *newname,
 			   int flags)
 {
-	struct dentry *new_dentry;
+	struct path path;
 	struct nameidata nd, old_nd;
 	int error;
 	char * to;
@@ -2357,12 +2351,11 @@ asmlinkage long sys_linkat(int olddfd, c
 	error = -EXDEV;
 	if (old_nd.mnt != nd.mnt)
 		goto out_release;
-	new_dentry = lookup_create(&nd, 0);
-	error = PTR_ERR(new_dentry);
-	if (IS_ERR(new_dentry))
+	error = lookup_create(&nd, 0, &path);
+	if (error)
 		goto out_unlock;
-	error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
-	dput(new_dentry);
+	error = vfs_link(old_nd.dentry, nd.dentry->d_inode, path.dentry);
+	dput_path(&path, &nd);
 out_unlock:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 out_release:
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -358,7 +358,6 @@ static inline int d_mountpoint(struct de
 
 extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *);
 extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
-extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);
 
 extern int sysctl_vfs_cache_pressure;
 
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -91,6 +91,7 @@ static inline struct dentry *lookup_one_
 {
 	return lookup_one_len_nd(name, dir, len, NULL);
 }
+extern int lookup_create(struct nameidata *, int, struct path *);
 
 extern int follow_down(struct vfsmount **, struct dentry **);
 extern int follow_up(struct vfsmount **, struct dentry **);
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -749,8 +749,8 @@ static int unix_bind(struct socket *sock
 	struct sock *sk = sock->sk;
 	struct unix_sock *u = unix_sk(sk);
 	struct sockaddr_un *sunaddr=(struct sockaddr_un *)uaddr;
-	struct dentry * dentry = NULL;
 	struct nameidata nd;
+	struct path path;
 	int err;
 	unsigned hash;
 	struct unix_address *addr;
@@ -797,9 +797,8 @@ static int unix_bind(struct socket *sock
 		if (err)
 			goto out_mknod_parent;
 
-		dentry = lookup_create(&nd, 0);
-		err = PTR_ERR(dentry);
-		if (IS_ERR(dentry))
+		err = lookup_create(&nd, 0, &path);
+		if (err)
 			goto out_mknod_unlock;
 
 		/*
@@ -807,12 +806,11 @@ static int unix_bind(struct socket *sock
 		 */
 		mode = S_IFSOCK |
 		       (SOCK_INODE(sock)->i_mode & ~current->fs->umask);
-		err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+		err = vfs_mknod(nd.dentry->d_inode, path.dentry, mode, 0);
 		if (err)
 			goto out_mknod_dput;
 		mutex_unlock(&nd.dentry->d_inode->i_mutex);
-		dput(nd.dentry);
-		nd.dentry = dentry;
+		path_to_nameidata(&path, &nd);
 
 		addr->hash = UNIX_HASH_SIZE;
 	}
@@ -829,7 +827,8 @@ static int unix_bind(struct socket *sock
 
 		list = &unix_socket_table[addr->hash];
 	} else {
-		list = &unix_socket_table[dentry->d_inode->i_ino & (UNIX_HASH_SIZE-1)];
+		list = &unix_socket_table[nd.dentry->d_inode->i_ino &
+					  (UNIX_HASH_SIZE-1)];
 		u->dentry = nd.dentry;
 		u->mnt    = nd.mnt;
 	}
@@ -847,7 +846,7 @@ out:
 	return err;
 
 out_mknod_dput:
-	dput(dentry);
+	dput_path(&path, &nd);
 out_mknod_unlock:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 	path_release(&nd);

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 05/26] VFS: cache_lookup() cleanup
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (3 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 04/26] VFS: Make lookup_create() " Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 06/26] VFS: Make real_lookup() return a struct path Jan Blunck
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-cache_lookup-cleanup.diff --]
[-- Type: text/plain, Size: 1271 bytes --]

cache_lookup() can directly use d_lookup() instead of calling __d_lookup()
first since rename_lock is a seq_lock.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c |   13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -403,15 +403,10 @@ do_revalidate(struct dentry *dentry, str
  * Internal lookup() using the new generic dcache.
  * SMP-safe
  */
-static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
+static struct dentry *cache_lookup(struct dentry *parent, struct qstr *name,
+				   struct nameidata *nd)
 {
-	struct dentry * dentry = __d_lookup(parent, name);
-
-	/* lockess __d_lookup may fail due to concurrent d_move() 
-	 * in some unrelated directory, so try with d_lookup
-	 */
-	if (!dentry)
-		dentry = d_lookup(parent, name);
+	struct dentry *dentry = d_lookup(parent, name);
 
 	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
 		dentry = do_revalidate(dentry, nd);
@@ -1276,7 +1271,7 @@ static inline struct dentry *__lookup_ha
 			goto out;
 	}
 
-	dentry = cached_lookup(base, name, nd);
+	dentry = cache_lookup(base, name, nd);
 	if (!dentry) {
 		struct dentry *new = d_alloc(base, name);
 		dentry = ERR_PTR(-ENOMEM);

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 06/26] VFS: Make real_lookup() return a struct path
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (4 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 05/26] VFS: cache_lookup() cleanup Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 07/26] VFS: Introduce dput() variante that maintains a kill-list Jan Blunck
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-real_lookup-returns-path.diff --]
[-- Type: text/plain, Size: 3506 bytes --]

This patch changes real_lookup() into returning a struct path.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c |   77 ++++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 48 insertions(+), 29 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -462,10 +462,11 @@ ok:
  * make sure that nobody added the entry to the dcache in the meantime..
  * SMP-safe
  */
-static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
+static int real_lookup(struct nameidata *nd, struct qstr *name,
+		       struct path *path)
 {
-	struct dentry * result;
-	struct inode *dir = parent->d_inode;
+	struct inode *dir = nd->dentry->d_inode;
+	int res = 0;
 
 	mutex_lock(&dir->i_mutex);
 	/*
@@ -482,19 +483,27 @@ static struct dentry * real_lookup(struc
 	 *
 	 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
 	 */
-	result = d_lookup(parent, name);
-	if (!result) {
-		struct dentry * dentry = d_alloc(parent, name);
-		result = ERR_PTR(-ENOMEM);
+	path->dentry = d_lookup(nd->dentry, name);
+	path->mnt = nd->mnt;
+	if (!path->dentry) {
+		struct dentry *dentry = d_alloc(nd->dentry, name);
 		if (dentry) {
-			result = dir->i_op->lookup(dir, dentry, nd);
-			if (result)
+			path->dentry = dir->i_op->lookup(dir, dentry, nd);
+			if (path->dentry) {
 				dput(dentry);
-			else
-				result = dentry;
+				if (IS_ERR(path->dentry)) {
+					res = PTR_ERR(path->dentry);
+					path->dentry = NULL;
+					path->mnt = NULL;
+				}
+			} else
+				path->dentry = dentry;
+		} else {
+			res = -ENOMEM;
+			path->mnt = NULL;
 		}
 		mutex_unlock(&dir->i_mutex);
-		return result;
+		return res;
 	}
 
 	/*
@@ -502,12 +511,20 @@ static struct dentry * real_lookup(struc
 	 * we waited on the semaphore. Need to revalidate.
 	 */
 	mutex_unlock(&dir->i_mutex);
-	if (result->d_op && result->d_op->d_revalidate) {
-		result = do_revalidate(result, nd);
-		if (!result)
-			result = ERR_PTR(-ENOENT);
+	if (path->dentry->d_op && path->dentry->d_op->d_revalidate) {
+		path->dentry = do_revalidate(path->dentry, nd);
+		if (!path->dentry) {
+			res = -ENOENT;
+			path->mnt = NULL;
+		}
+		if (IS_ERR(path->dentry)) {
+			res = PTR_ERR(path->dentry);
+			path->dentry = NULL;
+			path->mnt = NULL;
+		}
 	}
-	return result;
+
+	return res;
 }
 
 static int __emul_lookup_dentry(const char *, struct nameidata *);
@@ -748,35 +765,37 @@ static __always_inline void follow_dotdo
 static int do_lookup(struct nameidata *nd, struct qstr *name,
 		     struct path *path)
 {
-	struct vfsmount *mnt = nd->mnt;
-	struct dentry *dentry = __d_lookup(nd->dentry, name);
+	int err;
 
-	if (!dentry)
+	path->dentry = __d_lookup(nd->dentry, name);
+	path->mnt = nd->mnt;
+	if (!path->dentry)
 		goto need_lookup;
-	if (dentry->d_op && dentry->d_op->d_revalidate)
+	if (path->dentry->d_op && path->dentry->d_op->d_revalidate)
 		goto need_revalidate;
+
 done:
-	path->mnt = mnt;
-	path->dentry = dentry;
 	__follow_mount(path);
 	return 0;
 
 need_lookup:
-	dentry = real_lookup(nd->dentry, name, nd);
-	if (IS_ERR(dentry))
+	err = real_lookup(nd, name, path);
+	if (err)
 		goto fail;
 	goto done;
 
 need_revalidate:
-	dentry = do_revalidate(dentry, nd);
-	if (!dentry)
+	path->dentry = do_revalidate(path->dentry, nd);
+	if (!path->dentry)
 		goto need_lookup;
-	if (IS_ERR(dentry))
+	if (IS_ERR(path->dentry)) {
+		err = PTR_ERR(path->dentry);
 		goto fail;
+	}
 	goto done;
 
 fail:
-	return PTR_ERR(dentry);
+	return err;
 }
 
 /*

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 07/26] VFS: Introduce dput() variante that maintains a kill-list
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (5 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 06/26] VFS: Make real_lookup() return a struct path Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 08/26] VFS: Export lives_below_in_same_fs() Jan Blunck
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-dput_freelist.diff --]
[-- Type: text/plain, Size: 3264 bytes --]

This patch introduces a new variant of dput(). This becomes necessary to
prevent a recursive call to dput() from the union mount code.

  void __dput(struct dentry *dentry, struct list_head *list);

__dput() works mostly like the original dput() did. The main difference is
that it doesn't do a full d_kill() at the end but puts the dentry on a list as
soon as it isn't reachable anymore. Therefore the union mount code can savely
call __dput() when it wants to get rid of underlying dentry references during
a dput(). After calling __dput() the caller must make sure that on all
dentries __d_kill_final() is called. __d_kill_final() is actually doing the
dentry_iput() and is also dereferencing the parent.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/dcache.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 5 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -129,19 +129,56 @@ static void dentry_iput(struct dentry * 
  *
  * If this is the root of the dentry tree, return NULL.
  */
-static struct dentry *d_kill(struct dentry *dentry)
+static struct dentry *__d_kill(struct dentry *dentry, struct list_head *list)
 {
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
-	/*drops the locks, at that point nobody can reach this dentry */
+
+	if (list) {
+		list_del_init(&dentry->d_alias);
+		/* at this point nobody can reach this dentry */
+		list_add(&dentry->d_lru, list);
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_lock);
+		return NULL;
+	}
+
+	/* drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
 	return dentry == parent ? NULL : parent;
 }
 
+void __dput(struct dentry *, struct list_head *);
+
+static void __d_kill_final(struct dentry *dentry, struct list_head *list)
+{
+	struct dentry *parent = dentry->d_parent;
+	struct inode *inode = dentry->d_inode;
+
+	if (inode) {
+		dentry->d_inode = NULL;
+		if (!inode->i_nlink)
+			fsnotify_inoderemove(inode);
+		if (dentry->d_op && dentry->d_op->d_iput)
+			dentry->d_op->d_iput(dentry, inode);
+		else
+			iput(inode);
+	}
+
+	d_free(dentry);
+	if (dentry != parent)
+		__dput(parent, list);
+}
+
+static struct dentry *d_kill(struct dentry *dentry)
+{
+	return __d_kill(dentry, NULL);
+}
+
 /* 
  * This is dput
  *
@@ -171,7 +208,7 @@ static struct dentry *d_kill(struct dent
  * no dcache lock, please.
  */
 
-void dput(struct dentry *dentry)
+void __dput(struct dentry *dentry, struct list_head *list)
 {
 	if (!dentry)
 		return;
@@ -215,14 +252,27 @@ kill_it:
 	 * delete it from there
 	 */
 	if (!list_empty(&dentry->d_lru)) {
-		list_del(&dentry->d_lru);
+		list_del_init(&dentry->d_lru);
 		dentry_stat.nr_unused--;
 	}
-	dentry = d_kill(dentry);
+
+	dentry = __d_kill(dentry, list);
 	if (dentry)
 		goto repeat;
 }
 
+void dput(struct dentry *dentry)
+{
+	LIST_HEAD(mortuary);
+
+	__dput(dentry, &mortuary);
+	while (!list_empty(&mortuary)) {
+		dentry = list_entry(mortuary.next, struct dentry, d_lru);
+		list_del(&dentry->d_lru);
+		__d_kill_final(dentry, &mortuary);
+	}
+}
+
 /**
  * d_invalidate - invalidate a dentry
  * @dentry: dentry to invalidate

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 08/26] VFS: Export lives_below_in_same_fs()
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (6 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 07/26] VFS: Introduce dput() variante that maintains a kill-list Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 09/26] linux/stat.h: Add the filetype white-out Jan Blunck
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-export-lives_below_in_same_fs.diff --]
[-- Type: text/plain, Size: 1076 bytes --]

Export lives_below_in_same_fs() for use in union mount code.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namespace.c        |    3 ++-
 include/linux/mount.h |    1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -793,7 +793,7 @@ static bool permit_mount(struct nameidat
 	return true;
 }
 
-static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
+int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
 {
 	while (1) {
 		if (d == dentry)
@@ -803,6 +803,7 @@ static int lives_below_in_same_fs(struct
 		d = d->d_parent;
 	}
 }
+EXPORT_SYMBOL_GPL(lives_below_in_same_fs);
 
 struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
 					int flag, uid_t owner)
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -106,6 +106,7 @@ extern void shrink_submounts(struct vfsm
 
 extern spinlock_t vfsmount_lock;
 extern dev_t name_to_dev_t(char *name);
+extern int lives_below_in_same_fs(struct dentry *, struct dentry *);
 
 #endif
 #endif /* _LINUX_MOUNT_H */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 09/26] linux/stat.h: Add the filetype white-out
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (7 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 08/26] VFS: Export lives_below_in_same_fs() Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 10/26] VFS white-out handling Jan Blunck
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/stat-add-S_IFWHT.diff --]
[-- Type: text/plain, Size: 945 bytes --]

A white-out stops the VFS from further lookups of the white-out's name and
returns -ENOENT. This is the same behaviour as if the filename isn't
found. This can be used in combination with union mounts to virtually
delete (white-out) files by creating a file of this file type.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 include/linux/stat.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -10,6 +10,7 @@
 #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)
 
 #define S_IFMT  00170000
+#define S_IFWHT  0160000	/* whiteout */
 #define S_IFSOCK 0140000
 #define S_IFLNK	 0120000
 #define S_IFREG  0100000
@@ -28,6 +29,7 @@
 #define S_ISBLK(m)	(((m) & S_IFMT) == S_IFBLK)
 #define S_ISFIFO(m)	(((m) & S_IFMT) == S_IFIFO)
 #define S_ISSOCK(m)	(((m) & S_IFMT) == S_IFSOCK)
+#define S_ISWHT(m)	(((m) & S_IFMT) == S_IFWHT)
 
 #define S_IRWXU 00700
 #define S_IRUSR 00400

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 10/26] VFS white-out handling
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (8 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 09/26] linux/stat.h: Add the filetype white-out Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 11/26] tmpfs white-out support Jan Blunck
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/vfs-whiteout.diff --]
[-- Type: text/plain, Size: 18366 bytes --]

Introduce white-out handling in the VFS.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/inode.c         |   22 ++
 fs/namei.c         |  417 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/readdir.c       |    6 
 include/linux/fs.h |    7 
 4 files changed, 441 insertions(+), 11 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1410,6 +1410,26 @@ void __init inode_init(unsigned long mem
 		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
+/*
+ * Dummy default file-operations:
+ * Never open a whiteout. This is always a bug.
+ */
+static int whiteout_no_open(struct inode *irrelevant, struct file *dontcare)
+{
+	printk("WARNING: at %s:%d %s(): Attempted to open a whiteout!\n",
+	       __FILE__, __LINE__, __FUNCTION__);
+	/*
+	 * Nobody should ever be able to open a whiteout. On the other hand
+	 * this isn't fatal so lets just print a warning message.
+	 */
+	WARN_ON(1);
+	return -ENXIO;
+}
+
+static struct file_operations def_wht_fops = {
+	.open		= whiteout_no_open,
+};
+
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 {
 	inode->i_mode = mode;
@@ -1423,6 +1443,8 @@ void init_special_inode(struct inode *in
 		inode->i_fop = &def_fifo_fops;
 	else if (S_ISSOCK(mode))
 		inode->i_fop = &bad_sock_fops;
+	else if (S_ISWHT(mode))
+		inode->i_fop = &def_wht_fops;
 	else
 		printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o)\n",
 		       mode);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -887,7 +887,7 @@ static fastcall int __link_path_walk(con
 
 		err = -ENOENT;
 		inode = next.dentry->d_inode;
-		if (!inode)
+		if (!inode || S_ISWHT(inode->i_mode))
 			goto out_dput;
 		err = -ENOTDIR; 
 		if (!inode->i_op)
@@ -951,6 +951,8 @@ last_component:
 		err = -ENOENT;
 		if (!inode)
 			break;
+		if (S_ISWHT(inode->i_mode))
+			break;
 		if (lookup_flags & LOOKUP_DIRECTORY) {
 			err = -ENOTDIR; 
 			if (!inode->i_op || !inode->i_op->lookup)
@@ -1434,13 +1436,10 @@ static inline int check_sticky(struct in
  * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
-static int may_delete(struct inode *dir,struct dentry *victim,int isdir)
+static int __may_delete(struct inode *dir, struct dentry *victim, int isdir)
 {
 	int error;
 
-	if (!victim->d_inode)
-		return -ENOENT;
-
 	BUG_ON(victim->d_parent->d_inode != dir);
 	audit_inode_child(victim->d_name.name, victim->d_inode, dir);
 
@@ -1466,6 +1465,14 @@ static int may_delete(struct inode *dir,
 	return 0;
 }
 
+static int may_delete(struct inode *dir, struct dentry *victim, int isdir)
+{
+	if (!victim->d_inode || S_ISWHT(victim->d_inode->i_mode))
+		return -ENOENT;
+
+	return __may_delete(dir, victim, isdir);
+}
+
 /*	Check whether we can create an object with dentry child in directory
  *  dir.
  *  1. We can't do it if child already exists (open has special treatment for
@@ -1477,7 +1484,7 @@ static int may_delete(struct inode *dir,
 static inline int may_create(struct inode *dir, struct dentry *child,
 			     struct nameidata *nd)
 {
-	if (child->d_inode)
+	if (child->d_inode && !S_ISWHT(child->d_inode->i_mode))
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
@@ -1559,6 +1566,13 @@ int vfs_create(struct inode *dir, struct
 	error = security_inode_create(dir, dentry, mode);
 	if (error)
 		return error;
+
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, dentry);
+		if (error)
+			return error;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->create(dir, dentry, mode, nd);
 	if (!error)
@@ -1741,7 +1755,7 @@ do_last:
 	}
 
 	/* Negative dentry, just create the file */
-	if (!path.dentry->d_inode) {
+	if (!path.dentry->d_inode || S_ISWHT(path.dentry->d_inode->i_mode)) {
 		error = open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
@@ -1903,6 +1917,12 @@ int vfs_mknod(struct inode *dir, struct 
 	if (error)
 		return error;
 
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, dentry);
+		if (error)
+			return error;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->mknod(dir, dentry, mode, dev);
 	if (!error)
@@ -1969,6 +1989,7 @@ asmlinkage long sys_mknod(const char __u
 int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	int error = may_create(dir, dentry, NULL);
+	int opaque = 0;
 
 	if (error)
 		return error;
@@ -1981,10 +2002,20 @@ int vfs_mkdir(struct inode *dir, struct 
 	if (error)
 		return error;
 
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, dentry);
+		if (error)
+			return error;
+		opaque = 1;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->mkdir(dir, dentry, mode);
-	if (!error)
+	if (!error) {
 		fsnotify_mkdir(dir, dentry);
+		if (opaque)
+			dentry->d_inode->i_flags |= S_OPAQUE;
+	}
 	return error;
 }
 
@@ -2025,6 +2056,360 @@ asmlinkage long sys_mkdir(const char __u
 	return sys_mkdirat(AT_FDCWD, pathname, mode);
 }
 
+static int filldir_is_empty(void *__buf, const char *name, int namlen,
+			    loff_t offset, u64 ino, unsigned int d_type)
+{
+	int *is_empty = (int *)__buf;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (d_type == DT_WHT)
+		return 0;
+
+	(*is_empty) = 0;
+	return 0;
+}
+
+static int directory_is_empty(struct dentry *dentry, struct vfsmount *mnt)
+{
+	struct file *file;
+	int err;
+	int is_empty = 1;
+
+	BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+
+	/* references for the file pointer */
+	dget(dentry);
+	mntget(mnt);
+
+	file = dentry_open(dentry, mnt, O_RDONLY);
+	if (IS_ERR(file))
+		return 0;
+
+	err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+	fput(file);
+	return is_empty;
+}
+
+/*
+ * We try to whiteout a dentry. dir is the parent of the whiteout.
+ * Whiteouts can be vfs_unlink'ed.
+ */
+int vfs_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	int err;
+
+	BUG_ON(dentry->d_parent->d_inode != dir);
+
+	/* from may_create() */
+	if (dentry->d_inode)
+		return -EEXIST;
+	if (IS_DEADDIR(dir))
+		return -ENOENT;
+	err = permission(dir, MAY_WRITE | MAY_EXEC, NULL);
+	if (err)
+		return err;
+
+	/* from may_delete() */
+	if (IS_APPEND(dir))
+		return -EPERM;
+	/* We don't call check_sticky() here because d_inode == NULL */
+
+	if (!dir->i_op || !dir->i_op->whiteout)
+		return -EOPNOTSUPP;
+
+	err = dir->i_op->whiteout(dir, dentry);
+	/* Ignore quota and fsnotify */
+	return err;
+}
+
+/* Checks on the victiom for whiteout */
+static inline int may_whiteout(struct dentry *victim, int isdir)
+{
+	if (!victim->d_inode || S_ISWHT(victim->d_inode->i_mode))
+		return -ENOENT;
+	if (IS_APPEND(victim->d_inode) || IS_IMMUTABLE(victim->d_inode))
+		return -EPERM;
+	if (isdir) {
+		if (!S_ISDIR(victim->d_inode->i_mode))
+			return -ENOTDIR;
+		if (IS_ROOT(victim))
+			return -EBUSY;
+	} else if (S_ISDIR(victim->d_inode->i_mode))
+		return -EISDIR;
+	if (victim->d_flags & DCACHE_NFSFS_RENAMED)
+		return -EBUSY;
+	return 0;
+}
+
+/*
+ * do_whiteout - whiteout a dentry, either when removing or renaming
+ * @dentry: the dentry to whiteout
+ *
+ * This is called by the VFS when removing or renaming files on an union mount.
+ * Must be called with nd->dentry->d_inode->i_mutex locked.
+ */
+static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
+{
+	struct path safe = { .dentry = dget(nd->dentry),
+			     .mnt = mntget(nd->mnt) };
+	struct dentry *dentry = path->dentry;
+	struct qstr name;
+	int err;
+
+	err = may_whiteout(dentry, isdir);
+	if (err)
+		goto out;
+
+	err = -ENOTEMPTY;
+	if (isdir && !directory_is_empty(path->dentry, path->mnt))
+		goto out;
+
+	/* safe the name for a later lookup */
+	err = -ENOMEM;
+	name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
+	if (!name.name)
+		goto out;
+	strncpy((char *)name.name, dentry->d_name.name, dentry->d_name.len);
+	name.len = dentry->d_name.len;
+	name.hash = dentry->d_name.hash;
+
+	/*
+	 * If the dentry to whiteout is on the topmost layer of
+	 * the union stack we must get rid of it first before
+	 * creating the whiteout.
+	 */
+	if (dentry->d_parent == nd->dentry) {
+		struct inode *dir = nd->dentry->d_inode;
+
+		if (isdir)
+			err = vfs_rmdir(dir, dentry);
+		else
+			err = vfs_unlink(dir, dentry);
+		if (err)
+			goto out_freename;
+	}
+
+	/*
+	 * Relookup the dentry to whiteout now. We should find a fresh negative
+	 * dentry by this time.
+	 */
+	dentry = __lookup_hash_kern(&name, nd->dentry, nd);
+	err = PTR_ERR(dentry);
+	if (IS_ERR(dentry))
+		goto out_freename;
+
+	dput(path->dentry);
+	if (path->mnt != safe.mnt)
+		mntput(path->mnt);
+	path->mnt = nd->mnt;
+	path->dentry = dentry;
+
+	err = vfs_whiteout(nd->dentry->d_inode, dentry);
+out_freename:
+	kfree(name.name);
+out:
+	pathput(&safe);
+	return err;
+}
+
+/*
+ * vfs_unlink_whiteout - Unlink a single whiteout from the system
+ * @dir: parent directory
+ * @dentry: the whiteout itself
+ *
+ * This is for unlinking a single whiteout. Don't use vfs_unlink() because we
+ * don't want any notification stuff etc. but basically it is the same stuff.
+ */
+int vfs_unlink_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	int error;
+
+	if (!dentry->d_inode)
+		return -ENOENT;
+
+	error = __may_delete(dir, dentry, 0);
+	if (error)
+		return error;
+
+	if (!dir->i_op || !dir->i_op->unlink)
+		return -EPERM;
+
+	DQUOT_INIT(dir);
+
+	mutex_lock(&dentry->d_inode->i_mutex);
+	if (d_mountpoint(dentry))
+		error = -EBUSY;
+	else {
+		error = security_inode_unlink(dir, dentry);
+		if (!error)
+			error = dir->i_op->unlink(dir, dentry);
+	}
+	mutex_unlock(&dentry->d_inode->i_mutex);
+
+	/*
+	 * We can call dentry_iput() since nobody could actually do something
+	 * useful with a whiteout. So dropping the reference to the inode
+	 * doesn't make a difference, does it?
+	 *
+	 * It turns the without dentry into a negative dentry ... hmm, couldn't
+	 * this race againt if(inode && S_ISWHT(inode->i_mode)) tests???
+	 */
+	if (!error) {
+		spin_lock(&dcache_lock);
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) == 1) {
+			struct inode *inode = dentry->d_inode;
+			dentry->d_inode = NULL;
+			list_del_init(&dentry->d_alias);
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			if (dentry->d_op && dentry->d_op->d_iput)
+				dentry->d_op->d_iput(dentry, inode);
+			else
+				iput(inode);
+		} else {
+			if (!d_unhashed(dentry))
+				__d_drop(dentry);
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			printk("WARNING: at %s:%d %s(): couldn't unlink\n",
+			       __FILE__, __LINE__, __FUNCTION__);
+			dump_stack();
+		}
+	}
+	return error;
+}
+
+static int __hash_one_len(const char *name, int len, struct qstr *this)
+{
+	unsigned long hash;
+	unsigned char c;
+
+	hash = init_name_hash();
+	while (len--) {
+		c = *(const unsigned char *)name++;
+		if (c == '/' || c == '\0')
+			return -EINVAL;
+		hash = partial_name_hash(c, hash);
+	}
+	this->hash = end_name_hash(hash);
+	return 0;
+}
+
+struct unlink_whiteout_dirent {
+	struct dentry *parent;
+	struct list_head list;
+};
+
+static int filldir_unlink_whiteouts(void *buf, const char *name, int namlen,
+				    loff_t offset, u64 ino,
+				    unsigned int d_type)
+{
+	struct unlink_whiteout_dirent *dirent = buf;
+	struct dentry *dentry;
+	struct qstr this;
+	int res;
+
+	if (d_type != DT_WHT)
+		return 0;
+
+	this.name = name;
+	this.len = namlen;
+	res = __hash_one_len(name, namlen, &this);
+	if (res)
+		return res;
+
+	dentry = __lookup_hash_kern(&this, dirent->parent, NULL);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+
+	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
+	__d_drop(dentry);
+	if (!list_empty(&dentry->d_lru)) {
+		list_del(&dentry->d_lru);
+		dentry_stat.nr_unused--;
+	}
+	list_add(&dentry->d_lru, &dirent->list);
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&dcache_lock);
+	return res;
+}
+
+/*
+ * do_unlink_whiteouts - remove all whiteouts of an "empty" directory
+ * @dentry: the directories dentry
+ *
+ * Before removing a directory from the file system, we have to make sure
+ * that there are no stale whiteouts in it. Therefore we call readdir() with
+ * a special filldir helper to remove all the whiteouts.
+ *
+ * XXX: Don't call any security and permission checks here (If we aren't
+ * allowed to go here, we shouldn't be here at all). Same with i_mutex, don't
+ * touch it here.
+ */
+static int do_unlink_whiteouts(struct dentry *dentry)
+{
+	struct file *file;
+	struct inode *inode;
+	struct unlink_whiteout_dirent dirent =
+		{ .list = LIST_HEAD_INIT(dirent.list),
+		  .parent = dentry };
+	struct dentry *n;
+	int res;
+
+	dget(dentry);
+
+	/*
+	 * FIXME: This is bad, because we really don't want to open a new
+	 * file in the kernel but readdir needs a file pointer
+	 */
+	file = dentry_open(dentry, NULL, O_RDWR);
+	if (IS_ERR(file)) {
+		printk(KERN_ERR "%s: dentry_open failed (%ld)\n",
+		       __FUNCTION__, PTR_ERR(file));
+		return PTR_ERR(file);
+	}
+
+	inode = file->f_path.dentry->d_inode;
+
+	res = -ENOTDIR;
+	if (!file->f_op || !file->f_op->readdir)
+		goto out_fput;
+
+	res = -ENOENT;
+	if (!IS_DEADDIR(inode)) {
+		res = file->f_op->readdir(file, &dirent,
+					  filldir_unlink_whiteouts);
+		file_accessed(file);
+	}
+
+	list_for_each_entry_safe(dentry, n, &dirent.list, d_lru) {
+		list_del_init(&dentry->d_lru);
+		res = vfs_unlink_whiteout(inode, dentry);
+		WARN_ON(res);
+		dput(dentry);
+	}
+
+out_fput:
+	fput(file);
+	if (unlikely(res))
+		printk(KERN_ERR "%s: readdir failed (%d)\n",
+		       __FUNCTION__, res);
+	return res;
+}
+
+
 /*
  * We try to drop the dentry early: we should have
  * a usage count of 2 if we're the only user of this
@@ -2064,18 +2449,22 @@ int vfs_rmdir(struct inode *dir, struct 
 
 	DQUOT_INIT(dir);
 
-	mutex_lock(&dentry->d_inode->i_mutex);
+	mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_CHILD);
 	dentry_unhash(dentry);
 	if (d_mountpoint(dentry))
 		error = -EBUSY;
 	else {
 		error = security_inode_rmdir(dir, dentry);
 		if (!error) {
+			error = do_unlink_whiteouts(dentry);
+			if (error)
+				goto out;
 			error = dir->i_op->rmdir(dir, dentry);
 			if (!error)
 				dentry->d_inode->i_flags |= S_DEAD;
 		}
 	}
+out:
 	mutex_unlock(&dentry->d_inode->i_mutex);
 	if (!error) {
 		d_delete(dentry);
@@ -2243,6 +2632,12 @@ int vfs_symlink(struct inode *dir, struc
 	if (error)
 		return error;
 
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, dentry);
+		if (error)
+			return error;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->symlink(dir, dentry, oldname);
 	if (!error)
@@ -2296,7 +2691,7 @@ int vfs_link(struct dentry *old_dentry, 
 	struct inode *inode = old_dentry->d_inode;
 	int error;
 
-	if (!inode)
+	if (!inode || S_ISWHT(inode->i_mode))
 		return -ENOENT;
 
 	error = may_create(dir, new_dentry, NULL);
@@ -2570,7 +2965,7 @@ static int do_rename(int olddfd, const c
 		goto exit3;
 	/* source must exist */
 	error = -ENOENT;
-	if (!old.dentry->d_inode)
+	if (!old.dentry->d_inode || S_ISWHT(old.dentry->d_inode->i_mode))
 		goto exit4;
 	/* unless the source is a directory trailing slashes give -ENOTDIR */
 	if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -148,6 +148,9 @@ static int filldir(void * __buf, const c
 	unsigned long d_ino;
 	int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
 
+	if (d_type == DT_WHT)
+		return 0;
+
 	buf->error = -EINVAL;	/* only used if we fail.. */
 	if (reclen > buf->count)
 		return -EINVAL;
@@ -233,6 +236,9 @@ static int filldir64(void * __buf, const
 	struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
 	int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
 
+	if (d_type == DT_WHT)
+		return 0;
+
 	buf->error = -EINVAL;	/* only used if we fail.. */
 	if (reclen > buf->count)
 		return -EINVAL;
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -97,6 +97,7 @@ extern int dir_notify_enable;
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
 #define FS_SAFE 8		/* Safe to mount by unprivileged users */
+#define FS_WHT		8192	/* FS supports whiteout filetype */
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
@@ -130,6 +131,7 @@ extern int dir_notify_enable;
 #define MS_NO_LEASES	(1<<22)	/* fs does not support leases */
 #define MS_SETUSER	(1<<23) /* set mnt_uid to current user */
 #define MS_NOMNT	(1<<24) /* don't allow unprivileged submounts */
+#define MS_WHITEOUT	(1<<25)	/* fs does support white-out filetype */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
@@ -156,6 +158,7 @@ extern int dir_notify_enable;
 #define S_NOCMTIME	128	/* Do not update file c/mtime */
 #define S_SWAPFILE	256	/* Do not truncate: swapon got its bmaps */
 #define S_PRIVATE	512	/* Inode is fs-internal */
+#define S_OPAQUE	1024	/* Directory is opaque */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -190,6 +193,7 @@ extern int dir_notify_enable;
 #define IS_SWAPFILE(inode)	((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)	((inode)->i_flags & S_PRIVATE)
 #define IS_NO_LEASES(inode)	__IS_FLG(inode, MS_NO_LEASES)
+#define IS_OPAQUE(inode)	((inode)->i_flags & S_OPAQUE)
 
 /* the read-only stuff doesn't really belong here, but any other place is
    probably as bad and I don't want to create yet another include file. */
@@ -1087,6 +1091,8 @@ extern int vfs_link(struct dentry *, str
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_whiteout(struct inode *, struct dentry *);
+extern int vfs_unlink_whiteout(struct inode *, struct dentry *);
 
 /*
  * VFS dentry helper functions.
@@ -1212,6 +1218,7 @@ struct inode_operations {
 	int (*mkdir) (struct inode *,struct dentry *,int);
 	int (*rmdir) (struct inode *,struct dentry *);
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+	int (*whiteout) (struct inode *, struct dentry *);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
 	int (*readlink) (struct dentry *, char __user *,int);

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 11/26] tmpfs white-out support
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (9 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 10/26] VFS white-out handling Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-08-01 15:13   ` Hugh Dickins
  2007-07-30 16:13 ` [RFC 12/26] ext2 " Jan Blunck
                   ` (16 subsequent siblings)
  27 siblings, 1 reply; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, Hugh Dickins; +Cc: Bharata B Rao

[-- Attachment #1: um/tmpfs-whiteout.diff --]
[-- Type: text/plain, Size: 3013 bytes --]

Introduce white-out support to tmpfs.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 include/linux/shmem_fs.h |    1 
 mm/shmem.c               |   54 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -33,6 +33,7 @@ struct shmem_sb_info {
 	int policy;		    /* Default NUMA memory alloc policy */
 	nodemask_t policy_nodes;    /* nodemask for preferred and bind */
 	spinlock_t    stat_lock;
+	struct inode *whiteout_inode;
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1784,6 +1784,42 @@ static int shmem_create(struct inode *di
 }
 
 /*
+ * This is the whiteout support for tmpfs. It uses one singleton whiteout
+ * inode per superblock thus it is very similar to shmem_link().
+ */
+static int shmem_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+	struct inode *inode = sbinfo->whiteout_inode;
+
+	if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+		return -EPERM;
+
+	/*
+	 * No ordinary (disk based) filesystem counts whiteouts as inodes;
+	 * but each new link needs a new dentry, pinning lowmem, and
+	 * tmpfs dentries cannot be pruned until they are unlinked.
+	 */
+	if (sbinfo->max_inodes) {
+		spin_lock(&sbinfo->stat_lock);
+		if (!sbinfo->free_inodes) {
+			spin_unlock(&sbinfo->stat_lock);
+			return -ENOSPC;
+		}
+		sbinfo->free_inodes--;
+		spin_unlock(&sbinfo->stat_lock);
+	}
+
+	dir->i_size += BOGO_DIRENT_SIZE;
+	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+	inc_nlink(inode);
+	atomic_inc(&inode->i_count);	/* New dentry reference */
+	dget(dentry);		/* Extra pinning count for the created dentry */
+	d_instantiate(dentry, inode);
+	return 0;
+}
+
+/*
  * Link a file..
  */
 static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
@@ -2231,6 +2267,9 @@ out:
 
 static void shmem_put_super(struct super_block *sb)
 {
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+
+	iput(sbinfo->whiteout_inode);
 	kfree(sb->s_fs_info);
 	sb->s_fs_info = NULL;
 }
@@ -2305,6 +2344,19 @@ static int shmem_fill_super(struct super
 	if (!root)
 		goto failed_iput;
 	sb->s_root = root;
+
+#ifdef CONFIG_TMPFS
+	if (!(sb->s_flags & MS_NOUSER)) {
+		inode = shmem_get_inode(sb, S_IRUGO | S_IWUGO | S_IFWHT, 0);
+		if (!inode) {
+			dput(root);
+			goto failed;
+		}
+		sbinfo->whiteout_inode = inode;
+		sb->s_flags |= MS_WHITEOUT;
+	}
+#endif
+
 	return 0;
 
 failed_iput:
@@ -2410,6 +2462,7 @@ static const struct inode_operations shm
 	.rmdir		= shmem_rmdir,
 	.mknod		= shmem_mknod,
 	.rename		= shmem_rename,
+	.whiteout       = shmem_whiteout,
 #endif
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	.setattr	= shmem_notify_change,
@@ -2464,6 +2517,7 @@ static struct file_system_type tmpfs_fs_
 	.name		= "tmpfs",
 	.get_sb		= shmem_get_sb,
 	.kill_sb	= kill_litter_super,
+	.fs_flags	= FS_WHT,
 };
 static struct vfsmount *shm_mnt;
 

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 12/26] ext2 white-out support
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (10 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 11/26] tmpfs white-out support Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-31  3:45   ` Theodore Tso
  2007-07-31 16:36   ` Josef Sipek
  2007-07-30 16:13 ` [RFC 13/26] ext3 whiteout support Jan Blunck
                   ` (15 subsequent siblings)
  27 siblings, 2 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/ext2-whiteout.diff --]
[-- Type: text/plain, Size: 3642 bytes --]

Introduce white-out support to ext2.

Known Bugs:
- Needs a reserved inode number for white-outs
- S_OPAQUE isn't persistently stored

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
---
 fs/ext2/dir.c           |    2 ++
 fs/ext2/namei.c         |   18 ++++++++++++++++++
 fs/ext2/super.c         |    5 ++++-
 include/linux/ext2_fs.h |    4 ++++
 4 files changed, 28 insertions(+), 1 deletion(-)

--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -230,6 +230,7 @@ static unsigned char ext2_filetype_table
 	[EXT2_FT_FIFO]		= DT_FIFO,
 	[EXT2_FT_SOCK]		= DT_SOCK,
 	[EXT2_FT_SYMLINK]	= DT_LNK,
+	[EXT2_FT_WHT]		= DT_WHT,
 };
 
 #define S_SHIFT 12
@@ -241,6 +242,7 @@ static unsigned char ext2_type_by_mode[S
 	[S_IFIFO >> S_SHIFT]	= EXT2_FT_FIFO,
 	[S_IFSOCK >> S_SHIFT]	= EXT2_FT_SOCK,
 	[S_IFLNK >> S_SHIFT]	= EXT2_FT_SYMLINK,
+	[S_IFWHT >> S_SHIFT]	= EXT2_FT_WHT,
 };
 
 static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -288,6 +288,23 @@ static int ext2_rmdir (struct inode * di
 	return err;
 }
 
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode;
+	int err;
+
+	inode = ext2_new_inode (dir, S_IFWHT | S_IRUGO);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out;
+
+	init_special_inode(inode, inode->i_mode, 0);
+	mark_inode_dirty(inode);
+	err = ext2_add_nondir(dentry, inode);
+out:
+	return err;
+}
+
 static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
 	struct inode * new_dir,	struct dentry * new_dentry )
 {
@@ -382,6 +399,7 @@ const struct inode_operations ext2_dir_i
 	.mkdir		= ext2_mkdir,
 	.rmdir		= ext2_rmdir,
 	.mknod		= ext2_mknod,
+	.whiteout	= ext2_whiteout,
 	.rename		= ext2_rename,
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -752,6 +752,9 @@ static int ext2_fill_super(struct super_
 	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
 				    EXT2_MOUNT_XIP if not */
 
+	if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
+		sb->s_flags |= MS_WHITEOUT;
+
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -1299,7 +1302,7 @@ static struct file_system_type ext2_fs_t
 	.name		= "ext2",
 	.get_sb		= ext2_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_WHT,
 };
 
 static int __init init_ext2_fs(void)
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -61,6 +61,7 @@
 #define EXT2_ROOT_INO		 2	/* Root inode */
 #define EXT2_BOOT_LOADER_INO	 5	/* Boot loader inode */
 #define EXT2_UNDEL_DIR_INO	 6	/* Undelete directory inode */
+#define EXT2_WHT_INO		 7	/* Whiteout inode */
 
 /* First non-reserved inode for old ext2 filesystems */
 #define EXT2_GOOD_OLD_FIRST_INO	11
@@ -479,10 +480,12 @@ struct ext2_super_block {
 #define EXT3_FEATURE_INCOMPAT_RECOVER		0x0004
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV	0x0008
 #define EXT2_FEATURE_INCOMPAT_META_BG		0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT		0x0020
 #define EXT2_FEATURE_INCOMPAT_ANY		0xffffffff
 
 #define EXT2_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT2_FEATURE_INCOMPAT_SUPP	(EXT2_FEATURE_INCOMPAT_FILETYPE| \
+					 EXT2_FEATURE_INCOMPAT_WHITEOUT| \
 					 EXT2_FEATURE_INCOMPAT_META_BG)
 #define EXT2_FEATURE_RO_COMPAT_SUPP	(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -549,6 +552,7 @@ enum {
 	EXT2_FT_FIFO,
 	EXT2_FT_SOCK,
 	EXT2_FT_SYMLINK,
+	EXT2_FT_WHT,
 	EXT2_FT_MAX
 };
 

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 13/26] ext3 whiteout support
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (11 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 12/26] ext2 " Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 14/26] union-mount: Documentation Jan Blunck
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/ext3-whiteout.diff --]
[-- Type: text/plain, Size: 4066 bytes --]

Introduce whiteout support for ext3.

- Needs a reserved inode number for white-outs
- S_OPAQUE isn't persistently stored

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/ext3/dir.c           |    3 ++-
 fs/ext3/namei.c         |   33 +++++++++++++++++++++++++++++++++
 fs/ext3/super.c         |    5 ++++-
 include/linux/ext3_fs.h |    5 ++++-
 4 files changed, 43 insertions(+), 3 deletions(-)

--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -29,7 +29,8 @@
 #include <linux/rbtree.h>
 
 static unsigned char ext3_filetype_table[] = {
-	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
+	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK,
+	DT_WHT
 };
 
 static int ext3_readdir(struct file *, void *, filldir_t);
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1081,6 +1081,7 @@ static unsigned char ext3_type_by_mode[S
 	[S_IFIFO >> S_SHIFT]	= EXT3_FT_FIFO,
 	[S_IFSOCK >> S_SHIFT]	= EXT3_FT_SOCK,
 	[S_IFLNK >> S_SHIFT]	= EXT3_FT_SYMLINK,
+	[S_IFWHT >> S_SHIFT]	= EXT3_FT_WHT,
 };
 
 static inline void ext3_set_de_type(struct super_block *sb,
@@ -2070,6 +2071,37 @@ end_rmdir:
 	return retval;
 }
 
+static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode;
+	int err, retries = 0;
+	handle_t *handle;
+
+retry:
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+					2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	if (IS_DIRSYNC(dir))
+		handle->h_sync = 1;
+
+	inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out_stop;
+
+	init_special_inode(inode, inode->i_mode, 0);
+	err = ext3_add_nondir(handle, dentry, inode);
+
+out_stop:
+	ext3_journal_stop(handle);
+	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
+		goto retry;
+	return err;
+}
+
 static int ext3_unlink(struct inode * dir, struct dentry *dentry)
 {
 	int retval;
@@ -2387,6 +2419,7 @@ const struct inode_operations ext3_dir_i
 	.mkdir		= ext3_mkdir,
 	.rmdir		= ext3_rmdir,
 	.mknod		= ext3_mknod,
+	.whiteout	= ext3_whiteout,
 	.rename		= ext3_rename,
 	.setattr	= ext3_setattr,
 #ifdef CONFIG_EXT3_FS_XATTR
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1500,6 +1500,9 @@ static int ext3_fill_super (struct super
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT3_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
+	if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_WHITEOUT))
+		sb->s_flags |= MS_WHITEOUT;
+
 	if (le32_to_cpu(es->s_rev_level) == EXT3_GOOD_OLD_REV &&
 	    (EXT3_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT3_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -2764,7 +2767,7 @@ static struct file_system_type ext3_fs_t
 	.name		= "ext3",
 	.get_sb		= ext3_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_WHT,
 };
 
 static int __init init_ext3_fs(void)
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -63,6 +63,7 @@
 #define EXT3_UNDEL_DIR_INO	 6	/* Undelete directory inode */
 #define EXT3_RESIZE_INO		 7	/* Reserved group descriptors inode */
 #define EXT3_JOURNAL_INO	 8	/* Journal inode */
+#define EXT3_WHT_INO		 9	/* Whiteout inode */
 
 /* First non-reserved inode for old ext3 filesystems */
 #define EXT3_GOOD_OLD_FIRST_INO	11
@@ -582,6 +583,7 @@ static inline int ext3_valid_inum(struct
 #define EXT3_FEATURE_INCOMPAT_RECOVER		0x0004 /* Needs recovery */
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV	0x0008 /* Journal device */
 #define EXT3_FEATURE_INCOMPAT_META_BG		0x0010
+#define EXT3_FEATURE_INCOMPAT_WHITEOUT		0x0020
 
 #define EXT3_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT3_FEATURE_INCOMPAT_SUPP	(EXT3_FEATURE_INCOMPAT_FILETYPE| \
@@ -648,8 +650,9 @@ struct ext3_dir_entry_2 {
 #define EXT3_FT_FIFO		5
 #define EXT3_FT_SOCK		6
 #define EXT3_FT_SYMLINK		7
+#define EXT3_FT_WHT		8
 
-#define EXT3_FT_MAX		8
+#define EXT3_FT_MAX		9
 
 /*
  * EXT3_DIR_PAD defines the directory entries boundaries

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 14/26] union-mount: Documentation
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (12 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 13/26] ext3 whiteout support Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 15/26] union-mount: Add union-mount mount flag Jan Blunck
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-documentation.diff --]
[-- Type: text/plain, Size: 8424 bytes --]

Add simple documentation about union mounting in general and this
implementation in specific.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 Documentation/filesystems/union-mounts.txt |  172 +++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,172 @@
+VFS based Union Mounts
+----------------------
+
+ 1. What are "Union Mounts"
+ 2. The Union Stack
+ 3. The White-out Filetype
+ 4. Renaming Unions
+ 5. Directory Reading
+ 6. Known Problems
+ 7. References
+
+-------------------------------------------------------------------------------
+
+1. What are "Union Mounts"
+==========================
+
+Please note: this is NOT about UnionFS and it is NOT derived work!
+
+Traditionally the mount operation is opaque, which means that the content of
+the mount point, the directory where the file system is mounted on, is hidden
+by the content of the mounted file system's root directory until the file
+system is unmounted again. Unlike the traditional UNIX mount mechanism, that
+hides the contents of the mount point, a union mount presents a view as if
+both filesystems are merged together. Although only the topmost layer of the
+mount stack can be altered, it appears as if transparent file system mounts
+allow any file to be created, modified or deleted.
+
+Most people know the concepts and features of union mounts from other
+operating systems like Sun's Translucent Filesystem, Plan9 or BSD.
+
+Here are the key features of this implementation:
+- completely VFS based
+- does not change the namespace stacking
+- directory listings have duplicate entries removed
+- writable unions: only the topmost file system layer may be writable
+- writable unions: new white-out filetype handled inside the kernel
+
+-------------------------------------------------------------------------------
+
+2. The Union Stack
+==================
+
+The mounted file systems are organized in the "file system hierarchy" (tree of
+vfsmount structures), which keeps track about the stacking of file systems
+upon each other. The per-directory view on the file system hierarchy is called
+"mount stack" and reflects the order of file systems, which are mounted on a
+specific directory.
+
+Union mounts present a single unified view of the contents of two or more file
+systems as if they are merged together. Since the information which file
+system objects are part of a unified view is not directly available from the
+file system hierachy there is a need for a new structure. The file system
+objects, which are part of a unified view are ordered in a so-called "union
+stack". Only directoties can be part of a unified view.
+
+The link between two layers of the union stack is maintained using the
+union_mount structure (#include <linux/union.h>):
+
+struct union_mount {
+       atomic_t u_count;               /* reference count */
+       struct mutex u_mutex;
+       struct list_head u_unions;      /* list head for d_unions */
+       struct hlist_node u_hash;       /* list head for seaching */
+       struct hlist_node u_rhash;      /* list head for reverse seaching */
+
+       struct path u_this;             /* this is me */
+       struct path u_next;             /* this is what I overlay */
+};
+
+The union_mount structure holds a reference (dget,mntget) to the next lower
+layer of the union stack. Since a dentry can be part of multiple unions
+(e.g. with bind mounts) they are tied together via the d_unions field of the
+dentry structure.
+
+All union_mount structures are cached in two hash tables, one for lookups of
+the next lower layer of the union stack and one for reverse lookups of the
+next upper layer of the union stack. The reverse lookup is necessary to
+resolve CWD relative path lookups. For calculation of the hash value, the
+(dentry,vfsmount) pair is used. The u_this field is used for the hash table
+which is used in forward lookups and the u_next field for the reverse lookups.
+
+During every new mount (or mount propagation), a new union_mount structure is
+allocated. A reference to the mountpoint's vfsmount and dentry is taken and
+stored in the u_next field.  In almost the same manner an union_mount
+structure is created during the first time lookup of a directory within a
+union mount point. In this case the lookup proceeds to all lower layers of the
+union. Therefore the complete union stack is constructed during lookups.
+
+The union_mount structures of a dentry are destroyed when the dentry itself is
+destroyed. Therefore the dentry cache is indirectly driving the union_mount
+cache like this is done for inodes too. Please note that lower layer
+union_mount structures are kept in memory until the topmost dentry is
+destroyed.
+
+-------------------------------------------------------------------------------
+
+3. Writable Unions: The White-out Filetype and Copy-On-Open
+===========================================================
+
+The white-out filetype isn't new. It has been there for quite some time now
+but Linux's VFS hasn't used it yet. With the availability of union mount code
+inside the VFS the white-out filetype is getting important to support writable
+union mounts. For read-only union mounts support neither white-outs nor
+copy-on-open is necessary.
+
+The white-out filetype has the same function as negative dentries: they
+describe a filename which isn't there. The creation of white-outs needs
+lowlevel filesystem support. At the time of writing this, there is white-out
+support for tmpfs, ext2 and ext3 available. The VFS is extended to make the
+white-out handling transparent to all its users. The white-outs are not
+visible by the user-space.
+
+-------------------------------------------------------------------------------
+
+4. Renaming Unions
+==================
+
+Rename on union mounts has been handled in a lazy way: it returned -EXDEV.
+This works well for dirctories but not for regular files. Even a kernel build
+doesn't handle rename errors appropriate. Therefore when renaming regular
+files from a lower layer of the union stack it is copied to the topmost
+layer. If the file already resides on the topmost layer, the traditional
+rename method is used.
+
+-------------------------------------------------------------------------------
+
+5. Directory Reading
+====================
+
+As mentioned, union mounts represent a single view of multiple directories as
+if they are merged together. This is achieved by reading the contents of every
+directory on the union stack and by merging the result. When the directory
+listing is read via readdir() or getdents() system call, the union stack is
+traversed from the topmost layer of the union stack to the lowermost.
+
+Likewise with regular files, directories are seekable and the position of the
+following read is marked by the file position filp->f_pos. When reading from
+multiple directories, it is possible that the file position exceeds the inode
+size of the first directory. Therefore the file position is rearranged to
+select the correct directory in the union stack. This is done by substractiong
+the inode size if the file position exceeds it and selecting the next member
+of the union stack next.
+
+This worked well with filesystems like ext2 that used flat file directories.
+The directory entry offsets are arranged linear and are always smaller than
+the inode size of the directory. Modern filesystems have implemented
+directories differently and just return special cookies as directory entry
+offsets which are unrelated to the position in the directory or the inode
+size.
+
+-------------------------------------------------------------------------------
+
+6. Known Problems
+=================
+
+- currently it doesn't support seeking/readdir when d_off > i_size is possible
+- readdir() is a file operation
+- copyup() for other filetypes that reg and dir (e.g. for chown() on devices)
+
+-------------------------------------------------------------------------------
+
+7. References
+=============
+
+[1] http://marc.info/?l=linux-fsdevel&m=96035682927821&w=2
+[2] http://marc.info/?l=linux-fsdevel&m=117681527820133&w=2
+[3] http://marc.info/?l=linux-fsdevel&m=117913503200362&w=2
+[4] http://marc.info/?l=linux-fsdevel&m=118231827024394&w=2
+
+Authors:
+Jan Blunck <jblunck@suse.de>
+Bharata B Rao <bharata@linux.vnet.ibm.com>

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 15/26] union-mount: Add union-mount mount flag
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (13 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 14/26] union-mount: Documentation Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 16/26] union-mount: Introduce union_mount structure Jan Blunck
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-mount_flag.diff --]
[-- Type: text/plain, Size: 1953 bytes --]

Introduce MNT_UNION and MS_UNION flags. You need additional patches for
util-linux for that to work.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namespace.c        |    6 +++++-
 include/linux/fs.h    |    1 +
 include/linux/mount.h |    1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -437,6 +437,7 @@ static int show_vfsmnt(struct seq_file *
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
 		{ MNT_NOMNT, ",nomnt" },
+		{ MNT_UNION, ",union" },
 		{ 0, NULL }
 	};
 	struct proc_fs_info *fs_infop;
@@ -1558,9 +1559,12 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_RELATIME;
 	if (flags & MS_NOMNT)
 		mnt_flags |= MNT_NOMNT;
+	if (flags & MS_UNION)
+		mnt_flags |= MNT_UNION;
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT);
+		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT |
+		   MS_UNION );
 
 	/* ... and get the mountpoint */
 	retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -114,6 +114,7 @@ extern int dir_notify_enable;
 #define MS_REMOUNT	32	/* Alter flags of a mounted FS */
 #define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
 #define MS_DIRSYNC	128	/* Directory modifications are synchronous */
+#define MS_UNION	256
 #define MS_NOATIME	1024	/* Do not update access times. */
 #define MS_NODIRATIME	2048	/* Do not update directory access times */
 #define MS_BIND		4096
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -36,6 +36,7 @@ struct mnt_namespace;
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
 #define MNT_PNODE_MASK	0x3000	/* propagation flag mask */
+#define MNT_UNION	0x4000	/* if the vfsmount is a union mount */
 
 struct vfsmount {
 	struct list_head mnt_hash;

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 16/26] union-mount: Introduce union_mount structure
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (14 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 15/26] union-mount: Add union-mount mount flag Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-08-06  5:57   ` Bharata B Rao
  2007-07-30 16:13 ` [RFC 17/26] union-mount: Drive the union cache via dcache Jan Blunck
                   ` (11 subsequent siblings)
  27 siblings, 1 reply; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-union-stack.diff --]
[-- Type: text/plain, Size: 13313 bytes --]

This patch adds the basic structures of VFS based union mounts. It is a new
implementation based on some of my old idea's that influenced Bharata B Rao
<bharata@linux.vnet.ibm.com> who came up with the proposal to let the
union_mount struct only point to the next layer in the union stack. I rewrote
nearly all of the central patches around lookup and the dcache interaction.

Advantages of the new implementation:
- the new union stack is no longer tied directly to one dentry
- the union stack enables dentries to be part of more than one union
  (bind mounts)
- it is unnecessary to traverse the union stack when de/referencing a dentry
- caching of union stack information still driven by dentry cache

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/Kconfig             |    8 +
 fs/Makefile            |    2 
 fs/dcache.c            |    4 
 fs/union.c             |  335 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |    9 +
 include/linux/union.h  |   61 ++++++++
 6 files changed, 419 insertions(+)

--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -551,6 +551,14 @@ config INOTIFY_USER
 
 	  If unsure, say Y.
 
+config UNION_MOUNT
+       bool "Union mount support (EXPERIMENTAL)"
+       depends on EXPERIMENTAL
+       ---help---
+         If you say Y here, you will be able to mount file systems as
+         union mount stacks. This is a VFS based implementation and
+         should work with all file systems. If unsure, say N.
+
 config QUOTA
 	bool "Quota support"
 	help
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.
 obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
 obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
 
+obj-$(CONFIG_UNION_MOUNT)	+= union.o
+
 obj-$(CONFIG_QUOTA)		+= dquot.o
 obj-$(CONFIG_QFMT_V1)		+= quota_v1.o
 obj-$(CONFIG_QFMT_V2)		+= quota_v2.o
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -985,6 +985,10 @@ struct dentry *d_alloc(struct dentry * p
 #ifdef CONFIG_PROFILING
 	dentry->d_cookie = NULL;
 #endif
+#ifdef CONFIG_UNION_MOUNT
+	INIT_LIST_HEAD(&dentry->d_unions);
+	dentry->d_unionized = 0;
+#endif
 	INIT_HLIST_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,335 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007 Novell Inc.
+ *
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/hash.h>
+#include <linux/fs.h>
+#include <linux/union.h>
+
+/*
+ * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
+ * should try to make this good - I've just made it work.
+ */
+static unsigned int union_hash_mask __read_mostly;
+static unsigned int union_hash_shift __read_mostly;
+static struct hlist_head *union_hashtable __read_mostly;
+static unsigned int union_rhash_mask __read_mostly;
+static unsigned int union_rhash_shift __read_mostly;
+static struct hlist_head *union_rhashtable __read_mostly;
+
+/*
+ * Locking Rules:
+ * - dcache_lock (for union_rlookup() only)
+ * - union_lock
+ */
+DEFINE_SPINLOCK(union_lock);
+
+static struct kmem_cache *union_cache __read_mostly;
+
+static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt)
+{
+	unsigned long tmp;
+
+	tmp = ((unsigned long)mnt * (unsigned long)dentry) ^
+		(GOLDEN_RATIO_PRIME + (unsigned long)mnt) / L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> union_hash_shift);
+	return tmp & union_hash_mask;
+}
+
+static __initdata unsigned long union_hash_entries;
+
+static int __init set_union_hash_entries(char *str)
+{
+	if (!str)
+		return 0;
+	union_hash_entries = simple_strtoul(str, &str, 0);
+	return 1;
+}
+
+__setup("union_hash_entries=", set_union_hash_entries);
+
+static int __init init_union(void)
+{
+	int loop;
+
+	union_cache = kmem_cache_create("union_mount",
+					sizeof(struct union_mount), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_PANIC,
+					NULL, NULL);
+
+	union_hashtable = alloc_large_system_hash("Union-cache",
+						  sizeof(struct hlist_head),
+						  union_hash_entries,
+						  14,
+						  0,
+						  &union_hash_shift,
+						  &union_hash_mask,
+						  0);
+
+	for (loop = 0; loop < (1 << union_hash_shift); loop++)
+		INIT_HLIST_HEAD(&union_hashtable[loop]);
+
+
+	union_rhashtable = alloc_large_system_hash("rUnion-cache",
+						  sizeof(struct hlist_head),
+						  union_hash_entries,
+						  14,
+						  0,
+						  &union_rhash_shift,
+						  &union_rhash_mask,
+						  0);
+
+	for (loop = 0; loop < (1 << union_rhash_shift); loop++)
+		INIT_HLIST_HEAD(&union_rhashtable[loop]);
+
+	return 0;
+}
+
+fs_initcall(init_union);
+
+struct union_mount *union_alloc(struct dentry *this, struct vfsmount *this_mnt,
+				struct dentry *next, struct vfsmount *next_mnt)
+{
+	struct union_mount *um;
+
+	BUG_ON(!S_ISDIR(this->d_inode->i_mode));
+	BUG_ON(!S_ISDIR(next->d_inode->i_mode));
+
+	um = kmem_cache_alloc(union_cache, GFP_ATOMIC);
+	if (!um)
+		return NULL;
+
+	atomic_set(&um->u_count, 1);
+	INIT_LIST_HEAD(&um->u_unions);
+	INIT_HLIST_NODE(&um->u_hash);
+	INIT_HLIST_NODE(&um->u_rhash);
+
+	um->u_this.mnt = this_mnt;
+	um->u_this.dentry = this;
+	um->u_next.mnt = mntget(next_mnt);
+	um->u_next.dentry = dget(next);
+
+	return um;
+}
+
+struct union_mount *union_get(struct union_mount *um)
+{
+	BUG_ON(!atomic_read(&um->u_count));
+	atomic_inc(&um->u_count);
+	return um;
+}
+
+static int __union_put(struct union_mount *um)
+{
+	if (!atomic_dec_and_test(&um->u_count))
+		return 0;
+
+	BUG_ON(!hlist_unhashed(&um->u_hash));
+	BUG_ON(!hlist_unhashed(&um->u_rhash));
+
+	kmem_cache_free(union_cache, um);
+	return 1;
+}
+
+void union_put(struct union_mount *um)
+{
+	struct path tmp = um->u_next;
+
+	if (__union_put(um))
+		pathput(&tmp);
+}
+
+static void __union_hash(struct union_mount *um)
+{
+	hlist_add_head(&um->u_hash, union_hashtable +
+		       hash(um->u_this.dentry, um->u_this.mnt));
+	hlist_add_head(&um->u_rhash, union_rhashtable +
+		       hash(um->u_next.dentry, um->u_next.mnt));
+}
+
+static void __union_unhash(struct union_mount *um)
+{
+	hlist_del_init(&um->u_hash);
+	hlist_del_init(&um->u_rhash);
+}
+
+struct union_mount *union_lookup(struct dentry *dentry, struct vfsmount *mnt)
+{
+	struct hlist_head *head = union_hashtable + hash(dentry, mnt);
+	struct hlist_node *node;
+	struct union_mount *um;
+
+	hlist_for_each_entry(um, node, head, u_hash) {
+		if ((um->u_this.dentry == dentry) &&
+		    (um->u_this.mnt == mnt))
+			return um;
+	}
+
+	return NULL;
+}
+
+struct union_mount *union_rlookup(struct dentry *dentry, struct vfsmount *mnt)
+{
+	struct hlist_head *head = union_rhashtable + hash(dentry, mnt);
+	struct hlist_node *node;
+	struct union_mount *um;
+
+	hlist_for_each_entry(um, node, head, u_rhash) {
+		if ((um->u_next.dentry == dentry) &&
+		    (um->u_next.mnt == mnt))
+			return um;
+	}
+
+	return NULL;
+}
+
+/*
+ * is_unionized - check if a dentry lives on a union mounted file system
+ *
+ * This tests if a dentry is living on an union mounted file system by walking
+ * the file system hierarchy.
+ */
+int is_unionized(struct dentry *dentry, struct vfsmount *mnt)
+{
+	struct path this = { .mnt = mntget(mnt),
+			     .dentry = dget(dentry) };
+	struct vfsmount *tmp;
+
+	do {
+		/* check if there is an union mounted on top of us */
+		spin_lock(&vfsmount_lock);
+		list_for_each_entry(tmp, &this.mnt->mnt_mounts, mnt_child) {
+			if (!(tmp->mnt_flags & MNT_UNION))
+				continue;
+			/* Isn't this a bug? */
+			if (this.dentry->d_sb != tmp->mnt_mountpoint->d_sb)
+				continue;
+			if (lives_below_in_same_fs(this.dentry,
+						   tmp->mnt_mountpoint)) {
+				spin_unlock(&vfsmount_lock);
+				pathput(&this);
+				return 1;
+			}
+		}
+		spin_unlock(&vfsmount_lock);
+
+		/* check our mountpoint next */
+		tmp = mntget(this.mnt->mnt_parent);
+		dput(this.dentry);
+		this.dentry = dget(this.mnt->mnt_mountpoint);
+		mntput(this.mnt);
+		this.mnt = tmp;
+	} while (this.mnt != this.mnt->mnt_parent);
+
+	pathput(&this);
+	return 0;
+}
+
+int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
+		    struct vfsmount *dest_mnt, struct dentry *dest_dentry)
+{
+	struct union_mount *this, *um;
+
+	BUG_ON(!IS_MNT_UNION(mnt));
+
+	this = union_alloc(dentry, mnt, dest_dentry, dest_mnt);
+	if (!this)
+		return -ENOMEM;
+
+	spin_lock(&union_lock);
+	um = union_lookup(dentry, mnt);
+	if (um) {
+		BUG_ON((um->u_next.dentry != dest_dentry) ||
+		       (um->u_next.mnt != dest_mnt));
+		spin_unlock(&union_lock);
+		union_put(this);
+		return 0;
+	}
+	__union_hash(this);
+	spin_unlock(&union_lock);
+	return 0;
+}
+
+/*
+ * follow_union_down - follow the union stack one layer down
+ *
+ * This is called to traverse the union stack from one layer to the next
+ * overlayed one. follow_union_down() is called by various lookup functions
+ * that are aware of union mounts.
+ *
+ * Returns none zero if followed to the next layer, zero otherwise.
+ */
+int follow_union_down(struct vfsmount **mnt, struct dentry **dentry)
+{
+	struct union_mount *um;
+
+	if (!IS_MNT_UNION(*mnt))
+		return 0;
+
+	spin_lock(&union_lock);
+	um = union_lookup(*dentry, *mnt);
+	spin_unlock(&union_lock);
+	if (um) {
+		pathget(&um->u_next);
+		dput(*dentry);
+		*dentry = um->u_next.dentry;
+		mntput(*mnt);
+		*mnt = um->u_next.mnt;
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * follow_union_mount - follow the union stack to the topmost layer
+ *
+ * This is called to traverse the union stack to the topmost layer. This is
+ * necessary for following parent pointers in an union mount.
+ *
+ * Returns none zero if followed to the topmost layer, zero otherwise.
+ */
+int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
+{
+	struct union_mount *um;
+	int res = 0;
+
+	while (IS_UNION(*dentry)) {
+		spin_lock(&dcache_lock);
+		spin_lock(&union_lock);
+		um = union_rlookup(*dentry, *mnt);
+		if (um)
+			pathget(&um->u_this);
+		spin_unlock(&union_lock);
+		spin_unlock(&dcache_lock);
+
+		/*
+		 * Q: Aaargh, how do I validate the topmost dentry pointer?
+		 * A: Eeeeasy! We took the dcache_lock and union_lock. Since
+		 *    this protects from any dput'ng going on, we know that the
+		 *    dentry is valid since the union is unhashed under
+		 *    dcache_lock too.
+		 */
+		if (!um)
+			break;
+		dput(*dentry);
+		*dentry = um->u_this.dentry;
+		mntput(*mnt);
+		*mnt = um->u_this.mnt;
+		res = 1;
+	}
+
+	return res;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -93,6 +93,15 @@ struct dentry {
 	struct dentry *d_parent;	/* parent directory */
 	struct qstr d_name;
 
+#ifdef CONFIG_UNION_MOUNT
+	/*
+	 * The following fields are used by the VFS based union mount
+	 * implementation. Both are protected by union_lock!
+	 */
+	struct list_head d_unions;	/* list of union_mount's */
+	unsigned int d_unionized;	/* unions referencing this dentry */
+#endif
+
 	struct list_head d_lru;		/* LRU list */
 	/*
 	 * d_child and d_rcu can share memory
--- /dev/null
+++ b/include/linux/union.h
@@ -0,0 +1,61 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007 Novell Inc.
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_H
+#define __LINUX_UNION_H
+#ifdef __KERNEL__
+
+#include <linux/list.h>
+#include <asm/atomic.h>
+
+struct dentry;
+struct vfsmount;
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * The new union mount structure.
+ */
+struct union_mount {
+	atomic_t u_count;		/* reference count */
+	struct mutex u_mutex;
+	struct list_head u_unions;	/* list head for d_unions */
+	struct hlist_node u_hash;	/* list head for seaching */
+	struct hlist_node u_rhash;	/* list head for reverse seaching */
+
+	struct path u_this;		/* this is me */
+	struct path u_next;		/* this is what I overlay */
+};
+
+#define IS_UNION(dentry)	(!list_empty(&(dentry)->d_unions) || \
+				 (dentry)->d_unionized)
+#define IS_MNT_UNION(mnt)	((mnt)->mnt_flags & MNT_UNION)
+
+extern int is_unionized(struct dentry *, struct vfsmount *);
+extern int append_to_union(struct vfsmount *, struct dentry *,
+			   struct vfsmount *, struct dentry *);
+extern int follow_union_down(struct vfsmount **, struct dentry **);
+extern int follow_union_mount(struct vfsmount **, struct dentry **);
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define IS_UNION(x)			(0)
+#define IS_MNT_UNION(x)			(0)
+#define is_unionized(x, y)		(0)
+#define append_to_union(x1, y1, x2, y2)	({ BUG(); (0); })
+#define follow_union_down(x, y)		({ (0); })
+#define follow_union_mount(x, y)	({ (0); })
+
+#endif	/* CONFIG_UNION_MOUNT */
+#endif	/* __KERNEL__ */
+#endif	/* __LINUX_UNION_H */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 17/26] union-mount: Drive the union cache via dcache
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (15 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 16/26] union-mount: Introduce union_mount structure Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 18/26] union-mount: Changes to the namespace handling Jan Blunck
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-dentry-refcount.diff --]
[-- Type: text/plain, Size: 5771 bytes --]

If a dentry is removed from dentry cache because its usage count drops to
zero, the references to the underlying layer of the unions the dentry is in
are droped too. Therefore the union cache is driven by the dentry cache.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/dcache.c            |    8 +++++
 fs/union.c             |   72 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |    8 +++++
 include/linux/union.h  |    6 ++++
 4 files changed, 94 insertions(+)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -18,6 +18,7 @@
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <linux/fs.h>
+#include <linux/union.h>
 #include <linux/fsnotify.h>
 #include <linux/slab.h>
 #include <linux/init.h>
@@ -142,11 +143,14 @@ static struct dentry *__d_kill(struct de
 		list_add(&dentry->d_lru, list);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
+		__shrink_d_unions(dentry, list);
 		return NULL;
 	}
 
 	/* drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
+	/* If the dentry was in an union delete them */
+	shrink_d_unions(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
 	return dentry == parent ? NULL : parent;
@@ -721,6 +725,7 @@ static void shrink_dcache_for_umount_sub
 					iput(inode);
 			}
 
+			shrink_d_unions(dentry);
 			d_free(dentry);
 
 			/* finished when we fall off the top of the tree,
@@ -1464,7 +1469,9 @@ void d_delete(struct dentry * dentry)
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	if (atomic_read(&dentry->d_count) == 1) {
+		__d_drop_unions(dentry);
 		dentry_iput(dentry);
+		shrink_d_unions(dentry);
 		fsnotify_nameremove(dentry, isdir);
 
 		/* remove this and other inotify debug checks after 2.6.18 */
@@ -1478,6 +1485,7 @@ void d_delete(struct dentry * dentry)
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
+	shrink_d_unions(dentry);
 	fsnotify_nameremove(dentry, isdir);
 }
 
--- a/fs/union.c
+++ b/fs/union.c
@@ -258,6 +258,8 @@ int append_to_union(struct vfsmount *mnt
 		union_put(this);
 		return 0;
 	}
+	list_add(&this->u_unions, &dentry->d_unions);
+	dest_dentry->d_unionized++;
 	__union_hash(this);
 	spin_unlock(&union_lock);
 	return 0;
@@ -333,3 +335,73 @@ int follow_union_mount(struct vfsmount *
 
 	return res;
 }
+
+/*
+ * This must be called when unhashing a dentry. This is called with dcache_lock
+ * and unhashes all unions this dentry is in.
+ */
+void __d_drop_unions(struct dentry *dentry)
+{
+	struct union_mount *this, *next;
+
+	spin_lock(&union_lock);
+	list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions)
+		__union_unhash(this);
+	spin_unlock(&union_lock);
+}
+
+/*
+ * This must be called after __d_drop_unions() without holding any locks.
+ * Note: The dentry might still be reachable via a lookup but at that time it
+ * already a negative dentry. Otherwise it would be unhashed. The union_mount
+ * structure itself is still reachable through mnt->mnt_unions (which we
+ * protect against with union_lock).
+ */
+void shrink_d_unions(struct dentry *dentry)
+{
+	struct union_mount *this, *next;
+
+repeat:
+	spin_lock(&union_lock);
+	list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
+		BUG_ON(!hlist_unhashed(&this->u_hash));
+		BUG_ON(!hlist_unhashed(&this->u_rhash));
+		list_del(&this->u_unions);
+		this->u_next.dentry->d_unionized--;
+		spin_unlock(&union_lock);
+		union_put(this);
+		goto repeat;
+	}
+	spin_unlock(&union_lock);
+}
+
+extern void __dput(struct dentry *, struct list_head *);
+
+/*
+ * This is the special variant for use in dput() only.
+ */
+void __shrink_d_unions(struct dentry *dentry, struct list_head *list)
+{
+	struct union_mount *this, *next;
+
+	BUG_ON(!d_unhashed(dentry));
+
+repeat:
+	spin_lock(&union_lock);
+	list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
+		struct dentry *n_dentry = this->u_next.dentry;
+		struct vfsmount *n_mnt = this->u_next.mnt;
+
+		BUG_ON(!hlist_unhashed(&this->u_hash));
+		BUG_ON(!hlist_unhashed(&this->u_rhash));
+		list_del(&this->u_unions);
+		this->u_next.dentry->d_unionized--;
+		spin_unlock(&union_lock);
+		if (__union_put(this)) {
+			__dput(n_dentry, list);
+			mntput(n_mnt);
+		}
+		goto repeat;
+	}
+	spin_unlock(&union_lock);
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -204,12 +204,20 @@ extern spinlock_t dcache_lock;
  * __d_drop requires dentry->d_lock.
  */
 
+#ifdef CONFIG_UNION_MOUNT
+extern void __d_drop_unions(struct dentry *);
+#endif
+
 static inline void __d_drop(struct dentry *dentry)
 {
 	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
 		dentry->d_flags |= DCACHE_UNHASHED;
 		hlist_del_rcu(&dentry->d_hash);
 	}
+#ifdef CONFIG_UNION_MOUNT
+	/* remove dentry from the union hashtable */
+	__d_drop_unions(dentry);
+#endif
 }
 
 static inline void d_drop(struct dentry *dentry)
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -46,6 +46,9 @@ extern int append_to_union(struct vfsmou
 			   struct vfsmount *, struct dentry *);
 extern int follow_union_down(struct vfsmount **, struct dentry **);
 extern int follow_union_mount(struct vfsmount **, struct dentry **);
+extern void __d_drop_unions(struct dentry *);
+extern void shrink_d_unions(struct dentry *);
+extern void __shrink_d_unions(struct dentry *, struct list_head *);
 
 #else /* CONFIG_UNION_MOUNT */
 
@@ -55,6 +58,9 @@ extern int follow_union_mount(struct vfs
 #define append_to_union(x1, y1, x2, y2)	({ BUG(); (0); })
 #define follow_union_down(x, y)		({ (0); })
 #define follow_union_mount(x, y)	({ (0); })
+#define __d_drop_unions(x)		do { } while (0)
+#define shrink_d_unions(x)		do { } while (0)
+#define __shrink_d_unions(x)		do { } while (0)
 
 #endif	/* CONFIG_UNION_MOUNT */
 #endif	/* __KERNEL__ */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 18/26] union-mount: Changes to the namespace handling
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (16 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 17/26] union-mount: Drive the union cache via dcache Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-08-08 10:10   ` Bharata B Rao
  2007-07-30 16:13 ` [RFC 19/26] union-mount: Make lookup work for union-mounted file systems Jan Blunck
                   ` (9 subsequent siblings)
  27 siblings, 1 reply; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-mount-changes.diff --]
[-- Type: text/plain, Size: 8685 bytes --]

Creates the proper struct union_mount when mounting something into a
union. If the topmost filesystem isn't capable of handling the white-out
filetype it could only be mount read-only.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namespace.c        |   46 ++++++++++++++++++++++++++++++++++++++--
 fs/union.c            |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mount.h |    3 ++
 include/linux/union.h |    6 +++++
 4 files changed, 110 insertions(+), 2 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -25,6 +25,7 @@
 #include <linux/security.h>
 #include <linux/mount.h>
 #include <linux/ramfs.h>
+#include <linux/union.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include "pnode.h"
@@ -68,6 +69,9 @@ struct vfsmount *alloc_vfsmnt(const char
 		INIT_LIST_HEAD(&mnt->mnt_share);
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
+#ifdef CONFIG_UNION_MOUNT
+		INIT_LIST_HEAD(&mnt->mnt_unions);
+#endif
 		if (name) {
 			int size = strlen(name) + 1;
 			char *newname = kmalloc(size, GFP_KERNEL);
@@ -157,6 +161,7 @@ static void __touch_mnt_namespace(struct
 
 static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd)
 {
+	detach_mnt_union(mnt);
 	old_nd->dentry = mnt->mnt_mountpoint;
 	old_nd->mnt = mnt->mnt_parent;
 	mnt->mnt_parent = mnt;
@@ -180,6 +185,7 @@ static void attach_mnt(struct vfsmount *
 	list_add_tail(&mnt->mnt_hash, mount_hashtable +
 			hash(nd->mnt, nd->dentry));
 	list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
+	attach_mnt_union(mnt, nd->mnt, nd->dentry);
 }
 
 /*
@@ -202,6 +208,7 @@ static void commit_tree(struct vfsmount 
 	list_add_tail(&mnt->mnt_hash, mount_hashtable +
 				hash(parent, mnt->mnt_mountpoint));
 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+	attach_mnt_union(mnt, mnt->mnt_parent, mnt->mnt_mountpoint);
 	touch_mnt_namespace(n);
 }
 
@@ -577,6 +584,7 @@ void release_mounts(struct list_head *he
 			struct dentry *dentry;
 			struct vfsmount *m;
 			spin_lock(&vfsmount_lock);
+			detach_mnt_union(mnt);
 			dentry = mnt->mnt_mountpoint;
 			m = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
@@ -999,6 +1007,10 @@ static int do_change_type(struct nameida
 	if (nd->dentry != nd->mnt->mnt_root)
 		return -EINVAL;
 
+	/* Don't change the type of union mounts */
+	if (IS_MNT_UNION(nd->mnt))
+		return -EINVAL;
+
 	down_write(&namespace_sem);
 	spin_lock(&vfsmount_lock);
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
@@ -1011,7 +1023,8 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int flags)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags,
+		       int mnt_flags)
 {
 	int clone_flags = 0;
 	uid_t owner = 0;
@@ -1049,6 +1062,18 @@ static int do_loopback(struct nameidata 
 	if (IS_ERR(mnt))
 		goto out;
 
+	/*
+	 * Unions couldn't be writable if the filesystem doesn't know about
+	 * whiteouts
+	 */
+	err = -ENOTSUPP;
+	if ((mnt_flags & MNT_UNION) &&
+	    !(mnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
+		goto out;
+
+	if (mnt_flags & MNT_UNION)
+		mnt->mnt_flags |= MNT_UNION;
+
 	err = graft_tree(mnt, nd);
 	if (err) {
 		LIST_HEAD(umount_list);
@@ -1121,6 +1146,13 @@ static int do_move_mount(struct nameidat
 	if (err)
 		return err;
 
+	/* moving to or from a union mount is not supported */
+	err = -EINVAL;
+	if (IS_MNT_UNION(nd->mnt))
+		goto exit;
+	if (IS_MNT_UNION(old_nd.mnt))
+		goto exit;
+
 	down_write(&namespace_sem);
 	while (d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
 		;
@@ -1176,6 +1208,7 @@ out:
 	up_write(&namespace_sem);
 	if (!err)
 		path_release(&parent_nd);
+exit:
 	path_release(&old_nd);
 	return err;
 }
@@ -1253,6 +1286,15 @@ int do_add_mount(struct vfsmount *newmnt
 	if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
 		goto unlock;
 
+	/*
+	 * Unions couldn't be writable if the filesystem doesn't know about
+	 * whiteouts
+	 */
+	err = -ENOTSUPP;
+	if ((mnt_flags & MNT_UNION) &&
+	    !(newmnt->mnt_sb->s_flags & (MS_WHITEOUT|MS_RDONLY)))
+		goto unlock;
+
 	/* some flags may have been set earlier */
 	newmnt->mnt_flags |= mnt_flags;
 	if ((err = graft_tree(newmnt, nd)))
@@ -1579,7 +1621,7 @@ long do_mount(char *dev_name, char *dir_
 		retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags,
 				    data_page);
 	else if (flags & MS_BIND)
-		retval = do_loopback(&nd, dev_name, flags);
+		retval = do_loopback(&nd, dev_name, flags, mnt_flags);
 	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
 		retval = do_change_type(&nd, flags);
 	else if (flags & MS_MOVE)
--- a/fs/union.c
+++ b/fs/union.c
@@ -114,6 +114,7 @@ struct union_mount *union_alloc(struct d
 
 	atomic_set(&um->u_count, 1);
 	INIT_LIST_HEAD(&um->u_unions);
+	INIT_LIST_HEAD(&um->u_list);
 	INIT_HLIST_NODE(&um->u_hash);
 	INIT_HLIST_NODE(&um->u_rhash);
 
@@ -258,6 +259,7 @@ int append_to_union(struct vfsmount *mnt
 		union_put(this);
 		return 0;
 	}
+	list_add(&this->u_list, &mnt->mnt_unions);
 	list_add(&this->u_unions, &dentry->d_unions);
 	dest_dentry->d_unionized++;
 	__union_hash(this);
@@ -366,6 +368,7 @@ repeat:
 	list_for_each_entry_safe(this, next, &dentry->d_unions, u_unions) {
 		BUG_ON(!hlist_unhashed(&this->u_hash));
 		BUG_ON(!hlist_unhashed(&this->u_rhash));
+		list_del(&this->u_list);
 		list_del(&this->u_unions);
 		this->u_next.dentry->d_unionized--;
 		spin_unlock(&union_lock);
@@ -394,6 +397,7 @@ repeat:
 
 		BUG_ON(!hlist_unhashed(&this->u_hash));
 		BUG_ON(!hlist_unhashed(&this->u_rhash));
+		list_del(&this->u_list);
 		list_del(&this->u_unions);
 		this->u_next.dentry->d_unionized--;
 		spin_unlock(&union_lock);
@@ -405,3 +409,56 @@ repeat:
 	}
 	spin_unlock(&union_lock);
 }
+
+/*
+ * Remove all union_mounts structures belonging to this vfsmount from the
+ * union lookup hashtable and so on ...
+ */
+void shrink_mnt_unions(struct vfsmount *mnt)
+{
+	struct union_mount *this, *next;
+
+repeat:
+	spin_lock(&union_lock);
+	list_for_each_entry_safe(this, next, &mnt->mnt_unions, u_list) {
+		if (this->u_this.dentry == mnt->mnt_root)
+			continue;
+		__union_unhash(this);
+		list_del(&this->u_list);
+		list_del(&this->u_unions);
+		this->u_next.dentry->d_unionized--;
+		spin_unlock(&union_lock);
+		union_put(this);
+		goto repeat;
+	}
+	spin_unlock(&union_lock);
+}
+
+int attach_mnt_union(struct vfsmount *mnt, struct vfsmount *dest_mnt,
+		     struct dentry *dest_dentry)
+{
+	if (!IS_MNT_UNION(mnt))
+		return 0;
+
+	return append_to_union(mnt, mnt->mnt_root, dest_mnt, dest_dentry);
+}
+
+void detach_mnt_union(struct vfsmount *mnt)
+{
+	struct union_mount *um;
+
+	if (!IS_MNT_UNION(mnt))
+		return;
+
+	shrink_mnt_unions(mnt);
+
+	spin_lock(&union_lock);
+	um = union_lookup(mnt->mnt_root, mnt);
+	__union_unhash(um);
+	list_del(&um->u_list);
+	list_del(&um->u_unions);
+	um->u_next.dentry->d_unionized--;
+	spin_unlock(&union_lock);
+	union_put(um);
+	return;
+}
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -56,6 +56,9 @@ struct vfsmount {
 	struct list_head mnt_slave;	/* slave list entry */
 	struct vfsmount *mnt_master;	/* slave is on master->mnt_slave_list */
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
+#ifdef CONFIG_UNION_MOUNT
+	struct list_head mnt_unions;	/* list of union_mount structures */
+#endif
 	/*
 	 * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
 	 * to let these frequently modified fields in a separate cache line
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -30,6 +30,7 @@ struct union_mount {
 	atomic_t u_count;		/* reference count */
 	struct mutex u_mutex;
 	struct list_head u_unions;	/* list head for d_unions */
+	struct list_head u_list;	/* list head for mnt_unions */
 	struct hlist_node u_hash;	/* list head for seaching */
 	struct hlist_node u_rhash;	/* list head for reverse seaching */
 
@@ -49,6 +50,9 @@ extern int follow_union_mount(struct vfs
 extern void __d_drop_unions(struct dentry *);
 extern void shrink_d_unions(struct dentry *);
 extern void __shrink_d_unions(struct dentry *, struct list_head *);
+extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
+			    struct dentry *);
+extern void detach_mnt_union(struct vfsmount *);
 
 #else /* CONFIG_UNION_MOUNT */
 
@@ -61,6 +65,8 @@ extern void __shrink_d_unions(struct den
 #define __d_drop_unions(x)		do { } while (0)
 #define shrink_d_unions(x)		do { } while (0)
 #define __shrink_d_unions(x)		do { } while (0)
+#define attach_mnt_union(x, y, z)	do { } while (0)
+#define detach_mnt_union(x)		do { } while (0)
 
 #endif	/* CONFIG_UNION_MOUNT */
 #endif	/* __KERNEL__ */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 19/26] union-mount: Make lookup work for union-mounted file systems
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (17 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 18/26] union-mount: Changes to the namespace handling Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-08-09  5:42   ` Bharata B Rao
  2007-07-30 16:13 ` [RFC 20/26] union-mount: Simple union-mount readdir implementation Jan Blunck
                   ` (8 subsequent siblings)
  27 siblings, 1 reply; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-lookup.diff --]
[-- Type: text/plain, Size: 16034 bytes --]

On union-mounted file systems the lookup function must also visit lower layers
of the union-stack when doing a lookup. This patches add support for
union-mounts to cached lookups and real lookups.

We have 3 different styles of lookup functions now:
- multiple pathname components, follow mounts, follow union, follow symlinks
- single pathname component, doesn't follow mounts, follow union, doesn't
  follow symlinks
- single pathname component doesn't follow mounts, doesn't follow unions,
  doesn't follow symlinks

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c            |  467 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/namei.h |    6 
 2 files changed, 465 insertions(+), 8 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -31,6 +31,7 @@
 #include <linux/file.h>
 #include <linux/fcntl.h>
 #include <linux/namei.h>
+#include <linux/union.h>
 #include <asm/namei.h>
 #include <asm/uaccess.h>
 
@@ -415,6 +416,167 @@ static struct dentry *cache_lookup(struc
 }
 
 /*
+ * cache_lookup_topmost - lookup the topmost (non-)negative dentry
+ *
+ * This is used for union mount lookups from dcache. The first non-negative
+ * dentry is searched on all layers of the union stack. Otherwise the topmost
+ * negative dentry is return.
+ */
+static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name,
+				  struct path *path)
+{
+	struct dentry *dentry;
+
+	dentry = d_lookup(nd->dentry, name);
+	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
+		dentry = do_revalidate(dentry, nd);
+
+	/*
+	 * Remember the topmost negative dentry in case we don't find anything
+	 */
+	path->dentry = dentry;
+	path->mnt = dentry ? nd->mnt : NULL;
+
+	if (!dentry || dentry->d_inode)
+		return !dentry;
+
+	/* look for the first non-negative dentry */
+
+	while (follow_union_down(&nd->mnt, &nd->dentry)) {
+		dentry = d_hash_and_lookup(nd->dentry, name);
+
+		/*
+		 * If parts of the union stack are not in the dcache we need
+		 * to do a real lookup
+		 */
+		if (!dentry)
+			goto out_dput;
+
+		/*
+		 * If parts of the union don't survive the revalidation we
+		 * need to do a real lookup
+		 */
+		if (dentry->d_op && dentry->d_op->d_revalidate) {
+			dentry = do_revalidate(dentry, nd);
+			if (!dentry)
+				goto out_dput;
+		}
+
+		if (dentry->d_inode)
+			goto out_dput;
+
+		dput(dentry);
+	}
+
+	return !dentry;
+
+out_dput:
+	dput(path->dentry);
+	path->dentry = dentry;
+	path->mnt = dentry ? mntget(nd->mnt) : NULL;
+	return !dentry;
+}
+
+/*
+ * cache_lookup_union - lookup the rest of the union stack
+ *
+ * This is called after you have the topmost dentry in @path.
+ */
+static int __cache_lookup_union(struct nameidata *nd, struct qstr *name,
+				struct path *path)
+{
+	struct path last = *path;
+	struct dentry *dentry;
+
+	while (follow_union_down(&nd->mnt, &nd->dentry)) {
+		dentry = d_hash_and_lookup(nd->dentry, name);
+		if (!dentry)
+			return 1;
+
+		if (dentry->d_op && dentry->d_op->d_revalidate) {
+			dentry = do_revalidate(dentry, nd);
+			if (!dentry)
+				return 1;
+		}
+
+		if (!dentry->d_inode) {
+			dput(dentry);
+			continue;
+		}
+
+		/* only directories can be part of a union stack */
+		if (!S_ISDIR(dentry->d_inode->i_mode)) {
+			dput(dentry);
+			break;
+		}
+
+		/* now we know we found something "real"  */
+		append_to_union(last.mnt, last.dentry, nd->mnt, dentry);
+
+		if (last.dentry != path->dentry)
+			pathput(&last);
+		last.dentry = dentry;
+		last.mnt = mntget(nd->mnt);
+	}
+
+	if (last.dentry != path->dentry)
+		pathput(&last);
+
+	return 0;
+}
+
+/*
+ * cache_lookup - lookup a single pathname part from dcache
+ *
+ * This is a union mount capable version of what d_lookup() & revalidate()
+ * would do. This function returns a valid (union) dentry on success.
+ *
+ * Remember: On failure it means that parts of the union aren't cached. You
+ * should call real_lookup() afterwards to find the proper (union) dentry.
+ */
+static int cache_lookup_union(struct nameidata *nd, struct qstr *name,
+			      struct path *path)
+{
+	int res ;
+
+	if (!IS_MNT_UNION(nd->mnt)) {
+		path->dentry = cache_lookup(nd->dentry, name, nd);
+		path->mnt = path->dentry ? nd->mnt : NULL;
+		res = path->dentry ? 0 : 1;
+	} else {
+		struct path safe = { .dentry = nd->dentry, .mnt = nd->mnt };
+
+		pathget(&safe);
+		res = __cache_lookup_topmost(nd, name, path);
+		if (res)
+			goto out;
+
+		/* only directories can be part of a union stack */
+		if (!path->dentry->d_inode ||
+		    !S_ISDIR(path->dentry->d_inode->i_mode))
+			goto out;
+
+		/* revalidate the union */
+
+		/* continue lookup on the lower layers of the union */
+		res = __cache_lookup_union(nd, name, path);
+		if (res) {
+			dput(path->dentry);
+			if (path->mnt != safe.mnt)
+				mntput(path->mnt);
+			goto out;
+		}
+
+out:
+		path_release(nd);
+		nd->dentry = safe.dentry;
+		nd->mnt = safe.mnt;
+	}
+
+	return res;
+}
+
+/*
  * Short-cut version of permission(), for calling by
  * path_walk(), when dcache lock is held.  Combines parts
  * of permission() and generic_permission(), and tests ONLY for
@@ -527,6 +689,151 @@ static int real_lookup(struct nameidata 
 	return res;
 }
 
+static inline void copy_nameidata(struct nameidata *old, struct nameidata *new)
+{
+	memset(new, 0, sizeof(*new));
+
+	/* Maybbe we are called via lookup_hash() with a NULL nd argument */
+	if (old) {
+		new->flags = old->flags;
+		memcpy(&new->intent, &old->intent, sizeof(new->intent));
+	}
+}
+
+/*
+ * This is called when a dentries parent is union-mounted and we have
+ * to lookup the overlaid dentries. The lookup starts at the parents
+ * first overlaid dentry of the given dentry. Negative dentries are
+ * ignored and not included in the overlaid list.
+ *
+ * If we reach a dentry with restricted access, we just stop the lookup
+ * because we shouldn't see through that dentry. Same thing for dentry
+ * type mismatch and whiteouts.
+ *
+ * FIXME:
+ * - handle DT_WHT
+ * - handle union stacks in use
+ * - handle union stacks mounted upon union stacks
+ * - avoid unnecessary allocations of union locks
+ */
+static int __real_lookup_topmost(struct nameidata *nd, struct qstr *name,
+				 struct path *path)
+{
+	struct path next;
+	int err;
+
+	err = real_lookup(nd, name, path);
+	if (err)
+		return err;
+
+	if (path->dentry->d_inode)
+		return 0;
+
+	while (follow_union_down(&nd->mnt, &nd->dentry)) {
+		name->hash = full_name_hash(name->name, name->len);
+		if (nd->dentry->d_op && nd->dentry->d_op->d_hash) {
+			err = nd->dentry->d_op->d_hash(nd->dentry, name);
+			if (err < 0)
+				goto out;
+		}
+
+		err = real_lookup(nd, name, &next);
+		if (err)
+			goto out;
+
+		if (next.dentry->d_inode) {
+			dput(path->dentry);
+			mntget(next.mnt);
+			*path = next;
+			goto out;
+		}
+
+		dput(next.dentry);
+	}
+out:
+	if (err)
+		dput(path->dentry);
+	return err;
+}
+
+static int __real_lookup_union(struct nameidata *nd, struct qstr *name,
+			       struct path *path)
+{
+	struct path last = *path;
+	struct path next;
+	int err = 0;
+
+	while (follow_union_down(&nd->mnt, &nd->dentry)) {
+		/* We need to recompute the hash for lower layer lookups */
+		name->hash = full_name_hash(name->name, name->len);
+		if (nd->dentry->d_op && nd->dentry->d_op->d_hash) {
+			err = nd->dentry->d_op->d_hash(nd->dentry, name);
+			if (err < 0)
+				goto out;
+		}
+
+		err = real_lookup(nd, name, &next);
+		if (err)
+			goto out;
+
+		if (!next.dentry->d_inode) {
+			dput(next.dentry);
+			continue;
+		}
+
+		/* only directories can be part of a union stack */
+		if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
+			dput(next.dentry);
+			break;
+		}
+
+		/* now we know we found something "real" */
+		append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
+
+		if (last.dentry != path->dentry)
+			pathput(&last);
+		last.dentry = next.dentry;
+		last.mnt = mntget(next.mnt);
+	}
+
+	if (last.dentry != path->dentry)
+		pathput(&last);
+out:
+	return err;
+}
+
+static int real_lookup_union(struct nameidata *nd, struct qstr *name,
+			     struct path *path)
+{
+	struct path safe = { .dentry = nd->dentry, .mnt = nd->mnt };
+	int res ;
+
+	pathget(&safe);
+	res = __real_lookup_topmost(nd, name, path);
+	if (res)
+		goto out;
+
+	/* only directories can be part of a union stack */
+	if (!path->dentry->d_inode ||
+	    !S_ISDIR(path->dentry->d_inode->i_mode))
+		goto out;
+
+	/* continue lookup on the lower layers of the union */
+	res = __real_lookup_union(nd, name, path);
+	if (res) {
+		dput(path->dentry);
+		if (path->mnt != safe.mnt)
+			mntput(path->mnt);
+		goto out;
+	}
+
+out:
+	path_release(nd);
+	nd->dentry = safe.dentry;
+	nd->mnt = safe.mnt;
+	return res;
+}
+
 static int __emul_lookup_dentry(const char *, struct nameidata *);
 
 /* SMP-safe */
@@ -755,6 +1062,7 @@ static __always_inline void follow_dotdo
 		nd->mnt = parent;
 	}
 	follow_mount(&nd->mnt, &nd->dentry);
+	follow_union_mount(&nd->mnt, &nd->dentry);
 }
 
 /*
@@ -767,6 +1075,9 @@ static int do_lookup(struct nameidata *n
 {
 	int err;
 
+	if (IS_MNT_UNION(nd->mnt))
+		goto need_union_lookup;
+
 	path->dentry = __d_lookup(nd->dentry, name);
 	path->mnt = nd->mnt;
 	if (!path->dentry)
@@ -775,7 +1086,12 @@ static int do_lookup(struct nameidata *n
 		goto need_revalidate;
 
 done:
-	__follow_mount(path);
+	if (nd->mnt != path->mnt) {
+		nd->um_flags |= LAST_LOWLEVEL;
+		follow_mount(&path->mnt, &path->dentry);
+	} else
+		__follow_mount(path);
+	follow_union_mount(&path->mnt, &path->dentry);
 	return 0;
 
 need_lookup:
@@ -784,6 +1100,16 @@ need_lookup:
 		goto fail;
 	goto done;
 
+need_union_lookup:
+	err = cache_lookup_union(nd, name, path);
+	if (!err && path->dentry)
+		goto done;
+
+	err = real_lookup_union(nd, name, path);
+	if (err)
+		goto fail;
+	goto done;
+
 need_revalidate:
 	path->dentry = do_revalidate(path->dentry, nd);
 	if (!path->dentry)
@@ -822,6 +1148,8 @@ static fastcall int __link_path_walk(con
 	if (nd->depth)
 		lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
 
+	follow_union_mount(&nd->mnt, &nd->dentry);
+
 	/* At this point we know we have a real path component. */
 	for(;;) {
 		unsigned long hash;
@@ -1111,6 +1439,7 @@ static int fastcall do_path_lookup(int d
 
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags;
+	nd->um_flags = 0;
 	nd->depth = 0;
 
 	if (*name=='/') {
@@ -1336,6 +1665,128 @@ out:
 	return err;
 }
 
+static int __hash_lookup_topmost(struct nameidata *nd, struct qstr *name,
+				 struct path *path)
+{
+	struct path next;
+	int err;
+
+	err = lookup_hash(nd, name, path);
+	if (err)
+		return err;
+
+	if (path->dentry->d_inode)
+		return 0;
+
+	while (follow_union_down(&nd->mnt, &nd->dentry)) {
+		name->hash = full_name_hash(name->name, name->len);
+		if (nd->dentry->d_op && nd->dentry->d_op->d_hash) {
+			err = nd->dentry->d_op->d_hash(nd->dentry, name);
+			if (err < 0)
+				goto out;
+		}
+
+		mutex_lock(&nd->dentry->d_inode->i_mutex);
+		err = lookup_hash(nd, name, &next);
+		mutex_unlock(&nd->dentry->d_inode->i_mutex);
+		if (err)
+			goto out;
+
+		if (next.dentry->d_inode) {
+			dput(path->dentry);
+			mntget(next.mnt);
+			*path = next;
+			goto out;
+		}
+
+		dput(next.dentry);
+	}
+out:
+	if (err)
+		dput(path->dentry);
+	return err;
+}
+
+static int __hash_lookup_union(struct nameidata *nd, struct qstr *name,
+			       struct path *path)
+{
+	struct path last = *path;
+	struct path next;
+	int err = 0;
+
+	while (follow_union_down(&nd->mnt, &nd->dentry)) {
+		/* We need to recompute the hash for lower layer lookups */
+		name->hash = full_name_hash(name->name, name->len);
+		if (nd->dentry->d_op && nd->dentry->d_op->d_hash) {
+			err = nd->dentry->d_op->d_hash(nd->dentry, name);
+			if (err < 0)
+				goto out;
+		}
+
+		mutex_lock(&nd->dentry->d_inode->i_mutex);
+		err = lookup_hash(nd, name, &next);
+		mutex_unlock(&nd->dentry->d_inode->i_mutex);
+		if (err)
+			goto out;
+
+		if (!next.dentry->d_inode) {
+			dput(next.dentry);
+			continue;
+		}
+
+		/* only directories can be part of a union stack */
+		if (!S_ISDIR(next.dentry->d_inode->i_mode)) {
+			dput(next.dentry);
+			break;
+		}
+
+		/* now we know we found something "real" */
+		append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
+
+		if (last.dentry != path->dentry)
+			pathput(&last);
+		last.dentry = next.dentry;
+		last.mnt = mntget(next.mnt);
+	}
+
+	if (last.dentry != path->dentry)
+		pathput(&last);
+out:
+	return err;
+}
+
+static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
+			     struct path *path)
+{
+	struct path safe = { .dentry = nd->dentry, .mnt = nd->mnt };
+	int res ;
+
+	pathget(&safe);
+	res = __hash_lookup_topmost(nd, name, path);
+	if (res)
+		goto out;
+
+	/* only directories can be part of a union stack */
+	if (!path->dentry->d_inode ||
+	    !S_ISDIR(path->dentry->d_inode->i_mode))
+		goto out;
+
+	/* continue lookup on the lower layers of the union */
+	res = __hash_lookup_union(nd, name, path);
+	if (res) {
+		dput(path->dentry);
+		if (path->mnt != safe.mnt)
+			mntput(path->mnt);
+		goto out;
+	}
+
+out:
+	path_release(nd);
+	nd->dentry = safe.dentry;
+	nd->mnt = safe.mnt;
+	return res;
+}
+
 /* SMP-safe */
 static inline int __lookup_one_len(const char *name, struct qstr *this, struct dentry *base, int len)
 {
@@ -1740,7 +2191,7 @@ int open_namei(int dfd, const char *path
 	dir = nd->dentry;
 	nd->flags &= ~LOOKUP_PARENT;
 	mutex_lock(&dir->d_inode->i_mutex);
-	error = lookup_hash(nd, &nd->last, &path);
+	error = hash_lookup_union(nd, &nd->last, &path);
 
 do_last:
 	if (error) {
@@ -1846,7 +2297,7 @@ do_link:
 	}
 	dir = nd->dentry;
 	mutex_lock(&dir->d_inode->i_mutex);
-	error = lookup_hash(nd, &nd->last, &path);
+	error = hash_lookup_union(nd, &nd->last, &path);
 	__putname(nd->last.name);
 	goto do_last;
 }
@@ -1879,7 +2330,7 @@ int lookup_create(struct nameidata *nd, 
 	/*
 	 * Do the final lookup.
 	 */
-	err = lookup_hash(nd, &nd->last, path);
+	err = hash_lookup_union(nd, &nd->last, path);
 	if (err)
 		goto fail;
 
@@ -2501,7 +2952,7 @@ static long do_rmdir(int dfd, const char
 			goto exit1;
 	}
 	mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
-	error = lookup_hash(&nd, &nd.last, &path);
+	error = hash_lookup_union(&nd, &nd.last, &path);
 	if (error)
 		goto exit2;
 	error = vfs_rmdir(nd.dentry->d_inode, path.dentry);
@@ -2575,7 +3026,7 @@ static long do_unlinkat(int dfd, const c
 	if (nd.last_type != LAST_NORM)
 		goto exit1;
 	mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
-	error = lookup_hash(&nd, &nd.last, &path);
+	error = hash_lookup_union(&nd, &nd.last, &path);
 	if (!error) {
 		/* Why not before? Because we want correct error value */
 		if (nd.last.name[nd.last.len])
@@ -2960,7 +3411,7 @@ static int do_rename(int olddfd, const c
 
 	trap = lock_rename(new_dir, old_dir);
 
-	error = lookup_hash(&oldnd, &oldnd.last, &old);
+	error = hash_lookup_union(&oldnd, &oldnd.last, &old);
 	if (error)
 		goto exit3;
 	/* source must exist */
@@ -2979,7 +3430,7 @@ static int do_rename(int olddfd, const c
 	error = -EINVAL;
 	if (old.dentry == trap)
 		goto exit4;
-	error = lookup_hash(&newnd, &newnd.last, &new);
+	error = hash_lookup_union(&newnd, &newnd.last, &new);
 	if (error)
 		goto exit4;
 	/* target should not be an ancestor of source */
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -20,6 +20,7 @@ struct nameidata {
 	struct vfsmount *mnt;
 	struct qstr	last;
 	unsigned int	flags;
+	unsigned int	um_flags;
 	int		last_type;
 	unsigned	depth;
 	char *saved_names[MAX_NESTED_LINKS + 1];
@@ -40,6 +41,9 @@ struct path {
  */
 enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 
+#define LAST_UNION             0x01
+#define LAST_LOWLEVEL          0x02
+
 /*
  * The bitmask for a lookup event:
  *  - follow links at the end
@@ -55,6 +59,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
 #define LOOKUP_PARENT		16
 #define LOOKUP_NOALT		32
 #define LOOKUP_REVAL		64
+#define LOOKUP_TOPMOST	       128
+
 /*
  * Intent data
  */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 20/26] union-mount: Simple union-mount readdir implementation
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (18 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 19/26] union-mount: Make lookup work for union-mounted file systems Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-08-06 11:08   ` Bharata B Rao
  2007-07-30 16:13 ` [RFC 21/26] union-mount: in-kernel file copy between union mounted filesystems Jan Blunck
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-readdir.diff --]
[-- Type: text/plain, Size: 11470 bytes --]

This is a very simple union mount readdir implementation. It modifies the
readdir routine to merge the entries of union mounted directories and
eliminate duplicates while walking the union stack.

  FIXME:
  This patch needs to be reworked! At the moment this only works for ext2 and
  tmpfs. All kind of index directories that return d_off > i_size don't work
  with this.

The directory entries are read starting from the top layer and they are
maintained in a cache. Subsequently when the entries from the bottom layers
of the union stack are read they are checked for duplicates (in the cache)
before being passed out to the user space. There can be multiple calls
to readdir/getdents routines for reading the entries of a single directory.
But union directory cache is not maitained across these calls. Instead
for every call, the previously read entries are re-read into the cache
and newly read entires are compared against these for duplicates before
being they are returned to user space.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/readdir.c          |   11 -
 fs/union.c            |  336 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/union.h |   25 +++
 3 files changed, 364 insertions(+), 8 deletions(-)

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,13 +16,14 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/unistd.h>
+#include <linux/union.h>
 
 #include <asm/uaccess.h>
 
 int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 {
-	struct inode *inode = file->f_path.dentry->d_inode;
 	int res = -ENOTDIR;
+
 	if (!file->f_op || !file->f_op->readdir)
 		goto out;
 
@@ -30,13 +31,7 @@ int vfs_readdir(struct file *file, filld
 	if (res)
 		goto out;
 
-	mutex_lock(&inode->i_mutex);
-	res = -ENOENT;
-	if (!IS_DEADDIR(inode)) {
-		res = file->f_op->readdir(file, buf, filler);
-		file_accessed(file);
-	}
-	mutex_unlock(&inode->i_mutex);
+	res = do_readdir(file, buf, filler);
 out:
 	return res;
 }
--- a/fs/union.c
+++ b/fs/union.c
@@ -18,6 +18,8 @@
 #include <linux/hash.h>
 #include <linux/fs.h>
 #include <linux/union.h>
+#include <linux/module.h>
+#include <linux/file.h>
 
 /*
  * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
@@ -462,3 +464,337 @@ void detach_mnt_union(struct vfsmount *m
 	union_put(um);
 	return;
 }
+
+
+/*
+ * Union mounts support for readdir.
+ */
+
+/* This is a copy from fs/readdir.c */
+struct getdents_callback {
+	struct linux_dirent __user *current_dir;
+	struct linux_dirent __user *previous;
+	int count;
+	int error;
+};
+
+/* The readdir union cache object */
+struct union_cache_entry {
+	struct list_head list;
+	struct qstr name;
+};
+
+static int union_cache_add_entry(struct list_head *list,
+				 const char *name, int namelen)
+{
+	struct union_cache_entry *this;
+	char *tmp_name;
+
+	this = kmalloc(sizeof(*this), GFP_KERNEL);
+	if (!this) {
+		printk(KERN_CRIT
+		       "union_cache_add_entry(): out of kernel memory\n");
+		return -ENOMEM;
+	}
+
+	tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
+	if (!tmp_name) {
+		printk(KERN_CRIT
+		       "union_cache_add_entry(): out of kernel memory\n");
+		kfree(this);
+		return -ENOMEM;
+	}
+
+	this->name.name = tmp_name;
+	this->name.len = namelen;
+	this->name.hash = 0;
+	memcpy(tmp_name, name, namelen);
+	tmp_name[namelen] = 0;
+	INIT_LIST_HEAD(&this->list);
+	list_add(&this->list, list);
+	return 0;
+}
+
+static void union_cache_free(struct list_head *uc_list)
+{
+	struct list_head *p;
+	struct list_head *ptmp;
+	int count = 0;
+
+	list_for_each_safe(p, ptmp, uc_list) {
+		struct union_cache_entry *this;
+
+		this = list_entry(p, struct union_cache_entry, list);
+		list_del_init(&this->list);
+		kfree(this->name.name);
+		kfree(this);
+		count++;
+	}
+	return;
+}
+
+static int union_cache_find_entry(struct list_head *uc_list,
+				  const char *name, int namelen)
+{
+	struct union_cache_entry *p;
+	int ret = 0;
+
+	list_for_each_entry(p, uc_list, list) {
+		if (p->name.len != namelen)
+			continue;
+		if (strncmp(p->name.name, name, namelen) == 0) {
+			ret = 1;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * There are four filldir() wrapper necessary for the union mount readdir
+ * implementation:
+ *
+ * - filldir_topmost(): fills the union's readdir cache and the user space
+ *			buffer. This is only used for the topmost directory
+ *			in the union stack.
+ * - filldir_topmost_cacheonly(): only fills the union's readdir cache.
+ *			This is only used for the topmost directory in the
+ *			union stack.
+ * - filldir_overlaid(): fills the union's readdir cache and the user space
+ *			buffer. This is only used for directories on the
+ *			stack's lower layers.
+ * - filldir_overlaid_cacheonly(): only fills the union's readdir cache.
+ *			This is only used for directories on the stack's
+ *			lower layers.
+ */
+
+struct union_cache_callback {
+	struct getdents_callback *buf;	/* original getdents_callback */
+	struct list_head list;		/* list of union cache entries */
+	filldir_t filler;		/* the filldir() we should call */
+	loff_t offset;			/* base offset of our dirents */
+	loff_t count;			/* maximum number of bytes to "read" */
+};
+
+static int filldir_topmost(void *buf, const char *name, int namlen,
+			   loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct union_cache_callback *cb = buf;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return cb->filler(cb->buf, name, namlen, cb->offset + offset, ino,
+			  d_type);
+}
+
+static int filldir_topmost_cacheonly(void *buf, const char *name, int namlen,
+				     loff_t offset, u64 ino,
+				     unsigned int d_type)
+{
+	struct union_cache_callback *cb = buf;
+
+	if (offset > cb->count)
+		return -EINVAL;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return 0;
+}
+
+static int filldir_overlaid(void *buf, const char *name, int namlen,
+			    loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct union_cache_callback *cb = buf;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (union_cache_find_entry(&cb->list, name, namlen))
+		return 0;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return cb->filler(cb->buf, name, namlen, cb->offset + offset, ino,
+			  d_type);
+}
+
+static int filldir_overlaid_cacheonly(void *buf, const char *name, int namlen,
+				      loff_t offset, u64 ino,
+				      unsigned int d_type)
+{
+	struct union_cache_callback *cb = buf;
+
+	if (offset > cb->count)
+		return -EINVAL;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (union_cache_find_entry(&cb->list, name, namlen))
+		return 0;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return 0;
+}
+
+/*
+ * readdir_union_cache - A helper to fill the readdir cache
+ */
+static int readdir_union_cache(struct file *file, void *_buf, filldir_t filler)
+{
+	struct union_cache_callback *cb = _buf;
+	int old_count;
+	loff_t old_pos;
+	int res;
+
+	old_count = cb->count;
+	cb->count = ((file->f_pos > i_size_read(file->f_path.dentry->d_inode)) ?
+		      i_size_read(file->f_path.dentry->d_inode) :
+		      file->f_pos) & INT_MAX;
+	old_pos = file->f_pos;
+	file->f_pos = 0;
+	res = file->f_op->readdir(file, _buf, filler);
+	file->f_pos = old_pos;
+	cb->count = old_count;
+	return res;
+}
+
+/*
+ * readdir_union - A wrapper around ->readdir()
+ *
+ * This is a wrapper around the filesystems readdir(), which is walking
+ * the union stack and calls ->readdir() for every directory in the stack.
+ * The directory entries are read into the union mounts readdir cache to
+ * support whiteout's and duplicate removal.
+ */
+int readdir_union(struct file *file, void *buf, filldir_t filler)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct union_cache_callback cb;
+	struct path path;
+	loff_t offset = 0;
+	int res;
+
+	mutex_lock(&inode->i_mutex);
+	if (IS_DEADDIR(inode)) {
+		mutex_unlock(&inode->i_mutex);
+		return -ENOENT;
+	}
+
+	INIT_LIST_HEAD(&cb.list);
+	cb.buf = buf;
+	cb.filler = filler;
+	cb.offset = 0;
+	offset = i_size_read(file->f_path.dentry->d_inode);
+	cb.count = file->f_pos;
+
+	if (file->f_pos > 0) {
+		/*
+		 * We have already read from this dir, lets read that stuff to
+		 * our union-cache only
+		 */
+		res = readdir_union_cache(file, &cb,
+					  filldir_topmost_cacheonly);
+		if (res) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
+	if (file->f_pos < offset) {
+		res = file->f_op->readdir(file, &cb, filldir_topmost);
+		file_accessed(file);
+		if (res) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+		/* We read until EOF of this directory */
+		file->f_pos = offset;
+	}
+
+	mutex_unlock(&inode->i_mutex);
+
+	path = file->f_path;
+	pathget(&path);
+	while (follow_union_down(&path.mnt, &path.dentry)) {
+		struct file *ftmp;
+
+		/* get path reference for filep */
+		pathget(&path);
+		ftmp = dentry_open(path.dentry, path.mnt,
+				   ((file->f_flags & ~(O_ACCMODE)) |
+				    O_RDONLY | O_DIRECTORY | O_NOATIME));
+		if (IS_ERR(ftmp)) {
+			res = PTR_ERR(ftmp);
+			break;
+		}
+
+		inode = path.dentry->d_inode;
+		mutex_lock(&inode->i_mutex);
+
+		/* rearrange the file position */
+		cb.offset += offset;
+		offset = i_size_read(inode);
+		ftmp->f_pos = file->f_pos - cb.offset;
+		cb.count = ftmp->f_pos;
+		if (ftmp->f_pos < 0) {
+			mutex_unlock(&inode->i_mutex);
+			fput(ftmp);
+			break;
+		}
+
+		res = -ENOENT;
+		if (IS_DEADDIR(inode))
+			goto out_fput;
+
+		if (ftmp->f_pos > 0) {
+			/*
+			 * We have already read from this dir, lets read that
+			 * stuff to our union-cache only
+			 */
+			res = readdir_union_cache(ftmp, &cb,
+						  filldir_overlaid_cacheonly);
+			if (res)
+				goto out_fput;
+		}
+
+		if (ftmp->f_pos < offset) {
+			res = ftmp->f_op->readdir(ftmp, &cb, filldir_overlaid);
+			file_accessed(ftmp);
+			if (res)
+				file->f_pos += ftmp->f_pos;
+			else
+				/*
+				 * We read until EOF of this directory, so lets
+				 * advance the f_pos by the maximum offset
+				 * (i_size) of this directory
+				 */
+				file->f_pos += offset;
+		}
+
+		file_accessed(ftmp);
+
+out_fput:
+		mutex_unlock(&inode->i_mutex);
+		fput(ftmp);
+
+		if (res)
+			break;
+	}
+	pathput(&path);
+out:
+	union_cache_free(&cb.list);
+	return res;
+}
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -53,6 +53,7 @@ extern void __shrink_d_unions(struct den
 extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
 			    struct dentry *);
 extern void detach_mnt_union(struct vfsmount *);
+extern int readdir_union(struct file *, void *, filldir_t);
 
 #else /* CONFIG_UNION_MOUNT */
 
@@ -69,5 +70,29 @@ extern void detach_mnt_union(struct vfsm
 #define detach_mnt_union(x)		do { } while (0)
 
 #endif	/* CONFIG_UNION_MOUNT */
+
+static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	int res;
+
+#ifdef CONFIG_UNION_MOUNT
+	if (IS_MNT_UNION(file->f_path.mnt))
+		res = readdir_union(file, buf, filler);
+	else
+#endif
+	{
+		mutex_lock(&inode->i_mutex);
+		res = -ENOENT;
+		if (!IS_DEADDIR(inode)) {
+			res = file->f_op->readdir(file, buf, filler);
+			file_accessed(file);
+		}
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	return res;
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* __LINUX_UNION_H */

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 21/26] union-mount: in-kernel file copy between union mounted filesystems
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (19 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 20/26] union-mount: Simple union-mount readdir implementation Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 22/26] union-mount: white-out changes for copy-on-open Jan Blunck
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-copy-on-open.diff --]
[-- Type: text/plain, Size: 13178 bytes --]

This patch introduces in-kernel file copy between union mounted
filesystems. When a file is opened for writing but resides on a lower (thus
read-only) layer of the union stack it is copied to the topmost union layer
first.

This patch uses the do_splice() for doing the in-kernel file copy.

Signed-off-by: Bharata B Rao <bharata@in.ibm.com>
Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c            |   73 ++++++++++-
 fs/union.c            |  312 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/union.h |    9 +
 3 files changed, 389 insertions(+), 5 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -994,7 +994,7 @@ static int __follow_mount(struct path *p
 	return res;
 }
 
-static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
+void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
 {
 	while (d_mountpoint(*dentry)) {
 		struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
@@ -1213,6 +1213,21 @@ static fastcall int __link_path_walk(con
 		if (err)
 			break;
 
+		if ((nd->flags & LOOKUP_TOPMOST) &&
+		    (nd->um_flags & LAST_LOWLEVEL)) {
+			struct dentry *dentry;
+
+			dentry = union_create_topmost(nd, &this, &next);
+			if (IS_ERR(dentry)) {
+				err = PTR_ERR(dentry);
+				goto out_dput;
+			}
+			dput_path(&next, nd);
+			next.mnt = nd->mnt;
+			next.dentry = dentry;
+			nd->um_flags &= ~LAST_LOWLEVEL;
+		}
+
 		err = -ENOENT;
 		inode = next.dentry->d_inode;
 		if (!inode || S_ISWHT(inode->i_mode))
@@ -1267,6 +1282,22 @@ last_component:
 		err = do_lookup(nd, &this, &next);
 		if (err)
 			break;
+
+		if ((nd->flags & LOOKUP_TOPMOST) &&
+		    (nd->um_flags & LAST_LOWLEVEL)) {
+			struct dentry *dentry;
+
+			dentry = union_create_topmost(nd, &this, &next);
+			if (IS_ERR(dentry)) {
+				err = PTR_ERR(dentry);
+				goto out_dput;
+			}
+			dput_path(&next, nd);
+			next.mnt = nd->mnt;
+			next.dentry = dentry;
+			nd->um_flags &= ~LAST_LOWLEVEL;
+		}
+
 		inode = next.dentry->d_inode;
 		if ((lookup_flags & LOOKUP_FOLLOW)
 		    && inode && inode->i_op && inode->i_op->follow_link) {
@@ -1755,7 +1786,7 @@ out:
 	return err;
 }
 
-static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
+int hash_lookup_union(struct nameidata *nd, struct qstr *name,
 			     struct path *path)
 {
 	struct path safe = { .dentry = nd->dentry, .mnt = nd->mnt };
@@ -2169,6 +2200,11 @@ int open_namei(int dfd, const char *path
 					 nd, flag);
 		if (error)
 			return error;
+		if (flag & FMODE_WRITE) {
+			error = union_copyup(nd, flag);
+			if (error)
+				return error;
+		}
 		goto ok;
 	}
 
@@ -2188,6 +2224,16 @@ int open_namei(int dfd, const char *path
 	if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
 		goto exit;
 
+	/*
+	 * If this dentry is on an union mount we need the topmost dentry here.
+	 * This creates all topmost directories on the path to this dentry too.
+	 */
+	if (is_unionized(nd->dentry, nd->mnt)) {
+		error = union_relookup_topmost(nd, nd->flags & ~LOOKUP_PARENT);
+		if (error)
+			goto exit;
+	}
+
 	dir = nd->dentry;
 	nd->flags &= ~LOOKUP_PARENT;
 	mutex_lock(&dir->d_inode->i_mutex);
@@ -2235,10 +2281,21 @@ do_last:
 	if (path.dentry->d_inode->i_op && path.dentry->d_inode->i_op->follow_link)
 		goto do_link;
 
-	path_to_nameidata(&path, nd);
 	error = -EISDIR;
 	if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
-		goto exit;
+		goto exit_dput;
+
+	/*
+	 * If this file is on a lower layer of the union stack, copy it to the
+	 * topmost layer before opening it
+	 */
+	if (path.dentry->d_inode && (path.dentry->d_parent != dir)) {
+		error = __union_copyup(&path, nd, &path);
+		if (error)
+			goto exit_dput;
+	}
+
+	path_to_nameidata(&path, nd);
 ok:
 	error = may_open(nd, acc_mode, flag);
 	if (error)
@@ -3437,9 +3494,15 @@ static int do_rename(int olddfd, const c
 	error = -ENOTEMPTY;
 	if (new.dentry == trap)
 		goto exit5;
+	/* renaming on unions is done by the user-space */
+	error = -EXDEV;
+	if (is_unionized(oldnd.dentry, oldnd.mnt))
+		goto exit5;
+	if (is_unionized(newnd.dentry, newnd.mnt))
+		goto exit5;
 
 	error = vfs_rename(old_dir->d_inode, old.dentry,
-				   new_dir->d_inode, new.dentry);
+			   new_dir->d_inode, new.dentry);
 exit5:
 	dput_path(&new, &newnd);
 exit4:
--- a/fs/union.c
+++ b/fs/union.c
@@ -20,6 +20,11 @@
 #include <linux/union.h>
 #include <linux/module.h>
 #include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/quotaops.h>
+#include <linux/dnotify.h>
+#include <linux/security.h>
+#include <linux/pipe_fs_i.h>
 
 /*
  * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody
@@ -798,3 +803,310 @@ out:
 	union_cache_free(&cb.list);
 	return res;
 }
+
+/*
+ * Union mount copyup support
+ */
+
+extern int hash_lookup_union(struct nameidata *, struct qstr *, struct path *);
+extern void follow_mount(struct vfsmount **, struct dentry **);
+
+/*
+ * union_relookup_topmost - lookup and create the topmost path to dentry
+ * @nd: pointer to nameidata
+ * @flags: lookup flags
+ */
+int union_relookup_topmost(struct nameidata *nd, int flags)
+{
+	int err;
+	char *kbuf, *name;
+	struct nameidata this;
+
+	kbuf = (char *)__get_free_page(GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	name = d_path(nd->dentry, nd->mnt, kbuf, PAGE_SIZE);
+	err = PTR_ERR(name);
+	if (IS_ERR(name))
+		goto free_page;
+
+	err = path_lookup(name, flags|LOOKUP_CREATE|LOOKUP_TOPMOST, &this);
+	if (err)
+		goto free_page;
+
+	path_release(nd);
+	nd->dentry = this.dentry;
+	nd->mnt = this.mnt;
+
+	/*
+	 * the nd->flags should be unchanged
+	 */
+	BUG_ON(this.um_flags & LAST_LOWLEVEL);
+	nd->um_flags &= ~LAST_LOWLEVEL;
+ free_page:
+	free_page((unsigned long)kbuf);
+	return err;
+}
+
+static void __update_fs_pwd(struct path *path, struct dentry *dentry,
+			    struct vfsmount *mnt)
+{
+	struct dentry *old_pwd = NULL;
+	struct vfsmount *old_pwdmnt = NULL;
+
+	write_lock(&current->fs->lock);
+	if (current->fs->pwd == path->dentry) {
+		old_pwd = current->fs->pwd;
+		old_pwdmnt = current->fs->pwdmnt;
+		current->fs->pwdmnt = mntget(mnt);
+		current->fs->pwd = dget(dentry);
+	}
+	write_unlock(&current->fs->lock);
+
+	if (old_pwd) {
+		dput(old_pwd);
+		mntput(old_pwdmnt);
+	}
+
+	return;
+}
+
+/*
+ * union_create_topmost - create the topmost path component
+ * @nd: pointer to nameidata of the base directory
+ * @name: pointer to file name
+ * @path: pointer to path of the overlaid file
+ *
+ * This is called by __link_path_walk() to create the directories on a path
+ * when it is called with LOOKUP_TOPMOST.
+ */
+struct dentry *union_create_topmost(struct nameidata *nd, struct qstr *name,
+				    struct path *path)
+{
+	struct dentry *dentry, *parent = nd->dentry;
+	int res, mode = path->dentry->d_inode->i_mode;
+
+	if (parent->d_sb == path->dentry->d_sb)
+		return ERR_PTR(-EEXIST);
+
+	mutex_lock(&parent->d_inode->i_mutex);
+	dentry = lookup_one_len_nd(name->name, nd->dentry, name->len, nd);
+	if (IS_ERR(dentry))
+		goto out_unlock;
+
+	switch (mode & S_IFMT) {
+	case S_IFREG:
+		/*
+		 * FIXME: Does this make any sense in this case?
+		 * Special case - lookup gave negative, but... we had foo/bar/
+		 * From the vfs_mknod() POV we just have a negative dentry -
+		 * all is fine. Let's be bastards - you had / on the end,you've
+		 * been asking for (non-existent) directory. -ENOENT for you.
+		 */
+		if (name->name[name->len] && !dentry->d_inode) {
+			dput(dentry);
+			dentry = ERR_PTR(-ENOENT);
+			goto out_unlock;
+		}
+
+		res = vfs_create(parent->d_inode, dentry, mode, nd);
+		if (res) {
+			dput(dentry);
+			dentry = ERR_PTR(res);
+			goto out_unlock;
+		}
+		break;
+	case S_IFDIR:
+		res = vfs_mkdir(parent->d_inode, dentry, mode);
+		if (res) {
+			dput(dentry);
+			dentry = ERR_PTR(res);
+			goto out_unlock;
+		}
+
+		res = append_to_union(nd->mnt, dentry, path->mnt,
+				      path->dentry);
+		if (res) {
+			dput(dentry);
+			dentry = ERR_PTR(res);
+			goto out_unlock;
+		}
+		break;
+	default:
+		dput(dentry);
+		dentry = ERR_PTR(-EINVAL);
+		goto out_unlock;
+	}
+
+	/* Really necessary ??? */
+/*	__update_fs_pwd(path, dentry, nd->mnt); */
+
+ out_unlock:
+	mutex_unlock(&parent->d_inode->i_mutex);
+	return dentry;
+}
+
+static int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
+			   struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+	int ret;
+	size_t size;
+	loff_t offset;
+	struct file *old_file, *new_file;
+
+	dget(old_dentry);
+	mntget(old_mnt);
+	old_file = dentry_open(old_dentry, old_mnt, O_RDONLY);
+	if (IS_ERR(old_file))
+		return PTR_ERR(old_file);
+
+	dget(new_dentry);
+	mntget(new_mnt);
+	new_file = dentry_open(new_dentry, new_mnt, O_WRONLY);
+	ret = PTR_ERR(new_file);
+	if (IS_ERR(new_file))
+		goto fput_old;
+
+	size = i_size_read(old_file->f_path.dentry->d_inode);
+	if (((size_t)size != size) || ((ssize_t)size != size)) {
+		ret = -EFBIG;
+		goto fput_new;
+	}
+
+	offset = 0;
+	ret = do_splice_direct(old_file, &offset, new_file, size,
+			       SPLICE_F_MOVE);
+	if (ret >= 0)
+		ret = 0;
+ fput_new:
+	fput(new_file);
+ fput_old:
+	fput(old_file);
+	return ret;
+}
+
+/**
+ * __union_copyup - copy a file to the topmost directory
+ * @old: pointer to path of the old file name
+ * @new_nd: pointer to nameidata of the topmost directory
+ * @new: pointer to path of the new file name
+ *
+ * The topmost directory @new_nd must already be locked. Creates the topmost
+ * file if it doesn't exist yet.
+ */
+int __union_copyup(struct path *old, struct nameidata *new_nd, struct path *new)
+{
+	struct dentry *dentry;
+	int error;
+
+	/* Maybe this should be -EINVAL */
+	if (S_ISDIR(old->dentry->d_inode->i_mode))
+		return -EISDIR;
+
+	if (new_nd->dentry != new->dentry->d_parent) {
+		dentry = lookup_one_len_nd(new->dentry->d_name.name,
+					   new_nd->dentry,
+					   new->dentry->d_name.len, new_nd);
+		if (IS_ERR(dentry))
+			return PTR_ERR(dentry);
+		error = -EEXIST;
+		if (dentry->d_inode && !S_ISWHT(dentry->d_inode->i_mode))
+			goto out;
+	} else
+		dentry = new->dentry;
+
+	if (!dentry->d_inode) {
+		error = vfs_create(new_nd->dentry->d_inode, dentry,
+				   old->dentry->d_inode->i_mode, new_nd);
+		if (error)
+			goto out;
+	}
+
+	error = union_copy_file(old->dentry, old->mnt, dentry, new_nd->mnt);
+	if (error) {
+		/* FIXME: are there return value we should not BUG() on ? */
+		BUG_ON(vfs_unlink(new_nd->dentry->d_inode, dentry));
+		goto out;
+	}
+
+	dput_path(new, new_nd);
+	new->dentry = dentry;
+	new->mnt = new_nd->mnt;
+out:
+	if (new->dentry != dentry)
+		dput(dentry);
+	return error;
+}
+
+/*
+ * union_copyup - copy a file to the topmost layer of the union stack
+ * @nd: nameidata pointer to the file
+ * @flags: flags given to open_namei
+ */
+int union_copyup(struct nameidata *nd, int flags)
+{
+	struct qstr this;
+	char *name;
+	struct dentry *dir;
+	struct path path;
+	int err;
+
+	if (!is_unionized(nd->dentry, nd->mnt))
+		return 0;
+	if (!S_ISREG(nd->dentry->d_inode->i_mode))
+		return 0;
+
+	/* safe the name for hash_lookup_union() */
+	this.len = nd->dentry->d_name.len;
+	this.hash = nd->dentry->d_name.hash;
+	name = kmalloc(this.len + 1, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+	this.name = name;
+	memcpy(name, nd->dentry->d_name.name, nd->dentry->d_name.len);
+	name[this.len] = 0;
+
+	err = union_relookup_topmost(nd, nd->flags|LOOKUP_PARENT);
+	if (err) {
+		kfree(name);
+		return err;
+	}
+	nd->flags &= ~LOOKUP_PARENT;
+
+	dir = nd->dentry;
+	mutex_lock(&dir->d_inode->i_mutex);
+	err = hash_lookup_union(nd, &this, &path);
+	mutex_unlock(&dir->d_inode->i_mutex);
+	kfree(name);
+	if (err)
+		return err;
+
+	err = -ENOENT;
+	if (!path.dentry->d_inode)
+		goto exit_dput;
+
+	/* Necessary?! I guess not ... */
+	follow_mount(&path.mnt, &path.dentry);
+
+	err = -ENOENT;
+	if (!path.dentry->d_inode)
+		goto exit_dput;
+
+	err = -EISDIR;
+	if (!S_ISREG(path.dentry->d_inode->i_mode))
+		goto exit_dput;
+
+	if (path.dentry->d_parent != nd->dentry) {
+		err = __union_copyup(&path, nd, &path);
+		if (err)
+			goto exit_dput;
+	}
+
+	path_to_nameidata(&path, nd);
+	return 0;
+
+exit_dput:
+	dput_path(&path, nd);
+	return err;
+}
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -54,6 +54,11 @@ extern int attach_mnt_union(struct vfsmo
 			    struct dentry *);
 extern void detach_mnt_union(struct vfsmount *);
 extern int readdir_union(struct file *, void *, filldir_t);
+extern int union_relookup_topmost(struct nameidata *, int);
+extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
+					   struct path *);
+extern int __union_copyup(struct path *, struct nameidata *, struct path *);
+extern int union_copyup(struct nameidata *, int);
 
 #else /* CONFIG_UNION_MOUNT */
 
@@ -68,6 +73,10 @@ extern int readdir_union(struct file *, 
 #define __shrink_d_unions(x)		do { } while (0)
 #define attach_mnt_union(x, y, z)	do { } while (0)
 #define detach_mnt_union(x)		do { } while (0)
+#define union_relookup_topmost(x, y)	({ BUG(); (0); })
+#define union_create_topmost(x, y, z)	({ BUG(); (NULL); })
+#define __union_copyup(x, y, z)		({ BUG(); (0); })
+#define union_copyup(x, y)		({ (0); })
 
 #endif	/* CONFIG_UNION_MOUNT */
 

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 22/26] union-mount: white-out changes for copy-on-open
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (20 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 21/26] union-mount: in-kernel file copy between union mounted filesystems Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 23/26] union-mount: copyup on rename Jan Blunck
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-coo-whiteout.diff --]
[-- Type: text/plain, Size: 3058 bytes --]

When files on an upper layer of the union stack are removed we need to
white-out the removed filename.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c |   46 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2253,6 +2253,13 @@ do_last:
 
 	/* Negative dentry, just create the file */
 	if (!path.dentry->d_inode || S_ISWHT(path.dentry->d_inode->i_mode)) {
+		if (path.dentry->d_parent != dir) {
+			dput_path(&path, nd);
+			path.dentry = __lookup_hash_kern(&nd->last, dir, nd);
+			path.mnt = nd->mnt;
+			goto do_last;
+		}
+
 		error = open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
@@ -2373,6 +2380,16 @@ int lookup_create(struct nameidata *nd, 
 {
 	int err = -EEXIST;
 
+	if (is_unionized(nd->dentry, nd->mnt)) {
+		err = union_relookup_topmost(nd, nd->flags & ~LOOKUP_PARENT);
+		if (err) {
+			/* FIXME: This really sucks */
+			mutex_lock_nested(&nd->dentry->d_inode->i_mutex,
+					  I_MUTEX_PARENT);
+			goto fail;
+		}
+	}
+
 	mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	/*
 	 * Yucky last component or no last component at all?
@@ -2391,6 +2408,16 @@ int lookup_create(struct nameidata *nd, 
 	if (err)
 		goto fail;
 
+	/* Special case - we found a whiteout */
+	if (path->dentry->d_inode && S_ISWHT(path->dentry->d_inode->i_mode)) {
+		if (path->dentry->d_parent != nd->dentry) {
+			dput_path(path, nd);
+			path->dentry = __lookup_hash_kern(&nd->last, nd->dentry,
+							  nd);
+			path->mnt = nd->mnt;
+		}
+	}
+
 	/*
 	 * Special case - lookup gave negative, but... we had foo/bar/
 	 * From the vfs_mknod() POV we just have a negative dentry -
@@ -2682,6 +2709,15 @@ static int do_whiteout(struct nameidata 
 	if (isdir && !directory_is_empty(path->dentry, path->mnt))
 		goto out;
 
+	mutex_unlock(&nd->dentry->d_inode->i_mutex);
+	err = union_relookup_topmost(nd, nd->flags & ~LOOKUP_PARENT);
+	if (err) {
+		mutex_lock_nested(&nd->dentry->d_inode->i_mutex,
+				  I_MUTEX_PARENT);
+		goto out;
+	}
+	mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
+
 	/* safe the name for a later lookup */
 	err = -ENOMEM;
 	name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
@@ -3012,7 +3048,10 @@ static long do_rmdir(int dfd, const char
 	error = hash_lookup_union(&nd, &nd.last, &path);
 	if (error)
 		goto exit2;
-	error = vfs_rmdir(nd.dentry->d_inode, path.dentry);
+	if (is_unionized(nd.dentry, nd.mnt))
+		error = do_whiteout(&nd, &path, 1);
+	else
+		error = vfs_rmdir(nd.dentry->d_inode, path.dentry);
 	dput_path(&path, &nd);
 exit2:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
@@ -3091,7 +3130,10 @@ static long do_unlinkat(int dfd, const c
 		inode = path.dentry->d_inode;
 		if (inode)
 			atomic_inc(&inode->i_count);
-		error = vfs_unlink(nd.dentry->d_inode, path.dentry);
+		if (is_unionized(nd.dentry, nd.mnt))
+			error = do_whiteout(&nd, &path, 0);
+		else
+			error = vfs_unlink(nd.dentry->d_inode, path.dentry);
 	exit2:
 		dput_path(&path, &nd);
 	}

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 23/26] union-mount: copyup on rename
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (21 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 22/26] union-mount: white-out changes for copy-on-open Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 24/26] union-mount: dont report EROFS for union mounts Jan Blunck
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-coo-rename.diff --]
[-- Type: text/plain, Size: 5247 bytes --]

Add copyup renaming of regular files on union mounts. Directories are still
lazyly copied with the help of user-space.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c |  133 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/union.c |    8 ++-
 2 files changed, 129 insertions(+), 12 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1491,6 +1491,8 @@ static int fastcall do_path_lookup(int d
 		nd->mnt = mntget(fs->pwdmnt);
 		nd->dentry = dget(fs->pwd);
 		read_unlock(&fs->lock);
+		/* Force a union_relookup() */
+		nd->um_flags = LAST_LOWLEVEL;
 	} else {
 		struct dentry *dentry;
 
@@ -3478,6 +3480,97 @@ int vfs_rename(struct inode *old_dir, st
 	return error;
 }
 
+int vfs_rename_union(struct nameidata *oldnd, struct path *old,
+		     struct nameidata *newnd, struct path *new)
+{
+	struct inode *old_dir = oldnd->dentry->d_inode;
+	struct inode *new_dir = newnd->dentry->d_inode;
+	struct qstr old_name;
+	char *name;
+	struct dentry *dentry;
+	int error;
+
+	if (old->dentry->d_inode == new->dentry->d_inode)
+		return 0;
+
+	error = may_whiteout(old->dentry, 0);
+	if (error)
+		return error;
+	if (!old_dir->i_op || !old_dir->i_op->whiteout)
+		return -EPERM;
+
+	if (!new->dentry->d_inode)
+		error = may_create(new_dir, new->dentry, NULL);
+	else
+		error = may_delete(new_dir, new->dentry, 0);
+	if (error)
+		return error;
+
+	DQUOT_INIT(old_dir);
+	DQUOT_INIT(new_dir);
+
+	error = security_inode_rename(old_dir, old->dentry,
+				      new_dir, new->dentry);
+	if (error)
+		return error;
+
+	error = -EBUSY;
+	if (d_mountpoint(old->dentry) || d_mountpoint(new->dentry))
+		return error;
+
+	error = -ENOMEM;
+	name = kmalloc(old->dentry->d_name.len, GFP_KERNEL);
+	if (!name)
+		return error;
+	strncpy(name, old->dentry->d_name.name, old->dentry->d_name.len);
+	name[old->dentry->d_name.len] = 0;
+	old_name.len = old->dentry->d_name.len;
+	old_name.hash = old->dentry->d_name.hash;
+	old_name.name = name;
+
+	/* possibly delete the existing new file */
+	if ((newnd->dentry == new->dentry->d_parent) && new->dentry->d_inode) {
+		/* FIXME: inode may be truncated while we hold a lock */
+		error = vfs_unlink(new_dir, new->dentry);
+		if (error)
+			goto freename;
+
+		dentry = __lookup_hash_kern(&new->dentry->d_name,
+					    newnd->dentry, newnd);
+		if (IS_ERR(dentry))
+			goto freename;
+
+		dput(new->dentry);
+		new->dentry = dentry;
+	}
+
+	/* copyup to the new file */
+	error = __union_copyup(old, newnd, new);
+	if (error)
+		goto freename;
+
+	/* whiteout the old file */
+	dentry = __lookup_hash_kern(&old_name, oldnd->dentry, oldnd);
+	error = PTR_ERR(dentry);
+	if (IS_ERR(dentry))
+		goto freename;
+	error = vfs_whiteout(old_dir, dentry);
+	dput(dentry);
+
+	/* FIXME: This is acutally unlink() && create() ... */
+/*
+	if (!error) {
+		const char *new_name = old_dentry->d_name.name;
+		fsnotify_move(old_dir, new_dir, old_name.name, new_name, 0,
+			      new_dentry->d_inode, old_dentry->d_inode);
+	}
+*/
+freename:
+	kfree(old_name.name);
+	return error;
+}
+
+
 static int do_rename(int olddfd, const char *oldname,
 			int newdfd, const char *newname)
 {
@@ -3495,10 +3588,7 @@ static int do_rename(int olddfd, const c
 	if (error)
 		goto exit1;
 
-	error = -EXDEV;
-	if (oldnd.mnt != newnd.mnt)
-		goto exit2;
-
+lock:
 	old_dir = oldnd.dentry;
 	error = -EBUSY;
 	if (oldnd.last_type != LAST_NORM)
@@ -3536,15 +3626,40 @@ static int do_rename(int olddfd, const c
 	error = -ENOTEMPTY;
 	if (new.dentry == trap)
 		goto exit5;
-	/* renaming on unions is done by the user-space */
+	/* renaming of directories on unions is done by the user-space */
 	error = -EXDEV;
-	if (is_unionized(oldnd.dentry, oldnd.mnt))
+	if (is_unionized(oldnd.dentry, oldnd.mnt) &&
+	    S_ISDIR(old.dentry->d_inode->i_mode))
 		goto exit5;
-	if (is_unionized(newnd.dentry, newnd.mnt))
+	/* renameing of other files on unions is done by copyup */
+	if ((is_unionized(oldnd.dentry, oldnd.mnt) &&
+	     (oldnd.um_flags & LAST_LOWLEVEL)) ||
+	    (is_unionized(newnd.dentry, newnd.mnt) &&
+	     (newnd.um_flags & LAST_LOWLEVEL))) {
+		dput_path(&new, &newnd);
+		dput_path(&old, &oldnd);
+		unlock_rename(new_dir, old_dir);
+		error = union_relookup_topmost(&oldnd,
+					       oldnd.flags & ~LOOKUP_PARENT);
+		if (error)
+			goto exit2;
+		error = union_relookup_topmost(&newnd,
+					       newnd.flags & ~LOOKUP_PARENT);
+		if (error)
+			goto exit2;
+		goto lock;
+	}
+
+	error = -EXDEV;
+	if (oldnd.mnt != newnd.mnt)
 		goto exit5;
 
-	error = vfs_rename(old_dir->d_inode, old.dentry,
-			   new_dir->d_inode, new.dentry);
+	if (is_unionized(oldnd.dentry, oldnd.mnt) &&
+	    (old.dentry->d_parent != oldnd.dentry))
+		error = vfs_rename_union(&oldnd, &old, &newnd, &new);
+	else
+		error = vfs_rename(old_dir->d_inode, old.dentry,
+				   new_dir->d_inode, new.dentry);
 exit5:
 	dput_path(&new, &newnd);
 exit4:
--- a/fs/union.c
+++ b/fs/union.c
@@ -1030,9 +1030,11 @@ int __union_copyup(struct path *old, str
 		goto out;
 	}
 
-	dput_path(new, new_nd);
-	new->dentry = dentry;
-	new->mnt = new_nd->mnt;
+	if (new->dentry != dentry) {
+		dput_path(new, new_nd);
+		new->dentry = dentry;
+		new->mnt = new_nd->mnt;
+	}
 out:
 	if (new->dentry != dentry)
 		dput(dentry);

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 24/26] union-mount: dont report EROFS for union mounts
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (22 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 23/26] union-mount: copyup on rename Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 25/26] union-mount: Debug Infrastructure Jan Blunck
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-access.diff --]
[-- Type: text/plain, Size: 611 bytes --]

SuS v2 requires we report a read only fs too. For union-mounts this is a very
expensive check. So I'm lazy and just disable the check if we are on a lower
layer of an union.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/open.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/open.c
+++ b/fs/open.c
@@ -483,7 +483,7 @@ asmlinkage long sys_faccessat(int dfd, c
 	   special_file(nd.dentry->d_inode->i_mode))
 		goto out_path_release;
 
-	if(IS_RDONLY(nd.dentry->d_inode))
+	if (!(nd.um_flags & LAST_LOWLEVEL) && IS_RDONLY(nd.dentry->d_inode))
 		res = -EROFS;
 
 out_path_release:

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 25/26] union-mount: Debug Infrastructure
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (23 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 24/26] union-mount: dont report EROFS for union mounts Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 16:13 ` [RFC 26/26] union-mount: Debug code Jan Blunck
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-debug-infrastructure.diff --]
[-- Type: text/plain, Size: 10798 bytes --]

This adds debugfs/relay based debugging infrastructure helpful when doing
development of the union-mount code itself. The debgging output can be enabled
during runtime by:

 echo 1 > /proc/sys/fs/union-debug

This registers the relayfs files where the debug code is writing its output
to. There are different levels of debugging output available which can be ORed
together. For the valid sysctl values see include/linux/union_debug.h.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 include/linux/union_debug.h |   91 ++++++++++++++
 lib/Kconfig.debug           |    9 +
 lib/Makefile                |    2 
 lib/union_debug.c           |  268 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 370 insertions(+)

--- /dev/null
+++ b/include/linux/union_debug.h
@@ -0,0 +1,91 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007 Novell Inc.
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_DEBUG_H
+#define __LINUX_UNION_DEBUG_H
+
+#ifdef __KERNEL__
+
+#ifdef CONFIG_DEBUG_UNION_MOUNT
+
+#include <linux/sched.h>
+
+/* This is taken from klog debugging facility */
+extern void klog(const void *data, int len);
+extern void klog_printk(const char *fmt, ...);
+extern void klog_printk_dentry(const char *func, struct dentry *dentry);
+
+extern int sysctl_union_debug;
+
+#define UNION_MOUNT_DEBUG		1
+#define UNION_MOUNT_DEBUG_DCACHE	2
+#define UNION_MOUNT_DEBUG_LOCK		4
+#define UNION_MOUNT_DEBUG_READDIR	8
+#define UNION_MOUNT_DEBUG_LOOKUP	16
+
+#define UM_DEBUG(fmt, args...)						\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG)			\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_DENTRY(dentry)						\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG)			\
+		klog_printk_dentry(__FUNCTION__, (dentry));		\
+} while (0)
+#define UM_DEBUG_DCACHE(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_DCACHE)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_DCACHE_DENTRY(dentry)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_DCACHE)		\
+		klog_printk_dentry(__FUNCTION__, (dentry));		\
+} while (0)
+#define UM_DEBUG_LOCK(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_LOCK)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_READDIR(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_READDIR)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_LOOKUP(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_LOOKUP)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_LOOKUP_DENTRY(dentry)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_LOOKUP)		\
+		klog_printk_dentry(__FUNCTION__, (dentry));		\
+} while (0)
+
+#else	/* CONFIG_DEBUG_UNION_MOUNT */
+
+#define UM_DEBUG(fmt, args...)			do { /* empty */ } while (0)
+#define UM_DEBUG_DENTRY(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE_DENTRY(fmt, args...)	do { /* empty */ } while (0)
+#define UM_DEBUG_LOCK(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_READDIR(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_LOOKUP_DENTRY(fmt, args...)	do { /* empty */ } while (0)
+#define UM_DEBUG_LOOKUP_DENTRY(fmt, args...)	do { /* empty */ } while (0)
+
+#endif	/* CONFIG_DEBUG_UNION_MOUNT */
+
+#endif	/* __KERNEL__ */
+#endif	/*  __LINUX_UNION_DEBUG_H */
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -393,6 +393,15 @@ config DEBUG_LIST
 
 	  If unsure, say N.
 
+config DEBUG_UNION_MOUNT
+	bool "Debug VFS based union mounts"
+	depends on DEBUG_KERNEL && UNION_MOUNT
+	select DEBUG_FS
+	default n
+	help
+	  If you say Y here, the union mount debugging code will be
+	  compiled in.
+
 config FRAME_POINTER
 	bool "Compile the kernel with frame pointers"
 	depends on DEBUG_KERNEL && (X86 || CRIS || M68K || M68KNOMMU || FRV || UML || S390 || AVR32 || SUPERH || BFIN)
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -35,6 +35,8 @@ obj-$(CONFIG_PLIST) += plist.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
 
+obj-$(CONFIG_DEBUG_UNION_MOUNT) += union_debug.o
+
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
   lib-y += dec_and_lock.o
 endif
--- /dev/null
+++ b/lib/union_debug.c
@@ -0,0 +1,268 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2007 SUSE Linux
+ *   Author(s): Jan Blunck (jblunck@suse.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sysctl.h>
+#include <linux/init.h>
+#include <linux/relay.h>
+#include <linux/debugfs.h>
+
+int sysctl_union_debug;
+EXPORT_SYMBOL_GPL(sysctl_union_debug);
+
+static struct rchan *debug_rchan;
+static struct dentry *debug_logdir;
+#define SUBBUF_SIZE 262144
+#define N_SUBBUF 4
+
+static struct dentry *create_buf_file(const char *filename,
+				      struct dentry *parent, int mode,
+				      struct rchan_buf *buf, int *is_global)
+{
+	return debugfs_create_file(filename, mode, parent, buf,
+				   &relay_file_operations);
+}
+
+static int remove_buf_file(struct dentry *dentry)
+{
+	debugfs_remove(dentry);
+	return 0;
+}
+
+static int subbuf_start(struct rchan_buf *buf, void *subbuf, void *prev_subbuf,
+			unsigned int prev_padding)
+{
+	return 1;
+}
+
+static struct rchan_callbacks debug_relay_cb = {
+	.create_buf_file = create_buf_file,
+	.remove_buf_file = remove_buf_file,
+	.subbuf_start = subbuf_start,
+};
+
+static int union_debug_relay_init(void)
+{
+	struct dentry *dentry;
+	struct rchan *rchan;
+
+	if (!debug_logdir) {
+		dentry = debugfs_create_dir("union", NULL);
+		if (IS_ERR(dentry)) {
+			printk(KERN_INFO
+			       "%s: debugfs directory creation failed\n",
+			       __FUNCTION__);
+			return PTR_ERR(dentry);
+		}
+
+		debug_logdir = dentry;
+	}
+
+	if (!debug_rchan) {
+		rchan = relay_open("logfile", debug_logdir,
+				   SUBBUF_SIZE, N_SUBBUF,
+				   &debug_relay_cb, NULL);
+		if (!rchan) {
+			printk(KERN_INFO "%s: relay channel creation failed\n",
+			       __FUNCTION__);
+			debugfs_remove(debug_logdir);
+			return -ENOMEM;
+		}
+
+		debug_rchan = rchan;
+	}
+
+	return 0;
+}
+
+static void union_debug_relay_exit(void)
+{
+	if (debug_rchan)
+		relay_close(debug_rchan);
+	debug_rchan = NULL;
+	if (debug_logdir)
+		debugfs_remove(debug_logdir);
+	debug_logdir = NULL;
+}
+
+/*
+ * klog operations
+ */
+struct klog_operations
+{
+	/*
+	 * klog - called when klog called, same params
+	 */
+	void (*klog) (const void *data, int len);
+};
+
+/* maximum size of klog formatting buffer beyond which truncation will occur */
+#define KLOG_TMPBUF_SIZE (1024)
+/* per-cpu klog formatting temporary buffer */
+static char klog_buf[NR_CPUS][KLOG_TMPBUF_SIZE];
+
+/*
+ * do-nothing default klog handler, called if nothing registered
+ */
+static void default_klog(const void *data, int len)
+{
+}
+
+/*
+ * default klog operations, used if nothing registered
+ */
+static struct klog_operations default_klog_ops =
+{
+	.klog = default_klog,
+};
+
+static struct klog_operations *cur_klog_ops = &default_klog_ops;
+
+/**
+ *      register_klog_handler - register klog handler
+ *      @klog_ops: klog operations callbacks
+ *
+ *      replaces default klog handler with passed-in version
+ */
+int register_klog_handler(struct klog_operations *klog_ops)
+{
+	if (!klog_ops)
+		return -EINVAL;
+
+	if (!klog_ops->klog)
+		klog_ops->klog = default_klog;
+
+	cur_klog_ops = klog_ops;
+	return 0;
+}
+
+/**
+ *      unregister_klog_handler - unregister klog handler
+ *
+ *      default handler will be in effect after this
+ */
+void unregister_klog_handler(void)
+{
+	cur_klog_ops = &default_klog_ops;
+}
+
+/**
+ *      klog - send raw data to klog handler
+ */
+void klog(const void *data, int len)
+{
+	cur_klog_ops->klog(data, len);
+}
+
+/**
+ *      klog_printk - send a formatted string to the klog handler
+ *      @fmt: format string, same as printk
+ */
+void klog_printk(const char *fmt, ...)
+{
+	va_list args;
+	int len;
+	char *cbuf;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	cbuf = klog_buf[smp_processor_id()];
+	va_start(args, fmt);
+	len = vsnprintf(cbuf, KLOG_TMPBUF_SIZE, fmt, args);
+	va_end(args);
+	klog(cbuf, len);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(klog_printk);
+
+void klog_printk_dentry(const char *func, struct dentry *dentry)
+{
+	klog_printk("%s: %p{i=%p/%lx,c=%d,n=\"%s\"}\n",
+		    func,
+		    dentry,
+		    dentry->d_inode,
+		    dentry->d_inode ?
+		    dentry->d_inode->i_ino : 0UL,
+		    atomic_read(&dentry->d_count),
+		    dentry->d_name.name);
+}
+EXPORT_SYMBOL_GPL(klog_printk_dentry);
+
+static void log_data(const void *data, int len)
+{
+	relay_write(debug_rchan, data, len);
+}
+
+static struct klog_operations klog_handler =
+{
+	.klog = log_data,
+};
+
+static int union_debug_sysctl_handler(ctl_table *table, int write,
+				      struct file *file,
+				      void __user *buffer, size_t *length,
+				      loff_t *ppos)
+{
+	proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+
+	if (!write)
+		return 0;
+
+	printk(KERN_INFO "sysctl.fs.union-debug: %d\n", sysctl_union_debug);
+
+	switch (sysctl_union_debug) {
+	case 0:
+		unregister_klog_handler();
+		union_debug_relay_exit();
+		break;
+	default:
+		union_debug_relay_init();
+		if (register_klog_handler(&klog_handler))
+			union_debug_relay_exit();
+		break;
+	}
+
+	return 0;
+}
+
+static ctl_table union_table[] = {
+	{
+		.ctl_name = CTL_UNNUMBERED,
+		.procname = "union-debug",
+		.data = &sysctl_union_debug,
+		.maxlen = sizeof(int),
+		.mode = 0644,
+		.proc_handler = &union_debug_sysctl_handler,
+	},
+	{.ctl_name = 0}
+};
+
+static ctl_table fs_root[] = {
+	{
+		.ctl_name = CTL_FS,
+		.procname = "fs",
+		.maxlen = 0,
+		.mode = 0555,
+		.child = union_table,
+	},
+	{.ctl_name = 0}
+};
+
+static struct ctl_table_header *sysctl_header;
+
+static int union_debug_init(void)
+{
+	sysctl_header = register_sysctl_table(fs_root);
+	return 0;
+}
+
+late_initcall(union_debug_init);

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC 26/26] union-mount: Debug code
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (24 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 25/26] union-mount: Debug Infrastructure Jan Blunck
@ 2007-07-30 16:13 ` Jan Blunck
  2007-07-30 18:23 ` [RFC 00/26] VFS based Union Mount (V2) Al Boldi
  2007-08-02  6:49 ` Bharata B Rao
  27 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-30 16:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Bharata B Rao

[-- Attachment #1: um/union-mount-debug.diff --]
[-- Type: text/plain, Size: 6676 bytes --]

Some debugging code itself.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/namei.c            |   26 ++++++++++++++++++++++++++
 fs/union.c            |   27 +++++++++++++++++++++++++++
 include/linux/namei.h |    4 ++++
 3 files changed, 57 insertions(+)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -32,6 +32,7 @@
 #include <linux/fcntl.h>
 #include <linux/namei.h>
 #include <linux/union.h>
+#include <linux/union_debug.h>
 #include <asm/namei.h>
 #include <asm/uaccess.h>
 
@@ -1794,11 +1795,15 @@ int hash_lookup_union(struct nameidata *
 	struct path safe = { .dentry = nd->dentry, .mnt = nd->mnt };
 	int res ;
 
+	UM_DEBUG_LOOKUP("name = \"%*s\"\n", name->len, name->name);
+
 	pathget(&safe);
 	res = __hash_lookup_topmost(nd, name, path);
 	if (res)
 		goto out;
 
+	UM_DEBUG_LOOKUP_DENTRY(path->dentry);
+
 	/* only directories can be part of a union stack */
 	if (!path->dentry->d_inode ||
 	    !S_ISDIR(path->dentry->d_inode->i_mode))
@@ -1813,6 +1818,7 @@ int hash_lookup_union(struct nameidata *
 		goto out;
 	}
 
+	UM_DEBUG_LOOKUP_DENTRY(path->dentry);
 out:
 	path_release(nd);
 	nd->dentry = safe.dentry;
@@ -2765,6 +2771,8 @@ out_freename:
 	kfree(name.name);
 out:
 	pathput(&safe);
+	UM_DEBUG("err = %d\n", err);
+	UM_DEBUG_DENTRY(dentry);
 	return err;
 }
 
@@ -2802,6 +2810,9 @@ int vfs_unlink_whiteout(struct inode *di
 	}
 	mutex_unlock(&dentry->d_inode->i_mutex);
 
+	UM_DEBUG("err = %d\n", error);
+	UM_DEBUG_DENTRY(dentry);
+
 	/*
 	 * We can call dentry_iput() since nobody could actually do something
 	 * useful with a whiteout. So dropping the reference to the inode
@@ -3490,6 +3501,10 @@ int vfs_rename_union(struct nameidata *o
 	struct dentry *dentry;
 	int error;
 
+	UM_DEBUG_DENTRY(old->dentry);
+	UM_DEBUG_DENTRY(new->dentry);
+/*	return -EPERM; */
+
 	if (old->dentry->d_inode == new->dentry->d_inode)
 		return 0;
 
@@ -3530,6 +3545,9 @@ int vfs_rename_union(struct nameidata *o
 
 	/* possibly delete the existing new file */
 	if ((newnd->dentry == new->dentry->d_parent) && new->dentry->d_inode) {
+		UM_DEBUG("unlink:\n");
+		UM_DEBUG_DENTRY(new->dentry);
+
 		/* FIXME: inode may be truncated while we hold a lock */
 		error = vfs_unlink(new_dir, new->dentry);
 		if (error)
@@ -3540,6 +3558,9 @@ int vfs_rename_union(struct nameidata *o
 		if (IS_ERR(dentry))
 			goto freename;
 
+		UM_DEBUG("new target:\n");
+		UM_DEBUG_DENTRY(new->dentry);
+
 		dput(new->dentry);
 		new->dentry = dentry;
 	}
@@ -3554,6 +3575,10 @@ int vfs_rename_union(struct nameidata *o
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto freename;
+
+	UM_DEBUG("whiteout:\n");
+	UM_DEBUG_DENTRY(dentry);
+
 	error = vfs_whiteout(old_dir, dentry);
 	dput(dentry);
 
@@ -3567,6 +3592,7 @@ int vfs_rename_union(struct nameidata *o
 */
 freename:
 	kfree(old_name.name);
+	UM_DEBUG("err = %d\n", error);
 	return error;
 }
 
--- a/fs/union.c
+++ b/fs/union.c
@@ -18,6 +18,7 @@
 #include <linux/hash.h>
 #include <linux/fs.h>
 #include <linux/union.h>
+#include <linux/union_debug.h>
 #include <linux/module.h>
 #include <linux/file.h>
 #include <linux/mm.h>
@@ -253,6 +254,9 @@ int append_to_union(struct vfsmount *mnt
 
 	BUG_ON(!IS_MNT_UNION(mnt));
 
+	UM_DEBUG_DENTRY(dentry);
+	UM_DEBUG_DENTRY(dest_dentry);
+
 	this = union_alloc(dentry, mnt, dest_dentry, dest_mnt);
 	if (!this)
 		return -ENOMEM;
@@ -822,6 +826,8 @@ int union_relookup_topmost(struct nameid
 	char *kbuf, *name;
 	struct nameidata this;
 
+	UM_DEBUG_DENTRY(nd->dentry);
+
 	kbuf = (char *)__get_free_page(GFP_KERNEL);
 	if (!kbuf)
 		return -ENOMEM;
@@ -838,6 +844,7 @@ int union_relookup_topmost(struct nameid
 	path_release(nd);
 	nd->dentry = this.dentry;
 	nd->mnt = this.mnt;
+	UM_DEBUG_DENTRY(nd->dentry);
 
 	/*
 	 * the nd->flags should be unchanged
@@ -846,6 +853,7 @@ int union_relookup_topmost(struct nameid
 	nd->um_flags &= ~LAST_LOWLEVEL;
  free_page:
 	free_page((unsigned long)kbuf);
+	UM_DEBUG("err = %d\n", err);
 	return err;
 }
 
@@ -895,6 +903,8 @@ struct dentry *union_create_topmost(stru
 	if (IS_ERR(dentry))
 		goto out_unlock;
 
+	UM_DEBUG_DENTRY(dentry);
+
 	switch (mode & S_IFMT) {
 	case S_IFREG:
 		/*
@@ -916,6 +926,9 @@ struct dentry *union_create_topmost(stru
 			dentry = ERR_PTR(res);
 			goto out_unlock;
 		}
+
+		UM_DEBUG_DENTRY(dentry);
+
 		break;
 	case S_IFDIR:
 		res = vfs_mkdir(parent->d_inode, dentry, mode);
@@ -925,6 +938,8 @@ struct dentry *union_create_topmost(stru
 			goto out_unlock;
 		}
 
+		UM_DEBUG_DENTRY(dentry);
+
 		res = append_to_union(nd->mnt, dentry, path->mnt,
 				      path->dentry);
 		if (res) {
@@ -944,6 +959,7 @@ struct dentry *union_create_topmost(stru
 
  out_unlock:
 	mutex_unlock(&parent->d_inode->i_mutex);
+	UM_DEBUG("err = %d\n", IS_ERR(dentry) ? PTR_ERR(dentry) : 0);
 	return dentry;
 }
 
@@ -980,9 +996,14 @@ static int union_copy_file(struct dentry
 	if (ret >= 0)
 		ret = 0;
  fput_new:
+	UM_DEBUG("new_dentry:\n");
+	UM_DEBUG_DENTRY(new_dentry);
 	fput(new_file);
  fput_old:
+	UM_DEBUG("old_dentry:\n");
+	UM_DEBUG_DENTRY(old_dentry);
 	fput(old_file);
+	UM_DEBUG("err = %d\n", ret);
 	return ret;
 }
 
@@ -1059,6 +1080,8 @@ int union_copyup(struct nameidata *nd, i
 	if (!S_ISREG(nd->dentry->d_inode->i_mode))
 		return 0;
 
+	UM_DEBUG_DENTRY(nd->dentry);
+
 	/* safe the name for hash_lookup_union() */
 	this.len = nd->dentry->d_name.len;
 	this.hash = nd->dentry->d_name.hash;
@@ -1084,6 +1107,8 @@ int union_copyup(struct nameidata *nd, i
 	if (err)
 		return err;
 
+	UM_DEBUG_DENTRY(path.dentry);
+
 	err = -ENOENT;
 	if (!path.dentry->d_inode)
 		goto exit_dput;
@@ -1106,9 +1131,11 @@ int union_copyup(struct nameidata *nd, i
 	}
 
 	path_to_nameidata(&path, nd);
+	UM_DEBUG("err = 0\n");
 	return 0;
 
 exit_dput:
 	dput_path(&path, nd);
+	UM_DEBUG("err = %d\n", err);
 	return err;
 }
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -117,17 +117,20 @@ static inline char *nd_get_link(struct n
 
 static inline void pathget(struct path *path)
 {
+	WARN_ON(path->dentry->d_sb != path->mnt->mnt_sb);
 	mntget(path->mnt);
 	dget(path->dentry);
 }
 
 static inline void pathput(struct path *path)
 {
+	WARN_ON(path->dentry->d_sb != path->mnt->mnt_sb);
 	dput(path->dentry);
 	mntput(path->mnt);
 }
 static inline void dput_path(struct path *path, struct nameidata *nd)
 {
+	WARN_ON(path->dentry->d_sb != path->mnt->mnt_sb);
 	dput(path->dentry);
 	if (path->mnt != nd->mnt)
 		mntput(path->mnt);
@@ -135,6 +138,7 @@ static inline void dput_path(struct path
 
 static inline void path_to_nameidata(struct path *path, struct nameidata *nd)
 {
+	WARN_ON(path->dentry->d_sb != path->mnt->mnt_sb);
 	dput(nd->dentry);
 	if (nd->mnt != path->mnt)
 		mntput(nd->mnt);

-- 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 00/26] VFS based Union Mount (V2)
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (25 preceding siblings ...)
  2007-07-30 16:13 ` [RFC 26/26] union-mount: Debug code Jan Blunck
@ 2007-07-30 18:23 ` Al Boldi
  2007-08-02  6:49 ` Bharata B Rao
  27 siblings, 0 replies; 65+ messages in thread
From: Al Boldi @ 2007-07-30 18:23 UTC (permalink / raw)
  To: Jan Blunck; +Cc: Bharata B Rao, linux-fsdevel, linux-kernel

Jan Blunck wrote:
> Here is another post of the VFS based union mount implementation. Unlike
> the traditional mount which hides the contents of the mount point, union
> mounts present the merged view of the mount point and the mounted
> filesytem.

Great!

> Recent changes:
> - brand new union structure no longer tied to the dentryn, now works with
> bind mounts
> - generic part of the whiteout patches extracted
> - introduces MS_WHITEOUT to make the white-out patches independant of the
>   union-mount stuff
> - uses a singleton whiteout inode for the tmpfs filesystem (I need to fix
> this for ext2/3, too)
> - renaming files on unions uses copyup now

I wonder if this copyup functionality could be generalized to induce CoW when 
modifying hard-linked files.  Does that sound feasible?

> - rewrote the union mount debugging code: it is now debugfs/relay based.
> - random cleanups
>
> I'm able to compile the kernel with this patches applied on a  3 layer
> union mount with the seperate layers bind mounted to different locations.
> I haven't done any performance tests since I think there is a more
> important topic ahead: better readdir() support.

What about the umount oops?  Did that get fixed?

> This series is against 2.6.22-rc6-mm1.

Things as big and important like this should probably also be diff'd against 
mainline, to increase testing input.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-30 16:13 ` [RFC 12/26] ext2 " Jan Blunck
@ 2007-07-31  3:45   ` Theodore Tso
  2007-07-31  7:44     ` Jan Blunck
  2007-07-31 16:36   ` Josef Sipek
  1 sibling, 1 reply; 65+ messages in thread
From: Theodore Tso @ 2007-07-31  3:45 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> Introduce white-out support to ext2.
> 
> Known Bugs:
> - Needs a reserved inode number for white-outs

You picked different reserved inodes for the ext2 and ext3
filesystems.  That's good for a NACK right there.  The codepoints
(i.e., reserved inode numbers, feature bit masks, etc.) for ext2,
ext3, and ext4 MUST not overlap.  After all, someone might use tune2fs
-j to convert an ext2 filesystem to ext3, and is it's REALLY BAD that
you're using a reserved inode of 7 for ext2, and 9 for ext3.

Also, I note that you have created a new INCOMPAT feature flag support
for whiteouts.  That's really unfortunate; we try to avoid introducing
incompatible feature flags unless absolutely necessary; note that even
adding a COMPAT feature flag means that you need a new version of
e2fsprogs if you want e2fsck to be willing to touch that filesystem.

So --- if you're looking for a way to add whiteout support to
ext2/ext3 without needing a feature bit, here's how.  We allocate a
new inode flag in struct ext3_inode.i_flags:

#define EXT2_WHTOUT_FL	 0x00040000

We also allocate a new field in the ext2 superblock to store the
"whiteout inode".  (Please coordinate with me so it's a superblock
field not in use by ext3/ext4, and so it's reserved so that no one
else uses it.)  The superblock field, call it s_whtout_ino, stores the
inode number for the "white out inode".

When you create a new whiteout file, the code checks sb->s_whtout_ino,
and if it is zero, it allocates a new inode, and creates it as a
zero-length regular file (i_mode |= S_IFREG) with the EXT2_WHTOUT_FL
flag set in the inode, and then store the inode number in
sb->s_whtout_ino.  If sb->s_whtout_ino is non-zero, you must read in
the inode and make sure that the EXT2_WHTOUT_FL is set.  If it is not,
then allocate a new whiteout inode as described previously.  Then link
the inode into the directory as before.

When reading an inode, if the EXT2_WHTOUT_FL flag is set, then set the
in-memory mode of the inode to be S_IFWHT.  

That's pretty much about it.  For cleanliness sake, it would be good
if ext2_delete_inode clears sb->s_whtout_ino if the last whiteout link
has been deleted, but it's strictly speaking not necessary.  If you do
it this way, the filesystem is completely backwards compatible; the
whiteout files will just appear to links to a normal zero-lenth file.

I wouldn't bother with setting the directory type field to be DT_WHT,
given that they will never be returned to userspace anyway.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31  3:45   ` Theodore Tso
@ 2007-07-31  7:44     ` Jan Blunck
  2007-07-31  8:32       ` Andreas Dilger
  2007-07-31 10:53       ` Theodore Tso
  0 siblings, 2 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-31  7:44 UTC (permalink / raw)
  To: Theodore Tso, linux-fsdevel, linux-kernel, Bharata B Rao

On Mon, Jul 30, Theodore Tso wrote:

> On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > Introduce white-out support to ext2.
> > 
> > Known Bugs:
> > - Needs a reserved inode number for white-outs
> 
> You picked different reserved inodes for the ext2 and ext3
> filesystems.  That's good for a NACK right there.  The codepoints
> (i.e., reserved inode numbers, feature bit masks, etc.) for ext2,
> ext3, and ext4 MUST not overlap.  After all, someone might use tune2fs
> -j to convert an ext2 filesystem to ext3, and is it's REALLY BAD that
> you're using a reserved inode of 7 for ext2, and 9 for ext3.

Ouch, right.

> Also, I note that you have created a new INCOMPAT feature flag support
> for whiteouts.  That's really unfortunate; we try to avoid introducing
> incompatible feature flags unless absolutely necessary; note that even
> adding a COMPAT feature flag means that you need a new version of
> e2fsprogs if you want e2fsck to be willing to touch that filesystem.
> 
> So --- if you're looking for a way to add whiteout support to
> ext2/ext3 without needing a feature bit, here's how.  We allocate a
> new inode flag in struct ext3_inode.i_flags:
> 
> #define EXT2_WHTOUT_FL	 0x00040000
> 
> We also allocate a new field in the ext2 superblock to store the
> "whiteout inode".  (Please coordinate with me so it's a superblock
> field not in use by ext3/ext4, and so it's reserved so that no one
> else uses it.)  The superblock field, call it s_whtout_ino, stores the
> inode number for the "white out inode".
> 
> When you create a new whiteout file, the code checks sb->s_whtout_ino,
> and if it is zero, it allocates a new inode, and creates it as a
> zero-length regular file (i_mode |= S_IFREG) with the EXT2_WHTOUT_FL
> flag set in the inode, and then store the inode number in
> sb->s_whtout_ino.  If sb->s_whtout_ino is non-zero, you must read in
> the inode and make sure that the EXT2_WHTOUT_FL is set.  If it is not,
> then allocate a new whiteout inode as described previously.  Then link
> the inode into the directory as before.
> 
> When reading an inode, if the EXT2_WHTOUT_FL flag is set, then set the
> in-memory mode of the inode to be S_IFWHT.  
> 
> That's pretty much about it.  For cleanliness sake, it would be good
> if ext2_delete_inode clears sb->s_whtout_ino if the last whiteout link
> has been deleted, but it's strictly speaking not necessary.  If you do
> it this way, the filesystem is completely backwards compatible; the
> whiteout files will just appear to links to a normal zero-lenth file.

Ok, this is pretty similar to the way I implemented this for tmpfs. The
problem is that the union mount code is explicitly checking if the filesystem
is supporting whiteout. I used to use a new filesystem flag (FS_WHITEOUT) for
this but thought that disk filesystem like ext2/3/4 will have problem with
that if you mount an old image. So I guess I still need a feature flag.

> I wouldn't bother with setting the directory type field to be DT_WHT,
> given that they will never be returned to userspace anyway.

At the moment I still rely on this for the current readdir implementation.
Viro already said that he doesn't want to see this (the readdir changes) in
the kernel but in userspace.

Thanks,
Jan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31  7:44     ` Jan Blunck
@ 2007-07-31  8:32       ` Andreas Dilger
  2007-07-31  9:08         ` Jan Blunck
  2007-07-31 10:53       ` Theodore Tso
  1 sibling, 1 reply; 65+ messages in thread
From: Andreas Dilger @ 2007-07-31  8:32 UTC (permalink / raw)
  To: Jan Blunck; +Cc: Theodore Tso, linux-fsdevel, linux-kernel, Bharata B Rao

On Jul 31, 2007  09:44 +0200, Jan Blunck wrote:
> Ok, this is pretty similar to the way I implemented this for tmpfs. The
> problem is that the union mount code is explicitly checking if the filesystem
> is supporting whiteout. I used to use a new filesystem flag (FS_WHITEOUT) for
> this but thought that disk filesystem like ext2/3/4 will have problem with
> that if you mount an old image. So I guess I still need a feature flag.

You also need whiteout support for extents.  This could be done with
unwritten extents potentially, or as I previously proposed (RFC) in
linux-ext4.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31  8:32       ` Andreas Dilger
@ 2007-07-31  9:08         ` Jan Blunck
  0 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-31  9:08 UTC (permalink / raw)
  To: Theodore Tso, linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, Jul 31, Andreas Dilger wrote:

> On Jul 31, 2007  09:44 +0200, Jan Blunck wrote:
> > Ok, this is pretty similar to the way I implemented this for tmpfs. The
> > problem is that the union mount code is explicitly checking if the filesystem
> > is supporting whiteout. I used to use a new filesystem flag (FS_WHITEOUT) for
> > this but thought that disk filesystem like ext2/3/4 will have problem with
> > that if you mount an old image. So I guess I still need a feature flag.
> 
> You also need whiteout support for extents.  This could be done with
> unwritten extents potentially, or as I previously proposed (RFC) in
> linux-ext4.

Maybe. But this is about something totally different: a whiteout filetype, an
existing file that when it is found make the VFS return -ENOENT.

Cheers,
Jan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31  7:44     ` Jan Blunck
  2007-07-31  8:32       ` Andreas Dilger
@ 2007-07-31 10:53       ` Theodore Tso
  2007-08-02 19:31         ` Pavel Machek
  1 sibling, 1 reply; 65+ messages in thread
From: Theodore Tso @ 2007-07-31 10:53 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, Jul 31, 2007 at 09:44:36AM +0200, Jan Blunck wrote:
> Ok, this is pretty similar to the way I implemented this for tmpfs. The
> problem is that the union mount code is explicitly checking if the filesystem
> is supporting whiteout. I used to use a new filesystem flag (FS_WHITEOUT) for
> this but thought that disk filesystem like ext2/3/4 will have problem with
> that if you mount an old image. So I guess I still need a feature flag.

Without the method I described to you, *any* ext2/3/4 filesystem will
support whiteouts (as long as you have the support code compiled into
the kernel :-), so there's no need for a feature flag.

> > I wouldn't bother with setting the directory type field to be DT_WHT,
> > given that they will never be returned to userspace anyway.
> 
> At the moment I still rely on this for the current readdir implementation.
> Viro already said that he doesn't want to see this (the readdir changes) in
> the kernel but in userspace.

Life gets very messy if you have to do this in userspace.  Example:
statically linked programs that were compiled with a version of glibc
that didn't know about whiteout records.  Unfortunately, the memory
needed to to collate directories entries so that whiteout records can
be dropped is painful enough that completely understand why Al doesn't
want to see this in userspace.  Unfortunately this is going to be one
of those things that will make union mounts problematic, compared to
something like unionfs.

						- Ted

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-30 16:13 ` [RFC 12/26] ext2 " Jan Blunck
  2007-07-31  3:45   ` Theodore Tso
@ 2007-07-31 16:36   ` Josef Sipek
  2007-07-31 17:00     ` Jan Blunck
                       ` (2 more replies)
  1 sibling, 3 replies; 65+ messages in thread
From: Josef Sipek @ 2007-07-31 16:36 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> Introduce white-out support to ext2.

I think storing whiteouts on the branches is wrong. It creates all sort of
nasty cases when people actually try to use unioning. Imagine a (no-so
unlikely) scenario where you have 2 unions, and they share a branch. If you
create a whiteout in one union on that shared branch, the whiteout magically
affects the other union as well! Whiteouts are a union-level construct, and
therefore storing them at the branch level is wrong.

If you store whiteouts on the branches, you'll probably want readdir to not
include them. That's relatively cheap if you have a whiteout bit in the
inode, but I don't think filesystems should be forced to use up rather
prescious inode bits for whiteouts/opaqueness [1].

Really the only sane way of keeping track of whiteouts seems some external
store. We did an experiment with Unionfs, and moving the whiteout handling
to effectively a "library" that did all the dirty work cleaned up the code
considerably [2,3].

> Known Bugs:
> - Needs a reserved inode number for white-outs
> - S_OPAQUE isn't persistently stored

Out of curiosity, how do you keep track of opaqueness while the fs is
mounted?

Josef 'Jeff' Sipek.

[1] http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg02904.html
[2] http://www.filesystems.org/unionfs-odf.txt
[3] http://download.filesystems.org/unionfs/unionfs-2.0-odf/linux-2.6.20-rc6-odf1.diff.gz

-- 
UNIX is user-friendly ... it's just selective about who it's friends are

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 16:36   ` Josef Sipek
@ 2007-07-31 17:00     ` Jan Blunck
  2007-07-31 17:11       ` Josef Sipek
  2007-08-01 10:00       ` Hans-Peter Jansen
  2007-07-31 17:03     ` Mark Williamson
  2007-08-01 17:58     ` Jan Engelhardt
  2 siblings, 2 replies; 65+ messages in thread
From: Jan Blunck @ 2007-07-31 17:00 UTC (permalink / raw)
  To: Josef Sipek; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, Jul 31, Josef Sipek wrote:

> On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > Introduce white-out support to ext2.
> 
> I think storing whiteouts on the branches is wrong. It creates all sort of
> nasty cases when people actually try to use unioning. Imagine a (no-so
> unlikely) scenario where you have 2 unions, and they share a branch. If you
> create a whiteout in one union on that shared branch, the whiteout magically
> affects the other union as well! Whiteouts are a union-level construct, and
> therefore storing them at the branch level is wrong.

So you think that just because you mounted the filesystem somewhere else it
should look different? This is what sharing is all about. If you share a
filesystem you also share the removal of objects.

> If you store whiteouts on the branches, you'll probably want readdir to not
> include them. That's relatively cheap if you have a whiteout bit in the
> inode, but I don't think filesystems should be forced to use up rather
> prescious inode bits for whiteouts/opaqueness [1].

How filesystem implement the whiteout filetype is up to them.

> Really the only sane way of keeping track of whiteouts seems some external
> store. We did an experiment with Unionfs, and moving the whiteout handling
> to effectively a "library" that did all the dirty work cleaned up the code
> considerably [2,3].

Haven't checked if you could use ODF for a generic store for filesystems that
couldn't support whiteouts. This might be an interesting idea.

> > Known Bugs:
> > - Needs a reserved inode number for white-outs
> > - S_OPAQUE isn't persistently stored
> 
> Out of curiosity, how do you keep track of opaqueness while the fs is
> mounted?

Its an inode flag (S_OPAQUE).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 16:36   ` Josef Sipek
  2007-07-31 17:00     ` Jan Blunck
@ 2007-07-31 17:03     ` Mark Williamson
  2007-07-31 17:16       ` Josef Sipek
  2007-08-01 17:58     ` Jan Engelhardt
  2 siblings, 1 reply; 65+ messages in thread
From: Mark Williamson @ 2007-07-31 17:03 UTC (permalink / raw)
  To: Josef Sipek; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao

> Really the only sane way of keeping track of whiteouts seems some external
> store. We did an experiment with Unionfs, and moving the whiteout handling
> to effectively a "library" that did all the dirty work cleaned up the code
> considerably [2,3].

What about keeping track of whiteouts in a special file (or files) in the top 
level filesystem of the union?  For instance, having a /.whiteouts file at 
the root of the top FS in the stack, instead of storing union-specific data 
in the flags / inode numbers of the lower levels.

This file could also e.g. store the UUID of the lower level FS (if 
appropriate) so that in subsequent mounts (which might attempt a union with a 
different lower level branch) you can tell if the whiteouts have meaning.  
The whiteout history could be flushed by directly mounting the FS and doing 
rm .whiteouts.

This might avoid requiring a store external to the stack of filesystems and I 
believe it would solve the problem with shared branches and arbitrary 
stacking that you described?

I guess a rather similar effect could be had by somehow storing loopback 
mountable ODF filesystems in the top layer of a union somewhere (e.g. with 
the default path /.odf) and allowing the user to specify an alternate 
location at mount time if necessary.  So maybe these approaches are quite 
similar after all...

Cheers,
Mark

-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 17:00     ` Jan Blunck
@ 2007-07-31 17:11       ` Josef Sipek
  2007-08-01 15:23         ` Dave Kleikamp
  2007-08-02 10:26         ` Jan Blunck
  2007-08-01 10:00       ` Hans-Peter Jansen
  1 sibling, 2 replies; 65+ messages in thread
From: Josef Sipek @ 2007-07-31 17:11 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, Jul 31, 2007 at 07:00:12PM +0200, Jan Blunck wrote:
> On Tue, Jul 31, Josef Sipek wrote:
> 
> > On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > > Introduce white-out support to ext2.
> > 
> > I think storing whiteouts on the branches is wrong. It creates all sort of
> > nasty cases when people actually try to use unioning. Imagine a (no-so
> > unlikely) scenario where you have 2 unions, and they share a branch. If you
> > create a whiteout in one union on that shared branch, the whiteout magically
> > affects the other union as well! Whiteouts are a union-level construct, and
> > therefore storing them at the branch level is wrong.
> 
> So you think that just because you mounted the filesystem somewhere else it
> should look different? This is what sharing is all about. If you share a
> filesystem you also share the removal of objects.

The removal happens at the union level, not the branch level. Say you have:

/a/
/b/foo
/c/foo

And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}.

$ find /u*
/u1
/u1/foo
/u2
/u2/foo
$ rm /u1/foo # this creates whiteout for "foo" in /a
$ find /u*
/u1
/u2

Is that what you'd expect as a user? I don't think so.

...
> > Really the only sane way of keeping track of whiteouts seems some external
> > store. We did an experiment with Unionfs, and moving the whiteout handling
> > to effectively a "library" that did all the dirty work cleaned up the code
> > considerably [2,3].
> 
> Haven't checked if you could use ODF for a generic store for filesystems that
> couldn't support whiteouts. This might be an interesting idea.
 
Yes, since the ODF is completely separate, you can use _any_ filesystem and
regardless of whether or not they support whiteouts.

Josef 'Jeff' Sipek.

-- 
Once you have their hardware. Never give it back.
(The First Rule of Hardware Acquisition)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 17:03     ` Mark Williamson
@ 2007-07-31 17:16       ` Josef Sipek
  0 siblings, 0 replies; 65+ messages in thread
From: Josef Sipek @ 2007-07-31 17:16 UTC (permalink / raw)
  To: Mark Williamson; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, Jul 31, 2007 at 06:03:06PM +0100, Mark Williamson wrote:
> > Really the only sane way of keeping track of whiteouts seems some external
> > store. We did an experiment with Unionfs, and moving the whiteout handling
> > to effectively a "library" that did all the dirty work cleaned up the code
> > considerably [2,3].
> 
> What about keeping track of whiteouts in a special file (or files) in the top 
> level filesystem of the union?  For instance, having a /.whiteouts file at 
> the root of the top FS in the stack, instead of storing union-specific data 
> in the flags / inode numbers of the lower levels.
 
What is needed is a "filesystem" that has all the directory bits only. For
ODF, we opted to "abuse" existing filesystems to see if it actually helped
Unionfs, and I think it did help. Really, now what we (unionfs) need is a
cleanup of the ODF code, with a bit better defined interface.

...
> This might avoid requiring a store external to the stack of filesystems and I 
> believe it would solve the problem with shared branches and arbitrary 
> stacking that you described?

We generally did a loopback mount on a file. Very similar to your idea.

> I guess a rather similar effect could be had by somehow storing loopback 
> mountable ODF filesystems in the top layer of a union somewhere (e.g. with 
> the default path /.odf) and allowing the user to specify an alternate 
> location at mount time if necessary.  So maybe these approaches are quite 
> similar after all...

Very :) We forced the user to mount the fs in the odf loopback manually, but
there's no reason why we couldn't do an in-kernel mount on unionfs mount
time.

Josef 'Jeff' Sipek.

-- 
Once you have their hardware. Never give it back.
(The First Rule of Hardware Acquisition)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 17:00     ` Jan Blunck
  2007-07-31 17:11       ` Josef Sipek
@ 2007-08-01 10:00       ` Hans-Peter Jansen
  2007-08-01 11:43         ` Josef Sipek
  2007-08-01 18:01         ` Jan Engelhardt
  1 sibling, 2 replies; 65+ messages in thread
From: Hans-Peter Jansen @ 2007-08-01 10:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Blunck, Josef Sipek, linux-fsdevel, Bharata B Rao

Am Dienstag, 31. Juli 2007 19:00 schrieb Jan Blunck:
> On Tue, Jul 31, Josef Sipek wrote:
> > On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > > Introduce white-out support to ext2.
> >
> > I think storing whiteouts on the branches is wrong. It creates all sort
> > of nasty cases when people actually try to use unioning. Imagine a
> > (no-so unlikely) scenario where you have 2 unions, and they share a
> > branch. If you create a whiteout in one union on that shared branch,
> > the whiteout magically affects the other union as well! Whiteouts are a
> > union-level construct, and therefore storing them at the branch level
> > is wrong.
>
> So you think that just because you mounted the filesystem somewhere else
> it should look different? This is what sharing is all about. If you share
> a filesystem you also share the removal of objects.

No. At least I don't. 

Usage case: I heavily depend on using union mounts in diskless nfs setups, 
since it drops the amount of administration of many systems _near_ one. It 
boils down on installing the distribution of your choice in a directory, 
union mount it ro, overlayed with a node private one (doing this in initrd 
on the client for several reasons), add a little boot and automatic setup 
machinery and be done. Since all changes are persistant, any system can be 
set up individually, and still mostly only one tree is needed to keep up to 
date.. Being in production in an office environment since two years without 
major hassle (*).

This setup is likely to be useful for virtualization needs, too, but side 
effects via the base directory from one node to another would render this 
setup void.

Cheers,
  Pete

*) The amount of administration work of any (necessary, unfortunately) 
VMware XP instance running on top of those diskless clients excels that of 
all diskless clients by an order of magnitude. 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 10:00       ` Hans-Peter Jansen
@ 2007-08-01 11:43         ` Josef Sipek
  2007-08-01 18:01         ` Jan Engelhardt
  1 sibling, 0 replies; 65+ messages in thread
From: Josef Sipek @ 2007-08-01 11:43 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, Jan Blunck, linux-fsdevel, Bharata B Rao

On Wed, Aug 01, 2007 at 12:00:42PM +0200, Hans-Peter Jansen wrote:
> Am Dienstag, 31. Juli 2007 19:00 schrieb Jan Blunck:
> > On Tue, Jul 31, Josef Sipek wrote:
> > > On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > > > Introduce white-out support to ext2.
> > >
> > > I think storing whiteouts on the branches is wrong. It creates all sort
> > > of nasty cases when people actually try to use unioning. Imagine a
> > > (no-so unlikely) scenario where you have 2 unions, and they share a
> > > branch. If you create a whiteout in one union on that shared branch,
> > > the whiteout magically affects the other union as well! Whiteouts are a
> > > union-level construct, and therefore storing them at the branch level
> > > is wrong.
> >
> > So you think that just because you mounted the filesystem somewhere else
> > it should look different? This is what sharing is all about. If you share
> > a filesystem you also share the removal of objects.
> 
> No. At least I don't. 
> 
> Usage case: I heavily depend on using union mounts in diskless nfs setups, 
> since it drops the amount of administration of many systems _near_ one. It 
> boils down on installing the distribution of your choice in a directory, 
> union mount it ro, overlayed with a node private one (doing this in initrd 
> on the client for several reasons),

You're not sharing the rw layer so it's a different scenario, and will not
have the problem I'm talking about. See my other post [1] for exact scenario
where storing whiteouts on a branch would cause problems.

> add a little boot and automatic setup 
> machinery and be done. Since all changes are persistant, any system can be 
> set up individually, and still mostly only one tree is needed to keep up to 
> date.. Being in production in an office environment since two years without 
> major hassle (*).

Unionfs is used by many people in this way.

Josef 'Jeff' Sipek.

[1] http://lkml.org/lkml/2007/7/31/365

-- 
Intellectuals solve problems; geniuses prevent them
		- Albert Einstein

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 11/26] tmpfs white-out support
  2007-07-30 16:13 ` [RFC 11/26] tmpfs white-out support Jan Blunck
@ 2007-08-01 15:13   ` Hugh Dickins
  2007-08-02  2:48     ` Matt Mackall
  0 siblings, 1 reply; 65+ messages in thread
From: Hugh Dickins @ 2007-08-01 15:13 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Mon, 30 Jul 2007, Jan Blunck wrote:

> Introduce white-out support to tmpfs.
> 
> Signed-off-by: Jan Blunck <jblunck@suse.de>
> ---
>  include/linux/shmem_fs.h |    1 
>  mm/shmem.c               |   54 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 55 insertions(+)

I see there's debate about whether this (and its fellows) give the
right semantic to whiteouts; and I've not begun to think about that.

But as a patch to tmpfs for what you're trying to do, it looks just
about fine.  I say "just about" because the reference counting looks
right, but I wouldn't dare say that it _is_ right without testing.

And I'd probably want to add a minor adjustment, so that a mount with
nr_inodes=1000 could still support exactly 1000 inodes, despite your
allocating one for the whiteout (usually never used) at mount time.
But that can follow along later, no problem.

Hugh

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 17:11       ` Josef Sipek
@ 2007-08-01 15:23         ` Dave Kleikamp
  2007-08-01 18:44           ` Josef Sipek
  2007-08-02 10:26         ` Jan Blunck
  1 sibling, 1 reply; 65+ messages in thread
From: Dave Kleikamp @ 2007-08-01 15:23 UTC (permalink / raw)
  To: Josef Sipek; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, 2007-07-31 at 13:11 -0400, Josef Sipek wrote:
> On Tue, Jul 31, 2007 at 07:00:12PM +0200, Jan Blunck wrote:
> > On Tue, Jul 31, Josef Sipek wrote:
> > 
> > > On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > > > Introduce white-out support to ext2.
> > > 
> > > I think storing whiteouts on the branches is wrong. It creates all sort of
> > > nasty cases when people actually try to use unioning. Imagine a (no-so
> > > unlikely) scenario where you have 2 unions, and they share a branch. If you
> > > create a whiteout in one union on that shared branch, the whiteout magically
> > > affects the other union as well! Whiteouts are a union-level construct, and
> > > therefore storing them at the branch level is wrong.
> > 
> > So you think that just because you mounted the filesystem somewhere else it
> > should look different? This is what sharing is all about. If you share a
> > filesystem you also share the removal of objects.
> 
> The removal happens at the union level, not the branch level. Say you have:
> 
> /a/
> /b/foo
> /c/foo
> 
> And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}.

Who does this?  I'm assuming that a is the "top" layer.  Aren't union
mounts typically about sharing lower layers and having a separate rw
layer for each union mount?

> $ find /u*
> /u1
> /u1/foo
> /u2
> /u2/foo
> $ rm /u1/foo # this creates whiteout for "foo" in /a
> $ find /u*
> /u1
> /u2
> 
> Is that what you'd expect as a user? I don't think so.

That's exactly what I would expect.

If I were to:
$ echo "this is new" > /u1/foo

I would expect:
$ cat /u2/foo
this is new

So why should rm behave differently?

I haven't really been tuned into union mounts, so maybe I'm missing out
on something basic here.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 16:36   ` Josef Sipek
  2007-07-31 17:00     ` Jan Blunck
  2007-07-31 17:03     ` Mark Williamson
@ 2007-08-01 17:58     ` Jan Engelhardt
  2007-08-01 18:03       ` Josef Sipek
  2 siblings, 1 reply; 65+ messages in thread
From: Jan Engelhardt @ 2007-08-01 17:58 UTC (permalink / raw)
  To: Josef Sipek; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao


On Jul 31 2007 12:36, Josef Sipek wrote:
>[2] http://www.filesystems.org/unionfs-odf.txt

>Instead, the new ODF code stores whiteouts as hardlinks to a special
>(regular) zero-length file in odf (/odf/whiteout), and it stores opaqueness
>information for directories in the inode GID bits in an ODF file system
>(e.g., ext2, XFS, etc.) on the local machine.  This avoids the name-space
>pollution and avoids races with network file systems, while minimizing inode
>consummation in /odf.

Inode GID bits - are you reducing my 32 bits of gid_t to 31 bits?
That does not work out either.



	Jan
-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 10:00       ` Hans-Peter Jansen
  2007-08-01 11:43         ` Josef Sipek
@ 2007-08-01 18:01         ` Jan Engelhardt
  1 sibling, 0 replies; 65+ messages in thread
From: Jan Engelhardt @ 2007-08-01 18:01 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: linux-kernel, Jan Blunck, Josef Sipek, linux-fsdevel, Bharata B Rao


On Aug 1 2007 12:00, Hans-Peter Jansen wrote:
>
>*) The amount of administration work of any (necessary, unfortunately) 
>VMware XP instance running on top of those diskless clients excels that of 
>all diskless clients by an order of magnitude.

Hardly :)
Install XP, snapshot it when done. Copy .vmdk to 'all' machines.
On security upgrades, revert to snapshot (well - if the workflow allows it),
install, snapshot again. Etc.
Work: 1 1/2.


	Jan
-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 17:58     ` Jan Engelhardt
@ 2007-08-01 18:03       ` Josef Sipek
  0 siblings, 0 replies; 65+ messages in thread
From: Josef Sipek @ 2007-08-01 18:03 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao

On Wed, Aug 01, 2007 at 07:58:49PM +0200, Jan Engelhardt wrote:
> 
> On Jul 31 2007 12:36, Josef Sipek wrote:
> >[2] http://www.filesystems.org/unionfs-odf.txt
> 
> >Instead, the new ODF code stores whiteouts as hardlinks to a special
> >(regular) zero-length file in odf (/odf/whiteout), and it stores opaqueness
> >information for directories in the inode GID bits in an ODF file system
> >(e.g., ext2, XFS, etc.) on the local machine.  This avoids the name-space
> >pollution and avoids races with network file systems, while minimizing inode
> >consummation in /odf.
> 
> Inode GID bits - are you reducing my 32 bits of gid_t to 31 bits?
> That does not work out either.

No. The ODF code just uses the GID bits to store extra info. The GID is
_NOT_ used to store the GID of the file. The GID of the file is still coming
from the branches.

Josef 'Jeff' Sipek.

-- 
I abhor a system designed for the "user", if that word is a coded pejorative
meaning "stupid and unsophisticated."
		- Ken Thompson

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 15:23         ` Dave Kleikamp
@ 2007-08-01 18:44           ` Josef Sipek
  2007-08-01 19:10             ` Dave Kleikamp
  2007-08-02  5:24             ` Ph. Marek
  0 siblings, 2 replies; 65+ messages in thread
From: Josef Sipek @ 2007-08-01 18:44 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao, hch

On Wed, Aug 01, 2007 at 10:23:29AM -0500, Dave Kleikamp wrote:
> On Tue, 2007-07-31 at 13:11 -0400, Josef Sipek wrote:
> > On Tue, Jul 31, 2007 at 07:00:12PM +0200, Jan Blunck wrote:
> > > On Tue, Jul 31, Josef Sipek wrote:
> > > 
> > > > On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
> > > > > Introduce white-out support to ext2.
> > > > 
> > > > I think storing whiteouts on the branches is wrong. It creates all sort of
> > > > nasty cases when people actually try to use unioning. Imagine a (no-so
> > > > unlikely) scenario where you have 2 unions, and they share a branch. If you
> > > > create a whiteout in one union on that shared branch, the whiteout magically
> > > > affects the other union as well! Whiteouts are a union-level construct, and
> > > > therefore storing them at the branch level is wrong.
> > > 
> > > So you think that just because you mounted the filesystem somewhere else it
> > > should look different? This is what sharing is all about. If you share a
> > > filesystem you also share the removal of objects.
> > 
> > The removal happens at the union level, not the branch level. Say you have:
> > 
> > /a/
> > /b/foo
> > /c/foo
> > 
> > And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}.
> 
> Who does this?  I'm assuming that a is the "top" layer.  Aren't union
> mounts typically about sharing lower layers and having a separate rw
> layer for each union mount?
 
Alright not the greatest of examples, there is something to be said about
symmetry, so...let me try again :)

/a/
/b/bar		(whiteout for bar)
/c/foo/qwerty

Now, let's mount a union of {a,b,c}, and we'll see:

$ find /u
/u
/u/foo
/u/foo/qwerty
$ mv /u/foo /u/bar

Now what? How do you rename? Do you rename in the same branch (assuming it
is rw)? If you do, you'll get:

$ find /u
/u

Oops! There's a whiteout in /b that hides the directory in /c -- rename(2)
shouldn't make directory subtrees disappear.

There are two ways to solve this:

1) "cp -r" the entire subtree being renamed to highest-priority branch, and
rename there (you might have to recreate a series of directories to have a
place to "cp" to...so you got "cp -r" _AND_ "mkdir -p"-like code in the VFS!
1/2 a :) )

2) Don't store whiteouts within branches. This makes it really easy to
rename and remove the whiteout.

Sure, you could try to rename in-place and remove the whiteout, but what if
you have:

/a/
/b/bar		(whiteout)
/c/bar/blah
/d/foo/qwerty

$ mv /u/foo /u/bar

You can't just remove the whiteout, because that'd uncover the whited-out
directory bar in /c.

Josef 'Jeff' Sipek.

-- 
Bad pun of the week: The formula 1 control computer suffered from a race
condition

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 18:44           ` Josef Sipek
@ 2007-08-01 19:10             ` Dave Kleikamp
  2007-08-01 19:33               ` Josef Sipek
  2007-08-02  5:24             ` Ph. Marek
  1 sibling, 1 reply; 65+ messages in thread
From: Dave Kleikamp @ 2007-08-01 19:10 UTC (permalink / raw)
  To: Josef Sipek; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao, hch

On Wed, 2007-08-01 at 14:44 -0400, Josef Sipek wrote:
> Alright not the greatest of examples, there is something to be said about
> symmetry, so...let me try again :)
> 
> /a/
> /b/bar		(whiteout for bar)
> /c/foo/qwerty
> 
> Now, let's mount a union of {a,b,c}, and we'll see:
> 
> $ find /u
> /u
> /u/foo
> /u/foo/qwerty
> $ mv /u/foo /u/bar
> 
> Now what? How do you rename? Do you rename in the same branch (assuming it
> is rw)?

Er, no.  According to Documentation/filesystems/union-mounts.txt, "only
the topmost layer of the mount stack can be altered".

> If you do, you'll get:
> 
> $ find /u
> /u
> 
> Oops! There's a whiteout in /b that hides the directory in /c -- rename(2)
> shouldn't make directory subtrees disappear.
> 
> There are two ways to solve this:
> 
> 1) "cp -r" the entire subtree being renamed to highest-priority branch, and
> rename there (you might have to recreate a series of directories to have a
> place to "cp" to...so you got "cp -r" _AND_ "mkdir -p"-like code in the VFS!
> 1/2 a :) )

I think this is the only alternative, given the design.

> 2) Don't store whiteouts within branches. This makes it really easy to
> rename and remove the whiteout.
> 
> Sure, you could try to rename in-place and remove the whiteout, but what if
> you have:
> 
> /a/
> /b/bar		(whiteout)
> /c/bar/blah
> /d/foo/qwerty
> 
> $ mv /u/foo /u/bar
> 
> You can't just remove the whiteout, because that'd uncover the whited-out
> directory bar in /c.
> 
> Josef 'Jeff' Sipek.
> 
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 19:10             ` Dave Kleikamp
@ 2007-08-01 19:33               ` Josef Sipek
  2007-08-01 19:52                 ` Dave Kleikamp
                                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Josef Sipek @ 2007-08-01 19:33 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao, hch

On Wed, Aug 01, 2007 at 02:10:31PM -0500, Dave Kleikamp wrote:
> On Wed, 2007-08-01 at 14:44 -0400, Josef Sipek wrote:
> > Alright not the greatest of examples, there is something to be said about
> > symmetry, so...let me try again :)
> > 
> > /a/
> > /b/bar		(whiteout for bar)
> > /c/foo/qwerty
> > 
> > Now, let's mount a union of {a,b,c}, and we'll see:
> > 
> > $ find /u
> > /u
> > /u/foo
> > /u/foo/qwerty
> > $ mv /u/foo /u/bar
> > 
> > Now what? How do you rename? Do you rename in the same branch (assuming it
> > is rw)?
> 
> Er, no.  According to Documentation/filesystems/union-mounts.txt, "only
> the topmost layer of the mount stack can be altered".
 
This brings up an very interesting (but painful) question...which makes more
sense? Allowing the modifications in only the top-most branch, or any branch
(given the user allows it at mount-time)?

This is really question to the community at large, not just you, Dave :)

> > 1) "cp -r" the entire subtree being renamed to highest-priority branch, and
> > rename there (you might have to recreate a series of directories to have a
> > place to "cp" to...so you got "cp -r" _AND_ "mkdir -p"-like code in the VFS!
> > 1/2 a :) )
> 
> I think this is the only alternative, given the design.
 
Right. Doing something like this at the filesystem level (as we do in
unionfs) seems less painful - filesystems are places full of all sorts of
nefarious activities to begin with. Having it in the VFS seems...even
uglier.

Josef 'Jeff' Sipek.

-- 
*NOTE: This message is ROT-13 encrypted twice for extra protection*

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 19:33               ` Josef Sipek
@ 2007-08-01 19:52                 ` Dave Kleikamp
  2007-08-01 22:06                   ` Erez Zadok
  2007-08-02 11:55                 ` Jan Blunck
  2007-08-02 17:50                 ` Jörn Engel
  2 siblings, 1 reply; 65+ messages in thread
From: Dave Kleikamp @ 2007-08-01 19:52 UTC (permalink / raw)
  To: Josef Sipek; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao, hch

On Wed, 2007-08-01 at 15:33 -0400, Josef Sipek wrote:
> On Wed, Aug 01, 2007 at 02:10:31PM -0500, Dave Kleikamp wrote:
> > On Wed, 2007-08-01 at 14:44 -0400, Josef Sipek wrote:
> > > Now what? How do you rename? Do you rename in the same branch (assuming it
> > > is rw)?
> > 
> > Er, no.  According to Documentation/filesystems/union-mounts.txt, "only
> > the topmost layer of the mount stack can be altered".
> 
> This brings up an very interesting (but painful) question...which makes more
> sense? Allowing the modifications in only the top-most branch, or any branch
> (given the user allows it at mount-time)?

Your examples point out the complexity of trying to allow modifications
at lower levels.  It seems to me to be simpler (even if recursive copies
are needed) to leave it as proposed.

> This is really question to the community at large, not just you, Dave :)

I agree, but I have to add my $.02.

> > > 1) "cp -r" the entire subtree being renamed to highest-priority branch, and
> > > rename there (you might have to recreate a series of directories to have a
> > > place to "cp" to...so you got "cp -r" _AND_ "mkdir -p"-like code in the VFS!
> > > 1/2 a :) )
> > 
> > I think this is the only alternative, given the design.
> 
> Right. Doing something like this at the filesystem level (as we do in
> unionfs) seems less painful - filesystems are places full of all sorts of
> nefarious activities to begin with. Having it in the VFS seems...even
> uglier.

I haven't looked at either implementation close enough to offer an
opinion here that I would be able to defend.  I'm sure others have their
opinions.

> Josef 'Jeff' Sipek.
> 

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 19:52                 ` Dave Kleikamp
@ 2007-08-01 22:06                   ` Erez Zadok
  2007-08-02 12:05                     ` Jan Blunck
  0 siblings, 1 reply; 65+ messages in thread
From: Erez Zadok @ 2007-08-01 22:06 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Josef Sipek, Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao, hch

In message <1185997941.18007.30.camel@kleikamp.austin.ibm.com>, Dave Kleikamp writes:
> On Wed, 2007-08-01 at 15:33 -0400, Josef Sipek wrote:
> > On Wed, Aug 01, 2007 at 02:10:31PM -0500, Dave Kleikamp wrote:
> > > On Wed, 2007-08-01 at 14:44 -0400, Josef Sipek wrote:
> > > > Now what? How do you rename? Do you rename in the same branch (assuming it
> > > > is rw)?
> > > 
> > > Er, no.  According to Documentation/filesystems/union-mounts.txt, "only
> > > the topmost layer of the mount stack can be altered".
> > 
> > This brings up an very interesting (but painful) question...which makes more
> > sense? Allowing the modifications in only the top-most branch, or any branch
> > (given the user allows it at mount-time)?
> 
> Your examples point out the complexity of trying to allow modifications
> at lower levels.  It seems to me to be simpler (even if recursive copies
> are needed) to leave it as proposed.
[...]

There are three other reasons why Unionfs and our users like to have
multiple writable branches:

1. If only the topmost layer is writable, then every little change tends to
   cause a copyup, which tends to clutter the top layer more quickly.  Some
   of our users didn't like that idea, while others explicitly wanted it --
   so we give them a choice to decide, on a per layer/branch whether it
   should be writable or readonly.

2. Some users unify different packages together.  Imagine you union under
   /union, several installed packages: /X11R6/{bin,man,lib,conf},
   /apache/{bin,man,lib,etc}, and /mysql/{bin,man,lib,etc}, and so on.  If a
   user modifies /union/apache/etc/apache.conf, they sometimes want
   apache.conf to remain in the writable branch it came from, not copied up.
   That way all apache related files are logically left where they came
   from, which makes administration easier.  Again, some users like to have
   multiple writable branches, and some don't -- so in Unionfs we give them
   the choice.  And yes, it does make our implementation more complex.

3. Some people use Unionfs in the scenario described in point #2 above, as a
   poor man's space- and load- distribution system.  Some of our users like
   the idea of controlling how much storage space they give each branch, and
   how much it might grow, and even how much CPU or I/O load might be placed
   on each of the lower filesystems which serve a given branch.  That way
   they worry less about the top-layer's space filling up more quickly than
   expected.  Now Unionfs was never designed to be a load-balancing f/s (we
   have RAIF for that, see <http://www.filesystems.org/project-raif.html>),
   but users seems to always find creative ways to [ab]use one's software in
   ways one never thought of. :-)

BTW, does Union Mounts copyup on meta-data changes (e.g., chmod, chgrp,
etc.)?

Erez.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 11/26] tmpfs white-out support
  2007-08-01 15:13   ` Hugh Dickins
@ 2007-08-02  2:48     ` Matt Mackall
  0 siblings, 0 replies; 65+ messages in thread
From: Matt Mackall @ 2007-08-02  2:48 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao

On Wed, Aug 01, 2007 at 04:13:46PM +0100, Hugh Dickins wrote:
> On Mon, 30 Jul 2007, Jan Blunck wrote:
> 
> > Introduce white-out support to tmpfs.
> > 
> > Signed-off-by: Jan Blunck <jblunck@suse.de>
> > ---
> >  include/linux/shmem_fs.h |    1 
> >  mm/shmem.c               |   54 +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 55 insertions(+)
> 
> I see there's debate about whether this (and its fellows) give the
> right semantic to whiteouts; and I've not begun to think about that.
> 
> But as a patch to tmpfs for what you're trying to do, it looks just
> about fine.  I say "just about" because the reference counting looks
> right, but I wouldn't dare say that it _is_ right without testing.
> 
> And I'd probably want to add a minor adjustment, so that a mount with
> nr_inodes=1000 could still support exactly 1000 inodes, despite your
> allocating one for the whiteout (usually never used) at mount time.
> But that can follow along later, no problem.

Also, you might want to make sure whiteouts work with ramfs, which
replaces tmpfs when tmpfs is disabled.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 18:44           ` Josef Sipek
  2007-08-01 19:10             ` Dave Kleikamp
@ 2007-08-02  5:24             ` Ph. Marek
  2007-08-02 12:12               ` Jan Blunck
  1 sibling, 1 reply; 65+ messages in thread
From: Ph. Marek @ 2007-08-02  5:24 UTC (permalink / raw)
  To: Josef Sipek
  Cc: Dave Kleikamp, Jan Blunck, linux-fsdevel, linux-kernel,
	Bharata B Rao, hch

On Mittwoch, 1. August 2007, Josef Sipek wrote:
> Alright not the greatest of examples, there is something to be said about
> symmetry, so...let me try again :)
...
> Oops! There's a whiteout in /b that hides the directory in /c -- rename(2)
> shouldn't make directory subtrees disappear.
>
> There are two ways to solve this:
>
> 1) "cp -r" the entire subtree ...
>
> 2) Don't store whiteouts within branches ...
Sorry for making uninformed guesses, but if there are already special nodes 
(whiteout), why not extending them to some more general format - specifying a 
(source, destination) pair at the topmost level?
- A delete is a (source, NULL) pair
- A rename is a (source, destination) pair, which causes lookups on source to
  use the string destination in the lower branches.


Would that work?


Regards,

Phil


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 00/26] VFS based Union Mount (V2)
  2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
                   ` (26 preceding siblings ...)
  2007-07-30 18:23 ` [RFC 00/26] VFS based Union Mount (V2) Al Boldi
@ 2007-08-02  6:49 ` Bharata B Rao
  2007-08-02 10:17   ` Jan Blunck
  27 siblings, 1 reply; 65+ messages in thread
From: Bharata B Rao @ 2007-08-02  6:49 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel

On Mon, Jul 30, 2007 at 06:13:23PM +0200, Jan Blunck wrote:
> Here is another post of the VFS based union mount implementation. Unlike the
> traditional mount which hides the contents of the mount point, union mounts
> present the merged view of the mount point and the mounted filesytem.

Doesn't compile without CONFIG_DEBUG_UNION_MOUNT.

fs/namei.c: In function `hash_lookup_union':
fs/namei.c:1798: error: implicit declaration of function `UM_DEBUG_LOOKUP'
make[1]: *** [fs/namei.o] Error 1

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 00/26] VFS based Union Mount (V2)
  2007-08-02  6:49 ` Bharata B Rao
@ 2007-08-02 10:17   ` Jan Blunck
  0 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-08-02 10:17 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 638 bytes --]

On Thu, Aug 02, Bharata B Rao wrote:

> On Mon, Jul 30, 2007 at 06:13:23PM +0200, Jan Blunck wrote:
> > Here is another post of the VFS based union mount implementation. Unlike the
> > traditional mount which hides the contents of the mount point, union mounts
> > present the merged view of the mount point and the mounted filesytem.
> 
> Doesn't compile without CONFIG_DEBUG_UNION_MOUNT.
> 
> fs/namei.c: In function `hash_lookup_union':
> fs/namei.c:1798: error: implicit declaration of function `UM_DEBUG_LOOKUP'
> make[1]: *** [fs/namei.o] Error 1

Umm, typo in the debug infrastruture patch. Here is the fixed version.

Thanks,
Jan

[-- Attachment #2: union-mount-debug-infrastructure.diff --]
[-- Type: text/x-patch, Size: 10830 bytes --]

Subject: union-mount: Debug Infrastructure

This adds debugfs/relay based debugging infrastructure helpful when doing
development of the union-mount code itself. The debgging output can be enabled
during runtime by:

 echo 1 > /proc/sys/fs/union-debug

This registers the relayfs files where the debug code is writing its output
to. There are different levels of debugging output available which can be ORed
together. For the valid sysctl values see include/linux/union_debug.h.

Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 include/linux/union_debug.h |   91 ++++++++++++++
 lib/Kconfig.debug           |    9 +
 lib/Makefile                |    2 
 lib/union_debug.c           |  268 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 370 insertions(+)

--- /dev/null
+++ b/include/linux/union_debug.h
@@ -0,0 +1,91 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007 Novell Inc.
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_DEBUG_H
+#define __LINUX_UNION_DEBUG_H
+
+#ifdef __KERNEL__
+
+#ifdef CONFIG_DEBUG_UNION_MOUNT
+
+#include <linux/sched.h>
+
+/* This is taken from klog debugging facility */
+extern void klog(const void *data, int len);
+extern void klog_printk(const char *fmt, ...);
+extern void klog_printk_dentry(const char *func, struct dentry *dentry);
+
+extern int sysctl_union_debug;
+
+#define UNION_MOUNT_DEBUG		1
+#define UNION_MOUNT_DEBUG_DCACHE	2
+#define UNION_MOUNT_DEBUG_LOCK		4
+#define UNION_MOUNT_DEBUG_READDIR	8
+#define UNION_MOUNT_DEBUG_LOOKUP	16
+
+#define UM_DEBUG(fmt, args...)						\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG)			\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_DENTRY(dentry)						\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG)			\
+		klog_printk_dentry(__FUNCTION__, (dentry));		\
+} while (0)
+#define UM_DEBUG_DCACHE(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_DCACHE)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_DCACHE_DENTRY(dentry)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_DCACHE)		\
+		klog_printk_dentry(__FUNCTION__, (dentry));		\
+} while (0)
+#define UM_DEBUG_LOCK(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_LOCK)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_READDIR(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_READDIR)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_LOOKUP(fmt, args...)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_LOOKUP)		\
+		klog_printk("%s: " fmt, __FUNCTION__, ## args);		\
+} while (0)
+#define UM_DEBUG_LOOKUP_DENTRY(dentry)					\
+do {									\
+	if (sysctl_union_debug & UNION_MOUNT_DEBUG_LOOKUP)		\
+		klog_printk_dentry(__FUNCTION__, (dentry));		\
+} while (0)
+
+#else	/* CONFIG_DEBUG_UNION_MOUNT */
+
+#define UM_DEBUG(fmt, args...)			do { /* empty */ } while (0)
+#define UM_DEBUG_DENTRY(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE_DENTRY(fmt, args...)	do { /* empty */ } while (0)
+#define UM_DEBUG_LOCK(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_READDIR(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_LOOKUP(fmt, args...)		do { /* empty */ } while (0)
+#define UM_DEBUG_LOOKUP_DENTRY(fmt, args...)	do { /* empty */ } while (0)
+
+#endif	/* CONFIG_DEBUG_UNION_MOUNT */
+
+#endif	/* __KERNEL__ */
+#endif	/*  __LINUX_UNION_DEBUG_H */
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -393,6 +393,15 @@ config DEBUG_LIST
 
 	  If unsure, say N.
 
+config DEBUG_UNION_MOUNT
+	bool "Debug VFS based union mounts"
+	depends on DEBUG_KERNEL && UNION_MOUNT
+	select DEBUG_FS
+	default n
+	help
+	  If you say Y here, the union mount debugging code will be
+	  compiled in.
+
 config FRAME_POINTER
 	bool "Compile the kernel with frame pointers"
 	depends on DEBUG_KERNEL && (X86 || CRIS || M68K || M68KNOMMU || FRV || UML || S390 || AVR32 || SUPERH || BFIN)
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -35,6 +35,8 @@ obj-$(CONFIG_PLIST) += plist.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
 
+obj-$(CONFIG_DEBUG_UNION_MOUNT) += union_debug.o
+
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
   lib-y += dec_and_lock.o
 endif
--- /dev/null
+++ b/lib/union_debug.c
@@ -0,0 +1,268 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright (C) 2007 SUSE Linux
+ *   Author(s): Jan Blunck (jblunck@suse.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sysctl.h>
+#include <linux/init.h>
+#include <linux/relay.h>
+#include <linux/debugfs.h>
+
+int sysctl_union_debug;
+EXPORT_SYMBOL_GPL(sysctl_union_debug);
+
+static struct rchan *debug_rchan;
+static struct dentry *debug_logdir;
+#define SUBBUF_SIZE 262144
+#define N_SUBBUF 4
+
+static struct dentry *create_buf_file(const char *filename,
+				      struct dentry *parent, int mode,
+				      struct rchan_buf *buf, int *is_global)
+{
+	return debugfs_create_file(filename, mode, parent, buf,
+				   &relay_file_operations);
+}
+
+static int remove_buf_file(struct dentry *dentry)
+{
+	debugfs_remove(dentry);
+	return 0;
+}
+
+static int subbuf_start(struct rchan_buf *buf, void *subbuf, void *prev_subbuf,
+			unsigned int prev_padding)
+{
+	return 1;
+}
+
+static struct rchan_callbacks debug_relay_cb = {
+	.create_buf_file = create_buf_file,
+	.remove_buf_file = remove_buf_file,
+	.subbuf_start = subbuf_start,
+};
+
+static int union_debug_relay_init(void)
+{
+	struct dentry *dentry;
+	struct rchan *rchan;
+
+	if (!debug_logdir) {
+		dentry = debugfs_create_dir("union", NULL);
+		if (IS_ERR(dentry)) {
+			printk(KERN_INFO
+			       "%s: debugfs directory creation failed\n",
+			       __FUNCTION__);
+			return PTR_ERR(dentry);
+		}
+
+		debug_logdir = dentry;
+	}
+
+	if (!debug_rchan) {
+		rchan = relay_open("logfile", debug_logdir,
+				   SUBBUF_SIZE, N_SUBBUF,
+				   &debug_relay_cb, NULL);
+		if (!rchan) {
+			printk(KERN_INFO "%s: relay channel creation failed\n",
+			       __FUNCTION__);
+			debugfs_remove(debug_logdir);
+			return -ENOMEM;
+		}
+
+		debug_rchan = rchan;
+	}
+
+	return 0;
+}
+
+static void union_debug_relay_exit(void)
+{
+	if (debug_rchan)
+		relay_close(debug_rchan);
+	debug_rchan = NULL;
+	if (debug_logdir)
+		debugfs_remove(debug_logdir);
+	debug_logdir = NULL;
+}
+
+/*
+ * klog operations
+ */
+struct klog_operations
+{
+	/*
+	 * klog - called when klog called, same params
+	 */
+	void (*klog) (const void *data, int len);
+};
+
+/* maximum size of klog formatting buffer beyond which truncation will occur */
+#define KLOG_TMPBUF_SIZE (1024)
+/* per-cpu klog formatting temporary buffer */
+static char klog_buf[NR_CPUS][KLOG_TMPBUF_SIZE];
+
+/*
+ * do-nothing default klog handler, called if nothing registered
+ */
+static void default_klog(const void *data, int len)
+{
+}
+
+/*
+ * default klog operations, used if nothing registered
+ */
+static struct klog_operations default_klog_ops =
+{
+	.klog = default_klog,
+};
+
+static struct klog_operations *cur_klog_ops = &default_klog_ops;
+
+/**
+ *      register_klog_handler - register klog handler
+ *      @klog_ops: klog operations callbacks
+ *
+ *      replaces default klog handler with passed-in version
+ */
+int register_klog_handler(struct klog_operations *klog_ops)
+{
+	if (!klog_ops)
+		return -EINVAL;
+
+	if (!klog_ops->klog)
+		klog_ops->klog = default_klog;
+
+	cur_klog_ops = klog_ops;
+	return 0;
+}
+
+/**
+ *      unregister_klog_handler - unregister klog handler
+ *
+ *      default handler will be in effect after this
+ */
+void unregister_klog_handler(void)
+{
+	cur_klog_ops = &default_klog_ops;
+}
+
+/**
+ *      klog - send raw data to klog handler
+ */
+void klog(const void *data, int len)
+{
+	cur_klog_ops->klog(data, len);
+}
+
+/**
+ *      klog_printk - send a formatted string to the klog handler
+ *      @fmt: format string, same as printk
+ */
+void klog_printk(const char *fmt, ...)
+{
+	va_list args;
+	int len;
+	char *cbuf;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	cbuf = klog_buf[smp_processor_id()];
+	va_start(args, fmt);
+	len = vsnprintf(cbuf, KLOG_TMPBUF_SIZE, fmt, args);
+	va_end(args);
+	klog(cbuf, len);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(klog_printk);
+
+void klog_printk_dentry(const char *func, struct dentry *dentry)
+{
+	klog_printk("%s: %p{i=%p/%lx,c=%d,n=\"%s\"}\n",
+		    func,
+		    dentry,
+		    dentry->d_inode,
+		    dentry->d_inode ?
+		    dentry->d_inode->i_ino : 0UL,
+		    atomic_read(&dentry->d_count),
+		    dentry->d_name.name);
+}
+EXPORT_SYMBOL_GPL(klog_printk_dentry);
+
+static void log_data(const void *data, int len)
+{
+	relay_write(debug_rchan, data, len);
+}
+
+static struct klog_operations klog_handler =
+{
+	.klog = log_data,
+};
+
+static int union_debug_sysctl_handler(ctl_table *table, int write,
+				      struct file *file,
+				      void __user *buffer, size_t *length,
+				      loff_t *ppos)
+{
+	proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+
+	if (!write)
+		return 0;
+
+	printk(KERN_INFO "sysctl.fs.union-debug: %d\n", sysctl_union_debug);
+
+	switch (sysctl_union_debug) {
+	case 0:
+		unregister_klog_handler();
+		union_debug_relay_exit();
+		break;
+	default:
+		union_debug_relay_init();
+		if (register_klog_handler(&klog_handler))
+			union_debug_relay_exit();
+		break;
+	}
+
+	return 0;
+}
+
+static ctl_table union_table[] = {
+	{
+		.ctl_name = CTL_UNNUMBERED,
+		.procname = "union-debug",
+		.data = &sysctl_union_debug,
+		.maxlen = sizeof(int),
+		.mode = 0644,
+		.proc_handler = &union_debug_sysctl_handler,
+	},
+	{.ctl_name = 0}
+};
+
+static ctl_table fs_root[] = {
+	{
+		.ctl_name = CTL_FS,
+		.procname = "fs",
+		.maxlen = 0,
+		.mode = 0555,
+		.child = union_table,
+	},
+	{.ctl_name = 0}
+};
+
+static struct ctl_table_header *sysctl_header;
+
+static int union_debug_init(void)
+{
+	sysctl_header = register_sysctl_table(fs_root);
+	return 0;
+}
+
+late_initcall(union_debug_init);

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 17:11       ` Josef Sipek
  2007-08-01 15:23         ` Dave Kleikamp
@ 2007-08-02 10:26         ` Jan Blunck
  1 sibling, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-08-02 10:26 UTC (permalink / raw)
  To: Josef Sipek; +Cc: linux-fsdevel, linux-kernel, Bharata B Rao

On Tue, Jul 31, Josef Sipek wrote:

> > So you think that just because you mounted the filesystem somewhere else it
> > should look different? This is what sharing is all about. If you share a
> > filesystem you also share the removal of objects.
> 
> The removal happens at the union level, not the branch level. Say you have:
> 
> /a/
> /b/foo
> /c/foo
> 
> And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}.
> 
> $ find /u*
> /u1
> /u1/foo
> /u2
> /u2/foo
> $ rm /u1/foo # this creates whiteout for "foo" in /a
> $ find /u*
> /u1
> /u2
> 
> Is that what you'd expect as a user? I don't think so.
> 

Yes, although that might sound strange: you are sharing the topmost writable
layer. This is what I expect.

> > > store. We did an experiment with Unionfs, and moving the whiteout handling
> > > to effectively a "library" that did all the dirty work cleaned up the code
> > > considerably [2,3].
> > 
> > Haven't checked if you could use ODF for a generic store for filesystems that
> > couldn't support whiteouts. This might be an interesting idea.
>  
> Yes, since the ODF is completely separate, you can use _any_ filesystem and
> regardless of whether or not they support whiteouts.

Completely separate? It is totally tied to UnionFS and tries to work out
purely the problems that this kind of VFS emulating filesystems have.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 19:33               ` Josef Sipek
  2007-08-01 19:52                 ` Dave Kleikamp
@ 2007-08-02 11:55                 ` Jan Blunck
  2007-08-02 17:50                 ` Jörn Engel
  2 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-08-02 11:55 UTC (permalink / raw)
  To: Josef Sipek
  Cc: Dave Kleikamp, linux-fsdevel, linux-kernel, Bharata B Rao, hch

On Wed, Aug 01, Josef Sipek wrote:

> This brings up an very interesting (but painful) question...which makes more
> sense? Allowing the modifications in only the top-most branch, or any branch
> (given the user allows it at mount-time)?

My implementation is keeping things simple because of reason. There have been
many attempts to get unioning working on the filesystem layer. Most of them
failed because of complexity. E.g. BSD throwed away all of the filesystem
stacking support after they tried to fix unionfs for years. Writing to lower
layers is making things unnecessary complex. Therefore I left it out.

> > > 1) "cp -r" the entire subtree being renamed to highest-priority branch, and
> > > rename there (you might have to recreate a series of directories to have a
> > > place to "cp" to...so you got "cp -r" _AND_ "mkdir -p"-like code in the VFS!
> > > 1/2 a :) )
> > 
> > I think this is the only alternative, given the design.
>  
> Right. Doing something like this at the filesystem level (as we do in
> unionfs) seems less painful - filesystems are places full of all sorts of
> nefarious activities to begin with. Having it in the VFS seems...even
> uglier.

The userspace is doing it since I return -EXDEV. And that even comes for
free. I don't need to hack around and call back into VFS as you do. It is so
simple and straightforward in the VFS.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 22:06                   ` Erez Zadok
@ 2007-08-02 12:05                     ` Jan Blunck
  0 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-08-02 12:05 UTC (permalink / raw)
  To: Erez Zadok
  Cc: Dave Kleikamp, Josef Sipek, linux-fsdevel, linux-kernel,
	Bharata B Rao, hch

On Wed, Aug 01, Erez Zadok wrote:

> There are three other reasons why Unionfs and our users like to have
> multiple writable branches:
> 

...

>    And yes, it does make our implementation more complex.

And error-prone and unflexible wrt to changes. When XIP was introduced,
unionfs crashed all over this changes. I don't know if this has changed
yet. Not speaking of other issues like calling back into VFS (stack usage),
locking problems and so on.

> 3. Some people use Unionfs in the scenario described in point #2 above, as a
>    poor man's space- and load- distribution system.  Some of our users like
>    the idea of controlling how much storage space they give each branch, and
>    how much it might grow, and even how much CPU or I/O load might be placed
>    on each of the lower filesystems which serve a given branch.  That way
>    they worry less about the top-layer's space filling up more quickly than
>    expected.  Now Unionfs was never designed to be a load-balancing f/s (we
>    have RAIF for that, see <http://www.filesystems.org/project-raif.html>),
>    but users seems to always find creative ways to [ab]use one's software in
>    ways one never thought of. :-)

And this has nothing to do with unioning ...

> BTW, does Union Mounts copyup on meta-data changes (e.g., chmod, chgrp,
> etc.)?

No. But it was proposed during on of the last postings.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-02  5:24             ` Ph. Marek
@ 2007-08-02 12:12               ` Jan Blunck
  0 siblings, 0 replies; 65+ messages in thread
From: Jan Blunck @ 2007-08-02 12:12 UTC (permalink / raw)
  To: Ph. Marek
  Cc: Josef Sipek, Dave Kleikamp, linux-fsdevel, linux-kernel,
	Bharata B Rao, hch

On Thu, Aug 02, Ph. Marek wrote:

> On Mittwoch, 1. August 2007, Josef Sipek wrote:
> > Alright not the greatest of examples, there is something to be said about
> > symmetry, so...let me try again :)
> ...
> > Oops! There's a whiteout in /b that hides the directory in /c -- rename(2)
> > shouldn't make directory subtrees disappear.
> >
> > There are two ways to solve this:
> >
> > 1) "cp -r" the entire subtree ...
> >
> > 2) Don't store whiteouts within branches ...
> Sorry for making uninformed guesses, but if there are already special nodes 
> (whiteout), why not extending them to some more general format - specifying a 
> (source, destination) pair at the topmost level?
> - A delete is a (source, NULL) pair
> - A rename is a (source, destination) pair, which causes lookups on source to
>   use the string destination in the lower branches.

Originally I had the idea that whiteouts are a special kind of symlink. After
discussing that with various people sticked to the simplest approach.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-01 19:33               ` Josef Sipek
  2007-08-01 19:52                 ` Dave Kleikamp
  2007-08-02 11:55                 ` Jan Blunck
@ 2007-08-02 17:50                 ` Jörn Engel
  2007-08-02 18:15                   ` Jeremy Maitin-Shepard
  2 siblings, 1 reply; 65+ messages in thread
From: Jörn Engel @ 2007-08-02 17:50 UTC (permalink / raw)
  To: Josef Sipek
  Cc: Dave Kleikamp, Jan Blunck, linux-fsdevel, linux-kernel,
	Bharata B Rao, hch

On Wed, 1 August 2007 15:33:30 -0400, Josef Sipek wrote:
>  
> This brings up an very interesting (but painful) question...which makes more
> sense? Allowing the modifications in only the top-most branch, or any branch
> (given the user allows it at mount-time)?
> 
> This is really question to the community at large, not just you, Dave :)

Only write to top-most layer.

There are two reasons for this.  First it allows users to create a union
mount, test something (e.g. update the distribution) and remove every
trace from the test by umounting the top-most layer.  Such a thing can
be quite valuable.

The second reason is simplicity.  I personally couldn't even start to
describe the semantics.  If the user does a rename, which layer will the
change end up in?  What if source or target exist in multiple layers?
How to rename a directory in a lower layer containing a new file in an
upper layer?

Finding new and interesting corner cases for such a beast can be quite
entertaining.  And until someone has properly documented the semantics
for _all_ the corner cases, my enthusiasm is below freezing point.  Does
such a documentation exist?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-08-02 17:50                 ` Jörn Engel
@ 2007-08-02 18:15                   ` Jeremy Maitin-Shepard
  0 siblings, 0 replies; 65+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-08-02 18:15 UTC (permalink / raw)
  To: linux-kernel

Jörn Engel <joern@logfs.org> writes:

> On Wed, 1 August 2007 15:33:30 -0400, Josef Sipek wrote:
>> 
>> This brings up an very interesting (but painful) question...which makes more
>> sense? Allowing the modifications in only the top-most branch, or any branch
>> (given the user allows it at mount-time)?
>> 
>> This is really question to the community at large, not just you, Dave :)

> Only write to top-most layer.

> There are two reasons for this.  First it allows users to create a union
> mount, test something (e.g. update the distribution) and remove every
> trace from the test by umounting the top-most layer.  Such a thing can
> be quite valuable.

Josef did specifically state that modification to the lower layers would
be allowed only if a special mount flag is given.

> The second reason is simplicity.  I personally couldn't even start to
> describe the semantics.  If the user does a rename, which layer will the
> change end up in?  What if source or target exist in multiple layers?
> How to rename a directory in a lower layer containing a new file in an
> upper layer?

> Finding new and interesting corner cases for such a beast can be quite
> entertaining.  And until someone has properly documented the semantics
> for _all_ the corner cases, my enthusiasm is below freezing point.  Does
> such a documentation exist?

I think that if someone can come up with consistent (and useful)
semantics for a mount option that allows modifications to other layers
as well, it would be a useful additional feature to support.  It seems
that it should be possible to add this feature at a later time in any
case.

Perhaps referring to the plan9 semantics could be helpful.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 12/26] ext2 white-out support
  2007-07-31 10:53       ` Theodore Tso
@ 2007-08-02 19:31         ` Pavel Machek
  0 siblings, 0 replies; 65+ messages in thread
From: Pavel Machek @ 2007-08-02 19:31 UTC (permalink / raw)
  To: Theodore Tso, Jan Blunck, linux-fsdevel, linux-kernel, Bharata B Rao

Hi!

> > > I wouldn't bother with setting the directory type field to be DT_WHT,
> > > given that they will never be returned to userspace anyway.
> > 
> > At the moment I still rely on this for the current readdir implementation.
> > Viro already said that he doesn't want to see this (the readdir changes) in
> > the kernel but in userspace.
> 
> Life gets very messy if you have to do this in userspace.  Example:
> statically linked programs that were compiled with a version of glibc
> that didn't know about whiteout records.  Unfortunately, the memory

WEll, also if root deletes something, it should be _gone_, and user
should not be able to work around that just by bringing statically
linked ls..
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 16/26] union-mount: Introduce union_mount structure
  2007-07-30 16:13 ` [RFC 16/26] union-mount: Introduce union_mount structure Jan Blunck
@ 2007-08-06  5:57   ` Bharata B Rao
  0 siblings, 0 replies; 65+ messages in thread
From: Bharata B Rao @ 2007-08-06  5:57 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel

On Mon, Jul 30, 2007 at 06:13:39PM +0200, Jan Blunck wrote:
> +
> +int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
> +		    struct vfsmount *dest_mnt, struct dentry *dest_dentry)
> +{
> +	struct union_mount *this, *um;
> +
> +	BUG_ON(!IS_MNT_UNION(mnt));
> +
> +	this = union_alloc(dentry, mnt, dest_dentry, dest_mnt);
> +	if (!this)
> +		return -ENOMEM;
> +
> +	spin_lock(&union_lock);
> +	um = union_lookup(dentry, mnt);
> +	if (um) {
> +		BUG_ON((um->u_next.dentry != dest_dentry) ||
> +		       (um->u_next.mnt != dest_mnt));
> +		spin_unlock(&union_lock);
> +		union_put(this);
> +		return 0;
> +	}
> +	__union_hash(this);
> +	spin_unlock(&union_lock);
> +	return 0;
> +}

This breaks if we append to union stack from outside of the union.
A particular case I hit is with a 3 layer union with a subdir union
between topmost and bottom layer. Now if you create the same-named
directory in the middle layer from outside of this union, you hit the
above BUG_ON. The below patch fixes this and it applies on top of all of
your patches.

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Direct additions to union stack from outside of the union is resulting in
BUG_ON. But this is a valid case and hence needs to be supported. Modify
append_to_union() to correctly handle this case.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/namei.c            |    8 +++++---
 fs/union.c            |   32 ++++++++++++++++++++++++--------
 include/linux/union.h |    4 ++--
 3 files changed, 31 insertions(+), 13 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -512,7 +512,7 @@ static int __cache_lookup_union(struct n
 		}
 
 		/* now we know we found something "real"  */
-		append_to_union(last.mnt, last.dentry, nd->mnt, dentry);
+		append_to_union(last.mnt, last.dentry, nd->mnt, dentry, 1);
 
 		if (last.dentry != path->dentry)
 			pathput(&last);
@@ -789,7 +789,8 @@ static int __real_lookup_union(struct na
 		}
 
 		/* now we know we found something "real" */
-		append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
+		append_to_union(last.mnt, last.dentry,
+				next.mnt, next.dentry, 1);
 
 		if (last.dentry != path->dentry)
 			pathput(&last);
@@ -1775,7 +1776,8 @@ static int __hash_lookup_union(struct na
 		}
 
 		/* now we know we found something "real" */
-		append_to_union(last.mnt, last.dentry, next.mnt, next.dentry);
+		append_to_union(last.mnt, last.dentry,
+				next.mnt, next.dentry, 1);
 
 		if (last.dentry != path->dentry)
 			pathput(&last);
--- a/fs/union.c
+++ b/fs/union.c
@@ -248,7 +248,8 @@ int is_unionized(struct dentry *dentry, 
 }
 
 int append_to_union(struct vfsmount *mnt, struct dentry *dentry,
-		    struct vfsmount *dest_mnt, struct dentry *dest_dentry)
+		    struct vfsmount *dest_mnt, struct dentry *dest_dentry,
+		    int from_lookup)
 {
 	struct union_mount *this, *um;
 
@@ -264,11 +265,26 @@ int append_to_union(struct vfsmount *mnt
 	spin_lock(&union_lock);
 	um = union_lookup(dentry, mnt);
 	if (um) {
-		BUG_ON((um->u_next.dentry != dest_dentry) ||
-		       (um->u_next.mnt != dest_mnt));
-		spin_unlock(&union_lock);
-		union_put(this);
-		return 0;
+		if (um->u_next.dentry == dest_dentry &&
+				um->u_next.mnt == dest_mnt) {
+			spin_unlock(&union_lock);
+			union_put(this);
+			return 0;
+		}
+		if (from_lookup) {
+			__union_unhash(um);
+			list_del(&um->u_list);
+			list_del(&um->u_unions);
+			um->u_next.dentry->d_unionized--;
+			spin_unlock(&union_lock);
+			union_put(um);
+			spin_lock(&union_lock);
+		} else {
+			BUG();
+			spin_unlock(&union_lock);
+			union_put(this);
+			return 0;
+		}
 	}
 	list_add(&this->u_list, &mnt->mnt_unions);
 	list_add(&this->u_unions, &dentry->d_unions);
@@ -451,7 +467,7 @@ int attach_mnt_union(struct vfsmount *mn
 	if (!IS_MNT_UNION(mnt))
 		return 0;
 
-	return append_to_union(mnt, mnt->mnt_root, dest_mnt, dest_dentry);
+	return append_to_union(mnt, mnt->mnt_root, dest_mnt, dest_dentry, 0);
 }
 
 void detach_mnt_union(struct vfsmount *mnt)
@@ -941,7 +957,7 @@ struct dentry *union_create_topmost(stru
 		UM_DEBUG_DENTRY(dentry);
 
 		res = append_to_union(nd->mnt, dentry, path->mnt,
-				      path->dentry);
+				      path->dentry, 0);
 		if (res) {
 			dput(dentry);
 			dentry = ERR_PTR(res);
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -44,7 +44,7 @@ struct union_mount {
 
 extern int is_unionized(struct dentry *, struct vfsmount *);
 extern int append_to_union(struct vfsmount *, struct dentry *,
-			   struct vfsmount *, struct dentry *);
+			   struct vfsmount *, struct dentry *, int);
 extern int follow_union_down(struct vfsmount **, struct dentry **);
 extern int follow_union_mount(struct vfsmount **, struct dentry **);
 extern void __d_drop_unions(struct dentry *);
@@ -65,7 +65,7 @@ extern int union_copyup(struct nameidata
 #define IS_UNION(x)			(0)
 #define IS_MNT_UNION(x)			(0)
 #define is_unionized(x, y)		(0)
-#define append_to_union(x1, y1, x2, y2)	({ BUG(); (0); })
+#define append_to_union(x1, y1, x2, y2, z)	({ BUG(); (0); })
 #define follow_union_down(x, y)		({ (0); })
 #define follow_union_mount(x, y)	({ (0); })
 #define __d_drop_unions(x)		do { } while (0)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 20/26] union-mount: Simple union-mount readdir implementation
  2007-07-30 16:13 ` [RFC 20/26] union-mount: Simple union-mount readdir implementation Jan Blunck
@ 2007-08-06 11:08   ` Bharata B Rao
  0 siblings, 0 replies; 65+ messages in thread
From: Bharata B Rao @ 2007-08-06 11:08 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel

On Mon, Jul 30, 2007 at 06:13:43PM +0200, Jan Blunck wrote:
> --- a/include/linux/union.h
> +++ b/include/linux/union.h
> @@ -53,6 +53,7 @@ extern void __shrink_d_unions(struct den
>  extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
>  			    struct dentry *);
>  extern void detach_mnt_union(struct vfsmount *);
> +extern int readdir_union(struct file *, void *, filldir_t);
> 
>  #else /* CONFIG_UNION_MOUNT */
> 
> @@ -69,5 +70,29 @@ extern void detach_mnt_union(struct vfsm
>  #define detach_mnt_union(x)		do { } while (0)
> 
>  #endif	/* CONFIG_UNION_MOUNT */
> +
> +static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
> +{
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	int res;
> +
> +#ifdef CONFIG_UNION_MOUNT
> +	if (IS_MNT_UNION(file->f_path.mnt))
> +		res = readdir_union(file, buf, filler);
> +	else
> +#endif
> +	{
> +		mutex_lock(&inode->i_mutex);
> +		res = -ENOENT;
> +		if (!IS_DEADDIR(inode)) {
> +			res = file->f_op->readdir(file, buf, filler);
> +			file_accessed(file);
> +		}
> +		mutex_unlock(&inode->i_mutex);
> +	}
> +
> +	return res;
> +}

Here you are doing readdir_union for all the directories under a union
mount point, which is an overhead (building the readdir cache). Here is
the fix:

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Within a union mount point, there can be directories which don't have a
union stack underneath them. And readdir() doesn't have to maintain a cache
of dirents for such directories. But the current patch maintains the cache
for such directories. Fix this.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/union.c            |   17 +++++++++++++++++
 include/linux/union.h |   11 ++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

--- a/fs/union.c
+++ b/fs/union.c
@@ -366,6 +366,23 @@ int follow_union_mount(struct vfsmount *
 }
 
 /*
+ * is_dir_unioned - check if the directory represented by @mnt and @dentry
+ * has a union stack underneath.
+ *
+ * Returns true if a union stack exists at this directory level, else returns
+ * false.
+ */
+int is_dir_unioned(struct vfsmount *mnt, struct dentry *dentry)
+{
+	struct union_mount *um;
+
+	spin_lock(&union_lock);
+	um = union_lookup(dentry, mnt);
+	spin_unlock(&union_lock);
+	return um ? 1: 0;
+}
+
+/*
  * This must be called when unhashing a dentry. This is called with dcache_lock
  * and unhashes all unions this dentry is in.
  */
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -54,6 +54,7 @@ extern int attach_mnt_union(struct vfsmo
 			    struct dentry *);
 extern void detach_mnt_union(struct vfsmount *);
 extern int readdir_union(struct file *, void *, filldir_t);
+extern int is_dir_unioned(struct vfsmount *, struct dentry *);
 extern int union_relookup_topmost(struct nameidata *, int);
 extern struct dentry *union_create_topmost(struct nameidata *, struct qstr *,
 					   struct path *);
@@ -82,13 +83,14 @@ extern int union_copyup(struct nameidata
 
 static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
 {
-	struct inode *inode = file->f_path.dentry->d_inode;
 	int res;
+	struct inode *inode = file->f_path.dentry->d_inode;
 
 #ifdef CONFIG_UNION_MOUNT
-	if (IS_MNT_UNION(file->f_path.mnt))
-		res = readdir_union(file, buf, filler);
-	else
+	if (IS_MNT_UNION(file->f_path.mnt)) {
+		if (is_dir_unioned(file->f_path.mnt, file->f_path.dentry))
+			res = readdir_union(file, buf, filler);
+	} else
 #endif
 	{
 		mutex_lock(&inode->i_mutex);
@@ -99,7 +101,6 @@ static inline int do_readdir(struct file
 		}
 		mutex_unlock(&inode->i_mutex);
 	}
-
 	return res;
 }
 
Regards,
Bharata.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 18/26] union-mount: Changes to the namespace handling
  2007-07-30 16:13 ` [RFC 18/26] union-mount: Changes to the namespace handling Jan Blunck
@ 2007-08-08 10:10   ` Bharata B Rao
  0 siblings, 0 replies; 65+ messages in thread
From: Bharata B Rao @ 2007-08-08 10:10 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel

On Mon, Jul 30, 2007 at 06:13:41PM +0200, Jan Blunck wrote:
> Creates the proper struct union_mount when mounting something into a
> union. If the topmost filesystem isn't capable of handling the white-out
> filetype it could only be mount read-only.
>

Jan,

I think it is important to allow pivot_root of union mount points. Here
is an attempt to achieve that.

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Allow pivot_root to  work with union mount points.

If the current root filesystem is a union, then allow pivot_root
only if it's last component is a root, which allows it to be detached as a
complete union. Similarly if the new root filesystem is a union,
it's last component should be a root, so that it can be completely detached
from it's current mount point as a union and mounted back as root filesystem.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/namespace.c        |   17 +++++++++++++++--
 fs/union.c            |   12 ++++++++++++
 include/linux/union.h |    2 ++
 3 files changed, 29 insertions(+), 2 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -171,6 +171,13 @@ static void detach_mnt(struct vfsmount *
 	old_nd->dentry->d_mounted--;
 }
 
+static void detach_last_mnt(struct nameidata *nd, struct nameidata *old_nd)
+{
+	if (IS_MNT_UNION(nd->mnt))
+		while (follow_union_down(&nd->mnt, &nd->dentry));
+	detach_mnt(nd->mnt, old_nd);
+}
+
 void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
 			struct vfsmount *child_mnt)
 {
@@ -1878,6 +1885,9 @@ asmlinkage long sys_pivot_root(const cha
 	if (!check_mnt(new_nd.mnt))
 		goto out1;
 
+	if (IS_MNT_UNION(new_nd.mnt) && !last_union_is_root(&new_nd))
+		goto out1;
+
 	error = __user_walk(put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old_nd);
 	if (error)
 		goto out1;
@@ -1901,6 +1911,9 @@ asmlinkage long sys_pivot_root(const cha
 		goto out2;
 	if (!check_mnt(user_nd.mnt))
 		goto out2;
+	if (IS_MNT_UNION(user_nd.mnt) && !last_union_is_root(&user_nd))
+		goto out2;
+
 	error = -ENOENT;
 	if (IS_DEADDIR(new_nd.dentry->d_inode))
 		goto out2;
@@ -1934,8 +1947,8 @@ asmlinkage long sys_pivot_root(const cha
 			goto out3;
 	} else if (!is_subdir(old_nd.dentry, new_nd.dentry))
 		goto out3;
-	detach_mnt(new_nd.mnt, &parent_nd);
-	detach_mnt(user_nd.mnt, &root_parent);
+	detach_last_mnt(&new_nd, &parent_nd);
+	detach_last_mnt(&user_nd, &root_parent);
 	attach_mnt(user_nd.mnt, &old_nd);     /* mount old root on put_old */
 	attach_mnt(new_nd.mnt, &root_parent); /* mount new_root on / */
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
--- a/fs/union.c
+++ b/fs/union.c
@@ -507,6 +507,18 @@ void detach_mnt_union(struct vfsmount *m
 	return;
 }
 
+/*
+ * last_union_is_root - Check if the last component of the union stack
+ * is a root.
+ */
+int last_union_is_root(struct nameidata *nd)
+{
+	struct vfsmount *mnt = nd->mnt;
+	struct dentry *dentry = nd->dentry;
+
+	while (follow_union_down(&mnt, &dentry));
+	return IS_ROOT(dentry) ? 1: 0;
+}
 
 /*
  * Union mounts support for readdir.
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -53,6 +53,7 @@ extern void __shrink_d_unions(struct den
 extern int attach_mnt_union(struct vfsmount *, struct vfsmount *,
 			    struct dentry *);
 extern void detach_mnt_union(struct vfsmount *);
+extern int last_union_is_root(struct nameidata *nd);
 extern int readdir_union(struct file *, void *, filldir_t);
 extern int is_dir_unioned(struct path *);
 extern int union_relookup_topmost(struct nameidata *, int);
@@ -74,6 +75,7 @@ extern int union_copyup(struct nameidata
 #define __shrink_d_unions(x)		do { } while (0)
 #define attach_mnt_union(x, y, z)	do { } while (0)
 #define detach_mnt_union(x)		do { } while (0)
+#define last_union_is_root(x)		({ (0); })
 #define union_relookup_topmost(x, y)	({ BUG(); (0); })
 #define union_create_topmost(x, y, z)	({ BUG(); (NULL); })
 #define __union_copyup(x, y, z)		({ BUG(); (0); })

This applies on top of your patchset plus my subsequent fixes [1] and
[2].

[1] http://lkml.org/lkml/2007/8/6/10.
[2] http://lkml.org/lkml/2007/8/6/114.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC 19/26] union-mount: Make lookup work for union-mounted file systems
  2007-07-30 16:13 ` [RFC 19/26] union-mount: Make lookup work for union-mounted file systems Jan Blunck
@ 2007-08-09  5:42   ` Bharata B Rao
  0 siblings, 0 replies; 65+ messages in thread
From: Bharata B Rao @ 2007-08-09  5:42 UTC (permalink / raw)
  To: Jan Blunck; +Cc: linux-fsdevel, linux-kernel

On Mon, Jul 30, 2007 at 06:13:42PM +0200, Jan Blunck wrote:
> On union-mounted file systems the lookup function must also visit lower layers
> of the union-stack when doing a lookup. This patches add support for
> union-mounts to cached lookups and real lookups.
> 
> We have 3 different styles of lookup functions now:
> - multiple pathname components, follow mounts, follow union, follow symlinks
> - single pathname component, doesn't follow mounts, follow union, doesn't
>   follow symlinks
> - single pathname component doesn't follow mounts, doesn't follow unions,
>   doesn't follow symlinks
> 
<snip>
> +static int hash_lookup_union(struct nameidata *nd, struct qstr *name,
> +			     struct path *path)
> +{

Jan,

Looks like there is a lot of code duplication b/n lookup_hash versions and
real_lookup versions for union mounts. Is there a reason for doing it
this way? I believe that with a little effort we should be able to get
rid of the above hash_lookup_union() completely and can instead use
real_lookup_union() variants from lookup_hash() also.

The reason I say this is, I can't see any _real_ difference b/n
real_lookup() and __lookup_hash_kern().  While the former does a seqlock
protected(for concurrent renames) dcache lookup followed by a ->lookup(),
the latter does an extra lock free dcache lookup, followed by seqlock
protected dcache lookup and a ->lookup() on failure.

Do you want me to cook up a patch for this Jan ?

Aside from that, it would help if someone could throw some light on the history
of __lookup_hash_kern. I wonder why real_lookup wasn't be used instead.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2007-08-09  5:43 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-30 16:13 [RFC 00/26] VFS based Union Mount (V2) Jan Blunck
2007-07-30 16:13 ` [RFC 01/26] [PATCH 14/18] shmem: convert to using splice instead of sendfile() Jan Blunck
2007-07-30 16:13 ` [RFC 02/26] VFS: Export dput_path() and path_to_nameidata() Jan Blunck
2007-07-30 16:13 ` [RFC 03/26] VFS: Make lookup_hash() return a struct path Jan Blunck
2007-07-30 16:13 ` [RFC 04/26] VFS: Make lookup_create() " Jan Blunck
2007-07-30 16:13 ` [RFC 05/26] VFS: cache_lookup() cleanup Jan Blunck
2007-07-30 16:13 ` [RFC 06/26] VFS: Make real_lookup() return a struct path Jan Blunck
2007-07-30 16:13 ` [RFC 07/26] VFS: Introduce dput() variante that maintains a kill-list Jan Blunck
2007-07-30 16:13 ` [RFC 08/26] VFS: Export lives_below_in_same_fs() Jan Blunck
2007-07-30 16:13 ` [RFC 09/26] linux/stat.h: Add the filetype white-out Jan Blunck
2007-07-30 16:13 ` [RFC 10/26] VFS white-out handling Jan Blunck
2007-07-30 16:13 ` [RFC 11/26] tmpfs white-out support Jan Blunck
2007-08-01 15:13   ` Hugh Dickins
2007-08-02  2:48     ` Matt Mackall
2007-07-30 16:13 ` [RFC 12/26] ext2 " Jan Blunck
2007-07-31  3:45   ` Theodore Tso
2007-07-31  7:44     ` Jan Blunck
2007-07-31  8:32       ` Andreas Dilger
2007-07-31  9:08         ` Jan Blunck
2007-07-31 10:53       ` Theodore Tso
2007-08-02 19:31         ` Pavel Machek
2007-07-31 16:36   ` Josef Sipek
2007-07-31 17:00     ` Jan Blunck
2007-07-31 17:11       ` Josef Sipek
2007-08-01 15:23         ` Dave Kleikamp
2007-08-01 18:44           ` Josef Sipek
2007-08-01 19:10             ` Dave Kleikamp
2007-08-01 19:33               ` Josef Sipek
2007-08-01 19:52                 ` Dave Kleikamp
2007-08-01 22:06                   ` Erez Zadok
2007-08-02 12:05                     ` Jan Blunck
2007-08-02 11:55                 ` Jan Blunck
2007-08-02 17:50                 ` Jörn Engel
2007-08-02 18:15                   ` Jeremy Maitin-Shepard
2007-08-02  5:24             ` Ph. Marek
2007-08-02 12:12               ` Jan Blunck
2007-08-02 10:26         ` Jan Blunck
2007-08-01 10:00       ` Hans-Peter Jansen
2007-08-01 11:43         ` Josef Sipek
2007-08-01 18:01         ` Jan Engelhardt
2007-07-31 17:03     ` Mark Williamson
2007-07-31 17:16       ` Josef Sipek
2007-08-01 17:58     ` Jan Engelhardt
2007-08-01 18:03       ` Josef Sipek
2007-07-30 16:13 ` [RFC 13/26] ext3 whiteout support Jan Blunck
2007-07-30 16:13 ` [RFC 14/26] union-mount: Documentation Jan Blunck
2007-07-30 16:13 ` [RFC 15/26] union-mount: Add union-mount mount flag Jan Blunck
2007-07-30 16:13 ` [RFC 16/26] union-mount: Introduce union_mount structure Jan Blunck
2007-08-06  5:57   ` Bharata B Rao
2007-07-30 16:13 ` [RFC 17/26] union-mount: Drive the union cache via dcache Jan Blunck
2007-07-30 16:13 ` [RFC 18/26] union-mount: Changes to the namespace handling Jan Blunck
2007-08-08 10:10   ` Bharata B Rao
2007-07-30 16:13 ` [RFC 19/26] union-mount: Make lookup work for union-mounted file systems Jan Blunck
2007-08-09  5:42   ` Bharata B Rao
2007-07-30 16:13 ` [RFC 20/26] union-mount: Simple union-mount readdir implementation Jan Blunck
2007-08-06 11:08   ` Bharata B Rao
2007-07-30 16:13 ` [RFC 21/26] union-mount: in-kernel file copy between union mounted filesystems Jan Blunck
2007-07-30 16:13 ` [RFC 22/26] union-mount: white-out changes for copy-on-open Jan Blunck
2007-07-30 16:13 ` [RFC 23/26] union-mount: copyup on rename Jan Blunck
2007-07-30 16:13 ` [RFC 24/26] union-mount: dont report EROFS for union mounts Jan Blunck
2007-07-30 16:13 ` [RFC 25/26] union-mount: Debug Infrastructure Jan Blunck
2007-07-30 16:13 ` [RFC 26/26] union-mount: Debug code Jan Blunck
2007-07-30 18:23 ` [RFC 00/26] VFS based Union Mount (V2) Al Boldi
2007-08-02  6:49 ` Bharata B Rao
2007-08-02 10:17   ` Jan Blunck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).