linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] overlay filesystem: request for inclusion (v14)
@ 2012-08-15 15:48 Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 01/13] vfs: add i_op->open() Miklos Szeredi
                   ` (13 more replies)
  0 siblings, 14 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

Here's the latest version of the overlayfs series.

Git tree is here:

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v14

Please consider for 3.7.

Thanks,
Miklos


---
Andy Whitcroft (3):
      overlayfs: add statfs support
      ovl: switch to __inode_permission()
      overlayfs: copy up i_uid/i_gid from the underlying inode

Erez Zadok (1):
      overlayfs: implement show_options

Miklos Szeredi (6):
      vfs: add i_op->open()
      vfs: export do_splice_direct() to modules
      vfs: introduce clone_private_mount()
      overlay filesystem
      fs: limit filesystem stacking depth
      vfs: export __inode_permission() to modules

Neil Brown (1):
      overlay: overlay filesystem documentation

Robin Dong (2):
      overlayfs: fix possible leak in ovl_new_inode
      overlayfs: create new inode in ovl_link

---
 Documentation/filesystems/Locking       |    2 +
 Documentation/filesystems/overlayfs.txt |  199 +++++++++
 Documentation/filesystems/vfs.txt       |    7 +
 MAINTAINERS                             |    7 +
 fs/Kconfig                              |    1 +
 fs/Makefile                             |    1 +
 fs/ecryptfs/main.c                      |    7 +
 fs/internal.h                           |    5 -
 fs/namei.c                              |   10 +-
 fs/namespace.c                          |   18 +
 fs/open.c                               |   23 +-
 fs/overlayfs/Kconfig                    |    4 +
 fs/overlayfs/Makefile                   |    7 +
 fs/overlayfs/copy_up.c                  |  385 ++++++++++++++++++
 fs/overlayfs/dir.c                      |  604 ++++++++++++++++++++++++++++
 fs/overlayfs/inode.c                    |  372 +++++++++++++++++
 fs/overlayfs/overlayfs.h                |   70 ++++
 fs/overlayfs/readdir.c                  |  566 ++++++++++++++++++++++++++
 fs/overlayfs/super.c                    |  665 +++++++++++++++++++++++++++++++
 fs/splice.c                             |    1 +
 include/linux/fs.h                      |   14 +
 include/linux/mount.h                   |    3 +
 22 files changed, 2961 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/filesystems/overlayfs.txt
 create mode 100644 fs/overlayfs/Kconfig
 create mode 100644 fs/overlayfs/Makefile
 create mode 100644 fs/overlayfs/copy_up.c
 create mode 100644 fs/overlayfs/dir.c
 create mode 100644 fs/overlayfs/inode.c
 create mode 100644 fs/overlayfs/overlayfs.h
 create mode 100644 fs/overlayfs/readdir.c
 create mode 100644 fs/overlayfs/super.c

------------------------------------------------------------------------------
Changes from v13 to v14

- update to 3.6

- copy i_uid/i_gid from the underlying inode (patch by Andy Whitcroft)

------------------------------------------------------------------------------
Changes from v12 to v13

- create new inode in ovl_link (patch by Robin Dong)

- switch to __inode_permission() (patch by Andy Whitcroft)

------------------------------------------------------------------------------
Changes from v11 to v12

- update to for-next of vfs tree

- split __dentry_open argument cleanup patch from vfs-add-i_op-open.patch

- change i_op->open and vfs_open so that they take "struct file *"

------------------------------------------------------------------------------
Changes from v10 to v11

- fix overlayfs over overlayfs

- improve stack use of lookup and readdir

- add limitations to documentation

- make lower mount read-only

- update permission and fsync to new API

------------------------------------------------------------------------------
Changes from v9 to v10

- prevent d_delete() from turning upperdentry negative (reported by
  Erez Zadok)

- show mount options in /proc/mounts and friends (patch by Erez Zadok)

- fix off-by-one error in readdir (reported by Jordi Pujol)

------------------------------------------------------------------------------
Changes from v8 to v9

- support xattr on tmpfs

- fix build after split-up

- fix remove after rename (reported by Jordi Pujol)

- fix rename failure case

------------------------------------------------------------------------------
Changes from v7 to v8:

- split overlayfs.c into smaller files

- fix locking for copy up (reported by Al Viro)

- locking analysis of copy up vs. directory rename added as a comment

- tested with lockdep, fixed one lock annotation

- other bug fixes

------------------------------------------------------------------------------
Changes from v6 to v7

- added patches from Felix Fietkau to fix deadlocks on jffs2

- optimized directory removal

- properly clean up after copy-up and other failures

------------------------------------------------------------------------------
Changes from v5 to v6

- optimize directory merging

  o use rbtree for weeding out duplicates

  o use a cursor for current position within the stream

- instead of f_op->open_other(), implement i_op->open()

- don't share inodes for non-directory dentries - for now.  I hope
  this can come back once RCU lookup code has settled.

- misc bug fixes

------------------------------------------------------------------------------
Changes from v4 to v5

- fix copying up if fs doesn't support xattrs (Andy Whitcroft)

- clone mounts to be used internally to access the underlying
  filesystems

------------------------------------------------------------------------------
Changes from v3 to v4

- export security_inode_permission to allow overlayfs to be modular
  (Andy Whitcroft)

- add statfs support (Andy Whitcroft)

- change BUG_ON to WARN_ON

- Revert "vfs: add flag to allow rename to same inode", instead
  introduce s_op->is_same_inode()

- overlayfs: fix rename to self

- fix whiteout after rename

------------------------------------------------------------------------------
Changes from v2 to v3

 - Minimal remount support.  As overlayfs reflects the 'readonly'
   mount status in write-access to the upper filesystem, we must
   handle remount and either drop or take write access when the ro
   status changes. (NeilBrown)

 - Use correct seek function for directories.  It is incorrect to call
   generic_llseek_file on a file from a different filesystem.  For
   that we must use the seek function that the filesystem defines,
   which is called by vfs_llseek.  Also, we only want to seek the
   realfile when is_real is true.  Otherwise we just want to update
   our own f_pos pointer, so use generic_llseek_file for
   that. (NeilBrown)

 - Initialise is_real before use.  The previous patch can use
   od->is_real before it is properly initialised is llseek is called
   before readdir.  So factor out the initialisation of is_real and
   call it from both readdir and llseek when f_pos is 0. (NeilBrown)

 - Rename ovl_fill_cache to ovl_dir_read (NeilBrown)

 - Tiny optimisation in open_other handling (NeilBrown)

 - Assorted updates to Documentation/filesystems/overlayfs.txt (NeilBrown)

 - Make copy-up work for >=4G files, make it killable during copy-up.
   Need to fix recovery after a failed/interrupted copy-up.

 - Store and reference upper/lower dentries in overlay dentries.
   Store and reference upper/lower vfsmounts in overlay superblock.

 - Add necessary barriers for setting upper dentry in copyup and for
   retrieving upper dentry locklessly.

 - Make sure the right file is used for directory fsync() after
   copy-up.

 - Add locking to ovl_dir_llseek() to prevent concurrent call of
   ovl_dir_reset() with ovl_dir_read().

 - Get rid of ovl_dentry_iput().  The VFS doesn't provide enough
   locking for this function that the contents of ->d_fsdata could be
   safely updated.

 - After copying up a non-directory unhash the dentry.  This way the
   lower dentry ref, which is no longer necessary, can go away.  This
   revealed a use-after-free bug in truncate handling in
   fs/namei.c:finish_open().

 - Fix if a copy-up happens between the follow_linka the put_link
   calls.

 - Replace some WARN_ONs with BUG_ON.  Some things just _really_
   shouldn't happen.

 - Extract common code from ovl_unlink and ovl_rmdir to a helper
   function.

 - After unlink and rmdir unhash the dentry.  This will get rid of the
   lower and upper dentry references after there are no more users of
   the deleted dentry.  This is a safe replacement for the removed
   ->d_iput() functionality.

 - Added checks to unlink, rmdir and rename to verify that the
   parent-child relationship in the upper filesystem matches that of
   the overlay.  This is necessary to prevent crash and/or corruption
   if the upper filesystem topology is being modified while part of
   the overlay.

 - Optimize checking whiteout and opaque attributes.

 - Optimize copy-up on truncate: don't copy up whole file before
   truncating

 - Misc bug fixes

------------------------------------------------------------------------------
Changes from v1 to v2

 - rename "hybrid union filesystem" to "overlay filesystem" or overlayfs

 - added documentation written by Neil

 - correct st_dev for directories (reported by Neil)

 - use getattr() to get attributes from the underlying filesystems,
   this means that now an overlay filesystem itself can be the lower,
   read-only layer of another overlay

 - listxattr filters out private extended attributes

 - get write ref on the upper layer on mount unless the overlay
   itself is mounted read-only

 - raise capabilities for copy up, dealing with whiteouts and opaque
   directories.  Now the overlay works for non-root users as well

 - "rm -rf" didn't work correctly in all cases if the directory was
   copied up between opendir and the first readdir, this is now fixed
   (and the directory operations consolidated)

 - simplified copy up, this broke optimization for truncate and
   open(O_TRUNC) (now file is copied up to be immediately truncated,
   will fix)

 - st_nlink for merged directories set to 1, this is an "illegal"
   value that normal filesystems never have but some use it to
   indicate that the number of subdirectories is unknown.  Utilities
   (find, ...) seem to tolerate this well.

 - misc fixes I forgot about



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/13] vfs: add i_op->open()
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 17:21   ` J. Bruce Fields
  2012-08-15 15:48 ` [PATCH 02/13] vfs: export do_splice_direct() to modules Miklos Szeredi
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Miklos Szeredi <mszeredi@suse.cz>

Add a new inode operation i_op->open().  This is for stacked
filesystems that want to return a struct file from a different
filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 Documentation/filesystems/Locking |    2 ++
 Documentation/filesystems/vfs.txt |    7 +++++++
 fs/namei.c                        |    9 ++++++---
 fs/open.c                         |   23 +++++++++++++++++++++--
 include/linux/fs.h                |    2 ++
 5 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 0f103e3..d222b6a 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -64,6 +64,7 @@ prototypes:
 	int (*atomic_open)(struct inode *, struct dentry *,
 				struct file *, unsigned open_flag,
 				umode_t create_mode, int *opened);
+	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
 
 locking rules:
 	all may block
@@ -92,6 +93,7 @@ removexattr:	yes
 fiemap:		no
 update_time:	no
 atomic_open:	yes
+open:		no
 
 	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
 victim.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 065aa2d..f53d93c 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -367,6 +367,7 @@ struct inode_operations {
 	int (*atomic_open)(struct inode *, struct dentry *,
 				struct file *, unsigned open_flag,
 				umode_t create_mode, int *opened);
+	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -696,6 +697,12 @@ struct address_space_operations {
   	but instead uses bmap to find out where the blocks in the file
   	are and uses those addresses directly.
 
+  dentry_open: this is an alternative to f_op->open(), the difference is that
+	this method may open a file not necessarily originating from the same
+	filesystem as the one i_op->open() was called on.  It may be
+	useful for stacking filesystems which want to allow native I/O directly
+	on underlying files.
+
 
   invalidatepage: If a page has PagePrivate set, then invalidatepage
         will be called when part or all of the page is to be removed
diff --git a/fs/namei.c b/fs/namei.c
index 1b46439..ac2526d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2816,9 +2816,12 @@ finish_open_created:
 	error = may_open(&nd->path, acc_mode, open_flag);
 	if (error)
 		goto out;
-	file->f_path.mnt = nd->path.mnt;
-	error = finish_open(file, nd->path.dentry, NULL, opened);
-	if (error) {
+
+	BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */
+	error = vfs_open(&nd->path, file, current_cred());
+	if (!error) {
+		*opened |= FILE_OPENED;
+	} else {
 		if (error == -EOPENSTALE)
 			goto stale_open;
 		goto out;
diff --git a/fs/open.c b/fs/open.c
index f3d96e7..c5a8cac 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -787,8 +787,7 @@ struct file *dentry_open(const struct path *path, int flags,
 		return ERR_PTR(error);
 
 	f->f_flags = flags;
-	f->f_path = *path;
-	error = do_dentry_open(f, NULL, cred);
+	error = vfs_open(path, f, cred);
 	if (!error) {
 		error = open_check_o_direct(f);
 		if (error) {
@@ -803,6 +802,26 @@ struct file *dentry_open(const struct path *path, int flags,
 }
 EXPORT_SYMBOL(dentry_open);
 
+/**
+ * vfs_open - open the file at the given path
+ * @path: path to open
+ * @filp: newly allocated file with f_flag initialized
+ * @cred: credentials to use
+ */
+int vfs_open(const struct path *path, struct file *filp,
+	     const struct cred *cred)
+{
+	struct inode *inode = path->dentry->d_inode;
+
+	if (inode->i_op->dentry_open)
+		return inode->i_op->dentry_open(path->dentry, filp, cred);
+	else {
+		filp->f_path = *path;
+		return do_dentry_open(filp, NULL, cred);
+	}
+}
+EXPORT_SYMBOL(vfs_open);
+
 static void __put_unused_fd(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = files_fdtable(files);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 38dba16..abc7a53 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1836,6 +1836,7 @@ struct inode_operations {
 	int (*atomic_open)(struct inode *, struct dentry *,
 			   struct file *, unsigned open_flag,
 			   umode_t create_mode, int *opened);
+	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
 } ____cacheline_aligned;
 
 struct seq_file;
@@ -2201,6 +2202,7 @@ extern long do_sys_open(int dfd, const char __user *filename, int flags,
 extern struct file *filp_open(const char *, int, umode_t);
 extern struct file *file_open_root(struct dentry *, struct vfsmount *,
 				   const char *, int);
+extern int vfs_open(const struct path *, struct file *, const struct cred *);
 extern struct file * dentry_open(const struct path *, int, const struct cred *);
 extern int filp_close(struct file *, fl_owner_t id);
 extern char * getname(const char __user *);
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 02/13] vfs: export do_splice_direct() to modules
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 01/13] vfs: add i_op->open() Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 03/13] vfs: introduce clone_private_mount() Miklos Szeredi
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Miklos Szeredi <mszeredi@suse.cz>

Export do_splice_direct() to modules.  Needed by overlay filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/splice.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 41514dd..2695a60 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 
 	return ret;
 }
+EXPORT_SYMBOL(do_splice_direct);
 
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 03/13] vfs: introduce clone_private_mount()
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 01/13] vfs: add i_op->open() Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 02/13] vfs: export do_splice_direct() to modules Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 04/13] overlay filesystem Miklos Szeredi
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Miklos Szeredi <mszeredi@suse.cz>

Overlayfs needs a private clone of the mount, so create a function for
this and export to modules.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/namespace.c        |   18 ++++++++++++++++++
 include/linux/mount.h |    3 +++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 4d31f73..b4712ea 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1387,6 +1387,24 @@ void drop_collected_mounts(struct vfsmount *mnt)
 	release_mounts(&umount_list);
 }
 
+struct vfsmount *clone_private_mount(struct path *path)
+{
+	struct mount *old_mnt = real_mount(path->mnt);
+	struct mount *new_mnt;
+
+	if (IS_MNT_UNBINDABLE(old_mnt))
+		return ERR_PTR(-EINVAL);
+
+	down_read(&namespace_sem);
+	new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
+	up_read(&namespace_sem);
+	if (!new_mnt)
+		return ERR_PTR(-ENOMEM);
+
+	return &new_mnt->mnt;
+}
+EXPORT_SYMBOL_GPL(clone_private_mount);
+
 int iterate_mounts(int (*f)(struct vfsmount *, void *), void *arg,
 		   struct vfsmount *root)
 {
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..344a262 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -66,6 +66,9 @@ extern void mnt_pin(struct vfsmount *mnt);
 extern void mnt_unpin(struct vfsmount *mnt);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
 
+struct path;
+extern struct vfsmount *clone_private_mount(struct path *path);
+
 struct file_system_type;
 extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
 				      int flags, const char *name,
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 04/13] overlay filesystem
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (2 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 03/13] vfs: introduce clone_private_mount() Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-16  6:24   ` Eric W. Biederman
  2012-08-15 15:48 ` [PATCH 05/13] overlayfs: add statfs support Miklos Szeredi
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Miklos Szeredi <mszeredi@suse.cz>

Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree.  All modifications
go to the upper, writable layer.

This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.

The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems.  This
simplifies the implementation and allows native performance in these
cases.

The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS.  This uses slightly more memory than union mounts, but dentries
are relatively small.

Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.

Opening non directories results in the open forwarded to the
underlying filesystem.  This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).

Usage:

  mount -t overlay -olowerdir=/lower,upperdir=/upper overlay /mnt

Supported:

 - all operations

Missing:

 - Currently a crash in the middle of copy-up, rename, unlink, rmdir or create
   over a whiteout may result in filesystem corruption on the overlay level.
   IOW these operations need to become atomic or at least the corruption needs
   to be detected.


The following cotributions have been folded into this patch:

Neil Brown <neilb@suse.de>:
 - minimal remount support
 - use correct seek function for directories
 - initialise is_real before use
 - rename ovl_fill_cache to ovl_dir_read

Felix Fietkau <nbd@openwrt.org>:
 - fix a deadlock in ovl_dir_read_merged
 - fix a deadlock in ovl_remove_whiteouts

Erez Zadok <ezk@fsl.cs.sunysb.edu>
 - fix cleanup after WARN_ON

Sedat Dilek <sedat.dilek@googlemail.com>
 - fix up permission to confirm to new API

Also thanks to the following people for testing and reporting bugs:

  Jordi Pujol <jordipujolp@gmail.com>
  Andy Whitcroft <apw@canonical.com>
  Michal Suchanek <hramrach@centrum.cz>
  Felix Fietkau <nbd@openwrt.org>
  Erez Zadok <ezk@fsl.cs.sunysb.edu>
  Randy Dunlap <rdunlap@xenotime.net>

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/Kconfig               |    1 +
 fs/Makefile              |    1 +
 fs/overlayfs/Kconfig     |    4 +
 fs/overlayfs/Makefile    |    7 +
 fs/overlayfs/copy_up.c   |  385 +++++++++++++++++++++++++++++
 fs/overlayfs/dir.c       |  598 +++++++++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/inode.c     |  379 ++++++++++++++++++++++++++++
 fs/overlayfs/overlayfs.h |   64 +++++
 fs/overlayfs/readdir.c   |  566 ++++++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/super.c     |  611 ++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 2616 insertions(+), 0 deletions(-)
 create mode 100644 fs/overlayfs/Kconfig
 create mode 100644 fs/overlayfs/Makefile
 create mode 100644 fs/overlayfs/copy_up.c
 create mode 100644 fs/overlayfs/dir.c
 create mode 100644 fs/overlayfs/inode.c
 create mode 100644 fs/overlayfs/overlayfs.h
 create mode 100644 fs/overlayfs/readdir.c
 create mode 100644 fs/overlayfs/super.c

diff --git a/fs/Kconfig b/fs/Kconfig
index f95ae3a..e0c5d43 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -67,6 +67,7 @@ source "fs/quota/Kconfig"
 
 source "fs/autofs4/Kconfig"
 source "fs/fuse/Kconfig"
+source "fs/overlayfs/Kconfig"
 
 config CUSE
 	tristate "Character device in Userspace support"
diff --git a/fs/Makefile b/fs/Makefile
index 2fb9779..fcd9788 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_QNX6FS_FS)		+= qnx6/
 obj-$(CONFIG_AUTOFS4_FS)	+= autofs4/
 obj-$(CONFIG_ADFS_FS)		+= adfs/
 obj-$(CONFIG_FUSE_FS)		+= fuse/
+obj-$(CONFIG_OVERLAYFS_FS)	+= overlayfs/
 obj-$(CONFIG_UDF_FS)		+= udf/
 obj-$(CONFIG_SUN_OPENPROMFS)	+= openpromfs/
 obj-$(CONFIG_OMFS_FS)		+= omfs/
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
new file mode 100644
index 0000000..c4517da
--- /dev/null
+++ b/fs/overlayfs/Kconfig
@@ -0,0 +1,4 @@
+config OVERLAYFS_FS
+	tristate "Overlay filesystem support"
+	help
+	  Add support for overlay filesystem.
diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
new file mode 100644
index 0000000..8f91889
--- /dev/null
+++ b/fs/overlayfs/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the overlay filesystem.
+#
+
+obj-$(CONFIG_OVERLAYFS_FS) += overlayfs.o
+
+overlayfs-objs := super.o inode.o dir.o readdir.o copy_up.o
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
new file mode 100644
index 0000000..87dbeee
--- /dev/null
+++ b/fs/overlayfs/copy_up.c
@@ -0,0 +1,385 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/splice.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include "overlayfs.h"
+
+#define OVL_COPY_UP_CHUNK_SIZE (1 << 20)
+
+static int ovl_copy_up_xattr(struct dentry *old, struct dentry *new)
+{
+	ssize_t list_size, size;
+	char *buf, *name, *value;
+	int error;
+
+	if (!old->d_inode->i_op->getxattr ||
+	    !new->d_inode->i_op->getxattr)
+		return 0;
+
+	list_size = vfs_listxattr(old, NULL, 0);
+	if (list_size <= 0) {
+		if (list_size == -EOPNOTSUPP)
+			return 0;
+		return list_size;
+	}
+
+	buf = kzalloc(list_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	error = -ENOMEM;
+	value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL);
+	if (!value)
+		goto out;
+
+	list_size = vfs_listxattr(old, buf, list_size);
+	if (list_size <= 0) {
+		error = list_size;
+		goto out_free_value;
+	}
+
+	for (name = buf; name < (buf + list_size); name += strlen(name) + 1) {
+		size = vfs_getxattr(old, name, value, XATTR_SIZE_MAX);
+		if (size <= 0) {
+			error = size;
+			goto out_free_value;
+		}
+		error = vfs_setxattr(new, name, value, size, 0);
+		if (error)
+			goto out_free_value;
+	}
+
+out_free_value:
+	kfree(value);
+out:
+	kfree(buf);
+	return error;
+}
+
+static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
+{
+	struct file *old_file;
+	struct file *new_file;
+	int error = 0;
+
+	if (len == 0)
+		return 0;
+
+	old_file = ovl_path_open(old, O_RDONLY);
+	if (IS_ERR(old_file))
+		return PTR_ERR(old_file);
+
+	new_file = ovl_path_open(new, O_WRONLY);
+	if (IS_ERR(new_file)) {
+		error = PTR_ERR(new_file);
+		goto out_fput;
+	}
+
+	/* FIXME: copy up sparse files efficiently */
+	while (len) {
+		loff_t offset = new_file->f_pos;
+		size_t this_len = OVL_COPY_UP_CHUNK_SIZE;
+		long bytes;
+
+		if (len < this_len)
+			this_len = len;
+
+		if (signal_pending_state(TASK_KILLABLE, current)) {
+			error = -EINTR;
+			break;
+		}
+
+		bytes = do_splice_direct(old_file, &offset, new_file, this_len,
+				 SPLICE_F_MOVE);
+		if (bytes <= 0) {
+			error = bytes;
+			break;
+		}
+
+		len -= bytes;
+	}
+
+	fput(new_file);
+out_fput:
+	fput(old_file);
+	return error;
+}
+
+static char *ovl_read_symlink(struct dentry *realdentry)
+{
+	int res;
+	char *buf;
+	struct inode *inode = realdentry->d_inode;
+	mm_segment_t old_fs;
+
+	res = -EINVAL;
+	if (!inode->i_op->readlink)
+		goto err;
+
+	res = -ENOMEM;
+	buf = (char *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto err;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	/* The cast to a user pointer is valid due to the set_fs() */
+	res = inode->i_op->readlink(realdentry,
+				    (char __user *)buf, PAGE_SIZE - 1);
+	set_fs(old_fs);
+	if (res < 0) {
+		free_page((unsigned long) buf);
+		goto err;
+	}
+	buf[res] = '\0';
+
+	return buf;
+
+err:
+	return ERR_PTR(res);
+}
+
+static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+{
+	struct iattr attr = {
+		.ia_valid =
+		     ATTR_ATIME | ATTR_MTIME | ATTR_ATIME_SET | ATTR_MTIME_SET,
+		.ia_atime = stat->atime,
+		.ia_mtime = stat->mtime,
+	};
+
+	return notify_change(upperdentry, &attr);
+}
+
+static int ovl_set_mode(struct dentry *upperdentry, umode_t mode)
+{
+	struct iattr attr = {
+		.ia_valid = ATTR_MODE,
+		.ia_mode = mode,
+	};
+
+	return notify_change(upperdentry, &attr);
+}
+
+static int ovl_copy_up_locked(struct dentry *upperdir, struct dentry *dentry,
+			      struct path *lowerpath, struct kstat *stat,
+			      const char *link)
+{
+	int err;
+	struct path newpath;
+	umode_t mode = stat->mode;
+
+	/* Can't properly set mode on creation because of the umask */
+	stat->mode &= S_IFMT;
+
+	ovl_path_upper(dentry, &newpath);
+	WARN_ON(newpath.dentry);
+	newpath.dentry = ovl_upper_create(upperdir, dentry, stat, link);
+	if (IS_ERR(newpath.dentry))
+		return PTR_ERR(newpath.dentry);
+
+	if (S_ISREG(stat->mode)) {
+		err = ovl_copy_up_data(lowerpath, &newpath, stat->size);
+		if (err)
+			goto err_remove;
+	}
+
+	err = ovl_copy_up_xattr(lowerpath->dentry, newpath.dentry);
+	if (err)
+		goto err_remove;
+
+	mutex_lock(&newpath.dentry->d_inode->i_mutex);
+	if (!S_ISLNK(stat->mode))
+		err = ovl_set_mode(newpath.dentry, mode);
+	if (!err)
+		err = ovl_set_timestamps(newpath.dentry, stat);
+	mutex_unlock(&newpath.dentry->d_inode->i_mutex);
+	if (err)
+		goto err_remove;
+
+	ovl_dentry_update(dentry, newpath.dentry);
+
+	/*
+	 * Easiest way to get rid of the lower dentry reference is to
+	 * drop this dentry.  This is neither needed nor possible for
+	 * directories.
+	 */
+	if (!S_ISDIR(stat->mode))
+		d_drop(dentry);
+
+	return 0;
+
+err_remove:
+	if (S_ISDIR(stat->mode))
+		vfs_rmdir(upperdir->d_inode, newpath.dentry);
+	else
+		vfs_unlink(upperdir->d_inode, newpath.dentry);
+
+	dput(newpath.dentry);
+
+	return err;
+}
+
+/*
+ * Copy up a single dentry
+ *
+ * Directory renames only allowed on "pure upper" (already created on
+ * upper filesystem, never copied up).  Directories which are on lower or
+ * are merged may not be renamed.  For these -EXDEV is returned and
+ * userspace has to deal with it.  This means, when copying up a
+ * directory we can rely on it and ancestors being stable.
+ *
+ * Non-directory renames start with copy up of source if necessary.  The
+ * actual rename will only proceed once the copy up was successful.  Copy
+ * up uses upper parent i_mutex for exclusion.  Since rename can change
+ * d_parent it is possible that the copy up will lock the old parent.  At
+ * that point the file will have already been copied up anyway.
+ */
+static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
+			   struct path *lowerpath, struct kstat *stat)
+{
+	int err;
+	struct kstat pstat;
+	struct path parentpath;
+	struct dentry *upperdir;
+	const struct cred *old_cred;
+	struct cred *override_cred;
+	char *link = NULL;
+
+	ovl_path_upper(parent, &parentpath);
+	upperdir = parentpath.dentry;
+
+	err = vfs_getattr(parentpath.mnt, parentpath.dentry, &pstat);
+	if (err)
+		return err;
+
+	if (S_ISLNK(stat->mode)) {
+		link = ovl_read_symlink(lowerpath->dentry);
+		if (IS_ERR(link))
+			return PTR_ERR(link);
+	}
+
+	err = -ENOMEM;
+	override_cred = prepare_creds();
+	if (!override_cred)
+		goto out_free_link;
+
+	override_cred->fsuid = stat->uid;
+	override_cred->fsgid = stat->gid;
+	/*
+	 * CAP_SYS_ADMIN for copying up extended attributes
+	 * CAP_DAC_OVERRIDE for create
+	 * CAP_FOWNER for chmod, timestamp update
+	 * CAP_FSETID for chmod
+	 * CAP_MKNOD for mknod
+	 */
+	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+	cap_raise(override_cred->cap_effective, CAP_FOWNER);
+	cap_raise(override_cred->cap_effective, CAP_FSETID);
+	cap_raise(override_cred->cap_effective, CAP_MKNOD);
+	old_cred = override_creds(override_cred);
+
+	mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+	if (ovl_path_type(dentry) != OVL_PATH_LOWER) {
+		err = 0;
+	} else {
+		err = ovl_copy_up_locked(upperdir, dentry, lowerpath,
+					 stat, link);
+		if (!err) {
+			/* Restore timestamps on parent (best effort) */
+			ovl_set_timestamps(upperdir, &pstat);
+		}
+	}
+
+	mutex_unlock(&upperdir->d_inode->i_mutex);
+
+	revert_creds(old_cred);
+	put_cred(override_cred);
+
+out_free_link:
+	if (link)
+		free_page((unsigned long) link);
+
+	return err;
+}
+
+int ovl_copy_up(struct dentry *dentry)
+{
+	int err;
+
+	err = 0;
+	while (!err) {
+		struct dentry *next;
+		struct dentry *parent;
+		struct path lowerpath;
+		struct kstat stat;
+		enum ovl_path_type type = ovl_path_type(dentry);
+
+		if (type != OVL_PATH_LOWER)
+			break;
+
+		next = dget(dentry);
+		/* find the topmost dentry not yet copied up */
+		for (;;) {
+			parent = dget_parent(next);
+
+			type = ovl_path_type(parent);
+			if (type != OVL_PATH_LOWER)
+				break;
+
+			dput(next);
+			next = parent;
+		}
+
+		ovl_path_lower(next, &lowerpath);
+		err = vfs_getattr(lowerpath.mnt, lowerpath.dentry, &stat);
+		if (!err)
+			err = ovl_copy_up_one(parent, next, &lowerpath, &stat);
+
+		dput(parent);
+		dput(next);
+	}
+
+	return err;
+}
+
+/* Optimize by not copying up the file first and truncating later */
+int ovl_copy_up_truncate(struct dentry *dentry, loff_t size)
+{
+	int err;
+	struct kstat stat;
+	struct path lowerpath;
+	struct dentry *parent = dget_parent(dentry);
+
+	err = ovl_copy_up(parent);
+	if (err)
+		goto out_dput_parent;
+
+	ovl_path_lower(dentry, &lowerpath);
+	err = vfs_getattr(lowerpath.mnt, lowerpath.dentry, &stat);
+	if (err)
+		goto out_dput_parent;
+
+	if (size < stat.size)
+		stat.size = size;
+
+	err = ovl_copy_up_one(parent, dentry, &lowerpath, &stat);
+
+out_dput_parent:
+	dput(parent);
+	return err;
+}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
new file mode 100644
index 0000000..6b50823
--- /dev/null
+++ b/fs/overlayfs/dir.c
@@ -0,0 +1,598 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include "overlayfs.h"
+
+static const char *ovl_whiteout_symlink = "(overlay-whiteout)";
+
+static int ovl_whiteout(struct dentry *upperdir, struct dentry *dentry)
+{
+	int err;
+	struct dentry *newdentry;
+	const struct cred *old_cred;
+	struct cred *override_cred;
+
+	/* FIXME: recheck lower dentry to see if whiteout is really needed */
+
+	err = -ENOMEM;
+	override_cred = prepare_creds();
+	if (!override_cred)
+		goto out;
+
+	/*
+	 * CAP_SYS_ADMIN for setxattr
+	 * CAP_DAC_OVERRIDE for symlink creation
+	 * CAP_FOWNER for unlink in sticky directory
+	 */
+	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+	cap_raise(override_cred->cap_effective, CAP_FOWNER);
+	override_cred->fsuid = 0;
+	override_cred->fsgid = 0;
+	old_cred = override_creds(override_cred);
+
+	newdentry = lookup_one_len(dentry->d_name.name, upperdir,
+				   dentry->d_name.len);
+	err = PTR_ERR(newdentry);
+	if (IS_ERR(newdentry))
+		goto out_put_cred;
+
+	/* Just been removed within the same locked region */
+	WARN_ON(newdentry->d_inode);
+
+	err = vfs_symlink(upperdir->d_inode, newdentry, ovl_whiteout_symlink);
+	if (err)
+		goto out_dput;
+
+	ovl_dentry_version_inc(dentry->d_parent);
+
+	err = vfs_setxattr(newdentry, ovl_whiteout_xattr, "y", 1, 0);
+	if (err)
+		vfs_unlink(upperdir->d_inode, newdentry);
+
+out_dput:
+	dput(newdentry);
+out_put_cred:
+	revert_creds(old_cred);
+	put_cred(override_cred);
+out:
+	if (err) {
+		/*
+		 * There's no way to recover from failure to whiteout.
+		 * What should we do?  Log a big fat error and... ?
+		 */
+		printk(KERN_ERR "overlayfs: ERROR - failed to whiteout '%s'\n",
+		       dentry->d_name.name);
+	}
+
+	return err;
+}
+
+static struct dentry *ovl_lookup_create(struct dentry *upperdir,
+					struct dentry *template)
+{
+	int err;
+	struct dentry *newdentry;
+	struct qstr *name = &template->d_name;
+
+	newdentry = lookup_one_len(name->name, upperdir, name->len);
+	if (IS_ERR(newdentry))
+		return newdentry;
+
+	if (newdentry->d_inode) {
+		const struct cred *old_cred;
+		struct cred *override_cred;
+
+		/* No need to check whiteout if lower parent is non-existent */
+		err = -EEXIST;
+		if (!ovl_dentry_lower(template->d_parent))
+			goto out_dput;
+
+		if (!S_ISLNK(newdentry->d_inode->i_mode))
+			goto out_dput;
+
+		err = -ENOMEM;
+		override_cred = prepare_creds();
+		if (!override_cred)
+			goto out_dput;
+
+		/*
+		 * CAP_SYS_ADMIN for getxattr
+		 * CAP_FOWNER for unlink in sticky directory
+		 */
+		cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+		cap_raise(override_cred->cap_effective, CAP_FOWNER);
+		old_cred = override_creds(override_cred);
+
+		err = -EEXIST;
+		if (ovl_is_whiteout(newdentry))
+			err = vfs_unlink(upperdir->d_inode, newdentry);
+
+		revert_creds(old_cred);
+		put_cred(override_cred);
+		if (err)
+			goto out_dput;
+
+		dput(newdentry);
+		newdentry = lookup_one_len(name->name, upperdir, name->len);
+		if (IS_ERR(newdentry)) {
+			ovl_whiteout(upperdir, template);
+			return newdentry;
+		}
+
+		/*
+		 * Whiteout just been successfully removed, parent
+		 * i_mutex is still held, there's no way the lookup
+		 * could return positive.
+		 */
+		WARN_ON(newdentry->d_inode);
+	}
+
+	return newdentry;
+
+out_dput:
+	dput(newdentry);
+	return ERR_PTR(err);
+}
+
+struct dentry *ovl_upper_create(struct dentry *upperdir, struct dentry *dentry,
+				struct kstat *stat, const char *link)
+{
+	int err;
+	struct dentry *newdentry;
+	struct inode *dir = upperdir->d_inode;
+
+	newdentry = ovl_lookup_create(upperdir, dentry);
+	if (IS_ERR(newdentry))
+		goto out;
+
+	switch (stat->mode & S_IFMT) {
+	case S_IFREG:
+		err = vfs_create(dir, newdentry, stat->mode, NULL);
+		break;
+
+	case S_IFDIR:
+		err = vfs_mkdir(dir, newdentry, stat->mode);
+		break;
+
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFIFO:
+	case S_IFSOCK:
+		err = vfs_mknod(dir, newdentry, stat->mode, stat->rdev);
+		break;
+
+	case S_IFLNK:
+		err = vfs_symlink(dir, newdentry, link);
+		break;
+
+	default:
+		err = -EPERM;
+	}
+	if (err) {
+		if (ovl_dentry_is_opaque(dentry))
+			ovl_whiteout(upperdir, dentry);
+		dput(newdentry);
+		newdentry = ERR_PTR(err);
+	} else if (WARN_ON(!newdentry->d_inode)) {
+		/*
+		 * Not quite sure if non-instantiated dentry is legal or not.
+		 * VFS doesn't seem to care so check and warn here.
+		 */
+		dput(newdentry);
+		newdentry = ERR_PTR(-ENOENT);
+	}
+
+out:
+	return newdentry;
+
+}
+
+static int ovl_set_opaque(struct dentry *upperdentry)
+{
+	int err;
+	const struct cred *old_cred;
+	struct cred *override_cred;
+
+	override_cred = prepare_creds();
+	if (!override_cred)
+		return -ENOMEM;
+
+	/* CAP_SYS_ADMIN for setxattr of "trusted" namespace */
+	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+	old_cred = override_creds(override_cred);
+	err = vfs_setxattr(upperdentry, ovl_opaque_xattr, "y", 1, 0);
+	revert_creds(old_cred);
+	put_cred(override_cred);
+
+	return err;
+}
+
+static int ovl_remove_opaque(struct dentry *upperdentry)
+{
+	int err;
+	const struct cred *old_cred;
+	struct cred *override_cred;
+
+	override_cred = prepare_creds();
+	if (!override_cred)
+		return -ENOMEM;
+
+	/* CAP_SYS_ADMIN for removexattr of "trusted" namespace */
+	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+	old_cred = override_creds(override_cred);
+	err = vfs_removexattr(upperdentry, ovl_opaque_xattr);
+	revert_creds(old_cred);
+	put_cred(override_cred);
+
+	return err;
+}
+
+static int ovl_dir_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			 struct kstat *stat)
+{
+	int err;
+	enum ovl_path_type type;
+	struct path realpath;
+
+	type = ovl_path_real(dentry, &realpath);
+	err = vfs_getattr(realpath.mnt, realpath.dentry, stat);
+	if (err)
+		return err;
+
+	stat->dev = dentry->d_sb->s_dev;
+	stat->ino = dentry->d_inode->i_ino;
+
+	/*
+	 * It's probably not worth it to count subdirs to get the
+	 * correct link count.  nlink=1 seems to pacify 'find' and
+	 * other utilities.
+	 */
+	if (type == OVL_PATH_MERGE)
+		stat->nlink = 1;
+
+	return 0;
+}
+
+static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
+			     const char *link)
+{
+	int err;
+	struct dentry *newdentry;
+	struct dentry *upperdir;
+	struct inode *inode;
+	struct kstat stat = {
+		.mode = mode,
+		.rdev = rdev,
+	};
+
+	err = -ENOMEM;
+	inode = ovl_new_inode(dentry->d_sb, mode, dentry->d_fsdata);
+	if (!inode)
+		goto out;
+
+	err = ovl_copy_up(dentry->d_parent);
+	if (err)
+		goto out_iput;
+
+	upperdir = ovl_dentry_upper(dentry->d_parent);
+	mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+
+	newdentry = ovl_upper_create(upperdir, dentry, &stat, link);
+	err = PTR_ERR(newdentry);
+	if (IS_ERR(newdentry))
+		goto out_unlock;
+
+	ovl_dentry_version_inc(dentry->d_parent);
+	if (ovl_dentry_is_opaque(dentry) && S_ISDIR(mode)) {
+		err = ovl_set_opaque(newdentry);
+		if (err) {
+			vfs_rmdir(upperdir->d_inode, newdentry);
+			ovl_whiteout(upperdir, dentry);
+			goto out_dput;
+		}
+	}
+	ovl_dentry_update(dentry, newdentry);
+	d_instantiate(dentry, inode);
+	inode = NULL;
+	newdentry = NULL;
+	err = 0;
+
+out_dput:
+	dput(newdentry);
+out_unlock:
+	mutex_unlock(&upperdir->d_inode->i_mutex);
+out_iput:
+	iput(inode);
+out:
+	return err;
+}
+
+static int ovl_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		      bool excl)
+{
+	return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
+}
+
+static int ovl_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	return ovl_create_object(dentry, (mode & 07777) | S_IFDIR, 0, NULL);
+}
+
+static int ovl_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		     dev_t rdev)
+{
+	return ovl_create_object(dentry, mode, rdev, NULL);
+}
+
+static int ovl_symlink(struct inode *dir, struct dentry *dentry,
+			 const char *link)
+{
+	return ovl_create_object(dentry, S_IFLNK, 0, link);
+}
+
+static int ovl_do_remove(struct dentry *dentry, bool is_dir)
+{
+	int err;
+	enum ovl_path_type type;
+	struct path realpath;
+	struct dentry *upperdir;
+
+	err = ovl_copy_up(dentry->d_parent);
+	if (err)
+		return err;
+
+	upperdir = ovl_dentry_upper(dentry->d_parent);
+	mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+	type = ovl_path_real(dentry, &realpath);
+	if (type != OVL_PATH_LOWER) {
+		err = -ESTALE;
+		if (realpath.dentry->d_parent != upperdir)
+			goto out_d_drop;
+
+		/* FIXME: create whiteout up front and rename to target */
+
+		if (is_dir)
+			err = vfs_rmdir(upperdir->d_inode, realpath.dentry);
+		else
+			err = vfs_unlink(upperdir->d_inode, realpath.dentry);
+		if (err)
+			goto out_d_drop;
+
+		ovl_dentry_version_inc(dentry->d_parent);
+	}
+
+	if (type != OVL_PATH_UPPER || ovl_dentry_is_opaque(dentry))
+		err = ovl_whiteout(upperdir, dentry);
+
+	/*
+	 * Keeping this dentry hashed would mean having to release
+	 * upperpath/lowerpath, which could only be done if we are the
+	 * sole user of this dentry.  Too tricky...  Just unhash for
+	 * now.
+	 */
+out_d_drop:
+	d_drop(dentry);
+	mutex_unlock(&upperdir->d_inode->i_mutex);
+
+	return err;
+}
+
+static int ovl_unlink(struct inode *dir, struct dentry *dentry)
+{
+	return ovl_do_remove(dentry, false);
+}
+
+
+static int ovl_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	int err;
+	enum ovl_path_type type;
+
+	type = ovl_path_type(dentry);
+	if (type != OVL_PATH_UPPER) {
+		err = ovl_check_empty_and_clear(dentry, type);
+		if (err)
+			return err;
+	}
+
+	return ovl_do_remove(dentry, true);
+}
+
+static int ovl_link(struct dentry *old, struct inode *newdir,
+		    struct dentry *new)
+{
+	int err;
+	struct dentry *olddentry;
+	struct dentry *newdentry;
+	struct dentry *upperdir;
+
+	err = ovl_copy_up(old);
+	if (err)
+		goto out;
+
+	err = ovl_copy_up(new->d_parent);
+	if (err)
+		goto out;
+
+	upperdir = ovl_dentry_upper(new->d_parent);
+	mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+	newdentry = ovl_lookup_create(upperdir, new);
+	err = PTR_ERR(newdentry);
+	if (IS_ERR(newdentry))
+		goto out_unlock;
+
+	olddentry = ovl_dentry_upper(old);
+	err = vfs_link(olddentry, upperdir->d_inode, newdentry);
+	if (!err) {
+		if (WARN_ON(!newdentry->d_inode)) {
+			dput(newdentry);
+			err = -ENOENT;
+			goto out_unlock;
+		}
+
+		ovl_dentry_version_inc(new->d_parent);
+		ovl_dentry_update(new, newdentry);
+
+		ihold(old->d_inode);
+		d_instantiate(new, old->d_inode);
+	} else {
+		if (ovl_dentry_is_opaque(new))
+			ovl_whiteout(upperdir, new);
+		dput(newdentry);
+	}
+out_unlock:
+	mutex_unlock(&upperdir->d_inode->i_mutex);
+out:
+	return err;
+
+}
+
+static int ovl_rename(struct inode *olddir, struct dentry *old,
+			struct inode *newdir, struct dentry *new)
+{
+	int err;
+	enum ovl_path_type old_type;
+	enum ovl_path_type new_type;
+	struct dentry *old_upperdir;
+	struct dentry *new_upperdir;
+	struct dentry *olddentry;
+	struct dentry *newdentry;
+	struct dentry *trap;
+	bool old_opaque;
+	bool new_opaque;
+	bool new_create = false;
+	bool is_dir = S_ISDIR(old->d_inode->i_mode);
+
+	/* Don't copy up directory trees */
+	old_type = ovl_path_type(old);
+	if (old_type != OVL_PATH_UPPER && is_dir)
+		return -EXDEV;
+
+	if (new->d_inode) {
+		new_type = ovl_path_type(new);
+
+		if (new_type == OVL_PATH_LOWER && old_type == OVL_PATH_LOWER) {
+			if (ovl_dentry_lower(old)->d_inode ==
+			    ovl_dentry_lower(new)->d_inode)
+				return 0;
+		}
+		if (new_type != OVL_PATH_LOWER && old_type != OVL_PATH_LOWER) {
+			if (ovl_dentry_upper(old)->d_inode ==
+			    ovl_dentry_upper(new)->d_inode)
+				return 0;
+		}
+
+		if (new_type != OVL_PATH_UPPER &&
+		    S_ISDIR(new->d_inode->i_mode)) {
+			err = ovl_check_empty_and_clear(new, new_type);
+			if (err)
+				return err;
+		}
+	} else {
+		new_type = OVL_PATH_UPPER;
+	}
+
+	err = ovl_copy_up(old);
+	if (err)
+		return err;
+
+	err = ovl_copy_up(new->d_parent);
+	if (err)
+		return err;
+
+	old_upperdir = ovl_dentry_upper(old->d_parent);
+	new_upperdir = ovl_dentry_upper(new->d_parent);
+
+	trap = lock_rename(new_upperdir, old_upperdir);
+
+	olddentry = ovl_dentry_upper(old);
+	newdentry = ovl_dentry_upper(new);
+	if (newdentry) {
+		dget(newdentry);
+	} else {
+		new_create = true;
+		newdentry = ovl_lookup_create(new_upperdir, new);
+		err = PTR_ERR(newdentry);
+		if (IS_ERR(newdentry))
+			goto out_unlock;
+	}
+
+	err = -ESTALE;
+	if (olddentry->d_parent != old_upperdir)
+		goto out_dput;
+	if (newdentry->d_parent != new_upperdir)
+		goto out_dput;
+	if (olddentry == trap)
+		goto out_dput;
+	if (newdentry == trap)
+		goto out_dput;
+
+	old_opaque = ovl_dentry_is_opaque(old);
+	new_opaque = ovl_dentry_is_opaque(new) || new_type != OVL_PATH_UPPER;
+
+	if (is_dir && !old_opaque && new_opaque) {
+		err = ovl_set_opaque(olddentry);
+		if (err)
+			goto out_dput;
+	}
+
+	err = vfs_rename(old_upperdir->d_inode, olddentry,
+			 new_upperdir->d_inode, newdentry);
+
+	if (err) {
+		if (new_create && ovl_dentry_is_opaque(new))
+			ovl_whiteout(new_upperdir, new);
+		if (is_dir && !old_opaque && new_opaque)
+			ovl_remove_opaque(olddentry);
+		goto out_dput;
+	}
+
+	if (old_type != OVL_PATH_UPPER || old_opaque)
+		err = ovl_whiteout(old_upperdir, old);
+	if (is_dir && old_opaque && !new_opaque)
+		ovl_remove_opaque(olddentry);
+
+	if (old_opaque != new_opaque)
+		ovl_dentry_set_opaque(old, new_opaque);
+
+	ovl_dentry_version_inc(old->d_parent);
+	ovl_dentry_version_inc(new->d_parent);
+
+out_dput:
+	dput(newdentry);
+out_unlock:
+	unlock_rename(new_upperdir, old_upperdir);
+	return err;
+}
+
+const struct inode_operations ovl_dir_inode_operations = {
+	.lookup		= ovl_lookup,
+	.atomic_open	= ovl_atomic_open,
+	.mkdir		= ovl_mkdir,
+	.symlink	= ovl_symlink,
+	.unlink		= ovl_unlink,
+	.rmdir		= ovl_rmdir,
+	.rename		= ovl_rename,
+	.link		= ovl_link,
+	.setattr	= ovl_setattr,
+	.create		= ovl_create,
+	.mknod		= ovl_mknod,
+	.permission	= ovl_permission,
+	.getattr	= ovl_dir_getattr,
+	.setxattr	= ovl_setxattr,
+	.getxattr	= ovl_getxattr,
+	.listxattr	= ovl_listxattr,
+	.removexattr	= ovl_removexattr,
+};
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
new file mode 100644
index 0000000..787761f
--- /dev/null
+++ b/fs/overlayfs/inode.c
@@ -0,0 +1,379 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/xattr.h>
+#include "overlayfs.h"
+
+int ovl_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct dentry *upperdentry;
+	int err;
+
+	if ((attr->ia_valid & ATTR_SIZE) && !ovl_dentry_upper(dentry))
+		err = ovl_copy_up_truncate(dentry, attr->ia_size);
+	else
+		err = ovl_copy_up(dentry);
+	if (err)
+		return err;
+
+	upperdentry = ovl_dentry_upper(dentry);
+
+	if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
+		attr->ia_valid &= ~ATTR_MODE;
+
+	mutex_lock(&upperdentry->d_inode->i_mutex);
+	err = notify_change(upperdentry, attr);
+	mutex_unlock(&upperdentry->d_inode->i_mutex);
+
+	return err;
+}
+
+static int ovl_getattr(struct vfsmount *mnt, struct dentry *dentry,
+			 struct kstat *stat)
+{
+	struct path realpath;
+
+	ovl_path_real(dentry, &realpath);
+	return vfs_getattr(realpath.mnt, realpath.dentry, stat);
+}
+
+int ovl_permission(struct inode *inode, int mask)
+{
+	struct ovl_entry *oe;
+	struct dentry *alias = NULL;
+	struct inode *realinode;
+	struct dentry *realdentry;
+	bool is_upper;
+	int err;
+
+	if (S_ISDIR(inode->i_mode)) {
+		oe = inode->i_private;
+	} else if (mask & MAY_NOT_BLOCK) {
+		return -ECHILD;
+	} else {
+		/*
+		 * For non-directories find an alias and get the info
+		 * from there.
+		 */
+		alias = d_find_any_alias(inode);
+		if (WARN_ON(!alias))
+			return -ENOENT;
+
+		oe = alias->d_fsdata;
+	}
+
+	realdentry = ovl_entry_real(oe, &is_upper);
+
+	/* Careful in RCU walk mode */
+	realinode = ACCESS_ONCE(realdentry->d_inode);
+	if (!realinode) {
+		WARN_ON(!(mask & MAY_NOT_BLOCK));
+		err = -ENOENT;
+		goto out_dput;
+	}
+
+	if (mask & MAY_WRITE) {
+		umode_t mode = realinode->i_mode;
+
+		/*
+		 * Writes will always be redirected to upper layer, so
+		 * ignore lower layer being read-only.
+		 *
+		 * If the overlay itself is read-only then proceed
+		 * with the permission check, don't return EROFS.
+		 * This will only happen if this is the lower layer of
+		 * another overlayfs.
+		 *
+		 * If upper fs becomes read-only after the overlay was
+		 * constructed return EROFS to prevent modification of
+		 * upper layer.
+		 */
+		err = -EROFS;
+		if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
+		    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+			goto out_dput;
+
+		/*
+		 * Nobody gets write access to an immutable file.
+		 */
+		err = -EACCES;
+		if (IS_IMMUTABLE(realinode))
+			goto out_dput;
+	}
+
+	if (realinode->i_op->permission)
+		err = realinode->i_op->permission(realinode, mask);
+	else
+		err = generic_permission(realinode, mask);
+out_dput:
+	dput(alias);
+	return err;
+}
+
+
+struct ovl_link_data {
+	struct dentry *realdentry;
+	void *cookie;
+};
+
+static void *ovl_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	void *ret;
+	struct dentry *realdentry;
+	struct inode *realinode;
+
+	realdentry = ovl_dentry_real(dentry);
+	realinode = realdentry->d_inode;
+
+	if (WARN_ON(!realinode->i_op->follow_link))
+		return ERR_PTR(-EPERM);
+
+	ret = realinode->i_op->follow_link(realdentry, nd);
+	if (IS_ERR(ret))
+		return ret;
+
+	if (realinode->i_op->put_link) {
+		struct ovl_link_data *data;
+
+		data = kmalloc(sizeof(struct ovl_link_data), GFP_KERNEL);
+		if (!data) {
+			realinode->i_op->put_link(realdentry, nd, ret);
+			return ERR_PTR(-ENOMEM);
+		}
+		data->realdentry = realdentry;
+		data->cookie = ret;
+
+		return data;
+	} else {
+		return NULL;
+	}
+}
+
+static void ovl_put_link(struct dentry *dentry, struct nameidata *nd, void *c)
+{
+	struct inode *realinode;
+	struct ovl_link_data *data = c;
+
+	if (!data)
+		return;
+
+	realinode = data->realdentry->d_inode;
+	realinode->i_op->put_link(data->realdentry, nd, data->cookie);
+	kfree(data);
+}
+
+static int ovl_readlink(struct dentry *dentry, char __user *buf, int bufsiz)
+{
+	struct path realpath;
+	struct inode *realinode;
+
+	ovl_path_real(dentry, &realpath);
+	realinode = realpath.dentry->d_inode;
+
+	if (!realinode->i_op->readlink)
+		return -EINVAL;
+
+	touch_atime(&realpath);
+
+	return realinode->i_op->readlink(realpath.dentry, buf, bufsiz);
+}
+
+
+static bool ovl_is_private_xattr(const char *name)
+{
+	return strncmp(name, "trusted.overlay.", 14) == 0;
+}
+
+int ovl_setxattr(struct dentry *dentry, const char *name,
+		 const void *value, size_t size, int flags)
+{
+	int err;
+	struct dentry *upperdentry;
+
+	if (ovl_is_private_xattr(name))
+		return -EPERM;
+
+	err = ovl_copy_up(dentry);
+	if (err)
+		return err;
+
+	upperdentry = ovl_dentry_upper(dentry);
+	return  vfs_setxattr(upperdentry, name, value, size, flags);
+}
+
+ssize_t ovl_getxattr(struct dentry *dentry, const char *name,
+		     void *value, size_t size)
+{
+	if (ovl_path_type(dentry->d_parent) == OVL_PATH_MERGE &&
+	    ovl_is_private_xattr(name))
+		return -ENODATA;
+
+	return vfs_getxattr(ovl_dentry_real(dentry), name, value, size);
+}
+
+ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size)
+{
+	ssize_t res;
+	int off;
+
+	res = vfs_listxattr(ovl_dentry_real(dentry), list, size);
+	if (res <= 0 || size == 0)
+		return res;
+
+	if (ovl_path_type(dentry->d_parent) != OVL_PATH_MERGE)
+		return res;
+
+	/* filter out private xattrs */
+	for (off = 0; off < res;) {
+		char *s = list + off;
+		size_t slen = strlen(s) + 1;
+
+		BUG_ON(off + slen > res);
+
+		if (ovl_is_private_xattr(s)) {
+			res -= slen;
+			memmove(s, s + slen, res - off);
+		} else {
+			off += slen;
+		}
+	}
+
+	return res;
+}
+
+int ovl_removexattr(struct dentry *dentry, const char *name)
+{
+	int err;
+	struct path realpath;
+	enum ovl_path_type type;
+
+	if (ovl_path_type(dentry->d_parent) == OVL_PATH_MERGE &&
+	    ovl_is_private_xattr(name))
+		return -ENODATA;
+
+	type = ovl_path_real(dentry, &realpath);
+	if (type == OVL_PATH_LOWER) {
+		err = vfs_getxattr(realpath.dentry, name, NULL, 0);
+		if (err < 0)
+			return err;
+
+		err = ovl_copy_up(dentry);
+		if (err)
+			return err;
+
+		ovl_path_upper(dentry, &realpath);
+	}
+
+	return vfs_removexattr(realpath.dentry, name);
+}
+
+static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
+				  struct dentry *realdentry)
+{
+	if (type != OVL_PATH_LOWER)
+		return false;
+
+	if (special_file(realdentry->d_inode->i_mode))
+		return false;
+
+	if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
+		return false;
+
+	return true;
+}
+
+static int ovl_dentry_open(struct dentry *dentry, struct file *file,
+		    const struct cred *cred)
+{
+	int err;
+	struct path realpath;
+	enum ovl_path_type type;
+
+	type = ovl_path_real(dentry, &realpath);
+	if (ovl_open_need_copy_up(file->f_flags, type, realpath.dentry)) {
+		if (file->f_flags & O_TRUNC)
+			err = ovl_copy_up_truncate(dentry, 0);
+		else
+			err = ovl_copy_up(dentry);
+		if (err)
+			return err;
+
+		ovl_path_upper(dentry, &realpath);
+	}
+
+	return vfs_open(&realpath, file, cred);
+}
+
+static const struct inode_operations ovl_file_inode_operations = {
+	.setattr	= ovl_setattr,
+	.permission	= ovl_permission,
+	.getattr	= ovl_getattr,
+	.setxattr	= ovl_setxattr,
+	.getxattr	= ovl_getxattr,
+	.listxattr	= ovl_listxattr,
+	.removexattr	= ovl_removexattr,
+	.dentry_open	= ovl_dentry_open,
+};
+
+static const struct inode_operations ovl_symlink_inode_operations = {
+	.setattr	= ovl_setattr,
+	.follow_link	= ovl_follow_link,
+	.put_link	= ovl_put_link,
+	.readlink	= ovl_readlink,
+	.getattr	= ovl_getattr,
+	.setxattr	= ovl_setxattr,
+	.getxattr	= ovl_getxattr,
+	.listxattr	= ovl_listxattr,
+	.removexattr	= ovl_removexattr,
+};
+
+struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
+			    struct ovl_entry *oe)
+{
+	struct inode *inode;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return NULL;
+
+	mode &= S_IFMT;
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+	switch (mode) {
+	case S_IFDIR:
+		inode->i_private = oe;
+		inode->i_op = &ovl_dir_inode_operations;
+		inode->i_fop = &ovl_dir_operations;
+		break;
+
+	case S_IFLNK:
+		inode->i_op = &ovl_symlink_inode_operations;
+		break;
+
+	case S_IFREG:
+	case S_IFSOCK:
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+		inode->i_op = &ovl_file_inode_operations;
+		break;
+
+	default:
+		WARN(1, "illegal file type: %i\n", mode);
+		inode = NULL;
+	}
+
+	return inode;
+
+}
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
new file mode 100644
index 0000000..fe1241d
--- /dev/null
+++ b/fs/overlayfs/overlayfs.h
@@ -0,0 +1,64 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+struct ovl_entry;
+
+enum ovl_path_type {
+	OVL_PATH_UPPER,
+	OVL_PATH_MERGE,
+	OVL_PATH_LOWER,
+};
+
+extern const char *ovl_opaque_xattr;
+extern const char *ovl_whiteout_xattr;
+extern const struct dentry_operations ovl_dentry_operations;
+
+enum ovl_path_type ovl_path_type(struct dentry *dentry);
+u64 ovl_dentry_version_get(struct dentry *dentry);
+void ovl_dentry_version_inc(struct dentry *dentry);
+void ovl_path_upper(struct dentry *dentry, struct path *path);
+void ovl_path_lower(struct dentry *dentry, struct path *path);
+enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path);
+struct dentry *ovl_dentry_upper(struct dentry *dentry);
+struct dentry *ovl_dentry_lower(struct dentry *dentry);
+struct dentry *ovl_dentry_real(struct dentry *dentry);
+struct dentry *ovl_entry_real(struct ovl_entry *oe, bool *is_upper);
+bool ovl_dentry_is_opaque(struct dentry *dentry);
+void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque);
+bool ovl_is_whiteout(struct dentry *dentry);
+void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
+struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
+			  unsigned int flags);
+struct file *ovl_path_open(struct path *path, int flags);
+
+struct dentry *ovl_upper_create(struct dentry *upperdir, struct dentry *dentry,
+				struct kstat *stat, const char *link);
+
+/* readdir.c */
+extern const struct file_operations ovl_dir_operations;
+int ovl_check_empty_and_clear(struct dentry *dentry, enum ovl_path_type type);
+
+/* inode.c */
+int ovl_setattr(struct dentry *dentry, struct iattr *attr);
+int ovl_permission(struct inode *inode, int mask);
+int ovl_setxattr(struct dentry *dentry, const char *name,
+		 const void *value, size_t size, int flags);
+ssize_t ovl_getxattr(struct dentry *dentry, const char *name,
+		     void *value, size_t size);
+ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
+int ovl_removexattr(struct dentry *dentry, const char *name);
+
+struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
+			    struct ovl_entry *oe);
+/* dir.c */
+extern const struct inode_operations ovl_dir_inode_operations;
+
+/* copy_up.c */
+int ovl_copy_up(struct dentry *dentry);
+int ovl_copy_up_truncate(struct dentry *dentry, loff_t size);
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
new file mode 100644
index 0000000..0797efb
--- /dev/null
+++ b/fs/overlayfs/readdir.c
@@ -0,0 +1,566 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/xattr.h>
+#include <linux/rbtree.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include "overlayfs.h"
+
+struct ovl_cache_entry {
+	const char *name;
+	unsigned int len;
+	unsigned int type;
+	u64 ino;
+	bool is_whiteout;
+	struct list_head l_node;
+	struct rb_node node;
+};
+
+struct ovl_readdir_data {
+	struct rb_root *root;
+	struct list_head *list;
+	struct list_head *middle;
+	struct dentry *dir;
+	int count;
+	int err;
+};
+
+struct ovl_dir_file {
+	bool is_real;
+	bool is_cached;
+	struct list_head cursor;
+	u64 cache_version;
+	struct list_head cache;
+	struct file *realfile;
+};
+
+static struct ovl_cache_entry *ovl_cache_entry_from_node(struct rb_node *n)
+{
+	return container_of(n, struct ovl_cache_entry, node);
+}
+
+static struct ovl_cache_entry *ovl_cache_entry_find(struct rb_root *root,
+						    const char *name, int len)
+{
+	struct rb_node *node = root->rb_node;
+	int cmp;
+
+	while (node) {
+		struct ovl_cache_entry *p = ovl_cache_entry_from_node(node);
+
+		cmp = strncmp(name, p->name, len);
+		if (cmp > 0)
+			node = p->node.rb_right;
+		else if (cmp < 0 || len < p->len)
+			node = p->node.rb_left;
+		else
+			return p;
+	}
+
+	return NULL;
+}
+
+static struct ovl_cache_entry *ovl_cache_entry_new(const char *name, int len,
+						   u64 ino, unsigned int d_type)
+{
+	struct ovl_cache_entry *p;
+
+	p = kmalloc(sizeof(*p) + len + 1, GFP_KERNEL);
+	if (p) {
+		char *name_copy = (char *) (p + 1);
+		memcpy(name_copy, name, len);
+		name_copy[len] = '\0';
+		p->name = name_copy;
+		p->len = len;
+		p->type = d_type;
+		p->ino = ino;
+		p->is_whiteout = false;
+	}
+
+	return p;
+}
+
+static int ovl_cache_entry_add_rb(struct ovl_readdir_data *rdd,
+				  const char *name, int len, u64 ino,
+				  unsigned int d_type)
+{
+	struct rb_node **newp = &rdd->root->rb_node;
+	struct rb_node *parent = NULL;
+	struct ovl_cache_entry *p;
+
+	while (*newp) {
+		int cmp;
+		struct ovl_cache_entry *tmp;
+
+		parent = *newp;
+		tmp = ovl_cache_entry_from_node(*newp);
+		cmp = strncmp(name, tmp->name, len);
+		if (cmp > 0)
+			newp = &tmp->node.rb_right;
+		else if (cmp < 0 || len < tmp->len)
+			newp = &tmp->node.rb_left;
+		else
+			return 0;
+	}
+
+	p = ovl_cache_entry_new(name, len, ino, d_type);
+	if (p == NULL)
+		return -ENOMEM;
+
+	list_add_tail(&p->l_node, rdd->list);
+	rb_link_node(&p->node, parent, newp);
+	rb_insert_color(&p->node, rdd->root);
+
+	return 0;
+}
+
+static int ovl_fill_lower(void *buf, const char *name, int namelen,
+			    loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct ovl_readdir_data *rdd = buf;
+	struct ovl_cache_entry *p;
+
+	rdd->count++;
+	p = ovl_cache_entry_find(rdd->root, name, namelen);
+	if (p) {
+		list_move_tail(&p->l_node, rdd->middle);
+	} else {
+		p = ovl_cache_entry_new(name, namelen, ino, d_type);
+		if (p == NULL)
+			rdd->err = -ENOMEM;
+		else
+			list_add_tail(&p->l_node, rdd->middle);
+	}
+
+	return rdd->err;
+}
+
+static void ovl_cache_free(struct list_head *list)
+{
+	struct ovl_cache_entry *p;
+	struct ovl_cache_entry *n;
+
+	list_for_each_entry_safe(p, n, list, l_node)
+		kfree(p);
+
+	INIT_LIST_HEAD(list);
+}
+
+static int ovl_fill_upper(void *buf, const char *name, int namelen,
+			  loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct ovl_readdir_data *rdd = buf;
+
+	rdd->count++;
+	return ovl_cache_entry_add_rb(rdd, name, namelen, ino, d_type);
+}
+
+static inline int ovl_dir_read(struct path *realpath,
+			       struct ovl_readdir_data *rdd, filldir_t filler)
+{
+	struct file *realfile;
+	int err;
+
+	realfile = ovl_path_open(realpath, O_RDONLY | O_DIRECTORY);
+	if (IS_ERR(realfile))
+		return PTR_ERR(realfile);
+
+	do {
+		rdd->count = 0;
+		rdd->err = 0;
+		err = vfs_readdir(realfile, filler, rdd);
+		if (err >= 0)
+			err = rdd->err;
+	} while (!err && rdd->count);
+	fput(realfile);
+
+	return 0;
+}
+
+static void ovl_dir_reset(struct file *file)
+{
+	struct ovl_dir_file *od = file->private_data;
+	enum ovl_path_type type = ovl_path_type(file->f_path.dentry);
+
+	if (ovl_dentry_version_get(file->f_path.dentry) != od->cache_version) {
+		list_del_init(&od->cursor);
+		ovl_cache_free(&od->cache);
+		od->is_cached = false;
+	}
+	WARN_ON(!od->is_real && type != OVL_PATH_MERGE);
+	if (od->is_real && type == OVL_PATH_MERGE) {
+		fput(od->realfile);
+		od->realfile = NULL;
+		od->is_real = false;
+	}
+}
+
+static int ovl_dir_mark_whiteouts(struct ovl_readdir_data *rdd)
+{
+	struct ovl_cache_entry *p;
+	struct dentry *dentry;
+	const struct cred *old_cred;
+	struct cred *override_cred;
+
+	override_cred = prepare_creds();
+	if (!override_cred) {
+		ovl_cache_free(rdd->list);
+		return -ENOMEM;
+	}
+
+	/*
+	 * CAP_SYS_ADMIN for getxattr
+	 * CAP_DAC_OVERRIDE for lookup
+	 */
+	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+	old_cred = override_creds(override_cred);
+
+	mutex_lock(&rdd->dir->d_inode->i_mutex);
+	list_for_each_entry(p, rdd->list, l_node) {
+		if (p->type != DT_LNK)
+			continue;
+
+		dentry = lookup_one_len(p->name, rdd->dir, p->len);
+		if (IS_ERR(dentry))
+			continue;
+
+		p->is_whiteout = ovl_is_whiteout(dentry);
+		dput(dentry);
+	}
+	mutex_unlock(&rdd->dir->d_inode->i_mutex);
+
+	revert_creds(old_cred);
+	put_cred(override_cred);
+
+	return 0;
+}
+
+static inline int ovl_dir_read_merged(struct path *upperpath,
+				      struct path *lowerpath,
+				      struct ovl_readdir_data *rdd)
+{
+	int err;
+	struct rb_root root = RB_ROOT;
+	struct list_head middle;
+
+	rdd->root = &root;
+	if (upperpath->dentry) {
+		rdd->dir = upperpath->dentry;
+		err = ovl_dir_read(upperpath, rdd, ovl_fill_upper);
+		if (err)
+			goto out;
+
+		err = ovl_dir_mark_whiteouts(rdd);
+		if (err)
+			goto out;
+	}
+	/*
+	 * Insert lowerpath entries before upperpath ones, this allows
+	 * offsets to be reasonably constant
+	 */
+	list_add(&middle, rdd->list);
+	rdd->middle = &middle;
+	err = ovl_dir_read(lowerpath, rdd, ovl_fill_lower);
+	list_del(&middle);
+out:
+	rdd->root = NULL;
+
+	return err;
+}
+
+static void ovl_seek_cursor(struct ovl_dir_file *od, loff_t pos)
+{
+	struct list_head *l;
+	loff_t off;
+
+	l = od->cache.next;
+	for (off = 0; off < pos; off++) {
+		if (l == &od->cache)
+			break;
+		l = l->next;
+	}
+	list_move_tail(&od->cursor, l);
+}
+
+static int ovl_readdir(struct file *file, void *buf, filldir_t filler)
+{
+	struct ovl_dir_file *od = file->private_data;
+	int res;
+
+	if (!file->f_pos)
+		ovl_dir_reset(file);
+
+	if (od->is_real) {
+		res = vfs_readdir(od->realfile, filler, buf);
+		file->f_pos = od->realfile->f_pos;
+
+		return res;
+	}
+
+	if (!od->is_cached) {
+		struct path lowerpath;
+		struct path upperpath;
+		struct ovl_readdir_data rdd = { .list = &od->cache };
+
+		ovl_path_lower(file->f_path.dentry, &lowerpath);
+		ovl_path_upper(file->f_path.dentry, &upperpath);
+
+		res = ovl_dir_read_merged(&upperpath, &lowerpath, &rdd);
+		if (res) {
+			ovl_cache_free(rdd.list);
+			return res;
+		}
+
+		od->cache_version = ovl_dentry_version_get(file->f_path.dentry);
+		od->is_cached = true;
+
+		ovl_seek_cursor(od, file->f_pos);
+	}
+
+	while (od->cursor.next != &od->cache) {
+		int over;
+		loff_t off;
+		struct ovl_cache_entry *p;
+
+		p = list_entry(od->cursor.next, struct ovl_cache_entry, l_node);
+		off = file->f_pos;
+		if (!p->is_whiteout) {
+			over = filler(buf, p->name, p->len, off, p->ino,
+				      p->type);
+			if (over)
+				break;
+		}
+		file->f_pos++;
+		list_move(&od->cursor, &p->l_node);
+	}
+
+	return 0;
+}
+
+static loff_t ovl_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t res;
+	struct ovl_dir_file *od = file->private_data;
+
+	mutex_lock(&file->f_dentry->d_inode->i_mutex);
+	if (!file->f_pos)
+		ovl_dir_reset(file);
+
+	if (od->is_real) {
+		res = vfs_llseek(od->realfile, offset, origin);
+		file->f_pos = od->realfile->f_pos;
+	} else {
+		res = -EINVAL;
+
+		switch (origin) {
+		case SEEK_CUR:
+			offset += file->f_pos;
+			break;
+		case SEEK_SET:
+			break;
+		default:
+			goto out_unlock;
+		}
+		if (offset < 0)
+			goto out_unlock;
+
+		if (offset != file->f_pos) {
+			file->f_pos = offset;
+			if (od->is_cached)
+				ovl_seek_cursor(od, offset);
+		}
+		res = offset;
+	}
+out_unlock:
+	mutex_unlock(&file->f_dentry->d_inode->i_mutex);
+
+	return res;
+}
+
+static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
+			 int datasync)
+{
+	struct ovl_dir_file *od = file->private_data;
+
+	/* May need to reopen directory if it got copied up */
+	if (!od->realfile) {
+		struct path upperpath;
+
+		ovl_path_upper(file->f_path.dentry, &upperpath);
+		od->realfile = ovl_path_open(&upperpath, O_RDONLY);
+		if (IS_ERR(od->realfile))
+			return PTR_ERR(od->realfile);
+	}
+
+	return vfs_fsync_range(od->realfile, start, end, datasync);
+}
+
+static int ovl_dir_release(struct inode *inode, struct file *file)
+{
+	struct ovl_dir_file *od = file->private_data;
+
+	list_del(&od->cursor);
+	ovl_cache_free(&od->cache);
+	if (od->realfile)
+		fput(od->realfile);
+	kfree(od);
+
+	return 0;
+}
+
+static int ovl_dir_open(struct inode *inode, struct file *file)
+{
+	struct path realpath;
+	struct file *realfile;
+	struct ovl_dir_file *od;
+	enum ovl_path_type type;
+
+	od = kzalloc(sizeof(struct ovl_dir_file), GFP_KERNEL);
+	if (!od)
+		return -ENOMEM;
+
+	type = ovl_path_real(file->f_path.dentry, &realpath);
+	realfile = ovl_path_open(&realpath, file->f_flags);
+	if (IS_ERR(realfile)) {
+		kfree(od);
+		return PTR_ERR(realfile);
+	}
+	INIT_LIST_HEAD(&od->cache);
+	INIT_LIST_HEAD(&od->cursor);
+	od->is_cached = false;
+	od->realfile = realfile;
+	od->is_real = (type != OVL_PATH_MERGE);
+	file->private_data = od;
+
+	return 0;
+}
+
+const struct file_operations ovl_dir_operations = {
+	.read		= generic_read_dir,
+	.open		= ovl_dir_open,
+	.readdir	= ovl_readdir,
+	.llseek		= ovl_dir_llseek,
+	.fsync		= ovl_dir_fsync,
+	.release	= ovl_dir_release,
+};
+
+static int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
+{
+	int err;
+	struct path lowerpath;
+	struct path upperpath;
+	struct ovl_cache_entry *p;
+	struct ovl_readdir_data rdd = { .list = list };
+
+	ovl_path_upper(dentry, &upperpath);
+	ovl_path_lower(dentry, &lowerpath);
+
+	err = ovl_dir_read_merged(&upperpath, &lowerpath, &rdd);
+	if (err)
+		return err;
+
+	err = 0;
+
+	list_for_each_entry(p, list, l_node) {
+		if (p->is_whiteout)
+			continue;
+
+		if (p->name[0] == '.') {
+			if (p->len == 1)
+				continue;
+			if (p->len == 2 && p->name[1] == '.')
+				continue;
+		}
+		err = -ENOTEMPTY;
+		break;
+	}
+
+	return err;
+}
+
+static int ovl_remove_whiteouts(struct dentry *dir, struct list_head *list)
+{
+	struct path upperpath;
+	struct dentry *upperdir;
+	struct ovl_cache_entry *p;
+	const struct cred *old_cred;
+	struct cred *override_cred;
+	int err;
+
+	ovl_path_upper(dir, &upperpath);
+	upperdir = upperpath.dentry;
+
+	override_cred = prepare_creds();
+	if (!override_cred)
+		return -ENOMEM;
+
+	/*
+	 * CAP_DAC_OVERRIDE for lookup and unlink
+	 * CAP_SYS_ADMIN for setxattr of "trusted" namespace
+	 * CAP_FOWNER for unlink in sticky directory
+	 */
+	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+	cap_raise(override_cred->cap_effective, CAP_FOWNER);
+	old_cred = override_creds(override_cred);
+
+	err = vfs_setxattr(upperdir, ovl_opaque_xattr, "y", 1, 0);
+	if (err)
+		goto out_revert_creds;
+
+	mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+	list_for_each_entry(p, list, l_node) {
+		struct dentry *dentry;
+		int ret;
+
+		if (!p->is_whiteout)
+			continue;
+
+		dentry = lookup_one_len(p->name, upperdir, p->len);
+		if (IS_ERR(dentry)) {
+			printk(KERN_WARNING
+			    "overlayfs: failed to lookup whiteout %.*s: %li\n",
+			    p->len, p->name, PTR_ERR(dentry));
+			continue;
+		}
+		ret = vfs_unlink(upperdir->d_inode, dentry);
+		dput(dentry);
+		if (ret)
+			printk(KERN_WARNING
+			    "overlayfs: failed to unlink whiteout %.*s: %i\n",
+			    p->len, p->name, ret);
+	}
+	mutex_unlock(&upperdir->d_inode->i_mutex);
+
+out_revert_creds:
+	revert_creds(old_cred);
+	put_cred(override_cred);
+
+	return err;
+}
+
+int ovl_check_empty_and_clear(struct dentry *dentry, enum ovl_path_type type)
+{
+	int err;
+	LIST_HEAD(list);
+
+	err = ovl_check_empty_dir(dentry, &list);
+	if (!err && type == OVL_PATH_MERGE)
+		err = ovl_remove_whiteouts(dentry, &list);
+
+	ovl_cache_free(&list);
+
+	return err;
+}
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
new file mode 100644
index 0000000..02deecd
--- /dev/null
+++ b/fs/overlayfs/super.c
@@ -0,0 +1,611 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/module.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include "overlayfs.h"
+
+MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
+MODULE_DESCRIPTION("Overlay filesystem");
+MODULE_LICENSE("GPL");
+
+struct ovl_fs {
+	struct vfsmount *upper_mnt;
+	struct vfsmount *lower_mnt;
+};
+
+struct ovl_entry {
+	/*
+	 * Keep "double reference" on upper dentries, so that
+	 * d_delete() doesn't think it's OK to reset d_inode to NULL.
+	 */
+	struct dentry *__upperdentry;
+	struct dentry *lowerdentry;
+	union {
+		struct {
+			u64 version;
+			bool opaque;
+		};
+		struct rcu_head rcu;
+	};
+};
+
+const char *ovl_whiteout_xattr = "trusted.overlay.whiteout";
+const char *ovl_opaque_xattr = "trusted.overlay.opaque";
+
+
+enum ovl_path_type ovl_path_type(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	if (oe->__upperdentry) {
+		if (oe->lowerdentry && S_ISDIR(dentry->d_inode->i_mode))
+			return OVL_PATH_MERGE;
+		else
+			return OVL_PATH_UPPER;
+	} else {
+		return OVL_PATH_LOWER;
+	}
+}
+
+static struct dentry *ovl_upperdentry_dereference(struct ovl_entry *oe)
+{
+	struct dentry *upperdentry = ACCESS_ONCE(oe->__upperdentry);
+	smp_read_barrier_depends();
+	return upperdentry;
+}
+
+void ovl_path_upper(struct dentry *dentry, struct path *path)
+{
+	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	path->mnt = ofs->upper_mnt;
+	path->dentry = ovl_upperdentry_dereference(oe);
+}
+
+void ovl_path_lower(struct dentry *dentry, struct path *path)
+{
+	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	path->mnt = ofs->lower_mnt;
+	path->dentry = oe->lowerdentry;
+}
+
+enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path)
+{
+
+	enum ovl_path_type type = ovl_path_type(dentry);
+
+	if (type == OVL_PATH_LOWER)
+		ovl_path_lower(dentry, path);
+	else
+		ovl_path_upper(dentry, path);
+
+	return type;
+}
+
+struct dentry *ovl_dentry_upper(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	return ovl_upperdentry_dereference(oe);
+}
+
+struct dentry *ovl_dentry_lower(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	return oe->lowerdentry;
+}
+
+struct dentry *ovl_dentry_real(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+	struct dentry *realdentry;
+
+	realdentry = ovl_upperdentry_dereference(oe);
+	if (!realdentry)
+		realdentry = oe->lowerdentry;
+
+	return realdentry;
+}
+
+struct dentry *ovl_entry_real(struct ovl_entry *oe, bool *is_upper)
+{
+	struct dentry *realdentry;
+
+	realdentry = ovl_upperdentry_dereference(oe);
+	if (realdentry) {
+		*is_upper = true;
+	} else {
+		realdentry = oe->lowerdentry;
+		*is_upper = false;
+	}
+	return realdentry;
+}
+
+bool ovl_dentry_is_opaque(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+	return oe->opaque;
+}
+
+void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+	oe->opaque = opaque;
+}
+
+void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	WARN_ON(!mutex_is_locked(&upperdentry->d_parent->d_inode->i_mutex));
+	WARN_ON(oe->__upperdentry);
+	BUG_ON(!upperdentry->d_inode);
+	smp_wmb();
+	oe->__upperdentry = dget(upperdentry);
+}
+
+void ovl_dentry_version_inc(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	WARN_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
+	oe->version++;
+}
+
+u64 ovl_dentry_version_get(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	WARN_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
+	return oe->version;
+}
+
+bool ovl_is_whiteout(struct dentry *dentry)
+{
+	int res;
+	char val;
+
+	if (!dentry)
+		return false;
+	if (!dentry->d_inode)
+		return false;
+	if (!S_ISLNK(dentry->d_inode->i_mode))
+		return false;
+
+	res = vfs_getxattr(dentry, ovl_whiteout_xattr, &val, 1);
+	if (res == 1 && val == 'y')
+		return true;
+
+	return false;
+}
+
+static bool ovl_is_opaquedir(struct dentry *dentry)
+{
+	int res;
+	char val;
+
+	if (!S_ISDIR(dentry->d_inode->i_mode))
+		return false;
+
+	res = vfs_getxattr(dentry, ovl_opaque_xattr, &val, 1);
+	if (res == 1 && val == 'y')
+		return true;
+
+	return false;
+}
+
+static void ovl_entry_free(struct rcu_head *head)
+{
+	struct ovl_entry *oe = container_of(head, struct ovl_entry, rcu);
+	kfree(oe);
+}
+
+static void ovl_dentry_release(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	if (oe) {
+		dput(oe->__upperdentry);
+		dput(oe->__upperdentry);
+		dput(oe->lowerdentry);
+		call_rcu(&oe->rcu, ovl_entry_free);
+	}
+}
+
+const struct dentry_operations ovl_dentry_operations = {
+	.d_release = ovl_dentry_release,
+};
+
+static struct ovl_entry *ovl_alloc_entry(void)
+{
+	return kzalloc(sizeof(struct ovl_entry), GFP_KERNEL);
+}
+
+static inline struct dentry *ovl_lookup_real(struct dentry *dir,
+					     struct qstr *name)
+{
+	struct dentry *dentry;
+
+	mutex_lock(&dir->d_inode->i_mutex);
+	dentry = lookup_one_len(name->name, dir, name->len);
+	mutex_unlock(&dir->d_inode->i_mutex);
+
+	if (IS_ERR(dentry)) {
+		if (PTR_ERR(dentry) == -ENOENT)
+			dentry = NULL;
+	} else if (!dentry->d_inode) {
+		dput(dentry);
+		dentry = NULL;
+	}
+	return dentry;
+}
+
+static int ovl_do_lookup(struct dentry *dentry)
+{
+	struct ovl_entry *oe;
+	struct dentry *upperdir;
+	struct dentry *lowerdir;
+	struct dentry *upperdentry = NULL;
+	struct dentry *lowerdentry = NULL;
+	struct inode *inode = NULL;
+	int err;
+
+	err = -ENOMEM;
+	oe = ovl_alloc_entry();
+	if (!oe)
+		goto out;
+
+	upperdir = ovl_dentry_upper(dentry->d_parent);
+	lowerdir = ovl_dentry_lower(dentry->d_parent);
+
+	if (upperdir) {
+		upperdentry = ovl_lookup_real(upperdir, &dentry->d_name);
+		err = PTR_ERR(upperdentry);
+		if (IS_ERR(upperdentry))
+			goto out_put_dir;
+
+		if (lowerdir && upperdentry &&
+		    (S_ISLNK(upperdentry->d_inode->i_mode) ||
+		     S_ISDIR(upperdentry->d_inode->i_mode))) {
+			const struct cred *old_cred;
+			struct cred *override_cred;
+
+			err = -ENOMEM;
+			override_cred = prepare_creds();
+			if (!override_cred)
+				goto out_dput_upper;
+
+			/* CAP_SYS_ADMIN needed for getxattr */
+			cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+			old_cred = override_creds(override_cred);
+
+			if (ovl_is_opaquedir(upperdentry)) {
+				oe->opaque = true;
+			} else if (ovl_is_whiteout(upperdentry)) {
+				dput(upperdentry);
+				upperdentry = NULL;
+				oe->opaque = true;
+			}
+			revert_creds(old_cred);
+			put_cred(override_cred);
+		}
+	}
+	if (lowerdir && !oe->opaque) {
+		lowerdentry = ovl_lookup_real(lowerdir, &dentry->d_name);
+		err = PTR_ERR(lowerdentry);
+		if (IS_ERR(lowerdentry))
+			goto out_dput_upper;
+	}
+
+	if (lowerdentry && upperdentry &&
+	    (!S_ISDIR(upperdentry->d_inode->i_mode) ||
+	     !S_ISDIR(lowerdentry->d_inode->i_mode))) {
+		dput(lowerdentry);
+		lowerdentry = NULL;
+		oe->opaque = true;
+	}
+
+	if (lowerdentry || upperdentry) {
+		struct dentry *realdentry;
+
+		realdentry = upperdentry ? upperdentry : lowerdentry;
+		err = -ENOMEM;
+		inode = ovl_new_inode(dentry->d_sb, realdentry->d_inode->i_mode,
+				      oe);
+		if (!inode)
+			goto out_dput;
+	}
+
+	if (upperdentry)
+		oe->__upperdentry = dget(upperdentry);
+
+	if (lowerdentry)
+		oe->lowerdentry = lowerdentry;
+
+	dentry->d_fsdata = oe;
+	dentry->d_op = &ovl_dentry_operations;
+	d_add(dentry, inode);
+
+	return 0;
+
+out_dput:
+	dput(lowerdentry);
+out_dput_upper:
+	dput(upperdentry);
+out_put_dir:
+	kfree(oe);
+out:
+	return err;
+}
+
+struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
+			  unsigned int flags)
+{
+	int err = ovl_do_lookup(dentry);
+
+	if (err)
+		return ERR_PTR(err);
+
+	return NULL;
+}
+
+struct file *ovl_path_open(struct path *path, int flags)
+{
+	path_get(path);
+	return dentry_open(path, flags, current_cred());
+}
+
+static void ovl_put_super(struct super_block *sb)
+{
+	struct ovl_fs *ufs = sb->s_fs_info;
+
+	if (!(sb->s_flags & MS_RDONLY))
+		mnt_drop_write(ufs->upper_mnt);
+
+	mntput(ufs->upper_mnt);
+	mntput(ufs->lower_mnt);
+
+	kfree(ufs);
+}
+
+static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
+{
+	int flags = *flagsp;
+	struct ovl_fs *ufs = sb->s_fs_info;
+
+	/* When remounting rw or ro, we need to adjust the write access to the
+	 * upper fs.
+	 */
+	if (((flags ^ sb->s_flags) & MS_RDONLY) == 0)
+		/* No change to readonly status */
+		return 0;
+
+	if (flags & MS_RDONLY) {
+		mnt_drop_write(ufs->upper_mnt);
+		return 0;
+	} else
+		return mnt_want_write(ufs->upper_mnt);
+}
+
+static const struct super_operations ovl_super_operations = {
+	.put_super	= ovl_put_super,
+	.remount_fs	= ovl_remount_fs,
+};
+
+struct ovl_config {
+	char *lowerdir;
+	char *upperdir;
+};
+
+enum {
+	Opt_lowerdir,
+	Opt_upperdir,
+	Opt_err,
+};
+
+static const match_table_t ovl_tokens = {
+	{Opt_lowerdir,			"lowerdir=%s"},
+	{Opt_upperdir,			"upperdir=%s"},
+	{Opt_err,			NULL}
+};
+
+static int ovl_parse_opt(char *opt, struct ovl_config *config)
+{
+	char *p;
+
+	config->upperdir = NULL;
+	config->lowerdir = NULL;
+
+	while ((p = strsep(&opt, ",")) != NULL) {
+		int token;
+		substring_t args[MAX_OPT_ARGS];
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, ovl_tokens, args);
+		switch (token) {
+		case Opt_upperdir:
+			kfree(config->upperdir);
+			config->upperdir = match_strdup(&args[0]);
+			if (!config->upperdir)
+				return -ENOMEM;
+			break;
+
+		case Opt_lowerdir:
+			kfree(config->lowerdir);
+			config->lowerdir = match_strdup(&args[0]);
+			if (!config->lowerdir)
+				return -ENOMEM;
+			break;
+
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int ovl_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct path lowerpath;
+	struct path upperpath;
+	struct inode *root_inode;
+	struct dentry *root_dentry;
+	struct ovl_entry *oe;
+	struct ovl_fs *ufs;
+	struct ovl_config config;
+	int err;
+
+	err = ovl_parse_opt((char *) data, &config);
+	if (err)
+		goto out;
+
+	err = -EINVAL;
+	if (!config.upperdir || !config.lowerdir) {
+		printk(KERN_ERR "overlayfs: missing upperdir or lowerdir\n");
+		goto out_free_config;
+	}
+
+	err = -ENOMEM;
+	ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
+	if (!ufs)
+		goto out_free_config;
+
+	oe = ovl_alloc_entry();
+	if (oe == NULL)
+		goto out_free_ufs;
+
+	err = kern_path(config.upperdir, LOOKUP_FOLLOW, &upperpath);
+	if (err)
+		goto out_free_oe;
+
+	err = kern_path(config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
+	if (err)
+		goto out_put_upperpath;
+
+	err = -ENOTDIR;
+	if (!S_ISDIR(upperpath.dentry->d_inode->i_mode) ||
+	    !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
+		goto out_put_lowerpath;
+
+	ufs->upper_mnt = clone_private_mount(&upperpath);
+	err = PTR_ERR(ufs->upper_mnt);
+	if (IS_ERR(ufs->upper_mnt)) {
+		printk(KERN_ERR "overlayfs: failed to clone upperpath\n");
+		goto out_put_lowerpath;
+	}
+
+	ufs->lower_mnt = clone_private_mount(&lowerpath);
+	err = PTR_ERR(ufs->lower_mnt);
+	if (IS_ERR(ufs->lower_mnt)) {
+		printk(KERN_ERR "overlayfs: failed to clone lowerpath\n");
+		goto out_put_upper_mnt;
+	}
+
+	/*
+	 * Make lower_mnt R/O.  That way fchmod/fchown on lower file
+	 * will fail instead of modifying lower fs.
+	 */
+	ufs->lower_mnt->mnt_flags |= MNT_READONLY;
+
+	/* If the upper fs is r/o, we mark overlayfs r/o too */
+	if (ufs->upper_mnt->mnt_sb->s_flags & MS_RDONLY)
+		sb->s_flags |= MS_RDONLY;
+
+	if (!(sb->s_flags & MS_RDONLY)) {
+		err = mnt_want_write(ufs->upper_mnt);
+		if (err)
+			goto out_put_lower_mnt;
+	}
+
+	err = -ENOMEM;
+	root_inode = ovl_new_inode(sb, S_IFDIR, oe);
+	if (!root_inode)
+		goto out_drop_write;
+
+	root_dentry = d_make_root(root_inode);
+	if (!root_dentry)
+		goto out_drop_write;
+
+	mntput(upperpath.mnt);
+	mntput(lowerpath.mnt);
+
+	oe->__upperdentry = dget(upperpath.dentry);
+	oe->lowerdentry = lowerpath.dentry;
+
+	root_dentry->d_fsdata = oe;
+	root_dentry->d_op = &ovl_dentry_operations;
+
+	sb->s_op = &ovl_super_operations;
+	sb->s_root = root_dentry;
+	sb->s_fs_info = ufs;
+
+	return 0;
+
+out_drop_write:
+	if (!(sb->s_flags & MS_RDONLY))
+		mnt_drop_write(ufs->upper_mnt);
+out_put_lower_mnt:
+	mntput(ufs->lower_mnt);
+out_put_upper_mnt:
+	mntput(ufs->upper_mnt);
+out_put_lowerpath:
+	path_put(&lowerpath);
+out_put_upperpath:
+	path_put(&upperpath);
+out_free_oe:
+	kfree(oe);
+out_free_ufs:
+	kfree(ufs);
+out_free_config:
+	kfree(config.lowerdir);
+	kfree(config.upperdir);
+out:
+	return err;
+}
+
+static struct dentry *ovl_mount(struct file_system_type *fs_type, int flags,
+				const char *dev_name, void *raw_data)
+{
+	return mount_nodev(fs_type, flags, raw_data, ovl_fill_super);
+}
+
+static struct file_system_type ovl_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "overlayfs",
+	.mount		= ovl_mount,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init ovl_init(void)
+{
+	return register_filesystem(&ovl_fs_type);
+}
+
+static void __exit ovl_exit(void)
+{
+	unregister_filesystem(&ovl_fs_type);
+}
+
+module_init(ovl_init);
+module_exit(ovl_exit);
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 05/13] overlayfs: add statfs support
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (3 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 04/13] overlay filesystem Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-17 18:20   ` Ben Hutchings
  2012-08-15 15:48 ` [PATCH 06/13] overlayfs: implement show_options Miklos Szeredi
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Andy Whitcroft <apw@canonical.com>

Add support for statfs to the overlayfs filesystem.  As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs.  There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.

Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/super.c |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 02deecd..484753b 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -406,9 +406,29 @@ static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
 		return mnt_want_write(ufs->upper_mnt);
 }
 
+/**
+ * ovl_statfs
+ * @sb: The overlayfs super block
+ * @buf: The struct kstatfs to fill in with stats
+ *
+ * Get the filesystem statistics.  As writes always target the upper layer
+ * filesystem pass the statfs to the same filesystem.
+ */
+static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct dentry *root_dentry = dentry->d_sb->s_root;
+	struct path path;
+	ovl_path_upper(root_dentry, &path);
+
+	if (!path.dentry->d_sb->s_op->statfs)
+		return -ENOSYS;
+	return path.dentry->d_sb->s_op->statfs(path.dentry, buf);
+}
+
 static const struct super_operations ovl_super_operations = {
 	.put_super	= ovl_put_super,
 	.remount_fs	= ovl_remount_fs,
+	.statfs		= ovl_statfs,
 };
 
 struct ovl_config {
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 06/13] overlayfs: implement show_options
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (4 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 05/13] overlayfs: add statfs support Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 07/13] overlay: overlay filesystem documentation Miklos Szeredi
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi,
	Erez Zadok

From: Erez Zadok <ezk@fsl.cs.sunysb.edu>

This is useful because of the stacking nature of overlayfs.  Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/super.c |   63 ++++++++++++++++++++++++++++++++++----------------
 1 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 484753b..69a2099 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -17,17 +17,27 @@
 #include <linux/module.h>
 #include <linux/cred.h>
 #include <linux/sched.h>
+#include <linux/seq_file.h>
 #include "overlayfs.h"
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
 MODULE_DESCRIPTION("Overlay filesystem");
 MODULE_LICENSE("GPL");
 
+struct ovl_config {
+	char *lowerdir;
+	char *upperdir;
+};
+
+/* private information held for overlayfs's superblock */
 struct ovl_fs {
 	struct vfsmount *upper_mnt;
 	struct vfsmount *lower_mnt;
+	/* pathnames of lower and upper dirs, for show_options */
+	struct ovl_config config;
 };
 
+/* private information held for every overlayfs dentry */
 struct ovl_entry {
 	/*
 	 * Keep "double reference" on upper dentries, so that
@@ -384,6 +394,8 @@ static void ovl_put_super(struct super_block *sb)
 	mntput(ufs->upper_mnt);
 	mntput(ufs->lower_mnt);
 
+	kfree(ufs->config.lowerdir);
+	kfree(ufs->config.upperdir);
 	kfree(ufs);
 }
 
@@ -425,15 +437,27 @@ static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return path.dentry->d_sb->s_op->statfs(path.dentry, buf);
 }
 
+/**
+ * ovl_show_options
+ *
+ * Prints the mount options for a given superblock.
+ * Returns zero; does not fail.
+ */
+static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct ovl_fs *ufs = sb->s_fs_info;
+
+	seq_printf(m, ",lowerdir=%s", ufs->config.lowerdir);
+	seq_printf(m, ",upperdir=%s", ufs->config.upperdir);
+	return 0;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.put_super	= ovl_put_super,
 	.remount_fs	= ovl_remount_fs,
 	.statfs		= ovl_statfs,
-};
-
-struct ovl_config {
-	char *lowerdir;
-	char *upperdir;
+	.show_options	= ovl_show_options,
 };
 
 enum {
@@ -493,33 +517,32 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	struct dentry *root_dentry;
 	struct ovl_entry *oe;
 	struct ovl_fs *ufs;
-	struct ovl_config config;
 	int err;
 
-	err = ovl_parse_opt((char *) data, &config);
-	if (err)
+	err = -ENOMEM;
+	ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
+	if (!ufs)
 		goto out;
 
+	err = ovl_parse_opt((char *) data, &ufs->config);
+	if (err)
+		goto out_free_ufs;
+
 	err = -EINVAL;
-	if (!config.upperdir || !config.lowerdir) {
+	if (!ufs->config.upperdir || !ufs->config.lowerdir) {
 		printk(KERN_ERR "overlayfs: missing upperdir or lowerdir\n");
 		goto out_free_config;
 	}
 
-	err = -ENOMEM;
-	ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
-	if (!ufs)
-		goto out_free_config;
-
 	oe = ovl_alloc_entry();
 	if (oe == NULL)
-		goto out_free_ufs;
+		goto out_free_config;
 
-	err = kern_path(config.upperdir, LOOKUP_FOLLOW, &upperpath);
+	err = kern_path(ufs->config.upperdir, LOOKUP_FOLLOW, &upperpath);
 	if (err)
 		goto out_free_oe;
 
-	err = kern_path(config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
+	err = kern_path(ufs->config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
 	if (err)
 		goto out_put_upperpath;
 
@@ -595,11 +618,11 @@ out_put_upperpath:
 	path_put(&upperpath);
 out_free_oe:
 	kfree(oe);
+out_free_config:
+	kfree(ufs->config.lowerdir);
+	kfree(ufs->config.upperdir);
 out_free_ufs:
 	kfree(ufs);
-out_free_config:
-	kfree(config.lowerdir);
-	kfree(config.upperdir);
 out:
 	return err;
 }
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 07/13] overlay: overlay filesystem documentation
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (5 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 06/13] overlayfs: implement show_options Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 19:53   ` J. Bruce Fields
  2012-09-10  1:47   ` Jan Engelhardt
  2012-08-15 15:48 ` [PATCH 08/13] fs: limit filesystem stacking depth Miklos Szeredi
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Neil Brown <neilb@suse.de>

Document the overlay filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 Documentation/filesystems/overlayfs.txt |  199 +++++++++++++++++++++++++++++++
 MAINTAINERS                             |    7 +
 2 files changed, 206 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/overlayfs.txt

diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
new file mode 100644
index 0000000..7161dc3
--- /dev/null
+++ b/Documentation/filesystems/overlayfs.txt
@@ -0,0 +1,199 @@
+Written by: Neil Brown <neilb@suse.de>
+
+Overlay Filesystem
+==================
+
+This document describes a prototype for a new approach to providing
+overlay-filesystem functionality in Linux (sometimes referred to as
+union-filesystems).  An overlay-filesystem tries to present a
+filesystem which is the result over overlaying one filesystem on top
+of the other.
+
+The result will inevitably fail to look exactly like a normal
+filesystem for various technical reasons.  The expectation is that
+many use cases will be able to ignore these differences.
+
+This approach is 'hybrid' because the objects that appear in the
+filesystem do not all appear to belong to that filesystem.  In many
+cases an object accessed in the union will be indistinguishable
+from accessing the corresponding object from the original filesystem.
+This is most obvious from the 'st_dev' field returned by stat(2).
+
+While directories will report an st_dev from the overlay-filesystem,
+all non-directory objects will report an st_dev from the lower or
+upper filesystem that is providing the object.  Similarly st_ino will
+only be unique when combined with st_dev, and both of these can change
+over the lifetime of a non-directory object.  Many applications and
+tools ignore these values and will not be affected.
+
+Upper and Lower
+---------------
+
+An overlay filesystem combines two filesystems - an 'upper' filesystem
+and a 'lower' filesystem.  When a name exists in both filesystems, the
+object in the 'upper' filesystem is visible while the object in the
+'lower' filesystem is either hidden or, in the case of directories,
+merged with the 'upper' object.
+
+It would be more correct to refer to an upper and lower 'directory
+tree' rather than 'filesystem' as it is quite possible for both
+directory trees to be in the same filesystem and there is no
+requirement that the root of a filesystem be given for either upper or
+lower.
+
+The lower filesystem can be any filesystem supported by Linux and does
+not need to be writable.  The lower filesystem can even be another
+overlayfs.  The upper filesystem will normally be writable and if it
+is it must support the creation of trusted.* extended attributes, and
+must provide valid d_type in readdir responses, at least for symbolic
+links - so NFS is not suitable.
+
+A read-only overlay of two read-only filesystems may use any
+filesystem type.
+
+Directories
+-----------
+
+Overlaying mainly involved directories.  If a given name appears in both
+upper and lower filesystems and refers to a non-directory in either,
+then the lower object is hidden - the name refers only to the upper
+object.
+
+Where both upper and lower objects are directories, a merged directory
+is formed.
+
+At mount time, the two directories given as mount options are combined
+into a merged directory:
+
+  mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper /overlay
+
+Then whenever a lookup is requested in such a merged directory, the
+lookup is performed in each actual directory and the combined result
+is cached in the dentry belonging to the overlay filesystem.  If both
+actual lookups find directories, both are stored and a merged
+directory is created, otherwise only one is stored: the upper if it
+exists, else the lower.
+
+Only the lists of names from directories are merged.  Other content
+such as metadata and extended attributes are reported for the upper
+directory only.  These attributes of the lower directory are hidden.
+
+whiteouts and opaque directories
+--------------------------------
+
+In order to support rm and rmdir without changing the lower
+filesystem, an overlay filesystem needs to record in the upper filesystem
+that files have been removed.  This is done using whiteouts and opaque
+directories (non-directories are always opaque).
+
+The overlay filesystem uses extended attributes with a
+"trusted.overlay."  prefix to record these details.
+
+A whiteout is created as a symbolic link with target
+"(overlay-whiteout)" and with xattr "trusted.overlay.whiteout" set to "y".
+When a whiteout is found in the upper level of a merged directory, any
+matching name in the lower level is ignored, and the whiteout itself
+is also hidden.
+
+A directory is made opaque by setting the xattr "trusted.overlay.opaque"
+to "y".  Where the upper filesystem contains an opaque directory, any
+directory in the lower filesystem with the same name is ignored.
+
+readdir
+-------
+
+When a 'readdir' request is made on a merged directory, the upper and
+lower directories are each read and the name lists merged in the
+obvious way (upper is read first, then lower - entries that already
+exist are not re-added).  This merged name list is cached in the
+'struct file' and so remains as long as the file is kept open.  If the
+directory is opened and read by two processes at the same time, they
+will each have separate caches.  A seekdir to the start of the
+directory (offset 0) followed by a readdir will cause the cache to be
+discarded and rebuilt.
+
+This means that changes to the merged directory do not appear while a
+directory is being read.  This is unlikely to be noticed by many
+programs.
+
+seek offsets are assigned sequentially when the directories are read.
+Thus if
+  - read part of a directory
+  - remember an offset, and close the directory
+  - re-open the directory some time later
+  - seek to the remembered offset
+
+there may be little correlation between the old and new locations in
+the list of filenames, particularly if anything has changed in the
+directory.
+
+Readdir on directories that are not merged is simply handled by the
+underlying directory (upper or lower).
+
+
+Non-directories
+---------------
+
+Objects that are not directories (files, symlinks, device-special
+files etc.) are presented either from the upper or lower filesystem as
+appropriate.  When a file in the lower filesystem is accessed in a way
+the requires write-access, such as opening for write access, changing
+some metadata etc., the file is first copied from the lower filesystem
+to the upper filesystem (copy_up).  Note that creating a hard-link
+also requires copy_up, though of course creation of a symlink does
+not.
+
+The copy_up may turn out to be unnecessary, for example if the file is
+opened for read-write but the data is not modified.
+
+The copy_up process first makes sure that the containing directory
+exists in the upper filesystem - creating it and any parents as
+necessary.  It then creates the object with the same metadata (owner,
+mode, mtime, symlink-target etc.) and then if the object is a file, the
+data is copied from the lower to the upper filesystem.  Finally any
+extended attributes are copied up.
+
+Once the copy_up is complete, the overlay filesystem simply
+provides direct access to the newly created file in the upper
+filesystem - future operations on the file are barely noticed by the
+overlay filesystem (though an operation on the name of the file such as
+rename or unlink will of course be noticed and handled).
+
+
+Non-standard behavior
+---------------------
+
+The copy_up operation essentially creates a new, identical file and
+moves it over to the old name.  The new file may be on a different
+filesystem, so both st_dev and st_ino of the file may change.
+
+Any open files referring to this inode will access the old data and
+metadata.  Similarly any file locks obtained before copy_up will not
+apply to the copied up file.
+
+On a file is opened with O_RDONLY fchmod(2), fchown(2), futimesat(2)
+and fsetxattr(2) will fail with EROFS.
+
+If a file with multiple hard links is copied up, then this will
+"break" the link.  Changes will not be propagated to other names
+referring to the same inode.
+
+Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory
+object in overlayfs will not contain vaid absolute paths, only
+relative paths leading up to the filesystem's root.  This will be
+fixed in the future.
+
+Some operations are not atomic, for example a crash during copy_up or
+rename will leave the filesystem in an inconsitent state.  This will
+be addressed in the future.
+
+Changes to underlying filesystems
+---------------------------------
+
+Offline changes, when the overlay is not mounted, are allowed to either
+the upper or the lower trees.
+
+Changes to the underlying filesystems while part of a mounted overlay
+filesystem are not allowed.  If the underlying filesystem is changed,
+the behavior of the overlay is undefined, though it will not result in
+a crash or deadlock.
diff --git a/MAINTAINERS b/MAINTAINERS
index 94b823f..d843a51 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5099,6 +5099,13 @@ F:	drivers/scsi/osd/
 F:	include/scsi/osd_*
 F:	fs/exofs/
 
+OVERLAYFS FILESYSTEM
+M:	Miklos Szeredi <miklos@szeredi.hu>
+L:	linux-fsdevel@vger.kernel.org
+S:	Supported
+F:	fs/overlayfs/*
+F:	Documentation/filesystems/overlayfs.txt
+
 P54 WIRELESS DRIVER
 M:	Christian Lamparter <chunkeey@googlemail.com>
 L:	linux-wireless@vger.kernel.org
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 08/13] fs: limit filesystem stacking depth
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (6 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 07/13] overlay: overlay filesystem documentation Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-16  8:02   ` Sedat Dilek
  2012-08-15 15:48 ` [PATCH 09/13] overlayfs: fix possible leak in ovl_new_inode Miklos Szeredi
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Miklos Szeredi <mszeredi@suse.cz>

Add a simple read-only counter to super_block that indicates deep this
is in the stack of filesystems.  Previously ecryptfs was the only
stackable filesystem and it explicitly disallowed multiple layers of
itself.

Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.

To limit the kernel stack usage we must limit the depth of the
filesystem stack.  Initially the limit is set to 2.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/ecryptfs/main.c   |    7 +++++++
 fs/overlayfs/super.c |   10 ++++++++++
 include/linux/fs.h   |   11 +++++++++++
 3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 2768138..344fb2c 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -565,6 +565,13 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
 	s->s_maxbytes = path.dentry->d_sb->s_maxbytes;
 	s->s_blocksize = path.dentry->d_sb->s_blocksize;
 	s->s_magic = ECRYPTFS_SUPER_MAGIC;
+	s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
+
+	rc = -EINVAL;
+	if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+		printk(KERN_ERR "eCryptfs: maximum fs stacking depth exceeded\n");
+		goto out_free;
+	}
 
 	inode = ecryptfs_get_inode(path.dentry->d_inode, s);
 	rc = PTR_ERR(inode);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 69a2099..64d2695 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -551,6 +551,16 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	    !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
 		goto out_put_lowerpath;
 
+	sb->s_stack_depth = max(upperpath.mnt->mnt_sb->s_stack_depth,
+				lowerpath.mnt->mnt_sb->s_stack_depth) + 1;
+
+	err = -EINVAL;
+	if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+		printk(KERN_ERR "overlayfs: maximum fs stacking depth exceeded\n");
+		goto out_put_lowerpath;
+	}
+
+
 	ufs->upper_mnt = clone_private_mount(&upperpath);
 	err = PTR_ERR(ufs->upper_mnt);
 	if (IS_ERR(ufs->upper_mnt)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index abc7a53..1265e24 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -505,6 +505,12 @@ struct iattr {
  */
 #include <linux/quota.h>
 
+/*
+ * Maximum number of layers of fs stack.  Needs to be limited to
+ * prevent kernel stack overflow
+ */
+#define FILESYSTEM_MAX_STACK_DEPTH 2
+
 /** 
  * enum positive_aop_returns - aop return codes with specific semantics
  *
@@ -1579,6 +1585,11 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/*
+	 * Indicates how deep in a filesystem stack this SB is
+	 */
+	int s_stack_depth;
 };
 
 /* superblock cache pruning functions */
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 09/13] overlayfs: fix possible leak in ovl_new_inode
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (7 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 08/13] fs: limit filesystem stacking depth Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 10/13] overlayfs: create new inode in ovl_link Miklos Szeredi
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi,
	Robin Dong, Robin Dong

From: Robin Dong <hao.bigrat@gmail.com>

After allocating a new inode, if the mode of inode is incorrect, we should
release it by iput().

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/inode.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 787761f..e854720 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -371,6 +371,7 @@ struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
 
 	default:
 		WARN(1, "illegal file type: %i\n", mode);
+		iput(inode);
 		inode = NULL;
 	}
 
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 10/13] overlayfs: create new inode in ovl_link
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (8 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 09/13] overlayfs: fix possible leak in ovl_new_inode Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 15:48 ` [PATCH 11/13] vfs: export __inode_permission() to modules Miklos Szeredi
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi,
	Robin Dong, Robin Dong

From: Robin Dong <hao.bigrat@gmail.com>

Imaging using ext4 as upperdir which has a file "hello" and lowdir is
totally empty.

1. mount -t overlayfs overlayfs -o lowerdir=/lower,upperdir=/upper /overlay
2. cd /overlay
3. ln hello bye

then the overlayfs code will call vfs_link to create a real ext4
dentry for "bye" and create
a new overlayfs dentry point to overlayfs inode (which standed for
"hello"). That means:
	two overlayfs dentries and only one overlayfs inode.

and then

4. umount /overlay
5. mount -t overlayfs overlayfs -o lowerdir=/lower,upperdir=/upper
/overlay (again)
6. cd /overlay
7. ls hello bye

the overlayfs will create two inodes(one for the "hello", another
for the "bye") and two dentries (each point a inode).That means:
	two dentries and two inodes.

As above, with different order of "create link" and "mount", the
result is not the same.

In order to make the behavior coherent, we need to create inode in ovl_link.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/dir.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 6b50823..40650c4 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -417,6 +417,7 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
 	struct dentry *olddentry;
 	struct dentry *newdentry;
 	struct dentry *upperdir;
+	struct inode *newinode;
 
 	err = ovl_copy_up(old);
 	if (err)
@@ -441,13 +442,17 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
 			err = -ENOENT;
 			goto out_unlock;
 		}
+		newinode = ovl_new_inode(old->d_sb, newdentry->d_inode->i_mode,
+				new->d_fsdata);
+		if (!newinode)
+			goto link_fail;
 
 		ovl_dentry_version_inc(new->d_parent);
 		ovl_dentry_update(new, newdentry);
 
-		ihold(old->d_inode);
-		d_instantiate(new, old->d_inode);
+		d_instantiate(new, newinode);
 	} else {
+link_fail:
 		if (ovl_dentry_is_opaque(new))
 			ovl_whiteout(upperdir, new);
 		dput(newdentry);
@@ -579,7 +584,6 @@ out_unlock:
 
 const struct inode_operations ovl_dir_inode_operations = {
 	.lookup		= ovl_lookup,
-	.atomic_open	= ovl_atomic_open,
 	.mkdir		= ovl_mkdir,
 	.symlink	= ovl_symlink,
 	.unlink		= ovl_unlink,
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 11/13] vfs: export __inode_permission() to modules
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (9 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 10/13] overlayfs: create new inode in ovl_link Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 17:17   ` Sedat Dilek
  2012-08-15 15:48 ` [PATCH 12/13] ovl: switch to __inode_permission() Miklos Szeredi
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Miklos Szeredi <mszeredi@suse.cz>

We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems.  Expose this interface for overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/internal.h      |    5 -----
 fs/namei.c         |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 371bcc4..8578209 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -42,11 +42,6 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
 extern void __init chrdev_init(void);
 
 /*
- * namei.c
- */
-extern int __inode_permission(struct inode *, int);
-
-/*
  * namespace.c
  */
 extern int copy_mount_options(const void __user *, unsigned long *);
diff --git a/fs/namei.c b/fs/namei.c
index ac2526d..9be439a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -348,6 +348,7 @@ int __inode_permission(struct inode *inode, int mask)
 
 	return security_inode_permission(inode, mask);
 }
+EXPORT_SYMBOL(__inode_permission);
 
 /**
  * sb_permission - Check superblock-level permissions
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1265e24..d573703 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2418,6 +2418,7 @@ extern sector_t bmap(struct inode *, sector_t);
 #endif
 extern int notify_change(struct dentry *, struct iattr *);
 extern int inode_permission(struct inode *, int);
+extern int __inode_permission(struct inode *, int);
 extern int generic_permission(struct inode *, int);
 
 static inline bool execute_ok(struct inode *inode)
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 12/13] ovl: switch to __inode_permission()
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (10 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 11/13] vfs: export __inode_permission() to modules Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 16:59   ` Casey Schaufler
  2012-08-15 15:48 ` [PATCH 13/13] overlayfs: copy up i_uid/i_gid from the underlying inode Miklos Szeredi
  2012-08-15 17:14 ` [PATCH 00/13] overlay filesystem: request for inclusion (v14) Sedat Dilek
  13 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Andy Whitcroft <apw@canonical.com>

When checking permissions on an overlayfs inode we do not take into
account either device cgroup restrictions nor security permissions.
This allows a user to mount an overlayfs layer over a restricted device
directory and by pass those permissions to open otherwise restricted
files.

Switch over to __inode_permissions.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/inode.c |   12 +-----------
 1 files changed, 1 insertions(+), 11 deletions(-)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index e854720..f3a534f 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -100,19 +100,9 @@ int ovl_permission(struct inode *inode, int mask)
 		if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
 		    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
 			goto out_dput;
-
-		/*
-		 * Nobody gets write access to an immutable file.
-		 */
-		err = -EACCES;
-		if (IS_IMMUTABLE(realinode))
-			goto out_dput;
 	}
 
-	if (realinode->i_op->permission)
-		err = realinode->i_op->permission(realinode, mask);
-	else
-		err = generic_permission(realinode, mask);
+	err = __inode_permission(realinode, mask);
 out_dput:
 	dput(alias);
 	return err;
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 13/13] overlayfs: copy up i_uid/i_gid from the underlying inode
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (11 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 12/13] ovl: switch to __inode_permission() Miklos Szeredi
@ 2012-08-15 15:48 ` Miklos Szeredi
  2012-08-15 17:14 ` [PATCH 00/13] overlay filesystem: request for inclusion (v14) Sedat Dilek
  13 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-15 15:48 UTC (permalink / raw)
  To: viro
  Cc: linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

From: Andy Whitcroft <apw@canonical.com>

YAMA et al rely on on i_uid/i_gid to be populated in order to perform
their checks.  While these really cannot be guarenteed as the underlying
filesystem may not even have the concept, they are expected to be filled
when possible.  To quote Al Viro:

    "Ideally, yes, we'd want to have ->i_uid used only by fs-specific
     code and helpers used by that fs (including those that are
     implicit defaults). [...]   In practice we have enough places
     where uid/gid is used directly to make setting them practically
     a requirement - places like /proc/<pid>/ can get away with
     not doing that, but only because shitloads of syscalls are
     not allowed on those anyway, permissions or no permissions.
     In anything general-purpose you really need to set it."

Copy up the underlying filesystem information into the overlayfs inode
when we create it.

BugLink: http://bugs.launchpad.net/bugs/944386
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/dir.c       |    2 ++
 fs/overlayfs/inode.c     |    2 ++
 fs/overlayfs/overlayfs.h |    6 ++++++
 fs/overlayfs/super.c     |    1 +
 4 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 40650c4..c4446c4 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -304,6 +304,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
 		}
 	}
 	ovl_dentry_update(dentry, newdentry);
+	ovl_copyattr(newdentry->d_inode, inode);
 	d_instantiate(dentry, inode);
 	inode = NULL;
 	newdentry = NULL;
@@ -446,6 +447,7 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
 				new->d_fsdata);
 		if (!newinode)
 			goto link_fail;
+		ovl_copyattr(upperdir->d_inode, newinode);
 
 		ovl_dentry_version_inc(new->d_parent);
 		ovl_dentry_update(new, newdentry);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index f3a534f..e7ab09b 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -31,6 +31,8 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 
 	mutex_lock(&upperdentry->d_inode->i_mutex);
 	err = notify_change(upperdentry, attr);
+	if (!err)
+		ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
 	mutex_unlock(&upperdentry->d_inode->i_mutex);
 
 	return err;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index fe1241d..1cba38f 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -56,6 +56,12 @@ int ovl_removexattr(struct dentry *dentry, const char *name);
 
 struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
 			    struct ovl_entry *oe);
+static inline void ovl_copyattr(struct inode *from, struct inode *to)
+{
+	to->i_uid = from->i_uid;
+	to->i_gid = from->i_gid;
+}
+
 /* dir.c */
 extern const struct inode_operations ovl_dir_inode_operations;
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 64d2695..9808408 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -343,6 +343,7 @@ static int ovl_do_lookup(struct dentry *dentry)
 				      oe);
 		if (!inode)
 			goto out_dput;
+		ovl_copyattr(realdentry->d_inode, inode);
 	}
 
 	if (upperdentry)
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] ovl: switch to __inode_permission()
  2012-08-15 15:48 ` [PATCH 12/13] ovl: switch to __inode_permission() Miklos Szeredi
@ 2012-08-15 16:59   ` Casey Schaufler
  2012-08-15 17:07     ` Andy Whitcroft
  0 siblings, 1 reply; 38+ messages in thread
From: Casey Schaufler @ 2012-08-15 16:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi,
	Casey Schaufler

On 8/15/2012 8:48 AM, Miklos Szeredi wrote:
> From: Andy Whitcroft <apw@canonical.com>
>
> When checking permissions on an overlayfs inode we do not take into
> account either device cgroup restrictions nor security permissions.
> This allows a user to mount an overlayfs layer over a restricted device
> directory and by pass those permissions to open otherwise restricted
> files.

Why is this a good idea? Either you're not including enough context
about the conditions under which this can occur, or you're suggesting
the introduction of a trivial mechanism for bypassing all file access
controls. This does not seem right.


>
> Switch over to __inode_permissions.
>
> Signed-off-by: Andy Whitcroft <apw@canonical.com>
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
>  fs/overlayfs/inode.c |   12 +-----------
>  1 files changed, 1 insertions(+), 11 deletions(-)
>
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index e854720..f3a534f 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -100,19 +100,9 @@ int ovl_permission(struct inode *inode, int mask)
>  		if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
>  		    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
>  			goto out_dput;
> -
> -		/*
> -		 * Nobody gets write access to an immutable file.
> -		 */
> -		err = -EACCES;
> -		if (IS_IMMUTABLE(realinode))
> -			goto out_dput;
>  	}
>  
> -	if (realinode->i_op->permission)
> -		err = realinode->i_op->permission(realinode, mask);
> -	else
> -		err = generic_permission(realinode, mask);
> +	err = __inode_permission(realinode, mask);
>  out_dput:
>  	dput(alias);
>  	return err;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] ovl: switch to __inode_permission()
  2012-08-15 16:59   ` Casey Schaufler
@ 2012-08-15 17:07     ` Andy Whitcroft
  2012-08-15 17:34       ` Casey Schaufler
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Whitcroft @ 2012-08-15 17:07 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Miklos Szeredi, viro, linux-fsdevel, linux-kernel, hch, torvalds,
	akpm, nbd, neilb, hramrach, jordipujolp, ezk, ricwheeler,
	dhowells, hpj, sedat.dilek, penberg, goran.cetusic, romain,
	mszeredi

On Wed, Aug 15, 2012 at 09:59:51AM -0700, Casey Schaufler wrote:
> On 8/15/2012 8:48 AM, Miklos Szeredi wrote:
> > From: Andy Whitcroft <apw@canonical.com>
> >
> > When checking permissions on an overlayfs inode we do not take into
> > account either device cgroup restrictions nor security permissions.
> > This allows a user to mount an overlayfs layer over a restricted device
> > directory and by pass those permissions to open otherwise restricted
> > files.
> 
> Why is this a good idea? Either you're not including enough context
> about the conditions under which this can occur, or you're suggesting
> the introduction of a trivial mechanism for bypassing all file access
> controls. This does not seem right.

It is stating that the unprotected case is how things was before this
patch switches us over to __inode_permisssions.  The patch is closing
the hole indicated.

-apw
> >
> > Switch over to __inode_permissions.
> >
> > Signed-off-by: Andy Whitcroft <apw@canonical.com>
> > Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> > ---
> >  fs/overlayfs/inode.c |   12 +-----------
> >  1 files changed, 1 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> > index e854720..f3a534f 100644
> > --- a/fs/overlayfs/inode.c
> > +++ b/fs/overlayfs/inode.c
> > @@ -100,19 +100,9 @@ int ovl_permission(struct inode *inode, int mask)
> >  		if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
> >  		    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
> >  			goto out_dput;
> > -
> > -		/*
> > -		 * Nobody gets write access to an immutable file.
> > -		 */
> > -		err = -EACCES;
> > -		if (IS_IMMUTABLE(realinode))
> > -			goto out_dput;
> >  	}
> >  
> > -	if (realinode->i_op->permission)
> > -		err = realinode->i_op->permission(realinode, mask);
> > -	else
> > -		err = generic_permission(realinode, mask);
> > +	err = __inode_permission(realinode, mask);
> >  out_dput:
> >  	dput(alias);
> >  	return err;
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/13] overlay filesystem: request for inclusion (v14)
  2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
                   ` (12 preceding siblings ...)
  2012-08-15 15:48 ` [PATCH 13/13] overlayfs: copy up i_uid/i_gid from the underlying inode Miklos Szeredi
@ 2012-08-15 17:14 ` Sedat Dilek
  13 siblings, 0 replies; 38+ messages in thread
From: Sedat Dilek @ 2012-08-15 17:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi,
	Stephen Rothwell

On Wed, Aug 15, 2012 at 5:48 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> Here's the latest version of the overlayfs series.
>
> Git tree is here:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v14
>
> Please consider for 3.7.
>

[ Stephen (linux-next) maintainer ]

Wouldn't it be a good idea to ask Stephen to include your tree into linux-next?
Many people build linux-next manually or with a lot of rand-configs on
a daily base.

With Linux-3.6-rc1 merge-window I have noticed (for the first time) a
lot of merges which got pulled in "blindly" with the remark "matured
in linux-next".
So, chances will be higher that OverlayFS is more and better tested
and might be accepted (hopefully).
My 0.02EUR.

Your patch "vfs: export __inode_permission() to modules" can go
through Al's vfs tree?
When I saw [1] in the first pile of vfs patches for v3.6-rc1 (first
bits of/for union-mounts!) I talked with Andy that the export is
needed when OverlayFS is built as a module.
But I will comment separately on this.

- Sedat -

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=0bdaea9017b9d2b9996e153a71ee03555969b80e

> Thanks,
> Miklos
>
>
> ---
> Andy Whitcroft (3):
>       overlayfs: add statfs support
>       ovl: switch to __inode_permission()
>       overlayfs: copy up i_uid/i_gid from the underlying inode
>
> Erez Zadok (1):
>       overlayfs: implement show_options
>
> Miklos Szeredi (6):
>       vfs: add i_op->open()
>       vfs: export do_splice_direct() to modules
>       vfs: introduce clone_private_mount()
>       overlay filesystem
>       fs: limit filesystem stacking depth
>       vfs: export __inode_permission() to modules
>
> Neil Brown (1):
>       overlay: overlay filesystem documentation
>
> Robin Dong (2):
>       overlayfs: fix possible leak in ovl_new_inode
>       overlayfs: create new inode in ovl_link
>
> ---
>  Documentation/filesystems/Locking       |    2 +
>  Documentation/filesystems/overlayfs.txt |  199 +++++++++
>  Documentation/filesystems/vfs.txt       |    7 +
>  MAINTAINERS                             |    7 +
>  fs/Kconfig                              |    1 +
>  fs/Makefile                             |    1 +
>  fs/ecryptfs/main.c                      |    7 +
>  fs/internal.h                           |    5 -
>  fs/namei.c                              |   10 +-
>  fs/namespace.c                          |   18 +
>  fs/open.c                               |   23 +-
>  fs/overlayfs/Kconfig                    |    4 +
>  fs/overlayfs/Makefile                   |    7 +
>  fs/overlayfs/copy_up.c                  |  385 ++++++++++++++++++
>  fs/overlayfs/dir.c                      |  604 ++++++++++++++++++++++++++++
>  fs/overlayfs/inode.c                    |  372 +++++++++++++++++
>  fs/overlayfs/overlayfs.h                |   70 ++++
>  fs/overlayfs/readdir.c                  |  566 ++++++++++++++++++++++++++
>  fs/overlayfs/super.c                    |  665 +++++++++++++++++++++++++++++++
>  fs/splice.c                             |    1 +
>  include/linux/fs.h                      |   14 +
>  include/linux/mount.h                   |    3 +
>  22 files changed, 2961 insertions(+), 10 deletions(-)
>  create mode 100644 Documentation/filesystems/overlayfs.txt
>  create mode 100644 fs/overlayfs/Kconfig
>  create mode 100644 fs/overlayfs/Makefile
>  create mode 100644 fs/overlayfs/copy_up.c
>  create mode 100644 fs/overlayfs/dir.c
>  create mode 100644 fs/overlayfs/inode.c
>  create mode 100644 fs/overlayfs/overlayfs.h
>  create mode 100644 fs/overlayfs/readdir.c
>  create mode 100644 fs/overlayfs/super.c
>
> ------------------------------------------------------------------------------
> Changes from v13 to v14
>
> - update to 3.6
>
> - copy i_uid/i_gid from the underlying inode (patch by Andy Whitcroft)
>
> ------------------------------------------------------------------------------
> Changes from v12 to v13
>
> - create new inode in ovl_link (patch by Robin Dong)
>
> - switch to __inode_permission() (patch by Andy Whitcroft)
>
> ------------------------------------------------------------------------------
> Changes from v11 to v12
>
> - update to for-next of vfs tree
>
> - split __dentry_open argument cleanup patch from vfs-add-i_op-open.patch
>
> - change i_op->open and vfs_open so that they take "struct file *"
>
> ------------------------------------------------------------------------------
> Changes from v10 to v11
>
> - fix overlayfs over overlayfs
>
> - improve stack use of lookup and readdir
>
> - add limitations to documentation
>
> - make lower mount read-only
>
> - update permission and fsync to new API
>
> ------------------------------------------------------------------------------
> Changes from v9 to v10
>
> - prevent d_delete() from turning upperdentry negative (reported by
>   Erez Zadok)
>
> - show mount options in /proc/mounts and friends (patch by Erez Zadok)
>
> - fix off-by-one error in readdir (reported by Jordi Pujol)
>
> ------------------------------------------------------------------------------
> Changes from v8 to v9
>
> - support xattr on tmpfs
>
> - fix build after split-up
>
> - fix remove after rename (reported by Jordi Pujol)
>
> - fix rename failure case
>
> ------------------------------------------------------------------------------
> Changes from v7 to v8:
>
> - split overlayfs.c into smaller files
>
> - fix locking for copy up (reported by Al Viro)
>
> - locking analysis of copy up vs. directory rename added as a comment
>
> - tested with lockdep, fixed one lock annotation
>
> - other bug fixes
>
> ------------------------------------------------------------------------------
> Changes from v6 to v7
>
> - added patches from Felix Fietkau to fix deadlocks on jffs2
>
> - optimized directory removal
>
> - properly clean up after copy-up and other failures
>
> ------------------------------------------------------------------------------
> Changes from v5 to v6
>
> - optimize directory merging
>
>   o use rbtree for weeding out duplicates
>
>   o use a cursor for current position within the stream
>
> - instead of f_op->open_other(), implement i_op->open()
>
> - don't share inodes for non-directory dentries - for now.  I hope
>   this can come back once RCU lookup code has settled.
>
> - misc bug fixes
>
> ------------------------------------------------------------------------------
> Changes from v4 to v5
>
> - fix copying up if fs doesn't support xattrs (Andy Whitcroft)
>
> - clone mounts to be used internally to access the underlying
>   filesystems
>
> ------------------------------------------------------------------------------
> Changes from v3 to v4
>
> - export security_inode_permission to allow overlayfs to be modular
>   (Andy Whitcroft)
>
> - add statfs support (Andy Whitcroft)
>
> - change BUG_ON to WARN_ON
>
> - Revert "vfs: add flag to allow rename to same inode", instead
>   introduce s_op->is_same_inode()
>
> - overlayfs: fix rename to self
>
> - fix whiteout after rename
>
> ------------------------------------------------------------------------------
> Changes from v2 to v3
>
>  - Minimal remount support.  As overlayfs reflects the 'readonly'
>    mount status in write-access to the upper filesystem, we must
>    handle remount and either drop or take write access when the ro
>    status changes. (NeilBrown)
>
>  - Use correct seek function for directories.  It is incorrect to call
>    generic_llseek_file on a file from a different filesystem.  For
>    that we must use the seek function that the filesystem defines,
>    which is called by vfs_llseek.  Also, we only want to seek the
>    realfile when is_real is true.  Otherwise we just want to update
>    our own f_pos pointer, so use generic_llseek_file for
>    that. (NeilBrown)
>
>  - Initialise is_real before use.  The previous patch can use
>    od->is_real before it is properly initialised is llseek is called
>    before readdir.  So factor out the initialisation of is_real and
>    call it from both readdir and llseek when f_pos is 0. (NeilBrown)
>
>  - Rename ovl_fill_cache to ovl_dir_read (NeilBrown)
>
>  - Tiny optimisation in open_other handling (NeilBrown)
>
>  - Assorted updates to Documentation/filesystems/overlayfs.txt (NeilBrown)
>
>  - Make copy-up work for >=4G files, make it killable during copy-up.
>    Need to fix recovery after a failed/interrupted copy-up.
>
>  - Store and reference upper/lower dentries in overlay dentries.
>    Store and reference upper/lower vfsmounts in overlay superblock.
>
>  - Add necessary barriers for setting upper dentry in copyup and for
>    retrieving upper dentry locklessly.
>
>  - Make sure the right file is used for directory fsync() after
>    copy-up.
>
>  - Add locking to ovl_dir_llseek() to prevent concurrent call of
>    ovl_dir_reset() with ovl_dir_read().
>
>  - Get rid of ovl_dentry_iput().  The VFS doesn't provide enough
>    locking for this function that the contents of ->d_fsdata could be
>    safely updated.
>
>  - After copying up a non-directory unhash the dentry.  This way the
>    lower dentry ref, which is no longer necessary, can go away.  This
>    revealed a use-after-free bug in truncate handling in
>    fs/namei.c:finish_open().
>
>  - Fix if a copy-up happens between the follow_linka the put_link
>    calls.
>
>  - Replace some WARN_ONs with BUG_ON.  Some things just _really_
>    shouldn't happen.
>
>  - Extract common code from ovl_unlink and ovl_rmdir to a helper
>    function.
>
>  - After unlink and rmdir unhash the dentry.  This will get rid of the
>    lower and upper dentry references after there are no more users of
>    the deleted dentry.  This is a safe replacement for the removed
>    ->d_iput() functionality.
>
>  - Added checks to unlink, rmdir and rename to verify that the
>    parent-child relationship in the upper filesystem matches that of
>    the overlay.  This is necessary to prevent crash and/or corruption
>    if the upper filesystem topology is being modified while part of
>    the overlay.
>
>  - Optimize checking whiteout and opaque attributes.
>
>  - Optimize copy-up on truncate: don't copy up whole file before
>    truncating
>
>  - Misc bug fixes
>
> ------------------------------------------------------------------------------
> Changes from v1 to v2
>
>  - rename "hybrid union filesystem" to "overlay filesystem" or overlayfs
>
>  - added documentation written by Neil
>
>  - correct st_dev for directories (reported by Neil)
>
>  - use getattr() to get attributes from the underlying filesystems,
>    this means that now an overlay filesystem itself can be the lower,
>    read-only layer of another overlay
>
>  - listxattr filters out private extended attributes
>
>  - get write ref on the upper layer on mount unless the overlay
>    itself is mounted read-only
>
>  - raise capabilities for copy up, dealing with whiteouts and opaque
>    directories.  Now the overlay works for non-root users as well
>
>  - "rm -rf" didn't work correctly in all cases if the directory was
>    copied up between opendir and the first readdir, this is now fixed
>    (and the directory operations consolidated)
>
>  - simplified copy up, this broke optimization for truncate and
>    open(O_TRUNC) (now file is copied up to be immediately truncated,
>    will fix)
>
>  - st_nlink for merged directories set to 1, this is an "illegal"
>    value that normal filesystems never have but some use it to
>    indicate that the number of subdirectories is unknown.  Utilities
>    (find, ...) seem to tolerate this well.
>
>  - misc fixes I forgot about
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 11/13] vfs: export __inode_permission() to modules
  2012-08-15 15:48 ` [PATCH 11/13] vfs: export __inode_permission() to modules Miklos Szeredi
@ 2012-08-15 17:17   ` Sedat Dilek
  0 siblings, 0 replies; 38+ messages in thread
From: Sedat Dilek @ 2012-08-15 17:17 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

On Wed, Aug 15, 2012 at 5:48 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> We need to be able to check inode permissions (but not filesystem implied
> permissions) for stackable filesystems.  Expose this interface for overlayfs.
>

This patch can go through Al's vfs tree?
It's an addendum for the below patch "VFS: Split inode_permission()"
[1] from the first pile of vfs patches for v3.6-rc1 (first bits of/for
union-mounts!).
I talked with Andy on #ubuntu-kernel that the export is needed when
OverlayFS is built as a module.

Comments?

- Sedat -

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=0bdaea9017b9d2b9996e153a71ee03555969b80e

> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
>  fs/internal.h      |    5 -----
>  fs/namei.c         |    1 +
>  include/linux/fs.h |    1 +
>  3 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 371bcc4..8578209 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -42,11 +42,6 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
>  extern void __init chrdev_init(void);
>
>  /*
> - * namei.c
> - */
> -extern int __inode_permission(struct inode *, int);
> -
> -/*
>   * namespace.c
>   */
>  extern int copy_mount_options(const void __user *, unsigned long *);
> diff --git a/fs/namei.c b/fs/namei.c
> index ac2526d..9be439a 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -348,6 +348,7 @@ int __inode_permission(struct inode *inode, int mask)
>
>         return security_inode_permission(inode, mask);
>  }
> +EXPORT_SYMBOL(__inode_permission);
>
>  /**
>   * sb_permission - Check superblock-level permissions
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1265e24..d573703 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2418,6 +2418,7 @@ extern sector_t bmap(struct inode *, sector_t);
>  #endif
>  extern int notify_change(struct dentry *, struct iattr *);
>  extern int inode_permission(struct inode *, int);
> +extern int __inode_permission(struct inode *, int);
>  extern int generic_permission(struct inode *, int);
>
>  static inline bool execute_ok(struct inode *inode)
> --
> 1.7.7
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/13] vfs: add i_op->open()
  2012-08-15 15:48 ` [PATCH 01/13] vfs: add i_op->open() Miklos Szeredi
@ 2012-08-15 17:21   ` J. Bruce Fields
  2012-08-15 20:28     ` NeilBrown
  0 siblings, 1 reply; 38+ messages in thread
From: J. Bruce Fields @ 2012-08-15 17:21 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

On Wed, Aug 15, 2012 at 05:48:08PM +0200, Miklos Szeredi wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Add a new inode operation i_op->open().  This is for stacked

Shouldn't that "->open()" be "->dentry_open()" ?

--b.

> filesystems that want to return a struct file from a different
> filesystem.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
>  Documentation/filesystems/Locking |    2 ++
>  Documentation/filesystems/vfs.txt |    7 +++++++
>  fs/namei.c                        |    9 ++++++---
>  fs/open.c                         |   23 +++++++++++++++++++++--
>  include/linux/fs.h                |    2 ++
>  5 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 0f103e3..d222b6a 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -64,6 +64,7 @@ prototypes:
>  	int (*atomic_open)(struct inode *, struct dentry *,
>  				struct file *, unsigned open_flag,
>  				umode_t create_mode, int *opened);
> +	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
>  
>  locking rules:
>  	all may block
> @@ -92,6 +93,7 @@ removexattr:	yes
>  fiemap:		no
>  update_time:	no
>  atomic_open:	yes
> +open:		no
>  
>  	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
>  victim.
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 065aa2d..f53d93c 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -367,6 +367,7 @@ struct inode_operations {
>  	int (*atomic_open)(struct inode *, struct dentry *,
>  				struct file *, unsigned open_flag,
>  				umode_t create_mode, int *opened);
> +	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
>  };
>  
>  Again, all methods are called without any locks being held, unless
> @@ -696,6 +697,12 @@ struct address_space_operations {
>    	but instead uses bmap to find out where the blocks in the file
>    	are and uses those addresses directly.
>  
> +  dentry_open: this is an alternative to f_op->open(), the difference is that
> +	this method may open a file not necessarily originating from the same
> +	filesystem as the one i_op->open() was called on.  It may be
> +	useful for stacking filesystems which want to allow native I/O directly
> +	on underlying files.
> +
>  
>    invalidatepage: If a page has PagePrivate set, then invalidatepage
>          will be called when part or all of the page is to be removed
> diff --git a/fs/namei.c b/fs/namei.c
> index 1b46439..ac2526d 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2816,9 +2816,12 @@ finish_open_created:
>  	error = may_open(&nd->path, acc_mode, open_flag);
>  	if (error)
>  		goto out;
> -	file->f_path.mnt = nd->path.mnt;
> -	error = finish_open(file, nd->path.dentry, NULL, opened);
> -	if (error) {
> +
> +	BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */
> +	error = vfs_open(&nd->path, file, current_cred());
> +	if (!error) {
> +		*opened |= FILE_OPENED;
> +	} else {
>  		if (error == -EOPENSTALE)
>  			goto stale_open;
>  		goto out;
> diff --git a/fs/open.c b/fs/open.c
> index f3d96e7..c5a8cac 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -787,8 +787,7 @@ struct file *dentry_open(const struct path *path, int flags,
>  		return ERR_PTR(error);
>  
>  	f->f_flags = flags;
> -	f->f_path = *path;
> -	error = do_dentry_open(f, NULL, cred);
> +	error = vfs_open(path, f, cred);
>  	if (!error) {
>  		error = open_check_o_direct(f);
>  		if (error) {
> @@ -803,6 +802,26 @@ struct file *dentry_open(const struct path *path, int flags,
>  }
>  EXPORT_SYMBOL(dentry_open);
>  
> +/**
> + * vfs_open - open the file at the given path
> + * @path: path to open
> + * @filp: newly allocated file with f_flag initialized
> + * @cred: credentials to use
> + */
> +int vfs_open(const struct path *path, struct file *filp,
> +	     const struct cred *cred)
> +{
> +	struct inode *inode = path->dentry->d_inode;
> +
> +	if (inode->i_op->dentry_open)
> +		return inode->i_op->dentry_open(path->dentry, filp, cred);
> +	else {
> +		filp->f_path = *path;
> +		return do_dentry_open(filp, NULL, cred);
> +	}
> +}
> +EXPORT_SYMBOL(vfs_open);
> +
>  static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>  {
>  	struct fdtable *fdt = files_fdtable(files);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 38dba16..abc7a53 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1836,6 +1836,7 @@ struct inode_operations {
>  	int (*atomic_open)(struct inode *, struct dentry *,
>  			   struct file *, unsigned open_flag,
>  			   umode_t create_mode, int *opened);
> +	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
>  } ____cacheline_aligned;
>  
>  struct seq_file;
> @@ -2201,6 +2202,7 @@ extern long do_sys_open(int dfd, const char __user *filename, int flags,
>  extern struct file *filp_open(const char *, int, umode_t);
>  extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>  				   const char *, int);
> +extern int vfs_open(const struct path *, struct file *, const struct cred *);
>  extern struct file * dentry_open(const struct path *, int, const struct cred *);
>  extern int filp_close(struct file *, fl_owner_t id);
>  extern char * getname(const char __user *);
> -- 
> 1.7.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] ovl: switch to __inode_permission()
  2012-08-15 17:07     ` Andy Whitcroft
@ 2012-08-15 17:34       ` Casey Schaufler
  0 siblings, 0 replies; 38+ messages in thread
From: Casey Schaufler @ 2012-08-15 17:34 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Miklos Szeredi, viro, linux-fsdevel, linux-kernel, hch, torvalds,
	akpm, nbd, neilb, hramrach, jordipujolp, ezk, ricwheeler,
	dhowells, hpj, sedat.dilek, penberg, goran.cetusic, romain,
	mszeredi

On 8/15/2012 10:07 AM, Andy Whitcroft wrote:
> On Wed, Aug 15, 2012 at 09:59:51AM -0700, Casey Schaufler wrote:
>> On 8/15/2012 8:48 AM, Miklos Szeredi wrote:
>>> From: Andy Whitcroft <apw@canonical.com>
>>>
>>> When checking permissions on an overlayfs inode we do not take into
>>> account either device cgroup restrictions nor security permissions.
>>> This allows a user to mount an overlayfs layer over a restricted device
>>> directory and by pass those permissions to open otherwise restricted
>>> files.
>> Why is this a good idea? Either you're not including enough context
>> about the conditions under which this can occur, or you're suggesting
>> the introduction of a trivial mechanism for bypassing all file access
>> controls. This does not seem right.
> It is stating that the unprotected case is how things was before this
> patch switches us over to __inode_permisssions.  The patch is closing
> the hole indicated.

Well, that's good then. Carry on.

>
> -apw
>>> Switch over to __inode_permissions.
>>>
>>> Signed-off-by: Andy Whitcroft <apw@canonical.com>
>>> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
>>> ---
>>>  fs/overlayfs/inode.c |   12 +-----------
>>>  1 files changed, 1 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
>>> index e854720..f3a534f 100644
>>> --- a/fs/overlayfs/inode.c
>>> +++ b/fs/overlayfs/inode.c
>>> @@ -100,19 +100,9 @@ int ovl_permission(struct inode *inode, int mask)
>>>  		if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
>>>  		    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
>>>  			goto out_dput;
>>> -
>>> -		/*
>>> -		 * Nobody gets write access to an immutable file.
>>> -		 */
>>> -		err = -EACCES;
>>> -		if (IS_IMMUTABLE(realinode))
>>> -			goto out_dput;
>>>  	}
>>>  
>>> -	if (realinode->i_op->permission)
>>> -		err = realinode->i_op->permission(realinode, mask);
>>> -	else
>>> -		err = generic_permission(realinode, mask);
>>> +	err = __inode_permission(realinode, mask);
>>>  out_dput:
>>>  	dput(alias);
>>>  	return err;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 07/13] overlay: overlay filesystem documentation
  2012-08-15 15:48 ` [PATCH 07/13] overlay: overlay filesystem documentation Miklos Szeredi
@ 2012-08-15 19:53   ` J. Bruce Fields
  2012-08-16 10:09     ` Miklos Szeredi
  2012-09-10  1:47   ` Jan Engelhardt
  1 sibling, 1 reply; 38+ messages in thread
From: J. Bruce Fields @ 2012-08-15 19:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

Sorry, just trivial typos:

On Wed, Aug 15, 2012 at 05:48:14PM +0200, Miklos Szeredi wrote:
> +Directories
> +-----------
> +
> +Overlaying mainly involved directories.  If a given name appears in both

s/involved/involves/.

> +Non-standard behavior
> +---------------------
> +
> +The copy_up operation essentially creates a new, identical file and
> +moves it over to the old name.  The new file may be on a different
> +filesystem, so both st_dev and st_ino of the file may change.
> +
> +Any open files referring to this inode will access the old data and
> +metadata.  Similarly any file locks obtained before copy_up will not
> +apply to the copied up file.
> +
> +On a file is opened with O_RDONLY fchmod(2), fchown(2), futimesat(2)

s/is //

> +and fsetxattr(2) will fail with EROFS.
> +
> +If a file with multiple hard links is copied up, then this will
> +"break" the link.  Changes will not be propagated to other names
> +referring to the same inode.
> +
> +Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory
> +object in overlayfs will not contain vaid absolute paths, only

s/vaid/valid/

--b.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/13] vfs: add i_op->open()
  2012-08-15 17:21   ` J. Bruce Fields
@ 2012-08-15 20:28     ` NeilBrown
  2012-08-16 10:10       ` Miklos Szeredi
  0 siblings, 1 reply; 38+ messages in thread
From: NeilBrown @ 2012-08-15 20:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Miklos Szeredi, viro, linux-fsdevel, linux-kernel, hch, torvalds,
	akpm, apw, nbd, hramrach, jordipujolp, ezk, ricwheeler, dhowells,
	hpj, sedat.dilek, penberg, goran.cetusic, romain, mszeredi

[-- Attachment #1: Type: text/plain, Size: 1571 bytes --]

On Wed, 15 Aug 2012 13:21:50 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Wed, Aug 15, 2012 at 05:48:08PM +0200, Miklos Szeredi wrote:
> > From: Miklos Szeredi <mszeredi@suse.cz>
> > 
> > Add a new inode operation i_op->open().  This is for stacked
> 
> Shouldn't that "->open()" be "->dentry_open()" ?
> 
> --b.
> 
> > filesystems that want to return a struct file from a different
> > filesystem.
> > 
> > Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> > ---
> >  Documentation/filesystems/Locking |    2 ++
> >  Documentation/filesystems/vfs.txt |    7 +++++++
> >  fs/namei.c                        |    9 ++++++---
> >  fs/open.c                         |   23 +++++++++++++++++++++--
> >  include/linux/fs.h                |    2 ++
> >  5 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > index 0f103e3..d222b6a 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -64,6 +64,7 @@ prototypes:
> >  	int (*atomic_open)(struct inode *, struct dentry *,
> >  				struct file *, unsigned open_flag,
> >  				umode_t create_mode, int *opened);
> > +	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
> >  
> >  locking rules:
> >  	all may block
> > @@ -92,6 +93,7 @@ removexattr:	yes
> >  fiemap:		no
> >  update_time:	no
> >  atomic_open:	yes
> > +open:		no
> >  

and that last line should be
    +dentry_open:       no
??

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/13] overlay filesystem
  2012-08-15 15:48 ` [PATCH 04/13] overlay filesystem Miklos Szeredi
@ 2012-08-16  6:24   ` Eric W. Biederman
  2012-08-16 10:25     ` Miklos Szeredi
  0 siblings, 1 reply; 38+ messages in thread
From: Eric W. Biederman @ 2012-08-16  6:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

Miklos Szeredi <miklos@szeredi.hu> writes:

Minor nits below.

> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> new file mode 100644
> index 0000000..6b50823
> --- /dev/null
> +++ b/fs/overlayfs/dir.c
> @@ -0,0 +1,598 @@
> +/*
> + *
> + * Copyright (C) 2011 Novell Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/namei.h>
> +#include <linux/xattr.h>
> +#include <linux/security.h>
> +#include <linux/cred.h>
> +#include "overlayfs.h"
> +
> +static const char *ovl_whiteout_symlink = "(overlay-whiteout)";
> +
> +static int ovl_whiteout(struct dentry *upperdir, struct dentry *dentry)
> +{
> +	int err;
> +	struct dentry *newdentry;
> +	const struct cred *old_cred;
> +	struct cred *override_cred;
> +
> +	/* FIXME: recheck lower dentry to see if whiteout is really
>  	needed */

Is that FIXME still valid?

> +	err = -ENOMEM;
> +	override_cred = prepare_creds();
> +	if (!override_cred)
> +		goto out;
> +
> +	/*
> +	 * CAP_SYS_ADMIN for setxattr
> +	 * CAP_DAC_OVERRIDE for symlink creation
> +	 * CAP_FOWNER for unlink in sticky directory
> +	 */
> +	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
> +	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
> +	cap_raise(override_cred->cap_effective, CAP_FOWNER);
> +	override_cred->fsuid = 0;
> +	override_cred->fsgid = 0;

Could you please make these GLOBAL_ROOT_UID and GLOBAL_ROOT_GID
instead of 0?  Otherwise this code won't compile with the usernamespace
bits enabled.

> +	old_cred = override_creds(override_cred);

Eric

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/13] fs: limit filesystem stacking depth
  2012-08-15 15:48 ` [PATCH 08/13] fs: limit filesystem stacking depth Miklos Szeredi
@ 2012-08-16  8:02   ` Sedat Dilek
  2012-08-16  8:30     ` Sedat Dilek
  2012-08-16 10:42     ` Miklos Szeredi
  0 siblings, 2 replies; 38+ messages in thread
From: Sedat Dilek @ 2012-08-16  8:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

[-- Attachment #1: Type: text/plain, Size: 4316 bytes --]

On Wed, Aug 15, 2012 at 5:48 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Add a simple read-only counter to super_block that indicates deep this
> is in the stack of filesystems.  Previously ecryptfs was the only
> stackable filesystem and it explicitly disallowed multiple layers of
> itself.
>
> Overlayfs, however, can be stacked recursively and also may be stacked
> on top of ecryptfs or vice versa.
>
> To limit the kernel stack usage we must limit the depth of the
> filesystem stack.  Initially the limit is set to 2.
>

Hi,

I have tested OverlayFS for a long time with "fs-stack-depth=3".
The original OverlayFS test-case script  from Jordi was slightly
modified (see "testcase-ovl-v3.sh").
I have sent my test-results to Andy and Jordi (tested with the
patchset from Andy against Linux-v3.4 [1] with Ext4-FS).
The attached test-case script *requires* "fs-stack-depth=3" to run
properly (patch attached).

So, I have 2 questions:

[1] FS-stack-limitation

Is a "fs-stack-depth>=2" (like "3") critical?
Is your setting to "2" just a defensive (and initial) one?
Can you explain your choice a bit more as ecryptFS is involved in this
limitation, too.

[2] Test-Case/Use-Case scripts

It would be *very very very* helpful if you could provide or even ship
in the Linux-kernel a test-case/use-case script, Thanks!
Maybe describe "Documentation/filesystems/overlayfs.txt" would be a good place?
Helpful could be a "simple" and a "complex" testing scenario?

- Sedat -


[1] http://git.kernel.org/?p=linux/kernel/git/apw/overlayfs.git;a=shortlog;h=refs/heads/overlayfs.v12apw1

> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
>  fs/ecryptfs/main.c   |    7 +++++++
>  fs/overlayfs/super.c |   10 ++++++++++
>  include/linux/fs.h   |   11 +++++++++++
>  3 files changed, 28 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
> index 2768138..344fb2c 100644
> --- a/fs/ecryptfs/main.c
> +++ b/fs/ecryptfs/main.c
> @@ -565,6 +565,13 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
>         s->s_maxbytes = path.dentry->d_sb->s_maxbytes;
>         s->s_blocksize = path.dentry->d_sb->s_blocksize;
>         s->s_magic = ECRYPTFS_SUPER_MAGIC;
> +       s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> +
> +       rc = -EINVAL;
> +       if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> +               printk(KERN_ERR "eCryptfs: maximum fs stacking depth exceeded\n");
> +               goto out_free;
> +       }
>
>         inode = ecryptfs_get_inode(path.dentry->d_inode, s);
>         rc = PTR_ERR(inode);
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 69a2099..64d2695 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -551,6 +551,16 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>             !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
>                 goto out_put_lowerpath;
>
> +       sb->s_stack_depth = max(upperpath.mnt->mnt_sb->s_stack_depth,
> +                               lowerpath.mnt->mnt_sb->s_stack_depth) + 1;
> +
> +       err = -EINVAL;
> +       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> +               printk(KERN_ERR "overlayfs: maximum fs stacking depth exceeded\n");
> +               goto out_put_lowerpath;
> +       }
> +
> +
>         ufs->upper_mnt = clone_private_mount(&upperpath);
>         err = PTR_ERR(ufs->upper_mnt);
>         if (IS_ERR(ufs->upper_mnt)) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index abc7a53..1265e24 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -505,6 +505,12 @@ struct iattr {
>   */
>  #include <linux/quota.h>
>
> +/*
> + * Maximum number of layers of fs stack.  Needs to be limited to
> + * prevent kernel stack overflow
> + */
> +#define FILESYSTEM_MAX_STACK_DEPTH 2
> +
>  /**
>   * enum positive_aop_returns - aop return codes with specific semantics
>   *
> @@ -1579,6 +1585,11 @@ struct super_block {
>
>         /* Being remounted read-only */
>         int s_readonly_remount;
> +
> +       /*
> +        * Indicates how deep in a filesystem stack this SB is
> +        */
> +       int s_stack_depth;
>  };
>
>  /* superblock cache pruning functions */
> --
> 1.7.7
>

[-- Attachment #2: testcase-ovl-v3.sh --]
[-- Type: application/x-sh, Size: 4801 bytes --]

[-- Attachment #3: 0001-fs-Increase-limit-filesystem-stacking-depth-to-3.patch --]
[-- Type: application/octet-stream, Size: 796 bytes --]

From 48e8d478c7afb3fcd9f75b310adda15b48258a54 Mon Sep 17 00:00:00 2001
From: Sedat Dilek <sedat.dilek@gmail.com>
Date: Sat, 28 Apr 2012 17:11:16 +0200
Subject: [PATCH] fs: Increase limit filesystem stacking depth to 3


Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
---
 include/linux/fs.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fdd1d38..ad333e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -493,7 +493,7 @@ struct iattr {
  * Maximum number of layers of fs stack.  Needs to be limited to
  * prevent kernel stack overflow
  */
-#define FILESYSTEM_MAX_STACK_DEPTH 2
+#define FILESYSTEM_MAX_STACK_DEPTH 3
 
 /** 
  * enum positive_aop_returns - aop return codes with specific semantics
-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/13] fs: limit filesystem stacking depth
  2012-08-16  8:02   ` Sedat Dilek
@ 2012-08-16  8:30     ` Sedat Dilek
  2012-08-16 10:42     ` Miklos Szeredi
  1 sibling, 0 replies; 38+ messages in thread
From: Sedat Dilek @ 2012-08-16  8:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, jordipujolp, ezk, ricwheeler, dhowells, hpj, sedat.dilek,
	penberg, goran.cetusic, romain, mszeredi

On Thu, Aug 16, 2012 at 10:02 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> On Wed, Aug 15, 2012 at 5:48 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> From: Miklos Szeredi <mszeredi@suse.cz>
>>
>> Add a simple read-only counter to super_block that indicates deep this
>> is in the stack of filesystems.  Previously ecryptfs was the only
>> stackable filesystem and it explicitly disallowed multiple layers of
>> itself.
>>
>> Overlayfs, however, can be stacked recursively and also may be stacked
>> on top of ecryptfs or vice versa.
>>
>> To limit the kernel stack usage we must limit the depth of the
>> filesystem stack.  Initially the limit is set to 2.
>>
>
> Hi,
>
> I have tested OverlayFS for a long time with "fs-stack-depth=3".
> The original OverlayFS test-case script  from Jordi was slightly
> modified (see "testcase-ovl-v3.sh").
> I have sent my test-results to Andy and Jordi (tested with the
> patchset from Andy against Linux-v3.4 [1] with Ext4-FS).
> The attached test-case script *requires* "fs-stack-depth=3" to run
> properly (patch attached).
>
> So, I have 2 questions:
>
> [1] FS-stack-limitation
>
> Is a "fs-stack-depth>=2" (like "3") critical?
> Is your setting to "2" just a defensive (and initial) one?
> Can you explain your choice a bit more as ecryptFS is involved in this
> limitation, too.
>
> [2] Test-Case/Use-Case scripts
>
> It would be *very very very* helpful if you could provide or even ship
> in the Linux-kernel a test-case/use-case script, Thanks!
> Maybe describe "Documentation/filesystems/overlayfs.txt" would be a good place?
> Helpful could be a "simple" and a "complex" testing scenario?
>
> - Sedat -
>
>
> [1] http://git.kernel.org/?p=linux/kernel/git/apw/overlayfs.git;a=shortlog;h=refs/heads/overlayfs.v12apw1
>

[ Removing <hramrach@centrum.cz> (Email-address seems to be dead) ]

Just for the sake of completeness and some notes:

Jordi dropped your "fs-stack-depth" patch completely in his tests.
AFAICS he uses OverlayFS for his debian-based own distro.
I put some references to his OverlayFS work in [1] and [2].

Pesonally, I have tested against Debian/sid.
With Ubuntu/precise I got a rebel (just wanting a stable development
environment than wasting my time with fixing unstable packages - as
there is Debian/wheezy freeze sid (unstable) branch/distribution
cannot be called a "working environment") and did my recent testings
on this platform.

- Sedat -

[1] http://livenet.selfip.com/ftp/debian/overlayfs/
[2] http://livenet.selfip.com/ftp/debian/overlayfs/test/


>> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
>> ---
>>  fs/ecryptfs/main.c   |    7 +++++++
>>  fs/overlayfs/super.c |   10 ++++++++++
>>  include/linux/fs.h   |   11 +++++++++++
>>  3 files changed, 28 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
>> index 2768138..344fb2c 100644
>> --- a/fs/ecryptfs/main.c
>> +++ b/fs/ecryptfs/main.c
>> @@ -565,6 +565,13 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
>>         s->s_maxbytes = path.dentry->d_sb->s_maxbytes;
>>         s->s_blocksize = path.dentry->d_sb->s_blocksize;
>>         s->s_magic = ECRYPTFS_SUPER_MAGIC;
>> +       s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
>> +
>> +       rc = -EINVAL;
>> +       if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>> +               printk(KERN_ERR "eCryptfs: maximum fs stacking depth exceeded\n");
>> +               goto out_free;
>> +       }
>>
>>         inode = ecryptfs_get_inode(path.dentry->d_inode, s);
>>         rc = PTR_ERR(inode);
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index 69a2099..64d2695 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -551,6 +551,16 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>             !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
>>                 goto out_put_lowerpath;
>>
>> +       sb->s_stack_depth = max(upperpath.mnt->mnt_sb->s_stack_depth,
>> +                               lowerpath.mnt->mnt_sb->s_stack_depth) + 1;
>> +
>> +       err = -EINVAL;
>> +       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
>> +               printk(KERN_ERR "overlayfs: maximum fs stacking depth exceeded\n");
>> +               goto out_put_lowerpath;
>> +       }
>> +
>> +
>>         ufs->upper_mnt = clone_private_mount(&upperpath);
>>         err = PTR_ERR(ufs->upper_mnt);
>>         if (IS_ERR(ufs->upper_mnt)) {
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index abc7a53..1265e24 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -505,6 +505,12 @@ struct iattr {
>>   */
>>  #include <linux/quota.h>
>>
>> +/*
>> + * Maximum number of layers of fs stack.  Needs to be limited to
>> + * prevent kernel stack overflow
>> + */
>> +#define FILESYSTEM_MAX_STACK_DEPTH 2
>> +
>>  /**
>>   * enum positive_aop_returns - aop return codes with specific semantics
>>   *
>> @@ -1579,6 +1585,11 @@ struct super_block {
>>
>>         /* Being remounted read-only */
>>         int s_readonly_remount;
>> +
>> +       /*
>> +        * Indicates how deep in a filesystem stack this SB is
>> +        */
>> +       int s_stack_depth;
>>  };
>>
>>  /* superblock cache pruning functions */
>> --
>> 1.7.7
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 07/13] overlay: overlay filesystem documentation
  2012-08-15 19:53   ` J. Bruce Fields
@ 2012-08-16 10:09     ` Miklos Szeredi
  0 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-16 10:09 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, jordipujolp, ezk, ricwheeler, dhowells, hpj, sedat.dilek,
	penberg, goran.cetusic, romain

"J. Bruce Fields" <bfields@fieldses.org> writes:

> Sorry, just trivial typos:

Thanks, fixed.

Miklos

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/13] vfs: add i_op->open()
  2012-08-15 20:28     ` NeilBrown
@ 2012-08-16 10:10       ` Miklos Szeredi
  0 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-16 10:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, viro, linux-fsdevel, linux-kernel, hch,
	torvalds, akpm, apw, nbd, jordipujolp, ezk, ricwheeler, dhowells,
	hpj, sedat.dilek, penberg, goran.cetusic, romain

NeilBrown <neilb@suse.de> writes:

> On Wed, 15 Aug 2012 13:21:50 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
>
>> On Wed, Aug 15, 2012 at 05:48:08PM +0200, Miklos Szeredi wrote:
>> > From: Miklos Szeredi <mszeredi@suse.cz>
>> > 
>> > Add a new inode operation i_op->open().  This is for stacked
>> 
>> Shouldn't that "->open()" be "->dentry_open()" ?
>> 
>> --b.
>> 
>> > filesystems that want to return a struct file from a different
>> > filesystem.
>> > 
>> > Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
>> > ---
>> >  Documentation/filesystems/Locking |    2 ++
>> >  Documentation/filesystems/vfs.txt |    7 +++++++
>> >  fs/namei.c                        |    9 ++++++---
>> >  fs/open.c                         |   23 +++++++++++++++++++++--
>> >  include/linux/fs.h                |    2 ++
>> >  5 files changed, 38 insertions(+), 5 deletions(-)
>> > 
>> > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
>> > index 0f103e3..d222b6a 100644
>> > --- a/Documentation/filesystems/Locking
>> > +++ b/Documentation/filesystems/Locking
>> > @@ -64,6 +64,7 @@ prototypes:
>> >  	int (*atomic_open)(struct inode *, struct dentry *,
>> >  				struct file *, unsigned open_flag,
>> >  				umode_t create_mode, int *opened);
>> > +	int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
>> >  
>> >  locking rules:
>> >  	all may block
>> > @@ -92,6 +93,7 @@ removexattr:	yes
>> >  fiemap:		no
>> >  update_time:	no
>> >  atomic_open:	yes
>> > +open:		no
>> >  
>
> and that last line should be
>     +dentry_open:       no
> ??

Yes.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/13] overlay filesystem
  2012-08-16  6:24   ` Eric W. Biederman
@ 2012-08-16 10:25     ` Miklos Szeredi
  0 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-16 10:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain

ebiederm@xmission.com (Eric W. Biederman) writes:

> Miklos Szeredi <miklos@szeredi.hu> writes:
>
> Minor nits below.
>
>> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
>> new file mode 100644
>> index 0000000..6b50823
>> --- /dev/null
>> +++ b/fs/overlayfs/dir.c
>> @@ -0,0 +1,598 @@
>> +/*
>> + *
>> + * Copyright (C) 2011 Novell Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/fs.h>
>> +#include <linux/namei.h>
>> +#include <linux/xattr.h>
>> +#include <linux/security.h>
>> +#include <linux/cred.h>
>> +#include "overlayfs.h"
>> +
>> +static const char *ovl_whiteout_symlink = "(overlay-whiteout)";
>> +
>> +static int ovl_whiteout(struct dentry *upperdir, struct dentry *dentry)
>> +{
>> +	int err;
>> +	struct dentry *newdentry;
>> +	const struct cred *old_cred;
>> +	struct cred *override_cred;
>> +
>> +	/* FIXME: recheck lower dentry to see if whiteout is really
>>  	needed */
>
> Is that FIXME still valid?

It is, but it's not an important feature.  Lacking this will mean once a
file/directory is marked whiteout or opaque on the upper filesystem it
will remain so forever even after the file/directory it is masking out
has been removed from the lower filesystem.

However this cannot be observed by looking at the overlay, only by
looking at the underlying filesystems.


>
>> +	err = -ENOMEM;
>> +	override_cred = prepare_creds();
>> +	if (!override_cred)
>> +		goto out;
>> +
>> +	/*
>> +	 * CAP_SYS_ADMIN for setxattr
>> +	 * CAP_DAC_OVERRIDE for symlink creation
>> +	 * CAP_FOWNER for unlink in sticky directory
>> +	 */
>> +	cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
>> +	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
>> +	cap_raise(override_cred->cap_effective, CAP_FOWNER);
>> +	override_cred->fsuid = 0;
>> +	override_cred->fsgid = 0;
>
> Could you please make these GLOBAL_ROOT_UID and GLOBAL_ROOT_GID
> instead of 0?  Otherwise this code won't compile with the usernamespace
> bits enabled.

Okay.

Thanks for the review.

Miklos

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/13] fs: limit filesystem stacking depth
  2012-08-16  8:02   ` Sedat Dilek
  2012-08-16  8:30     ` Sedat Dilek
@ 2012-08-16 10:42     ` Miklos Szeredi
  2012-08-16 13:24       ` Sedat Dilek
  1 sibling, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-16 10:42 UTC (permalink / raw)
  To: sedat.dilek
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, jordipujolp, ezk, ricwheeler, dhowells, hpj, sedat.dilek,
	penberg, goran.cetusic, romain

Sedat Dilek <sedat.dilek@gmail.com> writes:

> On Wed, Aug 15, 2012 at 5:48 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> From: Miklos Szeredi <mszeredi@suse.cz>
>>
>> Add a simple read-only counter to super_block that indicates deep this
>> is in the stack of filesystems.  Previously ecryptfs was the only
>> stackable filesystem and it explicitly disallowed multiple layers of
>> itself.
>>
>> Overlayfs, however, can be stacked recursively and also may be stacked
>> on top of ecryptfs or vice versa.
>>
>> To limit the kernel stack usage we must limit the depth of the
>> filesystem stack.  Initially the limit is set to 2.
>>
>
> Hi,
>
> I have tested OverlayFS for a long time with "fs-stack-depth=3".
> The original OverlayFS test-case script  from Jordi was slightly
> modified (see "testcase-ovl-v3.sh").
> I have sent my test-results to Andy and Jordi (tested with the
> patchset from Andy against Linux-v3.4 [1] with Ext4-FS).
> The attached test-case script *requires* "fs-stack-depth=3" to run
> properly (patch attached).
>
> So, I have 2 questions:
>
> [1] FS-stack-limitation
>
> Is a "fs-stack-depth>=2" (like "3") critical?
> Is your setting to "2" just a defensive (and initial) one?
> Can you explain your choice a bit more as ecryptFS is involved in this
> limitation, too.

If directly stacking filesystems like this on top of each other
(ecryptfs is currently the only filesystem that does this in mainline)
then the call chain can get too long and the kernel stack overflow.

Yes, setting it to 2 is defensive, it would need more stack depth
analysis to see what an acceptable number would be.


> [2] Test-Case/Use-Case scripts
>
> It would be *very very very* helpful if you could provide or even ship
> in the Linux-kernel a test-case/use-case script, Thanks!

Sure, I could add Andy's test script under the tools/ directory.  But I
don't understand why exactly it needs the stacking depth to be
increased.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/13] fs: limit filesystem stacking depth
  2012-08-16 10:42     ` Miklos Szeredi
@ 2012-08-16 13:24       ` Sedat Dilek
  2012-09-03 15:05         ` Miklos Szeredi
  0 siblings, 1 reply; 38+ messages in thread
From: Sedat Dilek @ 2012-08-16 13:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, jordipujolp, ezk, ricwheeler, dhowells, hpj, sedat.dilek,
	penberg, goran.cetusic, romain

[-- Attachment #1: Type: text/plain, Size: 2848 bytes --]

On Thu, Aug 16, 2012 at 12:42 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> Sedat Dilek <sedat.dilek@gmail.com> writes:
>
>> On Wed, Aug 15, 2012 at 5:48 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> From: Miklos Szeredi <mszeredi@suse.cz>
>>>
>>> Add a simple read-only counter to super_block that indicates deep this
>>> is in the stack of filesystems.  Previously ecryptfs was the only
>>> stackable filesystem and it explicitly disallowed multiple layers of
>>> itself.
>>>
>>> Overlayfs, however, can be stacked recursively and also may be stacked
>>> on top of ecryptfs or vice versa.
>>>
>>> To limit the kernel stack usage we must limit the depth of the
>>> filesystem stack.  Initially the limit is set to 2.
>>>
>>
>> Hi,
>>
>> I have tested OverlayFS for a long time with "fs-stack-depth=3".
>> The original OverlayFS test-case script  from Jordi was slightly
>> modified (see "testcase-ovl-v3.sh").
>> I have sent my test-results to Andy and Jordi (tested with the
>> patchset from Andy against Linux-v3.4 [1] with Ext4-FS).
>> The attached test-case script *requires* "fs-stack-depth=3" to run
>> properly (patch attached).
>>
>> So, I have 2 questions:
>>
>> [1] FS-stack-limitation
>>
>> Is a "fs-stack-depth>=2" (like "3") critical?
>> Is your setting to "2" just a defensive (and initial) one?
>> Can you explain your choice a bit more as ecryptFS is involved in this
>> limitation, too.
>
> If directly stacking filesystems like this on top of each other
> (ecryptfs is currently the only filesystem that does this in mainline)
> then the call chain can get too long and the kernel stack overflow.
>
> Yes, setting it to 2 is defensive, it would need more stack depth
> analysis to see what an acceptable number would be.
>

Can you describe such an analysis method (in case you need help for testing it)?

>
>> [2] Test-Case/Use-Case scripts
>>
>> It would be *very very very* helpful if you could provide or even ship
>> in the Linux-kernel a test-case/use-case script, Thanks!
>
> Sure, I could add Andy's test script under the tools/ directory.  But I
> don't understand why exactly it needs the stacking depth to be
> increased.
>

No, it was Jordi's test-case script :-).
Unfortunately, my modified version had a brownbag included and will
not run (forgot a comment sign).
v4 attached is included in the atched tarball (see scripts/).

I have added my test-results against a slightly modified Linux-Next
(next-20120816) kernel (see patches/).

All relevant material is in the TAR-XZ archive (see also attached ls-lR.txt).

AFAICS Jordi is creating 3x Upper/Lower/Root dirs/mounts/etc., that's
why a "fs-stack-max-depth=3" is minimum requirement.
( Just FYI: The "LOG-24G" log-file below
TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/ has detailed
informations. )

Hope this helps you.

- Sedat -

> Thanks,
> Miklos

[-- Attachment #2: ls-lR.txt --]
[-- Type: text/plain, Size: 2201 bytes --]

.:
total 20
drwxr-xr-x 9 wearefam wearefam 4096 Aug 16 15:06 TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq
drwxrwxr-x 2 wearefam wearefam 4096 Aug 16 15:06 kernel-config
drwxrwxr-x 2 wearefam wearefam 4096 Aug 16 15:10 logs
-rw-rw-r-- 1 wearefam wearefam    0 Aug 16 15:11 ls-lR.txt
drwxrwxr-x 2 wearefam wearefam 4096 Aug 16 15:11 patches
drwxrwxr-x 2 wearefam wearefam 4096 Aug 16 15:05 scripts

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq:
total 4736
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 COW-r0b
-rw-r--r-- 1 wearefam wearefam 134217728 Aug 16 14:54 COWFILE-ZdO
-rw-r--r-- 1 wearefam wearefam    127944 Aug 16 14:54 LOG-24G
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 ROOT-RO-DmL
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 ROOT-RO-dxr
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 ROOT-sxo
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 UPPER-OCW
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 UPPER-ULU
drwxr-xr-x 2 wearefam wearefam      4096 Aug 16 14:54 UPPER-nJm
-rw-r--r-- 1 wearefam wearefam      4096 Aug 16 14:54 WORK-6S5.squashfs
-rw-r--r-- 1 wearefam wearefam      4096 Aug 16 14:54 WORK-gSq.squashfs
-rw-r--r-- 1 wearefam wearefam      4096 Aug 16 14:54 WORK-wBd.squashfs

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/COW-r0b:
total 0

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/ROOT-RO-DmL:
total 0

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/ROOT-RO-dxr:
total 0

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/ROOT-sxo:
total 0

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/UPPER-OCW:
total 0

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/UPPER-ULU:
total 0

./TEST-3.6.0-rc1-next20120816-1-iniza-generic-DLq/UPPER-nJm:
total 0

./kernel-config:
total 144
-rw-r--r-- 1 wearefam wearefam 145381 Aug 16 13:42 config-3.6.0-rc1-next20120816-1-iniza-generic

./logs:
total 56
-rw-rw-r-- 1 wearefam wearefam 54421 Aug 16 15:09 dmesg_3.6.0-rc1-next20120816-1-iniza-generic_HIDDEN.txt

./patches:
total 136
-rw-rw-r-- 1 wearefam wearefam 137929 Aug 16 12:47 3.6.0-rc1-next20120816-1-iniza-generic.patch

./scripts:
total 8
-rwxr-xr-x 1 wearefam wearefam 4925 Aug 16 14:58 testcase-ovl-v4.sh

[-- Attachment #3: overlayfs-v14_tested-by-dileks.tar.xz --]
[-- Type: application/octet-stream, Size: 107384 bytes --]

[-- Attachment #4: overlayfs-v14_tested-by-dileks.tar.xz.sha256sum --]
[-- Type: application/octet-stream, Size: 104 bytes --]

89d4c6e58a550fe6f414e2066bb9a982dfc5375586f6393ab0d3baac00b88eed  overlayfs-v14_tested-by-dileks.tar.xz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] overlayfs: add statfs support
  2012-08-15 15:48 ` [PATCH 05/13] overlayfs: add statfs support Miklos Szeredi
@ 2012-08-17 18:20   ` Ben Hutchings
  2012-08-29 22:48     ` Miklos Szeredi
  0 siblings, 1 reply; 38+ messages in thread
From: Ben Hutchings @ 2012-08-17 18:20 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi

On Wed, Aug 15, 2012 at 05:48:12PM +0200, Miklos Szeredi wrote:
> From: Andy Whitcroft <apw@canonical.com>
> 
> Add support for statfs to the overlayfs filesystem.  As the upper layer
> is the target of all write operations assume that the space in that
> filesystem is the space in the overlayfs.  There will be some inaccuracy as
> overwriting a file will copy it up and consume space we were not expecting,
> but it is better than nothing.
> 
> Use the upper layer dentry and mount from the overlayfs root inode,
> passing the statfs call to that filesystem.
[...] 
> +/**
> + * ovl_statfs
> + * @sb: The overlayfs super block
> + * @buf: The struct kstatfs to fill in with stats
> + *
> + * Get the filesystem statistics.  As writes always target the upper layer
> + * filesystem pass the statfs to the same filesystem.
> + */
> +static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
> +{
> +	struct dentry *root_dentry = dentry->d_sb->s_root;
> +	struct path path;
> +	ovl_path_upper(root_dentry, &path);
> +
> +	if (!path.dentry->d_sb->s_op->statfs)
> +		return -ENOSYS;
> +	return path.dentry->d_sb->s_op->statfs(path.dentry, buf);
> +}
[...]

In case f_namelen differs between the upper and lower filesystems, you
need to return the greater of the two.

Should f_type be overridden to indicate overlayfs?  I'm not sure what
userland is likely to do with f_type.  (For presentation to the user,
it should get the mount type name with getmntent() or libmount.  And
that will just work.)

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
                                                              - Albert Camus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] overlayfs: add statfs support
  2012-08-17 18:20   ` Ben Hutchings
@ 2012-08-29 22:48     ` Miklos Szeredi
  2012-08-30  5:54       ` Ben Hutchings
  0 siblings, 1 reply; 38+ messages in thread
From: Miklos Szeredi @ 2012-08-29 22:48 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain

Ben Hutchings <ben@decadent.org.uk> writes:

> On Wed, Aug 15, 2012 at 05:48:12PM +0200, Miklos Szeredi wrote:
>> From: Andy Whitcroft <apw@canonical.com>
>> 
>> Add support for statfs to the overlayfs filesystem.  As the upper layer
>> is the target of all write operations assume that the space in that
>> filesystem is the space in the overlayfs.  There will be some inaccuracy as
>> overwriting a file will copy it up and consume space we were not expecting,
>> but it is better than nothing.
>> 
>> Use the upper layer dentry and mount from the overlayfs root inode,
>> passing the statfs call to that filesystem.
> [...] 
>> +/**
>> + * ovl_statfs
>> + * @sb: The overlayfs super block
>> + * @buf: The struct kstatfs to fill in with stats
>> + *
>> + * Get the filesystem statistics.  As writes always target the upper layer
>> + * filesystem pass the statfs to the same filesystem.
>> + */
>> +static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
>> +{
>> +	struct dentry *root_dentry = dentry->d_sb->s_root;
>> +	struct path path;
>> +	ovl_path_upper(root_dentry, &path);
>> +
>> +	if (!path.dentry->d_sb->s_op->statfs)
>> +		return -ENOSYS;
>> +	return path.dentry->d_sb->s_op->statfs(path.dentry, buf);
>> +}
> [...]
>
> In case f_namelen differs between the upper and lower filesystems, you
> need to return the greater of the two.

Maybe.  I've never seen any app use f_namelen for anything useful.

>
> Should f_type be overridden to indicate overlayfs?  I'm not sure what
> userland is likely to do with f_type.  (For presentation to the user,
> it should get the mount type name with getmntent() or libmount.  And
> that will just work.)

Yeah we could do that.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] overlayfs: add statfs support
  2012-08-29 22:48     ` Miklos Szeredi
@ 2012-08-30  5:54       ` Ben Hutchings
  2012-08-31 12:47         ` J. R. Okajima
  0 siblings, 1 reply; 38+ messages in thread
From: Ben Hutchings @ 2012-08-30  5:54 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain

[-- Attachment #1: Type: text/plain, Size: 1850 bytes --]

On Thu, 2012-08-30 at 00:48 +0200, Miklos Szeredi wrote:
> Ben Hutchings <ben@decadent.org.uk> writes:
> 
> > On Wed, Aug 15, 2012 at 05:48:12PM +0200, Miklos Szeredi wrote:
> >> From: Andy Whitcroft <apw@canonical.com>
> >> 
> >> Add support for statfs to the overlayfs filesystem.  As the upper layer
> >> is the target of all write operations assume that the space in that
> >> filesystem is the space in the overlayfs.  There will be some inaccuracy as
> >> overwriting a file will copy it up and consume space we were not expecting,
> >> but it is better than nothing.
> >> 
> >> Use the upper layer dentry and mount from the overlayfs root inode,
> >> passing the statfs call to that filesystem.
> > [...] 
> >> +/**
> >> + * ovl_statfs
> >> + * @sb: The overlayfs super block
> >> + * @buf: The struct kstatfs to fill in with stats
> >> + *
> >> + * Get the filesystem statistics.  As writes always target the upper layer
> >> + * filesystem pass the statfs to the same filesystem.
> >> + */
> >> +static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
> >> +{
> >> +	struct dentry *root_dentry = dentry->d_sb->s_root;
> >> +	struct path path;
> >> +	ovl_path_upper(root_dentry, &path);
> >> +
> >> +	if (!path.dentry->d_sb->s_op->statfs)
> >> +		return -ENOSYS;
> >> +	return path.dentry->d_sb->s_op->statfs(path.dentry, buf);
> >> +}
> > [...]
> >
> > In case f_namelen differs between the upper and lower filesystems, you
> > need to return the greater of the two.
> 
> Maybe.  I've never seen any app use f_namelen for anything useful.
[...]

If I'm not mistaken, glibc uses it to implement pathconf(_PC_NAME_MAX),
which may be used by applications in conjunction with readdir_r().

Ben.

-- 
Ben Hutchings
Quantity is no substitute for quality, but it's the only one we've got.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] overlayfs: add statfs support
  2012-08-30  5:54       ` Ben Hutchings
@ 2012-08-31 12:47         ` J. R. Okajima
  0 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2012-08-31 12:47 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Miklos Szeredi, viro, linux-fsdevel, linux-kernel, hch, torvalds,
	akpm, apw, nbd, neilb, hramrach, jordipujolp, ezk, ricwheeler,
	dhowells, hpj, sedat.dilek, penberg, goran.cetusic, romain


Ben Hutchings:
> If I'm not mistaken, glibc uses it to implement pathconf(_PC_NAME_MAX),
> which may be used by applications in conjunction with readdir_r().

I agree pathconf(3) depends upon the filesystem type. I've posted about
the _PC_LINK_MAX and statfs(2). See my past post and its thread in detail.

[RFC 0/5] pathconf(3) with _PC_LINK_MAX
<http://marc.info/?l=linux-kernel&m=126008634810706&w=2>


J. R. Okajima


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/13] fs: limit filesystem stacking depth
  2012-08-16 13:24       ` Sedat Dilek
@ 2012-09-03 15:05         ` Miklos Szeredi
  0 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-09-03 15:05 UTC (permalink / raw)
  To: sedat.dilek
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, jordipujolp, ezk, ricwheeler, dhowells, hpj, sedat.dilek,
	penberg, goran.cetusic, romain

Sedat Dilek <sedat.dilek@gmail.com> writes:

>>
>> Yes, setting it to 2 is defensive, it would need more stack depth
>> analysis to see what an acceptable number would be.
>>
>
> Can you describe such an analysis method (in case you need help for
> testing it)?


I attached a systemtap script (x86-64 specific) which prints stack use
for stacked overlayfs filesystems.  Interpret output like this:

7288    0  ovl_lookup (bash/5721)
^          ^
|          |
|          +- function name (process/pid)
|
+------------ current stack use

7080  208          ovl_permission (bash/5721)
      ^    ^
      |    |
      |    +- stacking depth indicated by indentation
      |
      +------ stack increase from previous stacking level


You can try it on various setups (overlayfs being used as the lower
and/or upper level) and executing various filesystem operations.

Looks like "copy up" is the most stack hungry operation, it may be worth
trying to reduce its stack usage.

Thanks,
Miklos
---

global rec_level, stacks

probe
	kernel.function("ovl_permission"),
	kernel.function("ovl_getattr"),
	kernel.function("ovl_dir_getattr"),
	kernel.function("ovl_setattr"),
	kernel.function("ovl_setxattr"),
	kernel.function("ovl_listxattr"),
	kernel.function("ovl_removexattr"),
	kernel.function("ovl_dentry_open"),
	kernel.function("ovl_lookup"),
	kernel.function("ovl_mkdir"),
	kernel.function("ovl_symlink"),
	kernel.function("ovl_unlink"),
	kernel.function("ovl_rmdir"),
	kernel.function("ovl_rename"),
	kernel.function("ovl_link"),
	kernel.function("ovl_create"),
	kernel.function("ovl_mknod"),
	kernel.function("ovl_follow_link"),
	kernel.function("ovl_put_link"),
	kernel.function("ovl_readlink"),
	kernel.function("ovl_dir_open"),
	kernel.function("ovl_readdir"),
	kernel.function("ovl_dir_llseek"),
	kernel.function("ovl_dir_fsync"),
	kernel.function("ovl_dir_release"),
	kernel.function("ovl_dentry_release"),
	kernel.function("ovl_put_super"),
	kernel.function("ovl_statfs")
{
	tid = tid();
	i = rec_level[tid]++;
	stack_rem = u_register("rbp") & 0x1fff;
	stacks[tid, i] = stack_rem;
	delta = i > 0 ? stacks[tid, i - 1] - stack_rem : 0;
	printf("%4i %4i %-*s %s (%s/%i)\n", stack_rem, delta, i * 8, "", probefunc(), execname(), tid);
}

probe
	kernel.function("ovl_permission").return,
	kernel.function("ovl_getattr").return,
	kernel.function("ovl_dir_getattr").return,
	kernel.function("ovl_setattr").return,
	kernel.function("ovl_setxattr").return,
	kernel.function("ovl_listxattr").return,
	kernel.function("ovl_removexattr").return,
	kernel.function("ovl_dentry_open").return,
	kernel.function("ovl_lookup").return,
	kernel.function("ovl_mkdir").return,
	kernel.function("ovl_symlink").return,
	kernel.function("ovl_unlink").return,
	kernel.function("ovl_rmdir").return,
	kernel.function("ovl_rename").return,
	kernel.function("ovl_link").return,
	kernel.function("ovl_create").return,
	kernel.function("ovl_mknod").return,
	kernel.function("ovl_follow_link").return,
	kernel.function("ovl_put_link").return,
	kernel.function("ovl_readlink").return,
	kernel.function("ovl_dir_open").return,
	kernel.function("ovl_readdir").return,
	kernel.function("ovl_dir_llseek").return,
	kernel.function("ovl_dir_fsync").return,
	kernel.function("ovl_dir_release").return,
	kernel.function("ovl_dentry_release").return,
	kernel.function("ovl_put_super").return,
	kernel.function("ovl_statfs").return
{
	rec_level[tid()]--;
}



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 07/13] overlay: overlay filesystem documentation
  2012-08-15 15:48 ` [PATCH 07/13] overlay: overlay filesystem documentation Miklos Szeredi
  2012-08-15 19:53   ` J. Bruce Fields
@ 2012-09-10  1:47   ` Jan Engelhardt
  2012-09-10  3:18     ` NeilBrown
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Engelhardt @ 2012-09-10  1:47 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, linux-fsdevel, linux-kernel, hch, torvalds, akpm, apw, nbd,
	neilb, hramrach, jordipujolp, ezk, ricwheeler, dhowells, hpj,
	sedat.dilek, penberg, goran.cetusic, romain, mszeredi


On Wednesday 2012-08-15 17:48, Miklos Szeredi wrote:
>[...]
>+This is most obvious from the 'st_dev' field returned by stat(2).
>+
>+While directories will report an st_dev from the overlay-filesystem,
>+all non-directory objects will report an st_dev from the lower or
>+upper filesystem that is providing the object.

That would seem to render `rsync --one-filesystem` unusable?
(or similar options in other tools)



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 07/13] overlay: overlay filesystem documentation
  2012-09-10  1:47   ` Jan Engelhardt
@ 2012-09-10  3:18     ` NeilBrown
  0 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2012-09-10  3:18 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Miklos Szeredi, viro, linux-fsdevel, linux-kernel, hch, torvalds,
	akpm, apw, nbd, hramrach, jordipujolp, ezk, ricwheeler, dhowells,
	hpj, sedat.dilek, penberg, goran.cetusic, romain, mszeredi

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

On Mon, 10 Sep 2012 03:47:04 +0200 (CEST) Jan Engelhardt <jengelh@inai.de>
wrote:

> 
> On Wednesday 2012-08-15 17:48, Miklos Szeredi wrote:
> >[...]
> >+This is most obvious from the 'st_dev' field returned by stat(2).
> >+
> >+While directories will report an st_dev from the overlay-filesystem,
> >+all non-directory objects will report an st_dev from the lower or
> >+upper filesystem that is providing the object.
> 
> That would seem to render `rsync --one-filesystem` unusable?
> (or similar options in other tools)
> 

Would it?  Have you tested? or examine source code? or just guessed?

I quick look at rsync and du suggest that you only consider --one-file-system
when looking at a directory - they assume files are in the same filesystem as
their parent.
I cannot promise that everything would work exactly as expected, but I
suspect most things will..... though that might  depend on your expectations.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 05/13] overlayfs: add statfs support
  2012-09-20 18:55 [PATCH 00/13] overlay filesystem: request for inclusion (v15) Miklos Szeredi
@ 2012-09-20 18:55 ` Miklos Szeredi
  0 siblings, 0 replies; 38+ messages in thread
From: Miklos Szeredi @ 2012-09-20 18:55 UTC (permalink / raw)
  To: viro, torvalds
  Cc: linux-fsdevel, linux-kernel, hch, akpm, apw, nbd, neilb,
	jordipujolp, ezk, dhowells, sedat.dilek, hooanon05, mszeredi

From: Andy Whitcroft <apw@canonical.com>

Add support for statfs to the overlayfs filesystem.  As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs.  There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.

Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/overlayfs/super.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 02deecd..928b1b1 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -17,15 +17,19 @@
 #include <linux/module.h>
 #include <linux/cred.h>
 #include <linux/sched.h>
+#include <linux/statfs.h>
 #include "overlayfs.h"
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
 MODULE_DESCRIPTION("Overlay filesystem");
 MODULE_LICENSE("GPL");
 
+#define OVERLAYFS_SUPER_MAGIC 0x794c764f
+
 struct ovl_fs {
 	struct vfsmount *upper_mnt;
 	struct vfsmount *lower_mnt;
+	long lower_namelen;
 };
 
 struct ovl_entry {
@@ -406,9 +410,36 @@ static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
 		return mnt_want_write(ufs->upper_mnt);
 }
 
+/**
+ * ovl_statfs
+ * @sb: The overlayfs super block
+ * @buf: The struct kstatfs to fill in with stats
+ *
+ * Get the filesystem statistics.  As writes always target the upper layer
+ * filesystem pass the statfs to the same filesystem.
+ */
+static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+	struct dentry *root_dentry = dentry->d_sb->s_root;
+	struct path path;
+	int err;
+
+	ovl_path_upper(root_dentry, &path);
+
+	err = vfs_statfs(&path, buf);
+	if (!err) {
+		buf->f_namelen = max(buf->f_namelen, ofs->lower_namelen);
+		buf->f_type = OVERLAYFS_SUPER_MAGIC;
+	}
+
+	return err;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.put_super	= ovl_put_super,
 	.remount_fs	= ovl_remount_fs,
+	.statfs		= ovl_statfs,
 };
 
 struct ovl_config {
@@ -474,6 +505,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	struct ovl_entry *oe;
 	struct ovl_fs *ufs;
 	struct ovl_config config;
+	struct kstatfs statfs;
 	int err;
 
 	err = ovl_parse_opt((char *) data, &config);
@@ -508,6 +540,13 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	    !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
 		goto out_put_lowerpath;
 
+	err = vfs_statfs(&lowerpath, &statfs);
+	if (err) {
+		printk(KERN_ERR "overlayfs: statfs failed on lowerpath\n");
+		goto out_put_lowerpath;
+	}
+	ufs->lower_namelen = statfs.f_namelen;
+
 	ufs->upper_mnt = clone_private_mount(&upperpath);
 	err = PTR_ERR(ufs->upper_mnt);
 	if (IS_ERR(ufs->upper_mnt)) {
@@ -556,6 +595,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	root_dentry->d_fsdata = oe;
 	root_dentry->d_op = &ovl_dentry_operations;
 
+	sb->s_magic = OVERLAYFS_SUPER_MAGIC;
 	sb->s_op = &ovl_super_operations;
 	sb->s_root = root_dentry;
 	sb->s_fs_info = ufs;
-- 
1.7.7


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-09-20 18:58 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-15 15:48 [PATCH 00/13] overlay filesystem: request for inclusion (v14) Miklos Szeredi
2012-08-15 15:48 ` [PATCH 01/13] vfs: add i_op->open() Miklos Szeredi
2012-08-15 17:21   ` J. Bruce Fields
2012-08-15 20:28     ` NeilBrown
2012-08-16 10:10       ` Miklos Szeredi
2012-08-15 15:48 ` [PATCH 02/13] vfs: export do_splice_direct() to modules Miklos Szeredi
2012-08-15 15:48 ` [PATCH 03/13] vfs: introduce clone_private_mount() Miklos Szeredi
2012-08-15 15:48 ` [PATCH 04/13] overlay filesystem Miklos Szeredi
2012-08-16  6:24   ` Eric W. Biederman
2012-08-16 10:25     ` Miklos Szeredi
2012-08-15 15:48 ` [PATCH 05/13] overlayfs: add statfs support Miklos Szeredi
2012-08-17 18:20   ` Ben Hutchings
2012-08-29 22:48     ` Miklos Szeredi
2012-08-30  5:54       ` Ben Hutchings
2012-08-31 12:47         ` J. R. Okajima
2012-08-15 15:48 ` [PATCH 06/13] overlayfs: implement show_options Miklos Szeredi
2012-08-15 15:48 ` [PATCH 07/13] overlay: overlay filesystem documentation Miklos Szeredi
2012-08-15 19:53   ` J. Bruce Fields
2012-08-16 10:09     ` Miklos Szeredi
2012-09-10  1:47   ` Jan Engelhardt
2012-09-10  3:18     ` NeilBrown
2012-08-15 15:48 ` [PATCH 08/13] fs: limit filesystem stacking depth Miklos Szeredi
2012-08-16  8:02   ` Sedat Dilek
2012-08-16  8:30     ` Sedat Dilek
2012-08-16 10:42     ` Miklos Szeredi
2012-08-16 13:24       ` Sedat Dilek
2012-09-03 15:05         ` Miklos Szeredi
2012-08-15 15:48 ` [PATCH 09/13] overlayfs: fix possible leak in ovl_new_inode Miklos Szeredi
2012-08-15 15:48 ` [PATCH 10/13] overlayfs: create new inode in ovl_link Miklos Szeredi
2012-08-15 15:48 ` [PATCH 11/13] vfs: export __inode_permission() to modules Miklos Szeredi
2012-08-15 17:17   ` Sedat Dilek
2012-08-15 15:48 ` [PATCH 12/13] ovl: switch to __inode_permission() Miklos Szeredi
2012-08-15 16:59   ` Casey Schaufler
2012-08-15 17:07     ` Andy Whitcroft
2012-08-15 17:34       ` Casey Schaufler
2012-08-15 15:48 ` [PATCH 13/13] overlayfs: copy up i_uid/i_gid from the underlying inode Miklos Szeredi
2012-08-15 17:14 ` [PATCH 00/13] overlay filesystem: request for inclusion (v14) Sedat Dilek
2012-09-20 18:55 [PATCH 00/13] overlay filesystem: request for inclusion (v15) Miklos Szeredi
2012-09-20 18:55 ` [PATCH 05/13] overlayfs: add statfs support Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).