From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754107Ab2BUTxI (ORCPT <rfc822;w@1wt.eu>);
	Tue, 21 Feb 2012 14:53:08 -0500
Received: from mx1.redhat.com ([209.132.183.28]:11417 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753170Ab2BUTw4 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Tue, 21 Feb 2012 14:52:56 -0500
Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley
 Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United
 Kingdom.
 Registered in England and Wales under Company Registration No. 3798903
From: David Howells <dhowells@redhat.com>
Subject: [PATCH 18/73] union-mount: Union mounts documentation [ver #2]
To: linux-fsdevel@vger.kernel.org, viro@ZenIV.linux.org.uk,
        valerie.aurora@gmail.com
Cc: linux-kernel@vger.kernel.org, David Howells <dhowells@redhat.com>
Date: Tue, 21 Feb 2012 17:59:47 +0000
Message-ID: <20120221175947.25235.58759.stgit@warthog.procyon.org.uk>
In-Reply-To: <20120221175721.25235.8901.stgit@warthog.procyon.org.uk>
References: <20120221175721.25235.8901.stgit@warthog.procyon.org.uk>
User-Agent: StGIT/0.14.3
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

From: Valerie Aurora <vaurora@redhat.com>

Document design and implementation of union mounts (a.k.a. writable overlays).

With corrections from Andreas Gruenbacher <agruen@suse.de>.

Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/union-mounts.txt |  712 ++++++++++++++++++++++++++++
 1 files changed, 712 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..596bfe6
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,712 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union mounts,
+also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over one or more read-only file
+systems, with all writes going to the writable file system.  The namespace of
+both file systems appears as a combined whole to userland, with files and
+directories on the writable file system covering up any files or directories
+with matching pathnames on the read-only file system.  The read-write file
+system is the "topmost" or "upper" file system and the read-only file systems
+are the "lower" file systems.  A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block device or a
+clustered file system (include NFS mounts).  However, for some use cases,
+sharing is more efficient and better performing if done at the file system
+namespace level.  COW block devices only increase their divergence as time goes
+on, and a fully coherent writable file system is unnecessary synchronization
+overhead if no other client needs to see the writes.
+
+What union mounts are not
+-------------------------
+
+Union mounts are not a general-purpose unioning file system.  They do not
+provide a generic "union of namespaces" operation for an arbitrary number of
+file systems.  Many interesting features can be implemented with a generic
+unioning facility: dynamic insertion and removal of branches, write policies
+based on space available, online upgrade, etc.  Some unioning file systems that
+do this are UnionFS and AUFS.
+
+Terminology
+===========
+
+The main physical metaphor for union mounts is that a writable file system is
+mounted "on top" of a read-only file system.  Lookups start at the "topmost"
+read-write file system and travel "down" to the "bottom" read-only file system
+only if no blocking entry exists on the top layer.
+
+Topmost layer: The read-write file system.  Lookups begin here.
+
+Bottom layer: The read-only file system.  Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding path on
+the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding path on
+the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups from
+travelling down to the bottom layer.  Created on unlink()/rmdir() if a
+corresponding directory entry exists in the bottom layer.
+
+Opaque flag: A flag on a directory in the top layer that prevents lookups of
+entries in this directory from travelling down to the bottom layer (unless
+there is an explicit fallthru entry allowing that for a particular entry).  Set
+on creation of any new directory in in the topmost layer (that is, a directory
+that does not have any matching visible directory below it).
+
+Fallthru: A directory entry which allows lookups to "fall through" to the
+bottom layer for that exact directory entry.  This serves as a placeholder for
+directory entries from the bottom layer during readdir().  Fallthrus override
+opaque flags.
+
+File copyup: Create a file on the top layer that has the same metadata and
+contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the bottom layer
+as fallthrus in the matching top layer directory.  Mark the directory opaque to
+avoid unnecessary negative lookups on the bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on topmost layer
+- unlink() /oldfile -> creates a whiteout on topmost layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to topmost layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to topmost layer
+- mkdir() /newdir -> creates opaque dir on topmost layer
+- rmdir() /olddir -> creates a whiteout on topmost layer
+- mkdir() /olddir after above -> creates opaque dir on topmost layer
+- readdir() /shareddir -> copies up entries from bottom layer as
+    fallthrus, processes duplicates and whiteouts
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on
+    topmost layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer,
+    whiteouts /oldfile
+- rename() /olddir /newdir -> EXDEV
+- rename() /topmost_only_dir /topmost_only_dir2 -> success
+- stat() /oldfile - inode & dev from lower layer
+- stat() /newfile - inode & dev from topmost layer
+- readdir() /shareddir - d_ino & d_type from lower layer on fallthrus
+
+Getting to a root file system with union mounts:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the read-write layer on /newroot:
+   # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Union mounts are implemented as an integral part of the VFS, rather than as a
+VFS client file system (i.e., a stacked file system like unionfs or ecryptfs).
+Implementing unioning inside the VFS eliminates the need for duplicate copies
+of VFS data structures, unnecessary indirection, and code duplication, but
+requires very maintainable, low overhead code.  Union mounts require no change
+to file systems serving as the read-only layer, and requires some minor support
+from file systems serving as the read-write layer.  File systems that want to
+be the writable layer must implement the new ->whiteout() and ->fallthru()
+inode operations, which create special dummy directory entries.
+
+The union mounts code must accomplish the following major tasks:
+
+1) Pass lookups through to the lower level file system.
+2) Copy files and directories up to the topmost layer when written.
+3) Create whiteouts and fallthrus as necessary.
+
+VFS objects and union mounts
+----------------------------
+
+First, some VFS basics:
+
+The VFS allows multiple mounts of the same file system.  For example, /dev/sda
+can be mounted at /usr and also at /mnt.  The same file system can be mounted
+read-only at one point and read-write at another.  Each of these mounts has its
+own vfsmount data structure in the kernel.  However, each underlying file
+system has exactly one in-kernel superblock structure no matter how many times
+it is mounted.  All the separate vfsmounts for the same file system reference
+the same superblock data structure.
+
+Directory entries are cached by the VFS in dentry structures.  The VFS keeps
+one dentry structure for each file or directory in a file system, no matter how
+many times it is mounted.  Each dentry represents only one element of a path
+name.  When the VFS looks up a pathname (e.g., "/sbin/init"), the result is a
+combination of vfsmount and dentry.  This <mnt,dentry> pair is usually stored
+in a kernel structure named "path", which is simply two pointers, one to the
+vfsmount and one to the dentry.  A "struct path" is this structure; a pathname
+is a string like "/etc/fstab".
+
+In union mounts, a file system can only be the topmost layer for one union
+mount.  A file system can be part of multiple union mounts if it is a read-only
+layer.  So dentries in the read-only layers can be part of multiple unions,
+while a dentry in the read-write layer can only be part of one unin.
+
+union_dir structure
+---------------------
+
+The first job of union mounts is to map directories from the topmost layer to
+directories with the same pathname in the lower layer.  That is, given the
+<mnt,dentry> pair for a directory pathname in the topmost layer, we need to
+find all the <mnt,dentry> pairs for the directory with the same pathname in the
+lower layer.  We do this with the union_dir structure, which is an array
+containing struct paths (mnt, dentry pointer pairs) for each directory unioned
+with the topmost union.  The array is pointed to from the new d_union_stack
+member of struct dentry.
+
+/*
+ * The union_stack structure.  It is an array of struct paths of
+ * directories below the topmost directory in a unioned directory, The
+ * topmost dentry has a pointer to this structure.  The topmost dentry
+ * can only be part of one union, so we can reference it from the
+ * dentry, but lower dentries can be part of multiple union stacks.
+ *
+ * The number of dirs actually allocated is kept in the superblock,
+ * s_union_count.
+ */
+struct union_stack {
+	struct path u_dirs[0];
+};
+
+This structure is flexible enough to support an arbitrary number of layers of
+unioned file systems.  Since there can be more than two layers, this section
+will talk about mapping "upper" directories to "lower" directories, instead of
+"topmost" directories to "bottom" directories.
+
+Traversing the union stack
+--------------------------
+
+The set of union_dir structures referring to a particular pathname are called
+collectively the union stack for that directory.  To traverse the union stack,
+iterate through the number of layers in the union (stored in sb->s_union_count)
+with union_find_dir().  Example: freeing the union stack:
+
+void d_free_unions(struct dentry *topmost)
+{
+	struct path *path;
+	unsigned int i, layers = topmost->d_sb->s_union_count;
+
+	if (!IS_DIR_UNIONED(topmost))
+		return;
+
+	for (i = 0; i < layers; i++) {
+		path = union_find_dir(topmost, i);
+		if (path->mnt)
+			path_put(path);
+	}
+	kfree(topmost->d_union_stack);
+	topmost->d_union_stack = NULL;
+}
+
+Code paths
+----------
+
+Union mounts modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Pathname lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Union mounts are created in two steps:
+
+1. Mount the read-only layer file systems read-only in the usual manner, all on
+the same mountpoint.  Submounts are permitted as long as they are also
+read-only and not shared (part of a mount propagation group).
+
+2. Mount the top layer with the "-o union" option at the same mountpoint.  All
+read-only file systems mounted at this mountpoint will be included in the union
+mount.
+
+The bottom layers must be read-only and the top layer must be read-write and
+support whiteouts and fallthrus.  A file system that supports whiteouts and
+fallthrus indicates this by setting the MS_WHITEOUT and MS_FALLTHRU flags in
+the superblock.  Currently, the top layer is forced to "noatime" to avoid a
+copyup on every access of a file.  Supporting atime with the current
+infrastructure would require a copyup on every open().  The "relatime" option
+would be equally efficient if the atime is the same or more recent than the
+mtime/ctime for every object on the read-only file system, and if the 24-hour
+timeout on relatime was disabled.  However, this is probably not worthwhile for
+the majority of union mount use cases.
+
+File systems can only be union mounted at their root directories, for
+simplicity and performance.
+
+pivot_root() to a union mounted file system is supported.  The recommended way
+to get to a union mounted root file system is to boot with the read-only mount
+as the root file system, construct the union mount on an entirely new mount,
+and pivot_root() to the new union mount root.  Attempting to union mount the
+root file system later in boot will result in covering other file systems,
+e.g., /proc, which isn't permitted in the current code and is a bad idea
+anyway.
+
+Hard read-only file systems
+---------------------------
+
+Union mounts require the lower layer of the file system to be read-only.
+However, in Linux, any individual file system may be mounted at multiple places
+in the namespace, and a file system can be changed from read-only to read-write
+while still mounted.  Thus, simply checking that the bottom layer is read-only
+at the time the writable overlay is mounted over it is pointless, since at any
+time the bottom layer may become read-write.
+
+We have to guarantee that a file system will be read-only for as long as it is
+the bottom layer of a union mount.  To do this, we track the number of hard
+read-only users of a file system in its VFS superblock structure.  When we
+union mount a writable overlay over a file system, we increment its read-only
+user count.  The file system can only be mounted read-write if its read-only
+users count is zero.
+
+Todo:
+
+- Support hard read-only NFS mounts.  See discussion here:
+
+  http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Pathname lookup
+---------------
+
+Pathname lookup in a unioned directory traverses down the union stack for the
+parent directory, looking up each pathname element in each layer of the file
+system (according to the rules of whiteouts, fallthrus, and opaque flags).  At
+mount time, the union stack for the root directory of the file system is
+created, and the union stack creation for every other unioned directory in the
+file system is boot-strapped using the already-existing union stack of the
+directory's parent.  In order to simplify the code greatly, every visible
+directory on the lower file system is required to have a matching directory on
+the upper file system.  If this matching directory does not already exist, it
+is created during pathname lookup.  Therefore, each unioned directory is the
+child of another unioned directory (or is the root directory of the file
+system).
+
+The actual union lookup function is called in the following code paths:
+
+do_lookup()->do_union_lookup()->lookup_union()->__lookup_union()
+lookup_hash()->lookup_union()->__lookup_union()
+
+__lookup_union() is where the rules of whiteouts, fallthrus, and opaque flags
+are actually implemented.  __lookup_union() returns either the first visible
+dentry, or a negative dentry from the topmost file system if no matching dentry
+exists.  If it finds a directory, it looks up any potential matching lower
+layer directories.  If it finds a lower layer directory, it first creates the
+topmost dir if necessary via union_create_topmost_dir(), and then calls
+union_add_dir() to append the lower directory to the end of the union stack.
+
+Note that not all directories in a union mount are unioned, only those with
+matching directories on the lower layer.  The macro IS_DIR_UNIONED() is a
+cheap, constant time way to check if a directory is unioned, while
+IS_MNT_UNION() checks if the entire mount is unioned (and therefore whether the
+directory in question is potentially unioned).
+
+Currently, lookup of a negative dentry or a directory with no matching
+directories below it requires a lookup in every directory in the union stack
+every time it is looked up.  We could avoid subsequent lookups by adding the
+equivalent of a negative dcache entry.
+
+File copyup
+-----------
+
+Any system call that alters the data or metadata of a file on the bottom layer,
+or creates or changes a hard link to it will trigger a copyup of the target
+file from the lower layer to the topmost layer
+
+ - open(O_WRITE | O_RDWR | O_APPEND)
+ - truncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chown()/lchown()
+ - utimes()
+ - setxattr()/lsetxattr()
+
+Copyup of a file due to open(O_WRITE) has already occurred when:
+
+ - write()
+ - ftruncate()
+ - writable mmap()
+
+The following system calls will fail on an fd opened O_RDONLY:
+
+ - fchmod()
+ - fchown()
+ - fsetxattr()
+ - futimensat()
+
+Contrary to common sense, the above system calls are defined to succeed on
+O_RDONLY fds.  The idea seems to be that the O_RDONLY/O_RDWR/O_WRITE flags only
+apply to the actual file data, not to any form of metadata (times, owner, mode,
+or even extended attributes).  Applications making these system calls on
+O_RDONLY fds are correct according to the standard and work on non-union
+mounts.  They will need to be rewritten (O_RDONLY -> O_RDWR) to work on union
+mounts.  We suspect this usage is uncommon.
+
+This deviation from standard is due to technical limitations of the union mount
+implementation.  Specifically, we would need to replace an open file descriptor
+from the lower layer with an open file descriptor for a file with matching
+pathname and contents on the upper layer, which is difficult to do.  We avoid
+this in other system calls by doing the copyup before the file is opened.
+Unionfs doesn't encounter this problem because it creates a dummy file struct
+which redirects or fans out operations to the struct files for the underlying
+file systems.
+
+From an application's point of view, the result of an in-kernel file copyup is
+the logical equivalent of another application updating the file via the
+rename() pattern: creat() a new file, copy the data over, make changes the
+copy, and rename() over the old version.  Any existing open file descriptors
+for that file (including those in the same application) refer to a now
+invisible object that used to have the same pathname.  Only opens that occur
+after the copyup will see updates to the file.
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed in a
+system call before copying a file up to avoid unnecessary IO.  At present, the
+permission check for a single system call may be spread out over many hundreds
+of lines of code (e.g., open()).  In order to check permissions, we
+occasionally need to determine if there is a writable overlay on top of this
+inode.  This requires a full path, but often we only have the inode at this
+point.  In particular, inode_permission() returns EROFS if the inode is on a
+read-only file system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+The current solution is to split out the file-system-wide permission checks
+from the per-inode permission checks.  inode_permission() becomes:
+
+sb_permission()
+__inode_permission()
+
+inode_permission() calls sb_permission() and __inode_permission() on the same
+path.  We create path_permission() which calls sb_permission() on the parent
+directory from the top layer, and __inode_permission() on the target on the
+lower layer.  This gets us the correct write permissions consdering that the
+file will be copied up.
+
+Todo:
+
+  - Currently, we don't deal with differing directory permissions at
+    different levels of the stack.  This is a bug.
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are #ifdef'd
+out at the function/macro level with CONFIG_UNION_MOUNT in nearly all cases
+(see fs/union.h).  When CONFIG_UNION_MOUNT is enabled, struct dentry has one
+more pointer, reducing the size of dentry names stored in the dentry itself by
+4 to 8 bytes.
+
+Todo:
+
+ - Do performance tests
+
+Locking strategy
+================
+
+The current union mount locking strategy is based on the following
+rules:
+
+* The lower layer file system is always read-only
+* The topmost file system is always read-write
+  => A file system can never a topmost and lower layer at the same time
+
+Additionally, the topmost layer may only be mounted exactly once.  Don't think
+of the topmost layer as a separate independent file system; when it is part of
+a union mount, it is only a file system in conjunction with the read-only
+bottom layer.  The read-only bottom layer is an independent file system in and
+of itself and can be mounted elsewhere, including as the bottom layer for
+another union mount.
+
+Thus, we may define a stable locking order in terms of top layer and bottom
+layer locks, since a top layer is never a bottom layer and a bottom layer is
+never a top layer.  Another simplifying assumption is that all directories in a
+pathname exist on the top layer, as they are created step-by-step during
+lookup.  This prevents us from ever having to walk backwards up the path
+creating directory entries, which can get complicated.  By implication, parent
+directories paths during any operation (rename(), unlink(),etc.) are from the
+top layer.  Dentries for directories from the bottom layer are only ever seen
+or used by the lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file systems: A
+mounted over B, and B mounted over A.  Sometimes locks on objects in both A and
+B will have to be held simultanously.  What order should they be acquired in?
+Simply acquiring them from top to bottom will create a lock-ordering problem -
+one thread acquires lock on object from A and then tries for a lock on object
+from B, while another thread grabs the lock on object from B and then waits for
+the lock on object from A.  Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety of nasty
+corner cases arise when more than one layer is changing at the same time.
+Changes in the directory topology and their effect on inheritance are of
+special concern.  Al Viro's canonical email on the subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies objects from
+both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Grabs i_mutex on bottom layer while holding i_mutex on top layer directory
+inode.
+
+File copyup:
+
+Holds i_mutex on the parent directory from the top layer while copying up file
+from lower layer.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top layer.
+Followed by a normal link() operation.
+
+rename():
+
+Holds s_vfs_rename_mutex on the top layer, i_mutex of the source's parent dir
+(top layer), and i_mutex of the target's parent dir (also top layer) while
+looking up and copying the bottom layer target and also creating the whiteout.
+
+Notes on rename():
+
+First, renaming of directories returns EXDEV.  It's not at all reasonable to
+recursively copy directory trees and userspace has to handle this case anyway.
+An exception is rename() of directories that exist only on the topmost layer;
+this succeeds.
+
+Rename involves three steps on a union mount: (1) copyup of the file from the
+bottom layer, (2) rename of the new top-layer copy to the target in the usual
+manner, (3) creation of a whiteout covering the source of the rename.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir().  We hold the top layer
+directory i_mutex throughout and sequentially acquire and drop the i_mutex for
+each lower layer directory.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really really
+read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT and MS_FALLTHRU flags to indicate support of
+these operations.
+
+Todo:
+
+- Implement whiteouts and fallthrus in ext3
+- Implement whiteouts and fallthrus in btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer.  File systems must explicitly support
+whiteouts and fallthrus in order to be a read-write layer.  This patch set
+implements whiteouts for ext2, tmpfs, and jffs2.  We have tested ext2, tmpfs,
+and iso9660 as the read-only layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer.  NFS as read-only layer
+requires support from the server to honor the read-only guarantee needed for
+the bottom layer.  To do this, the server needs to revoke access to clients
+requesting read-only file systems if the exported file system is remounted
+read-write or unmounted (during which arbitrary changes can occur).  Some
+recent discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the ->whiteout()
+and ->fallthru() methods.  DT_WHT directory entries are theoretically already
+supported.
+
+Also, technically the requirement for a readdir() cookie that is stable across
+reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Guarantee really really read-only on NFS exports
+- Implement whiteout()/fallthru() for NFS
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass the
+corresponding MS_UNION flag to the kerel.  A util-linux git tree with union
+mount support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus.  An e2fsprogs git
+tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland.  While the
+directory type for whiteouts, DT_WHT, has been defined for many years, very
+little userland code handles them.  Userland will never see fallthru directory
+entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+
+  Most programs are not tested and don't work well under conditions of ENOSPC.
+  The solution is to add more disk space.
+
+- Link count may be wrong for files on bottom layer with > 1 link count
+
+  A file may have more than one hard link to it.  When a file with multiple
+  hard links is copied up, any other hard links pointing to the same inode will
+  remain unchanged.  If the file is looked up via one of the hard links on the
+  read-only layer, it will have the original link count (which is off by one at
+  this point).  An example:
+
+  /bin/link1 -> inode 100
+  /etc/link2 -> inode 100
+
+  inode 100 will have link count 2.
+
+  # echo "blah" > /bin/link1
+
+  Now /bin/link1 will be copied up to the topmost layer.  But /etc/link2 will
+  still point to the original inode 100, and its link count will still be 2.
+
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+  rename().  Any existing open file descriptors will continue to refer
+  to the read-only copy on the bottom layer and will not see any
+  changes that occur after the copy-up.
+- rename() of directory may fail with EXDEV
+- fchmod()/fchown()/futimensat()/fsetattr() fail on O_RDONLY fds
+
+Status
+======
+
+The current union mounts implementation is feature-complete on local file
+systems and passes an extensive union mounts test suite, available in the union
+mounts Usermode Linux-based development kit:
+
+http://valerieaurora.org/union/union_mount_devkit.tar.gz
+
+The whiteout code has had some non-trivial level of review and testing, but
+much of the code has had no external review or testing outside the authors'
+machines.
+
+The latest version is available at:
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Check the union mounts web page for the name of the latest branch:
+
+http://valerieaurora.org/union/
+
+Todo:
+
+- Run more tests (e.g., XFS test suite)
+- Get review from VFS maintainers
+
+Non-features
+------------
+
+Features we do not currently plan to support in union mounts:
+
+Online upgrade: E.g., installing software on a file system NFS exported to
+clients while the clients are still up and running.  Allowing the read-only
+bottom layer of a union mount to change invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across layers for
+directories.  Doing an in-kernel copy of a single file is bad enough.
+Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the ability
+to create persistent directory entries on the top layer file system (which may
+be tmpfs).  However, you can union two read-only file systems by union mounting
+a third file system (such as tmpfs) over the two read-onlly file systems.
+Numerous alternatives to this readdir() strategy (including in-kernel or
+in-application caching) exist and are compatible with union mounts with its
+writing-readdir() implementation disabled.  Creating a readdir() cookie that is
+stable across multiple readdir()s requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Often these features are supported by other unioning file systems or by other
+versions of union mounts.
+
+Contributing to union mounts
+============================
+
+The union mounts web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The best mailing list for discussing union mounts is:
+
+linux-fsdevel@vger.kernel.org
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!