All of lore.kernel.org
 help / color / mirror / Atom feed
From: Valerie Aurora <vaurora@redhat.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>, Jan Blunck <jblunck@suse.de>,
	Christoph Hellwig <hch@infradead.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Valerie Aurora <vaurora@redhat.com>
Subject: [PATCH 17/38] union-mount: Union mounts documentation
Date: Fri,  6 Aug 2010 15:35:03 -0700	[thread overview]
Message-ID: <1281134124-17041-18-git-send-email-vaurora@redhat.com> (raw)
In-Reply-To: <1281134124-17041-1-git-send-email-vaurora@redhat.com>

Document design and implementation of union mounts (a.k.a. writable
overlays).

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 Documentation/filesystems/union-mounts.txt |  752 ++++++++++++++++++++++++++++
 1 files changed, 752 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..977a2b5
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,752 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union
+mounts, also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over one or more
+read-only file systems, with all writes going to the writable file
+system.  The namespace of both file systems appears as a combined
+whole to userland, with files and directories on the writable file
+system covering up any files or directories with matching pathnames on
+the read-only file system.  The read-write file system is the
+"topmost" or "upper" file system and the read-only file systems are
+the "lower" file systems.  A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device or a clustered file system (include NFS mounts).  However, for
+some use cases, sharing is more efficient and better performing if
+done at the file system namespace level.  COW block devices only
+increase their divergence as time goes on, and a fully coherent
+writable file system is unnecessary synchronization overhead if no
+other client needs to see the writes.
+
+What union mounts are not
+-------------------------
+
+Union mounts are not a general-purpose unioning file system.  They do
+not provide a generic "union of namespaces" operation for an arbitrary
+number of file systems.  Many interesting features can be implemented
+with a generic unioning facility: dynamic insertion and removal of
+branches, write policies based on space available, online upgrade,
+etc.  Some unioning file systems that do this are UnionFS and AUFS.
+
+Terminology
+===========
+
+The main physical metaphor for union mounts is that a writable file
+system is mounted "on top" of a read-only file system.  Lookups start
+at the "topmost" read-write file system and travel "down" to the
+"bottom" read-only file system only if no blocking entry exists on the
+top layer.
+
+Topmost layer: The read-write file system.  Lookups begin here.
+
+Bottom layer: The read-only file system.  Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding
+path on the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding
+path on the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups
+from travelling down to the bottom layer.  Created on unlink()/rmdir()
+if a corresponding directory entry exists in the bottom layer.
+
+Opaque flag: A flag on a directory in the top layer that prevents
+lookups of entries in this directory from travelling down to the
+bottom layer (unless there is an explicit fallthru entry allowing that
+for a particular entry).  Set on creation of a directory that replaces
+a whiteout, and after a directory copyup.
+
+Fallthru: A directory entry which allows lookups to "fall through" to
+the bottom layer for that exact directory entry.  This serves as a
+placeholder for directory entries from the bottom layer during
+readdir().  Fallthrus override opaque flags.
+
+File copyup: Create a file on the top layer that has the same metadata
+and contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the
+bottom layer as fallthrus in the matching top layer directory.  Mark
+the directory opaque to avoid unnecessary negative lookups on the
+bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on topmost layer
+- unlink() /oldfile -> creates a whiteout on topmost layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to topmost layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to topmost layer
+- mkdir() /newdir -> creates on topmost layer
+- rmdir() /olddir -> creates a whiteout on topmost layer
+- mkdir() /olddir after above -> creates on topmost layer w/ opaque flag
+- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on topmost layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
+- rename() /olddir /newdir -> EXDEV
+- rename() /topmost_only_dir /topmost_only_dir2 -> success
+
+Getting to a root file system with union mounts:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the read-write layer on /newroot:
+   # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Union mounts are implemented as an integral part of the VFS, rather
+than as a VFS client file system (i.e., a stacked file system like
+unionfs or ecryptfs).  Implementing unioning inside the VFS eliminates
+the need for duplicate copies of VFS data structures, unnecessary
+indirection, and code duplication, but requires very maintainable,
+low-to-zero overhead code.  Union mounts require no change to file
+systems serving as the read-only layer, and requires some minor
+support from file systems serving as the read-write layer.  File
+systems that want to be the writable layer must implement the new
+->whiteout() and ->fallthru() inode operations, which create special
+dummy directory entries.
+
+The union mounts code must accomplish the following major tasks:
+
+1) Pass lookups through to the lower level file system.
+2) Copy files and directories up to the topmost layer when written.
+3) Create whiteouts and fallthrus as necessary.
+
+VFS objects and union mounts
+----------------------------
+
+First, some VFS basics:
+
+The VFS allows multiple mounts of the same file system.  For example,
+/dev/sda can be mounted at /usr and also at /mnt.  The same file
+system can be mounted read-only at one point and read-write at
+another.  Each of these mounts has its own vfsmount data structure in
+the kernel.  However, each underlying file system has exactly one
+in-kernel superblock structure no matter how many times it is mounted.
+All the separate vfsmounts for the same file system reference the same
+superblock data structure.
+
+Directory entries are cached by the VFS in dentry structures.  The VFS
+keeps one dentry structure for each file or directory in a file
+system, no matter how many times it is mounted.  Each dentry
+represents only one element of a path name.  When the VFS looks up a
+pathname (e.g., "/sbin/init"), the result is combination of vfsmount
+and dentry.  This <mnt,dentry> pair is usually stored in a kernel
+structure named "path", which is simply two pointers, one to the
+vfsmount and one to the dentry.  A "struct path" is this structure; a
+pathname is a string like "/etc/fstab".
+
+In union mounts, a file system can only be the topmost layer for one
+union mount.  A file system can be part of multiple union mounts if it
+is a read-only layer.  So dentries in the read-only layers can be part
+of multiple unions, while a dentry in the read-write layer can only be
+part of one unin.
+
+union_dir structure
+---------------------
+
+The first job of union mounts is to map directories from the topmost
+layer to directories with the same pathname in the lower layer.  That
+is, given the <mnt,dentry> pair for a directory pathname in the
+topmost layer, we need to find all the <mnt,dentry> pairs for the
+directory with the same pathname in the lower layer.  We do this with
+a singly linked list rooted in the dentry from the topmost layer.  The
+linked list is the union_dir structure:
+
+/*
+ * The union_dir structure.  Basically just a singly-linked list with
+ * a pointer to the referenced dentry, whose head is d_union_dir in
+ * the dentry of the topmost directory.  We can't link this list
+ * purely through list elements in the dentry because lower layer
+ * dentries can be part of multiple union stacks.  However, the
+ * topmost dentry is only part of one union stack.  So we point at the
+ * lower layer dentries through a linked list rooted in the topmost
+ * dentry.
+ */
+struct union_dir {
+	struct path u_this;		/* this is me */
+	struct union_dir *u_lower;	/* this is what I overlay */
+};
+
+This structure is flexible enough to support an arbitrary number of
+layers of unioned file systems. (The current code is tested only with
+two layers but should allow more layers.) Since there can be more than
+two layers, this section will talk about mapping "upper" directories
+to "lower" directories, instead of "topmost" directories to "bottom"
+directories.
+
+At the time of a union mount, we allocate a union_dir structure to link
+the root directory of the upper layer to the root directory of the
+lower layer and put the pointer to it in the d_union_dir field of
+struct dentry:
+
+struct dentry {
+[...]
+#ifdef CONFIG_UNION_MOUNT
+	struct union_dir *d_union_dir;	/* head of union stack */
+#endif
+
+
+Traversing the union stack
+--------------------------
+
+The set of union_dir structures referring to a particular pathname are
+called collectively the union stack for that directory.  Only lookup
+needs to traverse the union stack - walk down the list of paths
+beginning with the topmost.  This is open-coded:
+
+static int __lookup_union(struct nameidata *nd, struct qstr *name,
+			  struct path *topmost)
+{
+[...]
+	/* new_ud is the tail of the list of union dirs for this dentry */
+	struct union_dir **next_ud = &topmost->dentry->d_union_dir;
+[...]
+	/* Go through each dir underlying the parent, looking for a match */
+	for (ud = nd->path.dentry->d_union_dir; ud != NULL; ud = ud->u_lower) {
+[...]
+		next_ud = &(*next_ud)->u_lower;
+	}
+}
+
+Code paths
+----------
+
+Union mounts modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Pathname lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Union mounts are created in two steps:
+
+1. Mount the read-only layer file systems read-only in the usual
+manner, all on the same mountpoint.  Submounts are permitted as long
+as they are also read-only and not shared (part of a mount propagation
+group).
+
+2. Mount the top layer with the "-o union" option at the same
+mountpoint.  All read-only file systems mounted at this mountpoint
+will be included in the union mount.
+
+The bottom layers must be read-only and the top layer must be
+read-write and support whiteouts and fallthrus.  A file system that
+supports whiteouts and fallthrus indicates this by setting the
+MS_WHITEOUT flag in the superblock.  Currently, the top layer is
+forced to "noatime" to avoid a copyup on every access of a file.
+Supporting atime with the current infrastructure would require a
+copyup on every open().  The "relatime" option would be equally
+efficient if the atime is the same or more recent than the mtime/ctime
+for every object on the read-only file system, and if the 24-hour
+timeout on relatime was disabled.  However, this is probably not
+worthwhile for the majority of union mount use cases.
+
+File systems can only be union mounted at their root directories.
+Without this restriction, some VFS operations must always do a
+union_lookup() - requiring a global lock - in order to find out if a
+path is potentially unioned.  With this restriction, we can tell if a
+path is potentially unioned by checking a flag in the vfsmount.
+
+pivot_root() to a union mounted file system is supported.  The
+recommended way to get to a union mounted root file system is to boot
+with the read-only mount as the root file system, construct the union
+mount on an entirely new mount, and pivot_root() to the new union
+mount root.  Attempting to union mount the root file system later in
+boot will result in covering other file systems, e.g., /proc, which
+isn't permitted in the current code and is a bad idea anyway.
+
+Hard read-only file systems
+---------------------------
+
+Union mounts require the lower layer of the file system to be
+read-only.  However, in Linux, any individual file system may be
+mounted at multiple places in the namespace, and a file system can be
+changed from read-only to read-write while still mounted.  Thus, simply
+checking that the bottom layer is read-only at the time the writable
+overlay is mounted over it is pointless, since at any time the bottom
+layer may become read-write.
+
+We have to guarantee that a file system will be read-only for as long
+as it is the bottom layer of a union mount.  To do this, we track the
+number of hard read-only users of a file system in its VFS superblock
+structure.  When we union mount a writable overlay over a file system,
+we increment its read-only user count.  The file system can only be
+mounted read-write if its read-only users count is zero.
+
+Todo:
+
+- Support hard read-only NFS mounts.  See discussion here:
+
+  http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Pathname lookup
+---------------
+
+Pathname lookup in a unioned directory traverses down the union stack
+for the parent directory, looking up each pathname element in each
+layer of the file system (according to the rules of whiteouts,
+fallthrus, and opaque flags).  At mount time, the union stack for the
+root directory of the file system is created, and the union stack
+creation for every other unioned directory in the file system is
+boot-strapped using the already-existing union stack of the
+directory's parent.  In order to simplify the code greatly, every
+visible directory on the lower file system is required to have a
+matching directory on the upper file system.  This matching directory
+is created during pathname lookup if does not already exist.
+Therefore, each unioned directory is the child of another unioned
+directory (or is the root directory of the file system).
+
+The actual union lookup function is called in the following code
+paths:
+
+do_lookup()->do_union_lookup()->lookup_union()->__lookup_union()
+lookup_hash()->lookup_union()->__lookup_union()
+
+__lookup_union() is where the rules of whiteouts, fallthrus, and
+opaque flags are actually implemented.  __lookup_union() returns
+either the first visible dentry, or a negative dentry from the topmost
+file system if no matching dentry exists.  If it finds a directory, it
+looks up any potential matching lower layer directories.  If it finds
+a lower layer directory, it first creates the topmost dir if necessary
+via union_create_topmost_dir(), and then calls union_add_dir() to
+append the lower directory to the end of the union stack.
+
+Note that not all directories in a union mount are unioned, only those
+with matching directories on the lower layer.  The macro
+IS_DIR_UNIONED() is a cheap, constant time way to check if a directory
+is unioned, while IS_MNT_UNION() checks if the entire mount is unioned
+(and therefore whether the directory in question is potentially
+unioned).
+
+Currently, lookup of a negative dentry in a unioned directory requires
+a lookup in every directory in the union stack every time it is looked
+up.  We could avoid subsequent lookups by adding a negative union
+cache entry, exactly the way negative dentries are cached.
+
+File copyup
+-----------
+
+Any system call that alters the data or metadata of a file on the
+bottom layer, or creates or changes a hard link to it will trigger a
+copyup of the target file from the lower layer to the topmost layer
+
+ - open(O_WRITE | O_RDWR | O_APPEND)
+ - truncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chown()/lchown()
+ - utimes()
+ - setxattr()/lsetxattr()
+
+Copyup of a file due to open(O_WRITE) has already occurred when:
+
+ - write()
+ - ftruncate()
+ - writable mmap()
+
+The following system calls will fail on an fd opened O_RDONLY:
+
+ - fchmod()
+ - fchown()
+ - fsetxattr()
+ - futimensat()
+
+Contrary to common sense, the above system calls are defined to
+succeed on O_RDONLY fds.  The idea seems to be that the
+O_RDONLY/O_RDWR/O_WRITE flags only apply to the actual file data, not
+to any form of metadata (times, owner, mode, or even extended
+attributes).  Applications making these system calls on O_RDONLY fds
+are correct according to the standard and work on non-union-mounts.
+They will need to be rewritten (O_RDONLY -> O_RDWR) to work on union
+mounts.  We suspect this usage is uncommon.
+
+This deviation from standard is due to technical limitations of the
+union mount implementation.  Specifically, we would need to replace an
+open file descriptor from the lower layer with an open file descriptor
+for a file with matching pathname and contents on the upper layer,
+which is difficult to do.  We avoid this in other system calls by
+doing the copyup before the file is opened.  Unionfs doesn't encounter
+this problem because it creates a dummy file struct which redirects or
+fans out operations to the struct files for the underlying file
+systems.
+
+From an application's point of view, the result of an in-kernel file
+copyup is the logical equivalent of another application updating the
+file via the rename() pattern: creat() a new file, copy the data over,
+make changes the copy, and rename() over the old version.  Any
+existing open file descriptors for that file (including those in the
+same application) refer to a now invisible object that used to have
+the same pathname.  Only opens that occur after the copyup will see
+updates to the file.
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed
+in a system call before copying a file up to avoid unnecessary IO.  At
+present, the permission check for a single system call may be spread
+out over many hundreds of lines of code (e.g., open()).  In order to
+check permissions, we occasionally need to determine if there is a
+writable overlay on top of this inode.  This requires a full path, but
+often we only have the inode at this point.  In particular,
+inode_permission() returns EROFS if the inode is on a read-only file
+system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+The current solution is to split out the file-system-wide permission
+checks from the per-inode permission checks.  inode_permission()
+becomes:
+
+sb_permission()
+__inode_permission()
+
+inode_permission() calls sb_permission() and __inode_permission() on
+the same path.  We create path_permission() which calls
+sb_permission() on the parent directory from the top layer, and
+__inode_permission() on the target on the lower layer.  This gets us
+the correct write permissions consdering that the file will be copied
+up.
+
+Todo:
+
+  - Currently, we don't deal with differing directory permissions at
+    different levels of the stack.  This is a bug.
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are
+#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in
+nearly all cases (see fs/union.h).
+
+Todo:
+
+ - Do performance tests
+
+Locking strategy
+================
+
+The current union mount locking strategy is based on the following
+rules:
+
+* The lower layer file system is always read-only
+* The topmost file system is always read-write
+  => A file system can never a topmost and lower layer at the same time
+
+Additionally, the topmost layer may only be mounted exactly once.
+Don't think of the topmost layer as a separate independent file
+system; when it is part of a union mount, it is only a file system in
+conjunction with the read-only bottom layer.  The read-only bottom
+layer is an independent file system in and of itself and can be
+mounted elsewhere, including as the bottom layer for another union
+mount.
+
+Thus, we may define a stable locking order in terms of top layer and
+bottom layer locks, since a top layer is never a bottom layer and a
+bottom layer is never a top layer.  Another simplifying assumption is
+that all directories in a pathname exist on the top layer, as they are
+created step-by-step during lookup.  This prevents us from ever having
+to walk backwards up the path creating directory entries, which can
+get complicated.  By implication, parent directories paths during any
+operation (rename(), unlink(),etc.) are from the top layer.  Dentries
+for directories from the bottom layer are only ever seen or used by
+the lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file
+systems: A mounted over B, and B mounted over A.  Sometimes locks on
+objects in both A and B will have to be held simultanously.  What
+order should they be acquired in?  Simply acquiring them from top to
+bottom will create a lock-ordering problem - one thread acquires lock
+on object from A and then tries for a lock on object from B, while
+another thread grabs the lock on object from B and then waits for the
+lock on object from A.  Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety
+of nasty corner cases arise when more than one layer is changing at
+the same time.  Changes in the directory topology and their effect on
+inheritance are of special concern.  Al Viro's canonical email on the
+subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first
+place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies
+objects from both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Grabs i_mutex on bottom layer while holding i_mutex on top layer
+directory inode.
+
+File copyup:
+
+Holds i_mutex on the parent directory from the top layer while copying
+up file from lower layer.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top
+layer.  Followed by a normal link() operation.
+
+rename():
+
+Holds s_vfs_rename_mutex on the top layer, i_mutex of the source's
+parent dir (top layer), and i_mutex of the target's parent dir (also
+top layer) while looking up and copying the bottom layer target and
+also creating the whiteout.
+
+Notes on rename():
+
+First, renaming of directories returns EXDEV.  It's not at all
+reasonable to recursively copy directory trees and userspace has to
+handle this case anyway.  An exception is rename() of directories that
+exist only on the topmost layer; this succeeds.
+
+Rename involves three steps on a union mount: (1) copyup of the file
+from the bottom layer, (2) rename of the new top-layer copy to the
+target in the usual manner, (3) creation of a whiteout covering the
+source of the rename.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir().  We hold the
+top layer directory i_mutex throughout and sequentially acquire and
+drop the i_mutex for each lower layer directory.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really
+really read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT flag to indicate support of these operations.
+
+Todo:
+
+- Return inode of underlying file in d_ino in readdir()
+- Implement whiteouts and fallthrus in ext3
+- Implement whiteouts and fallthrus in btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer.  File systems must
+explicitly support whiteouts and fallthrus in order to be a read-write
+layer.  This patch set implements whiteouts for ext2, tmpfs, and
+jffs2.  We have tested ext2, tmpfs, and iso9660 as the read-only
+layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer.  NFS as
+read-only layer requires support from the server to honor the
+read-only guarantee needed for the bottom layer.  To do this, the
+server needs to revoke access to clients requesting read-only file
+systems if the exported file system is remounted read-write or
+unmounted (during which arbitrary changes can occur).  Some recent
+discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the
+->whiteout() and ->fallthru() methods.  DT_WHT directory entries are
+theoretically already supported.
+
+Also, technically the requirement for a readdir() cookie that is
+stable across reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Guarantee really really read-only on NFS exports
+- Implement whiteout()/fallthru() for NFS
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass
+the corresponding MS_UNION flag to the kerel.  A util-linux git
+tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus.  An
+e2fsprogs git tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland.
+While the directory type for whiteouts, DT_WHT, has been defined for
+many years, very little userland code handles them.  Userland will
+never see fallthru directory entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+
+  Most programs are not tested and don't work well under conditions of
+  ENOSPC.  The solution is to add more disk space.
+
+- Link count may be wrong for files on bottom layer with > 1 link count
+
+  A file may have more than one hard link to it.  When a file with
+  multiple hard links is copied up, any other hard links pointing to
+  the same inode will remain unchanged.  If the file is looked up via
+  one of the hard links on the read-only layer, it will have the
+  original link count (which is off by one at this point).  An
+  example:
+
+  /bin/link1 -> inode 100
+  /etc/link2 -> inode 100
+
+  inode 100 will have link count 2.
+
+  # echo "blah" > /bin/link1
+
+  Now /bin/link1 will be copied up to the topmost layer.  But
+  /etc/link2 will still point to the original inode 100, and its link
+  count will still be 2.
+
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+  rename().  Any existing open file descriptors will continue to refer
+  to the read-only copy on the bottom layer and will not see any
+  changes that occur after the copy-up.
+- rename() of directory may fail with EXDEV
+- inode number in d_ino of struct dirent will be wrong for fallthrus
+- fchmod()/fchown()/futimensat()/fsetattr() fail on O_RDONLY fds
+
+Status
+======
+
+The current union mounts implementation is feature-complete on local
+file systems and passes an extensive union mounts test suite,
+available in the union mounts Usermode Linux-based development kit:
+
+http://valerieaurora.org/union/union_mount_devkit.tar.gz
+
+The whiteout code has had some non-trivial level of review and
+testing, but the much the code has had no external review or testing
+outside the authors' machines.
+
+The latest version is available at:
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Check the union mounts web page for the name of the latest branch:
+
+http://valerieaurora.org/union/
+
+Todo:
+
+- Run more tests (e.g., XFS test suite)
+- Get review from VFS maintainers
+
+Non-features
+------------
+
+Features we do not currently plan to support in union mounts:
+
+Online upgrade: E.g., installing software on a file system NFS
+exported to clients while the clients are still up and running.
+Allowing the read-only bottom layer of a union mount to change
+invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across
+layers for directories.  Doing an in-kernel copy of a single file is
+bad enough.  Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the
+ability to create persistent directory entries on the top layer file
+system (which may be tmpfs).  Numerous alternatives (including
+in-kernel or in-application caching) exist and are compatible with
+union mounts with its writing-readdir() implementation disabled.
+Creating a readdir() cookie that is stable across multiple readdir()s
+requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Often these features are supported by other unioning file systems or
+by other versions of union mounts.
+
+Contributing to union mounts
+============================
+
+The union mounts web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The best mailing list for discussing union mounts is:
+
+linux-fsdevel@vger.kernel.org
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!
-- 
1.6.3.3


  parent reply	other threads:[~2010-08-06 22:36 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-06 22:34 [PATCH 00/38] VFS union mounts - Add MS_FALLTHRU Valerie Aurora
2010-08-06 22:34 ` [PATCH 01/38] VFS: Comment follow_mount() and friends Valerie Aurora
2010-08-06 22:34 ` [PATCH 02/38] VFS: Make lookup_hash() return a struct path Valerie Aurora
2010-08-06 22:34 ` [PATCH 03/38] VFS: Add read-only users count to superblock Valerie Aurora
2010-08-06 22:34 ` [PATCH 04/38] autofs4: Save autofs trigger's vfsmount in super block info Valerie Aurora
2010-08-06 22:34   ` Valerie Aurora
2010-08-06 22:34 ` [PATCH 05/38] whiteout/NFSD: Don't return information about whiteouts to userspace Valerie Aurora
2010-08-06 23:36   ` J. Bruce Fields
2010-08-06 23:36     ` J. Bruce Fields
2010-08-06 22:34 ` [PATCH 06/38] whiteout: Add vfs_whiteout() and whiteout inode operation Valerie Aurora
2010-08-06 22:34 ` [PATCH 07/38] whiteout: Set opaque flag if new directory was previously a whiteout Valerie Aurora
2010-08-06 22:34 ` [PATCH 08/38] whiteout: Allow removal of a directory with whiteouts Valerie Aurora
2010-08-06 22:34 ` [PATCH 09/38] whiteout: tmpfs whiteout support Valerie Aurora
2010-08-06 22:34   ` Valerie Aurora
2010-08-06 22:34 ` [PATCH 10/38] whiteout: Split of ext2_append_link() from ext2_add_link() Valerie Aurora
2010-08-06 22:34 ` [PATCH 11/38] whiteout: ext2 whiteout support Valerie Aurora
2010-08-06 22:34 ` [PATCH 12/38] whiteout: jffs2 " Valerie Aurora
2010-08-06 22:34   ` Valerie Aurora
2010-08-06 22:34   ` Valerie Aurora
2010-08-06 22:34 ` [PATCH 13/38] fallthru: Basic fallthru definitions Valerie Aurora
2010-08-06 22:35 ` [PATCH 14/38] fallthru: ext2 fallthru support Valerie Aurora
2010-08-07  0:28   ` Andreas Dilger
2010-08-08 16:40     ` Valerie Aurora
2010-08-06 22:35 ` [PATCH 15/38] fallthru: jffs2 " Valerie Aurora
2010-08-06 22:35   ` Valerie Aurora
2010-08-06 22:35   ` Valerie Aurora
2010-08-06 22:35 ` [PATCH 16/38] fallthru: tmpfs " Valerie Aurora
2010-08-06 22:35 ` Valerie Aurora [this message]
2010-08-06 22:35 ` [PATCH 18/38] union-mount: Introduce MNT_UNION and MS_UNION flags Valerie Aurora
2010-08-06 22:35 ` [PATCH 19/38] union-mount: Introduce union_dir structure and basic operations Valerie Aurora
2010-08-06 22:35 ` [PATCH 20/38] union-mount: Free union dirs on removal from dcache Valerie Aurora
2010-08-06 22:35 ` [PATCH 21/38] union-mount: Support for union mounting file systems Valerie Aurora
2010-08-06 22:35 ` [PATCH 22/38] union-mount: Implement union lookup Valerie Aurora
2010-08-06 22:35 ` [PATCH 23/38] union-mount: Call do_whiteout() on unlink and rmdir in unions Valerie Aurora
2010-08-06 22:35 ` [PATCH 24/38] union-mount: Copy up directory entries on first readdir() Valerie Aurora
2010-08-06 22:35 ` [PATCH 25/38] VFS: Split inode_permission() and create path_permission() Valerie Aurora
2010-08-06 22:35 ` [PATCH 26/38] VFS: Create user_path_nd() to lookup both parent and target Valerie Aurora
2010-08-06 22:35 ` [PATCH 27/38] union-mount: In-kernel file copyup routines Valerie Aurora
2010-08-06 22:35 ` [PATCH 28/38] union-mount: Implement union-aware access()/faccessat() Valerie Aurora
2010-08-06 22:35 ` [PATCH 29/38] union-mount: Implement union-aware link() Valerie Aurora
2010-08-06 22:35 ` [PATCH 30/38] union-mount: Implement union-aware rename() Valerie Aurora
2010-08-06 22:35 ` [PATCH 31/38] union-mount: Implement union-aware writable open() Valerie Aurora
2010-08-06 22:35 ` [PATCH 32/38] union-mount: Implement union-aware chown() Valerie Aurora
2010-08-06 22:35 ` [PATCH 33/38] union-mount: Implement union-aware truncate() Valerie Aurora
2010-08-06 22:35 ` [PATCH 34/38] union-mount: Implement union-aware chmod()/fchmodat() Valerie Aurora
2010-08-06 22:35 ` [PATCH 35/38] union-mount: Implement union-aware lchown() Valerie Aurora
2010-08-06 22:35 ` [PATCH 36/38] union-mount: Implement union-aware utimensat() Valerie Aurora
2010-08-06 22:35 ` [PATCH 37/38] union-mount: Implement union-aware setxattr() Valerie Aurora
2010-08-06 22:35 ` [PATCH 38/38] union-mount: Implement union-aware lsetxattr() Valerie Aurora
  -- strict thread matches above, loose matches on Subject: below --
2010-06-25 19:04 [PATCH 00/38] Union mounts - multiple layers and submounts Valerie Aurora
2010-06-25 19:05 ` [PATCH 17/38] union-mount: Union mounts documentation Valerie Aurora
2010-06-15 18:39 [PATCH 00/38] Union mounts - union stack as linked list Valerie Aurora
2010-06-15 18:39 ` [PATCH 17/38] union-mount: Union mounts documentation Valerie Aurora
2010-06-17  8:01   ` Alex Riesen
2010-06-17 18:39     ` Valerie Aurora
2010-06-17 20:32       ` Alex Riesen
2010-06-18 21:06         ` Valerie Aurora
2010-06-21 13:14       ` Miklos Szeredi
2010-06-21 23:17         ` Valerie Aurora
2010-06-23  8:43         ` Alex Riesen
2010-06-23  8:43           ` Alex Riesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1281134124-17041-18-git-send-email-vaurora@redhat.com \
    --to=vaurora@redhat.com \
    --cc=hch@infradead.org \
    --cc=jblunck@suse.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.