linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [UNIONFS] 00/42 Unionfs and related patches review
@ 2007-12-10  2:41 Erez Zadok
  2007-12-10  2:41 ` [PATCH 01/42] Unionfs: filesystems documentation index Erez Zadok
                   ` (42 more replies)
  0 siblings, 43 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel


Al, Christoph, and Andrew,

As per your request, I'm posting for review the unionfs code (and related
code) that's in my korg tree against mainline (v2.6.24-rc4-190-g94545ba).
This code is nearly identical to what's in -mm (the mm code has a couple of
additional things that depend on mm-specific patches that aren't in mainline
yet).

I really tried to keep this message short, by offering pointers to more
info, but still there's a bunch of info here.

Andrew, you've asked me to list the main issues that came about in
discussions regarding unionfs, and how were they addressed.  So I've
reviewed my notes from OLS'06, LSF'07, and OLS'07, as well as assorted
postings in mailing lists, and I came up with this prioritized list (in
descending priority order):

1. cache coherency
2. nameidata handling
3. namespace pollution
4. use of ioctls for branch management

(1) Cache coherency: by far, the biggest concern had been around cache
    coherency: what happens if someone modifies a lower object
    (file/dir/etc.).  I met with Mike Halcrow in October and we discussed
    stacking in general; Mike also emphasized that cache-coherency was one
    of his most pressing concerns in ecryptfs.

At OLS'06, several suggestions were made, including fancy tricks to hide the
lower namespace or "lock" it so users have readonly access.  None of these
solutions would have been able to easily handle the problem of an existing
open file descriptor on a lower file, and they might have required
significant VFS changes.  Moreover, unionfs users actually want to modify
lower branches directly, and then be able to see their changes reflected in
the union immediately.  So we explored a number of ideas.  We feel that the
VFS is complex enough so we tried our best to handle cache-coherency inside
unionfs.  The solution we have implemented is to compare the mtime/ctime of
upper/lower objects during revalidation (esp. of dentries); and if the lower
times are newer, we reconstruct the union object (drop the older objects,
and re-lookup them).  This time-based cache-coherency works well and is
similar to the NFS model.  Because Unionfs users tend to have a burst of
activity on lower branches, our current cache-coherency also defers the
revalidation actions until absolutely needed, so this idea tends to also be
more efficient for the common usage patterns.  More details about how we
handle cache-coherency are available in our
Documentation/filesystems/unionfs/concepts.txt file.

That said, we're now developing some VFS patches that would allow lower file
systems to more directly inform the upper objects about such (mtime)
changes.  We're exploring a couple of different options but our key goals
are to (a) minimize VFS changes and (b) avoid any changes to lower file
systems.

(2) nameidata handling.  Another important question raised (esp. by NFS
    people) was how we handle struct nameidata.  The VFS passes nameidata
    structs to file systems, and some file systems use that.  We used to
    either pass NULL or the upper nd to the lower f/s.  That caused NULL
    de-refs inside nfsv4, among other problems.  We now create our own
    nameidata structure, fill it up as needed (esp. for intent data), and
    pass it down.  We do this every time we call any VFS function that takes
    a nameidata (e.g., vfs_create).  This seems to work well.

There's been some discussion on lkml about splitting struct nameidata in
two, one of which would handle just the intent information.  I'd like to see
that happen, maybe even help, because right now we pass a whole large-ish
struct nameidata for just a couple of intent bits of information that the
lower f/s needs.

(3) namespace pollution.  Unioning readonly and readwrite directories
    requires the ability to mask, or white-out, files that are being deleted
    from a readonly directory.  Unionfs does this in a portable way, by
    creating .wh.XXX files to indicate that file XXX has been whited-out.
    This works well on many file systems, but it tends to clutter lower
    branches with these .wh.* files.  We recently optimized our whiteout
    creation algorithm so it minimizes the number of conditions in which
    whiteouts are created, and that helped some people a lot.  But still, if
    you unify a readonly and writeable branch, and you try to delete a file
    from the readonly branch/medium, there's no way to avoid creating some
    sort of a whiteout.  BTW, of course, these whiteouts are completely
    hidden from the view of the user who accesses files/dirs via the union.

In the long run, we really hope to see native whiteout support in Linux (ala
BSD).  Of course, this would require a change to the VFS and several native
file systems (possibly even a change to the on-disk format), so we realize
that this isn't likely to happen soon.  If/when native whiteout support was
available, unionfs could easily use it.  Until that time, we have lots of
users who want to use unionfs on top of numerous different file systems, and
so we have to do the next best thing wrt whiteouts.

This is a good point to mention that the version of unionfs in -mm is 2.1.x.
We have been working on a newer and still experimental version of Unionfs,
called "Unionfs with On-Disk Format" or Unionfs-ODF.  Unionfs-ODF uses a
small persistent store (e.g., a small ext2 partition) to store whiteouts in,
among other info; this moves the union-level meta-data (e.g., whiteouts),
outside the lower file systems, and thus eliminates the need to create .wh.*
files.  Unionfs-ODF has other useful benefits, and you can get more detail
about it here: <http://www.filesystems.org/unionfs-odf.txt>.  We recently
sync'ed up our unionfs 2.1 and unionfs-odf releases and we're tracking
Linus's tree for both.  IOW, every fix and user-visible feature that has
gone into unionfs in -mm, is now also in unionfs-odf.  Our intent is to
continue to develop both versions, and gradually move features from
unionfs-odf into unionfs 2.1; this would be possible even if/after
unionfs-2.1 gets merged, because the changes will all be internal to the
implementation, and users won't need to change the way they, say, mount a
union or manipulate its branches.

(4) branch management.  One of the most useful features of unioning is to be
    able to add/remove branches from the union.  We used to do this via
    ioctl's, which was considered racy, unclean, and non-atomic (only one
    branch-manipulation operation at a time).  We now do that via the
    remount interface, and allow users to pass multiple branch-manipulation
    commands, which are handled as one action.


* GENERAL

I should note that my philosophy in developing any stackable file system had
been to minimize changes to the VFS, and to not change any lower file system
whatsoever: that ensures that unionfs couldn't affect the stability of
performance of the rest of the kernel.  Still, some of the things unionfs
does could possibly be done more cleanly and easily at the VFS level (e.g.,
better hooks for cache coherency).

Unionfs 2.1.x is currently maintained on 2.6.9 and all major kernels since
2.6.18, all the way to Linus's latest 2.6.24-rc tree and -mm.  We've got a
lot users who use unionfs in more creative ways than even we could think of,
and this has helped us find the RIGHT set of features to please the users,
as well as stabilize the code.  Before every new release, we test the new
code on all versions using ltp-full, parallel compiles, and our own
unionfs-aware regression suite which exercises unionfs's unique features
(e.g., copy-up).

I therefore believe that unionfs is in a good enough shape now to be
considered for merging in 2.6.25.  The user-visible behavior isn't likely to
change; and any changes to the VFS to better support stacking, could be
handled internally in subsequent kernels without affecting how users use
unionfs.  Aside from greater exposure to stackable file systems and unionfs,
I think one of the other important benefits of a merge could be that we'd
have more than one stackable f/s in the kernel (i.e., ecryptfs and unionfs);
this would allow us to slowly and gradually generalize the VFS so it can
better support stackable file systems.


Lastly, diffstats:

 Documentation/filesystems/00-INDEX             |    2 
 Documentation/filesystems/unionfs/00-INDEX     |   10 
 Documentation/filesystems/unionfs/concepts.txt |  199 ++++
 Documentation/filesystems/unionfs/issues.txt   |   24 
 Documentation/filesystems/unionfs/rename.txt   |   31 
 Documentation/filesystems/unionfs/usage.txt    |  115 ++
 MAINTAINERS                                    |    9 
 fs/Kconfig                                     |   53 -
 fs/Makefile                                    |    1 
 fs/drop_caches.c                               |    4 
 fs/ecryptfs/dentry.c                           |    2 
 fs/ecryptfs/inode.c                            |    6 
 fs/ecryptfs/main.c                             |    2 
 fs/namei.c                                     |    1 
 fs/stack.c                                     |   30 
 fs/unionfs/Makefile                            |   13 
 fs/unionfs/commonfops.c                        |  827 +++++++++++++++++
 fs/unionfs/copyup.c                            |  897 +++++++++++++++++++
 fs/unionfs/debug.c                             |  532 +++++++++++
 fs/unionfs/dentry.c                            |  498 ++++++++++
 fs/unionfs/dirfops.c                           |  290 ++++++
 fs/unionfs/dirhelper.c                         |  272 +++++
 fs/unionfs/fanout.h                            |  355 +++++++
 fs/unionfs/file.c                              |  227 ++++
 fs/unionfs/inode.c                             | 1154 +++++++++++++++++++++++++
 fs/unionfs/lookup.c                            |  652 ++++++++++++++
 fs/unionfs/main.c                              |  783 ++++++++++++++++
 fs/unionfs/mmap.c                              |  338 +++++++
 fs/unionfs/rdstate.c                           |  285 ++++++
 fs/unionfs/rename.c                            |  533 +++++++++++
 fs/unionfs/sioq.c                              |  119 ++
 fs/unionfs/sioq.h                              |   92 +
 fs/unionfs/subr.c                              |  242 +++++
 fs/unionfs/super.c                             | 1020 ++++++++++++++++++++++
 fs/unionfs/union.h                             |  591 ++++++++++++
 fs/unionfs/unlink.c                            |  236 +++++
 fs/unionfs/xattr.c                             |  153 +++
 include/linux/fs_stack.h                       |   21 
 include/linux/magic.h                          |    2 
 include/linux/mm.h                             |    2 
 include/linux/namei.h                          |   13 
 include/linux/union_fs.h                       |   24 
 42 files changed, 10624 insertions(+), 36 deletions(-)


Thanks,
Erez.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/42] Unionfs: filesystems documentation index
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10 14:47   ` Jan Engelhardt
  2007-12-10  2:41 ` [PATCH 02/42] Unionfs: unionfs " Erez Zadok
                   ` (41 subsequent siblings)
  42 siblings, 1 reply; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 Documentation/filesystems/00-INDEX |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 1de155e..b168331 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -96,6 +96,8 @@ udf.txt
 	- info and mount options for the UDF filesystem.
 ufs.txt
 	- info on the ufs filesystem.
+unionfs/
+	- info on the unionfs filesystem
 vfat.txt
 	- info on using the VFAT filesystem used in Windows NT and Windows 95
 vfs.txt
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 02/42] Unionfs: unionfs documentation index
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
  2007-12-10  2:41 ` [PATCH 01/42] Unionfs: filesystems documentation index Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 03/42] Unionfs: documentation for general concepts Erez Zadok
                   ` (40 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 Documentation/filesystems/unionfs/00-INDEX |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/unionfs/00-INDEX

diff --git a/Documentation/filesystems/unionfs/00-INDEX b/Documentation/filesystems/unionfs/00-INDEX
new file mode 100644
index 0000000..96fdf67
--- /dev/null
+++ b/Documentation/filesystems/unionfs/00-INDEX
@@ -0,0 +1,10 @@
+00-INDEX
+	- this file.
+concepts.txt
+	- A brief introduction of concepts.
+issues.txt
+	- A summary of known issues with unionfs.
+rename.txt
+	- Information regarding rename operations.
+usage.txt
+	- Usage information and examples.
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 03/42] Unionfs: documentation for general concepts
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
  2007-12-10  2:41 ` [PATCH 01/42] Unionfs: filesystems documentation index Erez Zadok
  2007-12-10  2:41 ` [PATCH 02/42] Unionfs: unionfs " Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 04/42] Unionfs: usage documentation for users Erez Zadok
                   ` (39 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 Documentation/filesystems/unionfs/concepts.txt |  199 ++++++++++++++++++++++++
 1 files changed, 199 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/unionfs/concepts.txt

diff --git a/Documentation/filesystems/unionfs/concepts.txt b/Documentation/filesystems/unionfs/concepts.txt
new file mode 100644
index 0000000..7654ccc
--- /dev/null
+++ b/Documentation/filesystems/unionfs/concepts.txt
@@ -0,0 +1,199 @@
+Unionfs 2.1 CONCEPTS:
+=====================
+
+This file describes the concepts needed by a namespace unification file
+system.
+
+
+Branch Priority:
+================
+
+Each branch is assigned a unique priority - starting from 0 (highest
+priority).  No two branches can have the same priority.
+
+
+Branch Mode:
+============
+
+Each branch is assigned a mode - read-write or read-only. This allows
+directories on media mounted read-write to be used in a read-only manner.
+
+
+Whiteouts:
+==========
+
+A whiteout removes a file name from the namespace. Whiteouts are needed when
+one attempts to remove a file on a read-only branch.
+
+Suppose we have a two-branch union, where branch 0 is read-write and branch
+1 is read-only. And a file 'foo' on branch 1:
+
+./b0/
+./b1/
+./b1/foo
+
+The unified view would simply be:
+
+./union/
+./union/foo
+
+Since 'foo' is stored on a read-only branch, it cannot be removed. A
+whiteout is used to remove the name 'foo' from the unified namespace. Again,
+since branch 1 is read-only, the whiteout cannot be created there. So, we
+try on a higher priority (lower numerically) branch and create the whiteout
+there.
+
+./b0/
+./b0/.wh.foo
+./b1/
+./b1/foo
+
+Later, when Unionfs traverses branches (due to lookup or readdir), it
+eliminate 'foo' from the namespace (as well as the whiteout itself.)
+
+
+Duplicate Elimination:
+======================
+
+It is possible for files on different branches to have the same name.
+Unionfs then has to select which instance of the file to show to the user.
+Given the fact that each branch has a priority associated with it, the
+simplest solution is to take the instance from the highest priority
+(numerically lowest value) and "hide" the others.
+
+
+Copyup:
+=======
+
+When a change is made to the contents of a file's data or meta-data, they
+have to be stored somewhere. The best way is to create a copy of the
+original file on a branch that is writable, and then redirect the write
+though to this copy. The copy must be made on a higher priority branch so
+that lookup and readdir return this newer "version" of the file rather than
+the original (see duplicate elimination).
+
+
+Cache Coherency:
+================
+
+Unionfs users often want to be able to modify files and directories directly
+on the lower branches, and have those changes be visible at the Unionfs
+level.  This means that data (e.g., pages) and meta-data (dentries, inodes,
+open files, etc.) have to be synchronized between the upper and lower
+layers.  In other words, the newest changes from a layer below have to be
+propagated to the Unionfs layer above.  If the two layers are not in sync, a
+cache incoherency ensues, which could lead to application failures and even
+oopses.  The Linux kernel, however, has a rather limited set of mechanisms
+to ensure this inter-layer cache coherency---so Unionfs has to do most of
+the hard work on its own.
+
+Maintaining Invariants:
+
+The way Unionfs ensures cache coherency is as follows.  At each entry point
+to a Unionfs file system method, we call a utility function to validate the
+primary objects of this method.  Generally, we call unionfs_file_revalidate
+on open files, and __unionfs_d_revalidate_chain on dentries (which also
+validates inodes).  These utility functions check to see whether the upper
+Unionfs object is in sync with any of the lower objects that it represents.
+The checks we perform include whether the Unionfs superblock has a newer
+generation number, or if any of the lower objects mtime's or ctime's are
+newer.  (Note: generation numbers change when branch-management commands are
+issued, so in a way, maintaining cache coherency is also very important for
+branch-management.)  If indeed we determine that any Unionfs object is no
+longer in sync with its lower counterparts, then we rebuild that object
+similarly to how we do so for branch-management.
+
+While rebuilding Unionfs's objects, we also purge any page mappings and
+truncate inode pages (see fs/unionfs/dentry.c:purge_inode_data).  This is to
+ensure that Unionfs will re-get the newer data from the lower branches.  We
+perform this purging only if the Unionfs operation in question is a reading
+operation; if Unionfs is performing a data writing operation (e.g., ->write,
+->commit_write, etc.) then we do NOT flush the lower mappings/pages: this is
+because (1) a self-deadlock could occur and (2) the upper Unionfs pages are
+considered more authoritative anyway, as they are newer and will overwrite
+any lower pages.
+
+Unionfs maintains the following important invariant regarding mtime's,
+ctime's, and atime's: the upper inode object's times are the max() of all of
+the lower ones.  For non-directory objects, there's only one object below,
+so the mapping is simple; for directory objects, there could me multiple
+lower objects and we have to sync up with the newest one of all the lower
+ones.  This invariant is important to maintain, especially for directories
+(besides, we need this to be POSIX compliant).  A union could comprise
+multiple writable branches, each of which could change.  If we don't reflect
+the newest possible mtime/ctime, some applications could fail.  For example,
+NFSv2/v3 exports check for newer directory mtimes on the server to determine
+if the client-side attribute cache should be purged.
+
+To maintain these important invariants, of course, Unionfs carefully
+synchronizes upper and lower times in various places.  For example, if we
+copy-up a file to a top-level branch, the parent directory where the file
+was copied up to will now have a new mtime: so after a successful copy-up,
+we sync up with the new top-level branch's parent directory mtime.
+
+Implementation:
+
+This cache-coherency implementation is efficient because it defers any
+synchronizing between the upper and lower layers until absolutely needed.
+Consider the example a common situation where users perform a lot of lower
+changes, such as untarring a whole package.  While these take place,
+typically the user doesn't access the files via Unionfs; only after the
+lower changes are done, does the user try to access the lower files.  With
+our cache-coherency implementation, the entirety of the changes to the lower
+branches will not result in a single CPU cycle spent at the Unionfs level
+until the user invokes a system call that goes through Unionfs.
+
+We have considered two alternate cache-coherency designs.  (1) Using the
+dentry/inode notify functionality to register interest in finding out about
+any lower changes.  This is a somewhat limited and also a heavy-handed
+approach which could result in many notifications to the Unionfs layer upon
+each small change at the lower layer (imagine a file being modified multiple
+times in rapid succession).  (2) Rewriting the VFS to support explicit
+callbacks from lower objects to upper objects.  We began exploring such an
+implementation, but found it to be very complicated--it would have resulted
+in massive VFS/MM changes which are unlikely to be accepted by the LKML
+community.  We therefore believe that our current cache-coherency design and
+implementation represent the best approach at this time.
+
+Limitations:
+
+Our implementation works in that as long as a user process will have caused
+Unionfs to be called, directly or indirectly, even to just do
+->d_revalidate; then we will have purged the current Unionfs data and the
+process will see the new data.  For example, a process that continually
+re-reads the same file's data will see the NEW data as soon as the lower
+file had changed, upon the next read(2) syscall (even if the file is still
+open!)  However, this doesn't work when the process re-reads the open file's
+data via mmap(2) (unless the user unmaps/closes the file and remaps/reopens
+it).  Once we respond to ->readpage(s), then the kernel maps the page into
+the process's address space and there doesn't appear to be a way to force
+the kernel to invalidate those pages/mappings, and force the process to
+re-issue ->readpage.  If there's a way to invalidate active mappings and
+force a ->readpage, let us know please (invalidate_inode_pages2 doesn't do
+the trick).
+
+Our current Unionfs code has to perform many file-revalidation calls.  It
+would be really nice if the VFS would export an optional file system hook
+->file_revalidate (similarly to dentry->d_revalidate) that will be called
+before each VFS op that has a "struct file" in it.
+
+Certain file systems have micro-second granularity (or better) for inode
+times, and asynchronous actions could cause those times to change with some
+small delay.  In such cases, Unionfs may see a changed inode time that only
+differs by a tiny fraction of a second: such a change may be a false
+positive indication that the lower object has changed, whereas if unionfs
+waits a little longer, that false indication will not be seen.  (These false
+positives are harmless, because they would at most cause unionfs to
+re-validate an object that may need no revalidation, and print a debugging
+message that clutters the console/logs.)  Therefore, to minimize the chances
+of these situations, we delay the detection of changed times by a small
+factor of a few seconds, called UNIONFS_MIN_CC_TIME (which defaults to 3
+seconds, as does NFS).  This means that we will detect the change, only a
+couple of seconds later, if indeed the time change persists in the lower
+file object.  This delayed detection has an added performance benefit: we
+reduce the number of times that unionfs has to revalidate objects, in case
+there's a lot of concurrent activity on both the upper and lower objects,
+for the same file(s).  Lastly, this delayed time attribute detection is
+similar to how NFS clients operate (e.g., acregmin).
+
+For more information, see <http://unionfs.filesystems.org/>.
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 04/42] Unionfs: usage documentation for users
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (2 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 03/42] Unionfs: documentation for general concepts Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 05/42] Unionfs: documentation for any known issues Erez Zadok
                   ` (38 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 Documentation/filesystems/unionfs/usage.txt |  115 +++++++++++++++++++++++++++
 1 files changed, 115 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/unionfs/usage.txt

diff --git a/Documentation/filesystems/unionfs/usage.txt b/Documentation/filesystems/unionfs/usage.txt
new file mode 100644
index 0000000..a6b1aca
--- /dev/null
+++ b/Documentation/filesystems/unionfs/usage.txt
@@ -0,0 +1,115 @@
+Unionfs is a stackable unification file system, which can appear to merge
+the contents of several directories (branches), while keeping their physical
+content separate.  Unionfs is useful for unified source tree management,
+merged contents of split CD-ROM, merged separate software package
+directories, data grids, and more.  Unionfs allows any mix of read-only and
+read-write branches, as well as insertion and deletion of branches anywhere
+in the fan-out.  To maintain Unix semantics, Unionfs handles elimination of
+duplicates, partial-error conditions, and more.
+
+# mount -t unionfs -o branch-option[,union-options[,...]] none MOUNTPOINT
+
+The available branch-option for the mount command is:
+
+	dirs=branch[=ro|=rw][:...]
+
+specifies a separated list of which directories compose the union.
+Directories that come earlier in the list have a higher precedence than
+those which come later. Additionally, read-only or read-write permissions of
+the branch can be specified by appending =ro or =rw (default) to each
+directory.
+
+Syntax:
+
+	dirs=/branch1[=ro|=rw]:/branch2[=ro|=rw]:...:/branchN[=ro|=rw]
+
+Example:
+
+	dirs=/writable_branch=rw:/read-only_branch=ro
+
+
+DYNAMIC BRANCH MANAGEMENT AND REMOUNTS
+======================================
+
+You can remount a union and change its overall mode, or reconfigure the
+branches, as follows.
+
+To downgrade a union from read-write to read-only:
+
+# mount -t unionfs -o remount,ro none MOUNTPOINT
+
+To upgrade a union from read-only to read-write:
+
+# mount -t unionfs -o remount,rw none MOUNTPOINT
+
+To delete a branch /foo, regardless where it is in the current union:
+
+# mount -t unionfs -o remount,del=/foo none MOUNTPOINT
+
+To insert (add) a branch /foo before /bar:
+
+# mount -t unionfs -o remount,add=/bar:/foo none MOUNTPOINT
+
+To insert (add) a branch /foo (with the "rw" mode flag) before /bar:
+
+# mount -t unionfs -o remount,add=/bar:/foo=rw none MOUNTPOINT
+
+To insert (add) a branch /foo (in "rw" mode) at the very beginning (i.e., a
+new highest-priority branch), you can use the above syntax, or use a short
+hand version as follows:
+
+# mount -t unionfs -o remount,add=/foo none MOUNTPOINT
+
+To append a branch to the very end (new lowest-priority branch):
+
+# mount -t unionfs -o remount,add=:/foo none MOUNTPOINT
+
+To append a branch to the very end (new lowest-priority branch), in
+read-only mode:
+
+# mount -t unionfs -o remount,add=:/foo=ro none MOUNTPOINT
+
+Finally, to change the mode of one existing branch, say /foo, from read-only
+to read-write, and change /bar from read-write to read-only:
+
+# mount -t unionfs -o remount,mode=/foo=rw,mode=/bar=ro none MOUNTPOINT
+
+Note: in Unionfs 2.x, you cannot set the leftmost branch to readonly because
+then Unionfs won't have any writable place for copyups to take place.
+Moreover, the VFS can get confused when it tries to modify something in a
+file system mounted read-write, but isn't permitted to write to it.
+Instead, you should set the whole union as readonly, as described above.
+If, however, you must set the leftmost branch as readonly, perhaps so you
+can get a snapshot of it at a point in time, then you should insert a new
+writable top-level branch, and mark the one you want as readonly.  This can
+be accomplished as follows, assuming that /foo is your current leftmost
+branch:
+
+# mount -t tmpfs -o size=NNN /new
+# mount -t unionfs -o remount,add=/new,mode=/foo=ro none MOUNTPOINT
+<do what you want safely in /foo>
+# mount -t unionfs -o remount,del=/new,mode=/foo=rw none MOUNTPOINT
+<check if there's anything in /new you want to preserve>
+# umount /new
+
+CACHE CONSISTENCY
+=================
+
+If you modify any file on any of the lower branches directly, while there is
+a Unionfs 2.1 mounted above any of those branches, you should tell Unionfs
+to purge its caches and re-get the objects.  To do that, you have to
+increment the generation number of the superblock using the following
+command:
+
+# mount -t unionfs -o remount,incgen none MOUNTPOINT
+
+Note that the older way of incrementing the generation number using an
+ioctl, is no longer supported in Unionfs 2.0 and newer.  Ioctls in general
+are not encouraged.  Plus, an ioctl is per-file concept, whereas the
+generation number is a per-file-system concept.  Worse, such an ioctl
+requires an open file, which then has to be invalidated by the very nature
+of the generation number increase (read: the old generation increase ioctl
+was pretty racy).
+
+
+For more information, see <http://unionfs.filesystems.org/>.
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 05/42] Unionfs: documentation for any known issues
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (3 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 04/42] Unionfs: usage documentation for users Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 06/42] Unionfs: documentation about renaming operations Erez Zadok
                   ` (37 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 Documentation/filesystems/unionfs/issues.txt |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/unionfs/issues.txt

diff --git a/Documentation/filesystems/unionfs/issues.txt b/Documentation/filesystems/unionfs/issues.txt
new file mode 100644
index 0000000..9db1d70
--- /dev/null
+++ b/Documentation/filesystems/unionfs/issues.txt
@@ -0,0 +1,24 @@
+KNOWN Unionfs 2.1 ISSUES:
+=========================
+
+1. Unionfs should not use lookup_one_len() on the underlying f/s as it
+   confuses NFSv4.  Currently, unionfs_lookup() passes lookup intents to the
+   lower file-system, this eliminates part of the problem.  The remaining
+   calls to lookup_one_len may need to be changed to pass an intent.  We are
+   currently introducing VFS changes to fs/namei.c's do_path_lookup() to
+   allow proper file lookup and opening in stackable file systems.
+
+2. Lockdep (a debugging feature) isn't aware of stacking, and so it
+   incorrectly complains about locking problems.  The problem boils down to
+   this: Lockdep considers all objects of a certain type to be in the same
+   class, for example, all inodes.  Lockdep doesn't like to see a lock held
+   on two inodes within the same task, and warns that it could lead to a
+   deadlock.  However, stackable file systems do precisely that: they lock
+   an upper object, and then a lower object, in a strict order to avoid
+   locking problems; in addition, Unionfs, as a fan-out file system, may
+   have to lock several lower inodes.  We are currently looking into Lockdep
+   to see how to make it aware of stackable file systems.  In the meantime,
+   if you get any warnings from Lockdep, you can safely ignore them (or feel
+   free to report them to the Unionfs maintainers, just to be sure).
+
+For more information, see <http://unionfs.filesystems.org/>.
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 06/42] Unionfs: documentation about renaming operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (4 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 05/42] Unionfs: documentation for any known issues Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 07/42] Unionfs maintainers Erez Zadok
                   ` (36 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 Documentation/filesystems/unionfs/rename.txt |   31 ++++++++++++++++++++++++++
 1 files changed, 31 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/unionfs/rename.txt

diff --git a/Documentation/filesystems/unionfs/rename.txt b/Documentation/filesystems/unionfs/rename.txt
new file mode 100644
index 0000000..e20bb82
--- /dev/null
+++ b/Documentation/filesystems/unionfs/rename.txt
@@ -0,0 +1,31 @@
+Rename is a complex beast. The following table shows which rename(2) operations
+should succeed and which should fail.
+
+o: success
+E: error (either unionfs or vfs)
+X: EXDEV
+
+none = file does not exist
+file = file is a file
+dir  = file is a empty directory
+child= file is a non-empty directory
+wh   = file is a directory containing only whiteouts; this makes it logically
+		empty
+
+                      none    file    dir     child   wh
+file                  o       o       E       E       E
+dir                   o       E       o       E       o
+child                 X       E       X       E       X
+wh                    o       E       o       E       o
+
+
+Renaming directories:
+=====================
+
+Whenever a empty (either physically or logically) directory is being renamed,
+the following sequence of events should take place:
+
+1) Remove whiteouts from both source and destination directory
+2) Rename source to destination
+3) Make destination opaque to prevent anything under it from showing up
+
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 07/42] Unionfs maintainers
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (5 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 06/42] Unionfs: documentation about renaming operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 08/42] Makefile: hook to compile unionfs Erez Zadok
                   ` (35 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 MAINTAINERS |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index f3d7256..95f16f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3805,6 +3805,15 @@ L:	linux-kernel@vger.kernel.org
 W:	http://www.kernel.dk
 S:	Maintained
 
+UNIONFS
+P:	Erez Zadok
+M:	ezk@cs.sunysb.edu
+P:	Josef "Jeff" Sipek
+M:	jsipek@cs.sunysb.edu
+L:	unionfs@filesystems.org
+W:	http://unionfs.filesystems.org
+S:	Maintained
+
 USB ACM DRIVER
 P:	Oliver Neukum
 M:	oliver@neukum.name
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 08/42] Makefile: hook to compile unionfs
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (6 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 07/42] Unionfs maintainers Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 09/42] Unionfs: main Makefile Erez Zadok
                   ` (34 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/Makefile |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index 500cf15..e202288 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -118,3 +118,4 @@ obj-$(CONFIG_HPPFS)		+= hppfs/
 obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
 obj-$(CONFIG_GFS2_FS)           += gfs2/
+obj-$(CONFIG_UNION_FS)		+= unionfs/
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 09/42] Unionfs: main Makefile
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (7 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 08/42] Makefile: hook to compile unionfs Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 10/42] Unionfs: fanout header definitions Erez Zadok
                   ` (33 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/Makefile |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/Makefile

diff --git a/fs/unionfs/Makefile b/fs/unionfs/Makefile
new file mode 100644
index 0000000..17ca4a7
--- /dev/null
+++ b/fs/unionfs/Makefile
@@ -0,0 +1,13 @@
+obj-$(CONFIG_UNION_FS) += unionfs.o
+
+unionfs-y := subr.o dentry.o file.o inode.o main.o super.o \
+	rdstate.o copyup.o dirhelper.o rename.o unlink.o \
+	lookup.o commonfops.o dirfops.o sioq.o mmap.o
+
+unionfs-$(CONFIG_UNION_FS_XATTR) += xattr.o
+
+unionfs-$(CONFIG_UNION_FS_DEBUG) += debug.o
+
+ifeq ($(CONFIG_UNION_FS_DEBUG),y)
+EXTRA_CFLAGS += -DDEBUG
+endif
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 10/42] Unionfs: fanout header definitions
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (8 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 09/42] Unionfs: main Makefile Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 11/42] Unionfs: main header file Erez Zadok
                   ` (32 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/fanout.h |  355 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 355 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/fanout.h

diff --git a/fs/unionfs/fanout.h b/fs/unionfs/fanout.h
new file mode 100644
index 0000000..864383e
--- /dev/null
+++ b/fs/unionfs/fanout.h
@@ -0,0 +1,355 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _FANOUT_H_
+#define _FANOUT_H_
+
+/*
+ * Inode to private data
+ *
+ * Since we use containers and the struct inode is _inside_ the
+ * unionfs_inode_info structure, UNIONFS_I will always (given a non-NULL
+ * inode pointer), return a valid non-NULL pointer.
+ */
+static inline struct unionfs_inode_info *UNIONFS_I(const struct inode *inode)
+{
+	return container_of(inode, struct unionfs_inode_info, vfs_inode);
+}
+
+#define ibstart(ino) (UNIONFS_I(ino)->bstart)
+#define ibend(ino) (UNIONFS_I(ino)->bend)
+
+/* Superblock to private data */
+#define UNIONFS_SB(super) ((struct unionfs_sb_info *)(super)->s_fs_info)
+#define sbstart(sb) 0
+#define sbend(sb) (UNIONFS_SB(sb)->bend)
+#define sbmax(sb) (UNIONFS_SB(sb)->bend + 1)
+#define sbhbid(sb) (UNIONFS_SB(sb)->high_branch_id)
+
+/* File to private Data */
+#define UNIONFS_F(file) ((struct unionfs_file_info *)((file)->private_data))
+#define fbstart(file) (UNIONFS_F(file)->bstart)
+#define fbend(file) (UNIONFS_F(file)->bend)
+
+/* macros to manipulate branch IDs in stored in our superblock */
+static inline int branch_id(struct super_block *sb, int index)
+{
+	BUG_ON(!sb || index < 0);
+	return UNIONFS_SB(sb)->data[index].branch_id;
+}
+
+static inline void set_branch_id(struct super_block *sb, int index, int val)
+{
+	BUG_ON(!sb || index < 0);
+	UNIONFS_SB(sb)->data[index].branch_id = val;
+}
+
+static inline void new_branch_id(struct super_block *sb, int index)
+{
+	BUG_ON(!sb || index < 0);
+	set_branch_id(sb, index, ++UNIONFS_SB(sb)->high_branch_id);
+}
+
+/*
+ * Find new index of matching branch with an existing superblock of a known
+ * (possibly old) id.  This is needed because branches could have been
+ * added/deleted causing the branches of any open files to shift.
+ *
+ * @sb: the new superblock which may have new/different branch IDs
+ * @id: the old/existing id we're looking for
+ * Returns index of newly found branch (0 or greater), -1 otherwise.
+ */
+static inline int branch_id_to_idx(struct super_block *sb, int id)
+{
+	int i;
+	for (i = 0; i < sbmax(sb); i++) {
+		if (branch_id(sb, i) == id)
+			return i;
+	}
+	/* in the non-ODF code, this should really never happen */
+	printk(KERN_WARNING "unionfs: cannot find branch with id %d\n", id);
+	return -1;
+}
+
+/* File to lower file. */
+static inline struct file *unionfs_lower_file(const struct file *f)
+{
+	BUG_ON(!f);
+	return UNIONFS_F(f)->lower_files[fbstart(f)];
+}
+
+static inline struct file *unionfs_lower_file_idx(const struct file *f,
+						  int index)
+{
+	BUG_ON(!f || index < 0);
+	return UNIONFS_F(f)->lower_files[index];
+}
+
+static inline void unionfs_set_lower_file_idx(struct file *f, int index,
+					      struct file *val)
+{
+	BUG_ON(!f || index < 0);
+	UNIONFS_F(f)->lower_files[index] = val;
+	/* save branch ID (may be redundant?) */
+	UNIONFS_F(f)->saved_branch_ids[index] =
+		branch_id((f)->f_path.dentry->d_sb, index);
+}
+
+static inline void unionfs_set_lower_file(struct file *f, struct file *val)
+{
+	BUG_ON(!f);
+	unionfs_set_lower_file_idx((f), fbstart(f), (val));
+}
+
+/* Inode to lower inode. */
+static inline struct inode *unionfs_lower_inode(const struct inode *i)
+{
+	BUG_ON(!i);
+	return UNIONFS_I(i)->lower_inodes[ibstart(i)];
+}
+
+static inline struct inode *unionfs_lower_inode_idx(const struct inode *i,
+						    int index)
+{
+	BUG_ON(!i || index < 0);
+	return UNIONFS_I(i)->lower_inodes[index];
+}
+
+static inline void unionfs_set_lower_inode_idx(struct inode *i, int index,
+					       struct inode *val)
+{
+	BUG_ON(!i || index < 0);
+	UNIONFS_I(i)->lower_inodes[index] = val;
+}
+
+static inline void unionfs_set_lower_inode(struct inode *i, struct inode *val)
+{
+	BUG_ON(!i);
+	UNIONFS_I(i)->lower_inodes[ibstart(i)] = val;
+}
+
+/* Superblock to lower superblock. */
+static inline struct super_block *unionfs_lower_super(
+					const struct super_block *sb)
+{
+	BUG_ON(!sb);
+	return UNIONFS_SB(sb)->data[sbstart(sb)].sb;
+}
+
+static inline struct super_block *unionfs_lower_super_idx(
+					const struct super_block *sb,
+					int index)
+{
+	BUG_ON(!sb || index < 0);
+	return UNIONFS_SB(sb)->data[index].sb;
+}
+
+static inline void unionfs_set_lower_super_idx(struct super_block *sb,
+					       int index,
+					       struct super_block *val)
+{
+	BUG_ON(!sb || index < 0);
+	UNIONFS_SB(sb)->data[index].sb = val;
+}
+
+static inline void unionfs_set_lower_super(struct super_block *sb,
+					   struct super_block *val)
+{
+	BUG_ON(!sb);
+	UNIONFS_SB(sb)->data[sbstart(sb)].sb = val;
+}
+
+/* Branch count macros. */
+static inline int branch_count(const struct super_block *sb, int index)
+{
+	BUG_ON(!sb || index < 0);
+	return atomic_read(&UNIONFS_SB(sb)->data[index].open_files);
+}
+
+static inline void set_branch_count(struct super_block *sb, int index, int val)
+{
+	BUG_ON(!sb || index < 0);
+	atomic_set(&UNIONFS_SB(sb)->data[index].open_files, val);
+}
+
+static inline void branchget(struct super_block *sb, int index)
+{
+	BUG_ON(!sb || index < 0);
+	atomic_inc(&UNIONFS_SB(sb)->data[index].open_files);
+}
+
+static inline void branchput(struct super_block *sb, int index)
+{
+	BUG_ON(!sb || index < 0);
+	atomic_dec(&UNIONFS_SB(sb)->data[index].open_files);
+}
+
+/* Dentry macros */
+static inline struct unionfs_dentry_info *UNIONFS_D(const struct dentry *dent)
+{
+	BUG_ON(!dent);
+	return dent->d_fsdata;
+}
+
+static inline int dbstart(const struct dentry *dent)
+{
+	BUG_ON(!dent);
+	return UNIONFS_D(dent)->bstart;
+}
+
+static inline void set_dbstart(struct dentry *dent, int val)
+{
+	BUG_ON(!dent);
+	UNIONFS_D(dent)->bstart = val;
+}
+
+static inline int dbend(const struct dentry *dent)
+{
+	BUG_ON(!dent);
+	return UNIONFS_D(dent)->bend;
+}
+
+static inline void set_dbend(struct dentry *dent, int val)
+{
+	BUG_ON(!dent);
+	UNIONFS_D(dent)->bend = val;
+}
+
+static inline int dbopaque(const struct dentry *dent)
+{
+	BUG_ON(!dent);
+	return UNIONFS_D(dent)->bopaque;
+}
+
+static inline void set_dbopaque(struct dentry *dent, int val)
+{
+	BUG_ON(!dent);
+	UNIONFS_D(dent)->bopaque = val;
+}
+
+static inline void unionfs_set_lower_dentry_idx(struct dentry *dent, int index,
+						struct dentry *val)
+{
+	BUG_ON(!dent || index < 0);
+	UNIONFS_D(dent)->lower_paths[index].dentry = val;
+}
+
+static inline struct dentry *unionfs_lower_dentry_idx(
+				const struct dentry *dent,
+				int index)
+{
+	BUG_ON(!dent || index < 0);
+	return UNIONFS_D(dent)->lower_paths[index].dentry;
+}
+
+static inline struct dentry *unionfs_lower_dentry(const struct dentry *dent)
+{
+	BUG_ON(!dent);
+	return unionfs_lower_dentry_idx(dent, dbstart(dent));
+}
+
+static inline void unionfs_set_lower_mnt_idx(struct dentry *dent, int index,
+					     struct vfsmount *mnt)
+{
+	BUG_ON(!dent || index < 0);
+	UNIONFS_D(dent)->lower_paths[index].mnt = mnt;
+}
+
+static inline struct vfsmount *unionfs_lower_mnt_idx(
+					const struct dentry *dent,
+					int index)
+{
+	BUG_ON(!dent || index < 0);
+	return UNIONFS_D(dent)->lower_paths[index].mnt;
+}
+
+static inline struct vfsmount *unionfs_lower_mnt(const struct dentry *dent)
+{
+	BUG_ON(!dent);
+	return unionfs_lower_mnt_idx(dent, dbstart(dent));
+}
+
+/* Macros for locking a dentry. */
+static inline void unionfs_lock_dentry(struct dentry *d)
+{
+	BUG_ON(!d);
+	mutex_lock(&UNIONFS_D(d)->lock);
+}
+
+static inline void unionfs_unlock_dentry(struct dentry *d)
+{
+	BUG_ON(!d);
+	mutex_unlock(&UNIONFS_D(d)->lock);
+}
+
+static inline void verify_locked(struct dentry *d)
+{
+	BUG_ON(!d);
+	BUG_ON(!mutex_is_locked(&UNIONFS_D(d)->lock));
+}
+
+/* copy a/m/ctime from the lower branch with the newest times */
+static inline void unionfs_copy_attr_times(struct inode *upper)
+{
+	int bindex;
+	struct inode *lower;
+
+	if (!upper || ibstart(upper) < 0)
+		return;
+	for (bindex = ibstart(upper); bindex <= ibend(upper); bindex++) {
+		lower = unionfs_lower_inode_idx(upper, bindex);
+		if (!lower)
+			continue; /* not all lower dir objects may exist */
+		if (unlikely(timespec_compare(&upper->i_mtime,
+					      &lower->i_mtime) < 0))
+			upper->i_mtime = lower->i_mtime;
+		if (unlikely(timespec_compare(&upper->i_ctime,
+					      &lower->i_ctime) < 0))
+			upper->i_ctime = lower->i_ctime;
+		if (unlikely(timespec_compare(&upper->i_atime,
+					      &lower->i_atime) < 0))
+			upper->i_atime = lower->i_atime;
+	}
+}
+
+/*
+ * A unionfs/fanout version of fsstack_copy_attr_all.  Uses a
+ * unionfs_get_nlinks to properly calcluate the number of links to a file.
+ * Also, copies the max() of all a/m/ctimes for all lower inodes (which is
+ * important if the lower inode is a directory type)
+ */
+static inline void unionfs_copy_attr_all(struct inode *dest,
+					 const struct inode *src)
+{
+	dest->i_mode = src->i_mode;
+	dest->i_uid = src->i_uid;
+	dest->i_gid = src->i_gid;
+	dest->i_rdev = src->i_rdev;
+
+	unionfs_copy_attr_times(dest);
+
+	dest->i_blkbits = src->i_blkbits;
+	dest->i_flags = src->i_flags;
+
+	/*
+	 * Update the nlinks AFTER updating the above fields, because the
+	 * get_links callback may depend on them.
+	 */
+	dest->i_nlink = unionfs_get_nlinks(dest);
+}
+
+#endif	/* not _FANOUT_H */
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 11/42] Unionfs: main header file
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (9 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 10/42] Unionfs: fanout header definitions Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 12/42] Unionfs: common file copyup/revalidation operations Erez Zadok
                   ` (31 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/union.h |  591 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 591 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/union.h

diff --git a/fs/unionfs/union.h b/fs/unionfs/union.h
new file mode 100644
index 0000000..20bff7b
--- /dev/null
+++ b/fs/unionfs/union.h
@@ -0,0 +1,591 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UNION_H_
+#define _UNION_H_
+
+#include <linux/dcache.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/page-flags.h>
+#include <linux/pagemap.h>
+#include <linux/poll.h>
+#include <linux/security.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/smp_lock.h>
+#include <linux/statfs.h>
+#include <linux/string.h>
+#include <linux/vmalloc.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/xattr.h>
+#include <linux/fs_stack.h>
+#include <linux/magic.h>
+#include <linux/log2.h>
+#include <linux/poison.h>
+#include <linux/mman.h>
+#include <linux/backing-dev.h>
+
+#include <asm/system.h>
+
+#include <linux/union_fs.h>
+
+/* the file system name */
+#define UNIONFS_NAME "unionfs"
+
+/* unionfs root inode number */
+#define UNIONFS_ROOT_INO     1
+
+/* number of times we try to get a unique temporary file name */
+#define GET_TMPNAM_MAX_RETRY	5
+
+/* maximum number of branches we support, to avoid memory blowup */
+#define UNIONFS_MAX_BRANCHES	128
+
+/* minimum time (seconds) required for time-based cache-coherency */
+#define UNIONFS_MIN_CC_TIME	3
+
+/* Operations vectors defined in specific files. */
+extern struct file_operations unionfs_main_fops;
+extern struct file_operations unionfs_dir_fops;
+extern struct inode_operations unionfs_main_iops;
+extern struct inode_operations unionfs_dir_iops;
+extern struct inode_operations unionfs_symlink_iops;
+extern struct super_operations unionfs_sops;
+extern struct dentry_operations unionfs_dops;
+extern struct address_space_operations unionfs_aops;
+
+/* How long should an entry be allowed to persist */
+#define RDCACHE_JIFFIES	(5*HZ)
+
+/* file private data. */
+struct unionfs_file_info {
+	int bstart;
+	int bend;
+	atomic_t generation;
+
+	struct unionfs_dir_state *rdstate;
+	struct file **lower_files;
+	int *saved_branch_ids; /* IDs of branches when file was opened */
+};
+
+/* unionfs inode data in memory */
+struct unionfs_inode_info {
+	int bstart;
+	int bend;
+	atomic_t generation;
+	int stale;
+	/* Stuff for readdir over NFS. */
+	spinlock_t rdlock;
+	struct list_head readdircache;
+	int rdcount;
+	int hashsize;
+	int cookie;
+
+	/* The lower inodes */
+	struct inode **lower_inodes;
+
+	struct inode vfs_inode;
+};
+
+/* unionfs dentry data in memory */
+struct unionfs_dentry_info {
+	/*
+	 * The semaphore is used to lock the dentry as soon as we get into a
+	 * unionfs function from the VFS.  Our lock ordering is that children
+	 * go before their parents.
+	 */
+	struct mutex lock;
+	int bstart;
+	int bend;
+	int bopaque;
+	int bcount;
+	atomic_t generation;
+	struct path *lower_paths;
+};
+
+/* These are the pointers to our various objects. */
+struct unionfs_data {
+	struct super_block *sb;
+	atomic_t open_files;	/* number of open files on branch */
+	int branchperms;
+	int branch_id;		/* unique branch ID at re/mount time */
+};
+
+/* unionfs super-block data in memory */
+struct unionfs_sb_info {
+	int bend;
+
+	atomic_t generation;
+
+	/*
+	 * This rwsem is used to make sure that a branch management
+	 * operation...
+	 *   1) will not begin before all currently in-flight operations
+	 *      complete.
+	 *   2) any new operations do not execute until the currently
+	 *      running branch management operation completes.
+	 *
+	 * The write_lock_owner records the PID of the task which grabbed
+	 * the rw_sem for writing.  If the same task also tries to grab the
+	 * read lock, we allow it.  This prevents a self-deadlock when
+	 * branch-management is used on a pivot_root'ed union, because we
+	 * have to ->lookup paths which belong to the same union.
+	 */
+	struct rw_semaphore rwsem;
+	pid_t write_lock_owner;	/* PID of rw_sem owner (write lock) */
+	int high_branch_id;	/* last unique branch ID given */
+	struct unionfs_data *data;
+};
+
+/*
+ * structure for making the linked list of entries by readdir on left branch
+ * to compare with entries on right branch
+ */
+struct filldir_node {
+	struct list_head file_list;	/* list for directory entries */
+	char *name;		/* name entry */
+	int hash;		/* name hash */
+	int namelen;		/* name len since name is not 0 terminated */
+
+	/*
+	 * we can check for duplicate whiteouts and files in the same branch
+	 * in order to return -EIO.
+	 */
+	int bindex;
+
+	/* is this a whiteout entry? */
+	int whiteout;
+
+	/* Inline name, so we don't need to separately kmalloc small ones */
+	char iname[DNAME_INLINE_LEN_MIN];
+};
+
+/* Directory hash table. */
+struct unionfs_dir_state {
+	unsigned int cookie;	/* the cookie, based off of rdversion */
+	unsigned int offset;	/* The entry we have returned. */
+	int bindex;
+	loff_t dirpos;		/* offset within the lower level directory */
+	int size;		/* How big is the hash table? */
+	int hashentries;	/* How many entries have been inserted? */
+	unsigned long access;
+
+	/* This cache list is used when the inode keeps us around. */
+	struct list_head cache;
+	struct list_head list[0];
+};
+
+/* externs needed for fanout.h or sioq.h */
+extern int unionfs_get_nlinks(const struct inode *inode);
+
+/* include miscellaneous macros */
+#include "fanout.h"
+#include "sioq.h"
+
+/* externs for cache creation/deletion routines */
+extern void unionfs_destroy_filldir_cache(void);
+extern int unionfs_init_filldir_cache(void);
+extern int unionfs_init_inode_cache(void);
+extern void unionfs_destroy_inode_cache(void);
+extern int unionfs_init_dentry_cache(void);
+extern void unionfs_destroy_dentry_cache(void);
+
+/* Initialize and free readdir-specific  state. */
+extern int init_rdstate(struct file *file);
+extern struct unionfs_dir_state *alloc_rdstate(struct inode *inode,
+					       int bindex);
+extern struct unionfs_dir_state *find_rdstate(struct inode *inode,
+					      loff_t fpos);
+extern void free_rdstate(struct unionfs_dir_state *state);
+extern int add_filldir_node(struct unionfs_dir_state *rdstate,
+			    const char *name, int namelen, int bindex,
+			    int whiteout);
+extern struct filldir_node *find_filldir_node(struct unionfs_dir_state *rdstate,
+					      const char *name, int namelen,
+					      int is_whiteout);
+
+extern struct dentry **alloc_new_dentries(int objs);
+extern struct unionfs_data *alloc_new_data(int objs);
+
+/* We can only use 32-bits of offset for rdstate --- blech! */
+#define DIREOF (0xfffff)
+#define RDOFFBITS 20		/* This is the number of bits in DIREOF. */
+#define MAXRDCOOKIE (0xfff)
+/* Turn an rdstate into an offset. */
+static inline off_t rdstate2offset(struct unionfs_dir_state *buf)
+{
+	off_t tmp;
+
+	tmp = ((buf->cookie & MAXRDCOOKIE) << RDOFFBITS)
+		| (buf->offset & DIREOF);
+	return tmp;
+}
+
+static inline void unionfs_read_lock(struct super_block *sb)
+{
+	if (UNIONFS_SB(sb)->write_lock_owner &&
+	    UNIONFS_SB(sb)->write_lock_owner == current->pid)
+		return;
+	down_read(&UNIONFS_SB(sb)->rwsem);
+}
+static inline void unionfs_read_unlock(struct super_block *sb)
+{
+	if (UNIONFS_SB(sb)->write_lock_owner &&
+	    UNIONFS_SB(sb)->write_lock_owner == current->pid)
+		return;
+	up_read(&UNIONFS_SB(sb)->rwsem);
+}
+static inline void unionfs_write_lock(struct super_block *sb)
+{
+	down_write(&UNIONFS_SB(sb)->rwsem);
+	UNIONFS_SB(sb)->write_lock_owner = current->pid;
+}
+static inline void unionfs_write_unlock(struct super_block *sb)
+{
+	up_write(&UNIONFS_SB(sb)->rwsem);
+	UNIONFS_SB(sb)->write_lock_owner = 0;
+}
+
+static inline void unionfs_double_lock_dentry(struct dentry *d1,
+					      struct dentry *d2)
+{
+	if (d2 < d1) {
+		struct dentry *tmp = d1;
+		d1 = d2;
+		d2 = tmp;
+	}
+	unionfs_lock_dentry(d1);
+	unionfs_lock_dentry(d2);
+}
+
+extern int new_dentry_private_data(struct dentry *dentry);
+extern void free_dentry_private_data(struct dentry *dentry);
+extern void update_bstart(struct dentry *dentry);
+extern int init_lower_nd(struct nameidata *nd, unsigned int flags);
+extern void release_lower_nd(struct nameidata *nd, int err);
+
+/*
+ * EXTERNALS:
+ */
+
+/* replicates the directory structure up to given dentry in given branch */
+extern struct dentry *create_parents(struct inode *dir, struct dentry *dentry,
+				     const char *name, int bindex);
+extern int make_dir_opaque(struct dentry *dir, int bindex);
+
+/* partial lookup */
+extern int unionfs_partial_lookup(struct dentry *dentry);
+
+/*
+ * Pass an unionfs dentry and an index and it will try to create a whiteout
+ * in branch 'index'.
+ *
+ * On error, it will proceed to a branch to the left
+ */
+extern int create_whiteout(struct dentry *dentry, int start);
+/* copies a file from dbstart to newbindex branch */
+extern int copyup_file(struct inode *dir, struct file *file, int bstart,
+		       int newbindex, loff_t size);
+extern int copyup_named_file(struct inode *dir, struct file *file,
+			     char *name, int bstart, int new_bindex,
+			     loff_t len);
+/* copies a dentry from dbstart to newbindex branch */
+extern int copyup_dentry(struct inode *dir, struct dentry *dentry,
+			 int bstart, int new_bindex, const char *name,
+			 int namelen, struct file **copyup_file, loff_t len);
+/* helper functions for post-copyup actions */
+extern void unionfs_postcopyup_setmnt(struct dentry *dentry);
+extern void unionfs_postcopyup_release(struct dentry *dentry);
+
+extern int remove_whiteouts(struct dentry *dentry,
+			    struct dentry *lower_dentry, int bindex);
+
+extern int do_delete_whiteouts(struct dentry *dentry, int bindex,
+			       struct unionfs_dir_state *namelist);
+
+/* Is this directory empty: 0 if it is empty, -ENOTEMPTY if not. */
+extern int check_empty(struct dentry *dentry,
+		       struct unionfs_dir_state **namelist);
+/* Delete whiteouts from this directory in branch bindex. */
+extern int delete_whiteouts(struct dentry *dentry, int bindex,
+			    struct unionfs_dir_state *namelist);
+
+/* Re-lookup a lower dentry. */
+extern int unionfs_refresh_lower_dentry(struct dentry *dentry, int bindex);
+
+extern void unionfs_reinterpose(struct dentry *this_dentry);
+extern struct super_block *unionfs_duplicate_super(struct super_block *sb);
+
+/* Locking functions. */
+extern int unionfs_setlk(struct file *file, int cmd, struct file_lock *fl);
+extern int unionfs_getlk(struct file *file, struct file_lock *fl);
+
+/* Common file operations. */
+extern int unionfs_file_revalidate(struct file *file, bool willwrite);
+extern int unionfs_open(struct inode *inode, struct file *file);
+extern int unionfs_file_release(struct inode *inode, struct file *file);
+extern int unionfs_flush(struct file *file, fl_owner_t id);
+extern long unionfs_ioctl(struct file *file, unsigned int cmd,
+			  unsigned long arg);
+extern int unionfs_fsync(struct file *file, struct dentry *dentry,
+			 int datasync);
+extern int unionfs_fasync(int fd, struct file *file, int flag);
+
+/* Inode operations */
+extern int unionfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+			  struct inode *new_dir, struct dentry *new_dentry);
+extern int unionfs_unlink(struct inode *dir, struct dentry *dentry);
+extern int unionfs_rmdir(struct inode *dir, struct dentry *dentry);
+
+extern bool __unionfs_d_revalidate_chain(struct dentry *dentry,
+					 struct nameidata *nd, bool willwrite);
+extern bool is_newer_lower(const struct dentry *dentry);
+
+/* The values for unionfs_interpose's flag. */
+#define INTERPOSE_DEFAULT	0
+#define INTERPOSE_LOOKUP	1
+#define INTERPOSE_REVAL		2
+#define INTERPOSE_REVAL_NEG	3
+#define INTERPOSE_PARTIAL	4
+
+extern struct dentry *unionfs_interpose(struct dentry *this_dentry,
+					struct super_block *sb, int flag);
+
+#ifdef CONFIG_UNION_FS_XATTR
+/* Extended attribute functions. */
+extern void *unionfs_xattr_alloc(size_t size, size_t limit);
+static inline void unionfs_xattr_kfree(const void *p)
+{
+	kfree(p);
+}
+extern ssize_t unionfs_getxattr(struct dentry *dentry, const char *name,
+				void *value, size_t size);
+extern int unionfs_removexattr(struct dentry *dentry, const char *name);
+extern ssize_t unionfs_listxattr(struct dentry *dentry, char *list,
+				 size_t size);
+extern int unionfs_setxattr(struct dentry *dentry, const char *name,
+			    const void *value, size_t size, int flags);
+#endif /* CONFIG_UNION_FS_XATTR */
+
+/* The root directory is unhashed, but isn't deleted. */
+static inline int d_deleted(struct dentry *d)
+{
+	return d_unhashed(d) && (d != d->d_sb->s_root);
+}
+
+struct dentry *unionfs_lookup_backend(struct dentry *dentry,
+				      struct nameidata *nd, int lookupmode);
+
+/* unionfs_permission, check if we should bypass error to facilitate copyup */
+#define IS_COPYUP_ERR(err) ((err) == -EROFS)
+
+/* unionfs_open, check if we need to copyup the file */
+#define OPEN_WRITE_FLAGS (O_WRONLY | O_RDWR | O_APPEND)
+#define IS_WRITE_FLAG(flag) ((flag) & OPEN_WRITE_FLAGS)
+
+static inline int branchperms(const struct super_block *sb, int index)
+{
+	BUG_ON(index < 0);
+	return UNIONFS_SB(sb)->data[index].branchperms;
+}
+
+static inline int set_branchperms(struct super_block *sb, int index, int perms)
+{
+	BUG_ON(index < 0);
+	UNIONFS_SB(sb)->data[index].branchperms = perms;
+	return perms;
+}
+
+/* Is this file on a read-only branch? */
+static inline int is_robranch_super(const struct super_block *sb, int index)
+{
+	int ret;
+
+	ret = (!(branchperms(sb, index) & MAY_WRITE)) ? -EROFS : 0;
+	return ret;
+}
+
+/* Is this file on a read-only branch? */
+static inline int is_robranch_idx(const struct dentry *dentry, int index)
+{
+	struct super_block *lower_sb;
+
+	BUG_ON(index < 0);
+
+	if (!(branchperms(dentry->d_sb, index) & MAY_WRITE))
+		return -EROFS;
+
+	lower_sb = unionfs_lower_super_idx(dentry->d_sb, index);
+	BUG_ON(lower_sb == NULL);
+	/*
+	 * test sb flags directly, not IS_RDONLY(lower_inode) because the
+	 * lower_dentry could be a negative.
+	 */
+	if (lower_sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	return 0;
+}
+
+static inline int is_robranch(const struct dentry *dentry)
+{
+	int index;
+
+	index = UNIONFS_D(dentry)->bstart;
+	BUG_ON(index < 0);
+
+	return is_robranch_idx(dentry, index);
+}
+
+/* What do we use for whiteouts. */
+#define UNIONFS_WHPFX ".wh."
+#define UNIONFS_WHLEN 4
+/*
+ * If a directory contains this file, then it is opaque.  We start with the
+ * .wh. flag so that it is blocked by lookup.
+ */
+#define UNIONFS_DIR_OPAQUE_NAME "__dir_opaque"
+#define UNIONFS_DIR_OPAQUE UNIONFS_WHPFX UNIONFS_DIR_OPAQUE_NAME
+
+/*
+ * EXTERNALS:
+ */
+extern char *alloc_whname(const char *name, int len);
+extern int check_branch(struct nameidata *nd);
+extern int parse_branch_mode(const char *name, int *perms);
+
+/*
+ * These two functions are here because it is kind of daft to copy and paste
+ * the contents of the two functions to 32+ places in unionfs
+ */
+static inline struct dentry *lock_parent(struct dentry *dentry)
+{
+	struct dentry *dir = dget(dentry->d_parent);
+
+	mutex_lock(&dir->d_inode->i_mutex);
+	return dir;
+}
+
+static inline void unlock_dir(struct dentry *dir)
+{
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(dir);
+}
+
+static inline struct vfsmount *unionfs_mntget(struct dentry *dentry,
+					      int bindex)
+{
+	struct vfsmount *mnt;
+
+	BUG_ON(!dentry || bindex < 0);
+
+	mnt = mntget(unionfs_lower_mnt_idx(dentry, bindex));
+#ifdef CONFIG_UNION_FS_DEBUG
+	if (!mnt)
+		pr_debug("unionfs: mntget: mnt=%p bindex=%d\n",
+			 mnt, bindex);
+#endif /* CONFIG_UNION_FS_DEBUG */
+
+	return mnt;
+}
+
+static inline void unionfs_mntput(struct dentry *dentry, int bindex)
+{
+	struct vfsmount *mnt;
+
+	if (!dentry && bindex < 0)
+		return;
+	BUG_ON(!dentry || bindex < 0);
+
+	mnt = unionfs_lower_mnt_idx(dentry, bindex);
+#ifdef CONFIG_UNION_FS_DEBUG
+	/*
+	 * Directories can have NULL lower objects in between start/end, but
+	 * NOT if at the start/end range.  We cannot verify that this dentry
+	 * is a type=DIR, because it may already be a negative dentry.  But
+	 * if dbstart is greater than dbend, we know that this couldn't have
+	 * been a regular file: it had to have been a directory.
+	 */
+	if (!mnt && !(bindex > dbstart(dentry) && bindex < dbend(dentry)))
+		pr_debug("unionfs: mntput: mnt=%p bindex=%d\n", mnt, bindex);
+#endif /* CONFIG_UNION_FS_DEBUG */
+	mntput(mnt);
+}
+
+#ifdef CONFIG_UNION_FS_DEBUG
+
+/* useful for tracking code reachability */
+#define UDBG pr_debug("DBG:%s:%s:%d\n", __FILE__, __FUNCTION__, __LINE__)
+
+#define unionfs_check_inode(i)	__unionfs_check_inode((i),	\
+	__FILE__, __FUNCTION__, __LINE__)
+#define unionfs_check_dentry(d)	__unionfs_check_dentry((d),	\
+	__FILE__, __FUNCTION__, __LINE__)
+#define unionfs_check_file(f)	__unionfs_check_file((f),	\
+	__FILE__, __FUNCTION__, __LINE__)
+#define unionfs_check_nd(n)	__unionfs_check_nd((n),		\
+	__FILE__, __FUNCTION__, __LINE__)
+#define show_branch_counts(sb)	__show_branch_counts((sb),	\
+	__FILE__, __FUNCTION__, __LINE__)
+#define show_inode_times(i)	__show_inode_times((i),		\
+	__FILE__, __FUNCTION__, __LINE__)
+#define show_dinode_times(d)	__show_dinode_times((d),	\
+	__FILE__, __FUNCTION__, __LINE__)
+#define show_inode_counts(i)	__show_inode_counts((i),	\
+	__FILE__, __FUNCTION__, __LINE__)
+
+extern void __unionfs_check_inode(const struct inode *inode, const char *fname,
+				  const char *fxn, int line);
+extern void __unionfs_check_dentry(const struct dentry *dentry,
+				   const char *fname, const char *fxn,
+				   int line);
+extern void __unionfs_check_file(const struct file *file,
+				 const char *fname, const char *fxn, int line);
+extern void __unionfs_check_nd(const struct nameidata *nd,
+			       const char *fname, const char *fxn, int line);
+extern void __show_branch_counts(const struct super_block *sb,
+				 const char *file, const char *fxn, int line);
+extern void __show_inode_times(const struct inode *inode,
+			       const char *file, const char *fxn, int line);
+extern void __show_dinode_times(const struct dentry *dentry,
+				const char *file, const char *fxn, int line);
+extern void __show_inode_counts(const struct inode *inode,
+				const char *file, const char *fxn, int line);
+
+#else /* not CONFIG_UNION_FS_DEBUG */
+
+/* we leave useful hooks for these check functions throughout the code */
+#define unionfs_check_inode(i)		do { } while (0)
+#define unionfs_check_dentry(d)		do { } while (0)
+#define unionfs_check_file(f)		do { } while (0)
+#define unionfs_check_nd(n)		do { } while (0)
+#define show_branch_counts(sb)		do { } while (0)
+#define show_inode_times(i)		do { } while (0)
+#define show_dinode_times(d)		do { } while (0)
+#define show_inode_counts(i)		do { } while (0)
+
+#endif /* not CONFIG_UNION_FS_DEBUG */
+
+#endif	/* not _UNION_H_ */
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 12/42] Unionfs: common file copyup/revalidation operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (10 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 11/42] Unionfs: main header file Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 13/42] Unionfs: basic file operations Erez Zadok
                   ` (30 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes open, ioctl, and flush operations.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/commonfops.c |  827 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 827 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/commonfops.c

diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c
new file mode 100644
index 0000000..f714e2f
--- /dev/null
+++ b/fs/unionfs/commonfops.c
@@ -0,0 +1,827 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * 1) Copyup the file
+ * 2) Rename the file to '.unionfs<original inode#><counter>' - obviously
+ * stolen from NFS's silly rename
+ */
+static int copyup_deleted_file(struct file *file, struct dentry *dentry,
+			       int bstart, int bindex)
+{
+	static unsigned int counter;
+	const int i_inosize = sizeof(dentry->d_inode->i_ino) * 2;
+	const int countersize = sizeof(counter) * 2;
+	const int nlen = sizeof(".unionfs") + i_inosize + countersize - 1;
+	char name[nlen + 1];
+	int err;
+	struct dentry *tmp_dentry = NULL;
+	struct dentry *lower_dentry;
+	struct dentry *lower_dir_dentry = NULL;
+
+	lower_dentry = unionfs_lower_dentry_idx(dentry, bstart);
+
+	sprintf(name, ".unionfs%*.*lx",
+		i_inosize, i_inosize, lower_dentry->d_inode->i_ino);
+
+	/*
+	 * Loop, looking for an unused temp name to copyup to.
+	 *
+	 * It's somewhat silly that we look for a free temp tmp name in the
+	 * source branch (bstart) instead of the dest branch (bindex), where
+	 * the final name will be created.  We _will_ catch it if somehow
+	 * the name exists in the dest branch, but it'd be nice to catch it
+	 * sooner than later.
+	 */
+retry:
+	tmp_dentry = NULL;
+	do {
+		char *suffix = name + nlen - countersize;
+
+		dput(tmp_dentry);
+		counter++;
+		sprintf(suffix, "%*.*x", countersize, countersize, counter);
+
+		pr_debug("unionfs: trying to rename %s to %s\n",
+			 dentry->d_name.name, name);
+
+		tmp_dentry = lookup_one_len(name, lower_dentry->d_parent,
+					    nlen);
+		if (IS_ERR(tmp_dentry)) {
+			err = PTR_ERR(tmp_dentry);
+			goto out;
+		}
+	} while (tmp_dentry->d_inode != NULL);	/* need negative dentry */
+	dput(tmp_dentry);
+
+	err = copyup_named_file(dentry->d_parent->d_inode, file, name, bstart,
+				bindex,
+				i_size_read(file->f_path.dentry->d_inode));
+	if (err) {
+		if (unlikely(err == -EEXIST))
+			goto retry;
+		goto out;
+	}
+
+	/* bring it to the same state as an unlinked file */
+	lower_dentry = unionfs_lower_dentry_idx(dentry, dbstart(dentry));
+	if (!unionfs_lower_inode_idx(dentry->d_inode, bindex)) {
+		atomic_inc(&lower_dentry->d_inode->i_count);
+		unionfs_set_lower_inode_idx(dentry->d_inode, bindex,
+					    lower_dentry->d_inode);
+	}
+	lower_dir_dentry = lock_parent(lower_dentry);
+	err = vfs_unlink(lower_dir_dentry->d_inode, lower_dentry);
+	unlock_dir(lower_dir_dentry);
+
+out:
+	if (!err)
+		unionfs_check_dentry(dentry);
+	return err;
+}
+
+/*
+ * put all references held by upper struct file and free lower file pointer
+ * array
+ */
+static void cleanup_file(struct file *file)
+{
+	int bindex, bstart, bend;
+	struct file **lower_files;
+	struct file *lower_file;
+	struct super_block *sb = file->f_path.dentry->d_sb;
+
+	lower_files = UNIONFS_F(file)->lower_files;
+	bstart = fbstart(file);
+	bend = fbend(file);
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		int i;	/* holds (possibly) updated branch index */
+		int old_bid;
+
+		lower_file = unionfs_lower_file_idx(file, bindex);
+		if (!lower_file)
+			continue;
+
+		/*
+		 * Find new index of matching branch with an open
+		 * file, since branches could have been added or
+		 * deleted causing the one with open files to shift.
+		 */
+		old_bid = UNIONFS_F(file)->saved_branch_ids[bindex];
+		i = branch_id_to_idx(sb, old_bid);
+		if (unlikely(i < 0)) {
+			printk(KERN_ERR "unionfs: no superblock for "
+			       "file %p\n", file);
+			continue;
+		}
+
+		/* decrement count of open files */
+		branchput(sb, i);
+		/*
+		 * fput will perform an mntput for us on the correct branch.
+		 * Although we're using the file's old branch configuration,
+		 * bindex, which is the old index, correctly points to the
+		 * right branch in the file's branch list.  In other words,
+		 * we're going to mntput the correct branch even if branches
+		 * have been added/removed.
+		 */
+		fput(lower_file);
+		UNIONFS_F(file)->lower_files[bindex] = NULL;
+		UNIONFS_F(file)->saved_branch_ids[bindex] = -1;
+	}
+
+	UNIONFS_F(file)->lower_files = NULL;
+	kfree(lower_files);
+	kfree(UNIONFS_F(file)->saved_branch_ids);
+	/* set to NULL because caller needs to know if to kfree on error */
+	UNIONFS_F(file)->saved_branch_ids = NULL;
+}
+
+/* open all lower files for a given file */
+static int open_all_files(struct file *file)
+{
+	int bindex, bstart, bend, err = 0;
+	struct file *lower_file;
+	struct dentry *lower_dentry;
+	struct dentry *dentry = file->f_path.dentry;
+	struct super_block *sb = dentry->d_sb;
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry)
+			continue;
+
+		dget(lower_dentry);
+		unionfs_mntget(dentry, bindex);
+		branchget(sb, bindex);
+
+		lower_file =
+			dentry_open(lower_dentry,
+				    unionfs_lower_mnt_idx(dentry, bindex),
+				    file->f_flags);
+		if (IS_ERR(lower_file)) {
+			err = PTR_ERR(lower_file);
+			goto out;
+		} else {
+			unionfs_set_lower_file_idx(file, bindex, lower_file);
+		}
+	}
+out:
+	return err;
+}
+
+/* open the highest priority file for a given upper file */
+static int open_highest_file(struct file *file, bool willwrite)
+{
+	int bindex, bstart, bend, err = 0;
+	struct file *lower_file;
+	struct dentry *lower_dentry;
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *parent_inode = dentry->d_parent->d_inode;
+	struct super_block *sb = dentry->d_sb;
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+	if (willwrite && IS_WRITE_FLAG(file->f_flags) && is_robranch(dentry)) {
+		for (bindex = bstart - 1; bindex >= 0; bindex--) {
+			err = copyup_file(parent_inode, file, bstart, bindex,
+					  i_size_read(dentry->d_inode));
+			if (!err)
+				break;
+		}
+		atomic_set(&UNIONFS_F(file)->generation,
+			   atomic_read(&UNIONFS_I(dentry->d_inode)->
+				       generation));
+		goto out;
+	}
+
+	dget(lower_dentry);
+	unionfs_mntget(dentry, bstart);
+	lower_file = dentry_open(lower_dentry,
+				 unionfs_lower_mnt_idx(dentry, bstart),
+				 file->f_flags);
+	if (IS_ERR(lower_file)) {
+		err = PTR_ERR(lower_file);
+		goto out;
+	}
+	branchget(sb, bstart);
+	unionfs_set_lower_file(file, lower_file);
+	/* Fix up the position. */
+	lower_file->f_pos = file->f_pos;
+
+	memcpy(&lower_file->f_ra, &file->f_ra, sizeof(struct file_ra_state));
+out:
+	return err;
+}
+
+/* perform a delayed copyup of a read-write file on a read-only branch */
+static int do_delayed_copyup(struct file *file)
+{
+	int bindex, bstart, bend, err = 0;
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *parent_inode = dentry->d_parent->d_inode;
+
+	bstart = fbstart(file);
+	bend = fbend(file);
+
+	BUG_ON(!S_ISREG(dentry->d_inode->i_mode));
+
+	unionfs_check_file(file);
+	unionfs_check_dentry(dentry);
+	for (bindex = bstart - 1; bindex >= 0; bindex--) {
+		if (!d_deleted(dentry))
+			err = copyup_file(parent_inode, file, bstart,
+					  bindex,
+					  i_size_read(dentry->d_inode));
+		else
+			err = copyup_deleted_file(file, dentry, bstart,
+						  bindex);
+
+		if (!err)
+			break;
+	}
+	if (err || (bstart <= fbstart(file)))
+		goto out;
+	bend = fbend(file);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		if (unionfs_lower_file_idx(file, bindex)) {
+			branchput(dentry->d_sb, bindex);
+			fput(unionfs_lower_file_idx(file, bindex));
+			unionfs_set_lower_file_idx(file, bindex, NULL);
+		}
+		if (unionfs_lower_mnt_idx(dentry, bindex)) {
+			unionfs_mntput(dentry, bindex);
+			unionfs_set_lower_mnt_idx(dentry, bindex, NULL);
+		}
+		if (unionfs_lower_dentry_idx(dentry, bindex)) {
+			BUG_ON(!dentry->d_inode);
+			iput(unionfs_lower_inode_idx(dentry->d_inode, bindex));
+			unionfs_set_lower_inode_idx(dentry->d_inode, bindex,
+						    NULL);
+			dput(unionfs_lower_dentry_idx(dentry, bindex));
+			unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+		}
+	}
+	/* for reg file, we only open it "once" */
+	fbend(file) = fbstart(file);
+	set_dbend(dentry, dbstart(dentry));
+	ibend(dentry->d_inode) = ibstart(dentry->d_inode);
+
+out:
+	unionfs_check_file(file);
+	unionfs_check_dentry(dentry);
+	return err;
+}
+
+/*
+ * Revalidate the struct file
+ * @file: file to revalidate
+ * @willwrite: true if caller may cause changes to the file; false otherwise.
+ */
+int unionfs_file_revalidate(struct file *file, bool willwrite)
+{
+	struct super_block *sb;
+	struct dentry *dentry;
+	int sbgen, fgen, dgen;
+	int bstart, bend;
+	int size;
+	int err = 0;
+
+	dentry = file->f_path.dentry;
+	unionfs_lock_dentry(dentry);
+	sb = dentry->d_sb;
+
+	/*
+	 * First revalidate the dentry inside struct file,
+	 * but not unhashed dentries.
+	 */
+	if (unlikely(!d_deleted(dentry) &&
+		     !__unionfs_d_revalidate_chain(dentry, NULL, willwrite))) {
+		err = -ESTALE;
+		goto out_nofree;
+	}
+
+	sbgen = atomic_read(&UNIONFS_SB(sb)->generation);
+	dgen = atomic_read(&UNIONFS_D(dentry)->generation);
+	fgen = atomic_read(&UNIONFS_F(file)->generation);
+
+	BUG_ON(sbgen > dgen);
+
+	/*
+	 * There are two cases we are interested in.  The first is if the
+	 * generation is lower than the super-block.  The second is if
+	 * someone has copied up this file from underneath us, we also need
+	 * to refresh things.
+	 */
+	if (unlikely(!d_deleted(dentry) &&
+		     (sbgen > fgen || dbstart(dentry) != fbstart(file)))) {
+		/* save orig branch ID */
+		int orig_brid =
+			UNIONFS_F(file)->saved_branch_ids[fbstart(file)];
+
+		/* First we throw out the existing files. */
+		cleanup_file(file);
+
+		/* Now we reopen the file(s) as in unionfs_open. */
+		bstart = fbstart(file) = dbstart(dentry);
+		bend = fbend(file) = dbend(dentry);
+
+		size = sizeof(struct file *) * sbmax(sb);
+		UNIONFS_F(file)->lower_files = kzalloc(size, GFP_KERNEL);
+		if (unlikely(!UNIONFS_F(file)->lower_files)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		size = sizeof(int) * sbmax(sb);
+		UNIONFS_F(file)->saved_branch_ids = kzalloc(size, GFP_KERNEL);
+		if (unlikely(!UNIONFS_F(file)->saved_branch_ids)) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		if (S_ISDIR(dentry->d_inode->i_mode)) {
+			/* We need to open all the files. */
+			err = open_all_files(file);
+			if (err)
+				goto out;
+		} else {
+			int new_brid;
+			/* We only open the highest priority branch. */
+			err = open_highest_file(file, willwrite);
+			if (err)
+				goto out;
+			new_brid = UNIONFS_F(file)->
+			  saved_branch_ids[fbstart(file)];
+			if (unlikely(new_brid != orig_brid && sbgen > fgen)) {
+				/*
+				 * If we re-opened the file on a different
+				 * branch than the original one, and this
+				 * was due to a new branch inserted, then
+				 * update the mnt counts of the old and new
+				 * branches accordingly.
+				 */
+				unionfs_mntget(dentry, bstart);
+				unionfs_mntput(sb->s_root,
+					       branch_id_to_idx(sb, orig_brid));
+			}
+		}
+		atomic_set(&UNIONFS_F(file)->generation,
+			   atomic_read(
+				   &UNIONFS_I(dentry->d_inode)->generation));
+	}
+
+	/* Copyup on the first write to a file on a readonly branch. */
+	if (willwrite && IS_WRITE_FLAG(file->f_flags) &&
+	    !IS_WRITE_FLAG(unionfs_lower_file(file)->f_flags) &&
+	    is_robranch(dentry)) {
+		pr_debug("unionfs: do delay copyup of \"%s\"\n",
+			 dentry->d_name.name);
+		err = do_delayed_copyup(file);
+	}
+
+out:
+	if (err) {
+		kfree(UNIONFS_F(file)->lower_files);
+		kfree(UNIONFS_F(file)->saved_branch_ids);
+	}
+out_nofree:
+	if (!err)
+		unionfs_check_file(file);
+	unionfs_unlock_dentry(dentry);
+	return err;
+}
+
+/* unionfs_open helper function: open a directory */
+static int __open_dir(struct inode *inode, struct file *file)
+{
+	struct dentry *lower_dentry;
+	struct file *lower_file;
+	int bindex, bstart, bend;
+	struct vfsmount *mnt;
+
+	bstart = fbstart(file) = dbstart(file->f_path.dentry);
+	bend = fbend(file) = dbend(file->f_path.dentry);
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry =
+			unionfs_lower_dentry_idx(file->f_path.dentry, bindex);
+		if (!lower_dentry)
+			continue;
+
+		dget(lower_dentry);
+		unionfs_mntget(file->f_path.dentry, bindex);
+		mnt = unionfs_lower_mnt_idx(file->f_path.dentry, bindex);
+		lower_file = dentry_open(lower_dentry, mnt, file->f_flags);
+		if (IS_ERR(lower_file))
+			return PTR_ERR(lower_file);
+
+		unionfs_set_lower_file_idx(file, bindex, lower_file);
+
+		/*
+		 * The branchget goes after the open, because otherwise
+		 * we would miss the reference on release.
+		 */
+		branchget(inode->i_sb, bindex);
+	}
+
+	return 0;
+}
+
+/* unionfs_open helper function: open a file */
+static int __open_file(struct inode *inode, struct file *file)
+{
+	struct dentry *lower_dentry;
+	struct file *lower_file;
+	int lower_flags;
+	int bindex, bstart, bend;
+
+	lower_dentry = unionfs_lower_dentry(file->f_path.dentry);
+	lower_flags = file->f_flags;
+
+	bstart = fbstart(file) = dbstart(file->f_path.dentry);
+	bend = fbend(file) = dbend(file->f_path.dentry);
+
+	/*
+	 * check for the permission for lower file.  If the error is
+	 * COPYUP_ERR, copyup the file.
+	 */
+	if (lower_dentry->d_inode && is_robranch(file->f_path.dentry)) {
+		/*
+		 * if the open will change the file, copy it up otherwise
+		 * defer it.
+		 */
+		if (lower_flags & O_TRUNC) {
+			int size = 0;
+			int err = -EROFS;
+
+			/* copyup the file */
+			for (bindex = bstart - 1; bindex >= 0; bindex--) {
+				err = copyup_file(
+					file->f_path.dentry->d_parent->d_inode,
+					file, bstart, bindex, size);
+				if (!err)
+					break;
+			}
+			return err;
+		} else {
+			lower_flags &= ~(OPEN_WRITE_FLAGS);
+		}
+	}
+
+	dget(lower_dentry);
+
+	/*
+	 * dentry_open will decrement mnt refcnt if err.
+	 * otherwise fput() will do an mntput() for us upon file close.
+	 */
+	unionfs_mntget(file->f_path.dentry, bstart);
+	lower_file =
+		dentry_open(lower_dentry,
+			    unionfs_lower_mnt_idx(file->f_path.dentry, bstart),
+			    lower_flags);
+	if (IS_ERR(lower_file))
+		return PTR_ERR(lower_file);
+
+	unionfs_set_lower_file(file, lower_file);
+	branchget(inode->i_sb, bstart);
+
+	return 0;
+}
+
+int unionfs_open(struct inode *inode, struct file *file)
+{
+	int err = 0;
+	struct file *lower_file = NULL;
+	struct dentry *dentry = NULL;
+	int bindex = 0, bstart = 0, bend = 0;
+	int size;
+
+	unionfs_read_lock(inode->i_sb);
+
+	file->private_data =
+		kzalloc(sizeof(struct unionfs_file_info), GFP_KERNEL);
+	if (unlikely(!UNIONFS_F(file))) {
+		err = -ENOMEM;
+		goto out_nofree;
+	}
+	fbstart(file) = -1;
+	fbend(file) = -1;
+	atomic_set(&UNIONFS_F(file)->generation,
+		   atomic_read(&UNIONFS_I(inode)->generation));
+
+	size = sizeof(struct file *) * sbmax(inode->i_sb);
+	UNIONFS_F(file)->lower_files = kzalloc(size, GFP_KERNEL);
+	if (unlikely(!UNIONFS_F(file)->lower_files)) {
+		err = -ENOMEM;
+		goto out;
+	}
+	size = sizeof(int) * sbmax(inode->i_sb);
+	UNIONFS_F(file)->saved_branch_ids = kzalloc(size, GFP_KERNEL);
+	if (unlikely(!UNIONFS_F(file)->saved_branch_ids)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	dentry = file->f_path.dentry;
+	unionfs_lock_dentry(dentry);
+
+	bstart = fbstart(file) = dbstart(dentry);
+	bend = fbend(file) = dbend(dentry);
+
+	/*
+	 * open all directories and make the unionfs file struct point to
+	 * these lower file structs
+	 */
+	if (S_ISDIR(inode->i_mode))
+		err = __open_dir(inode, file);	/* open a dir */
+	else
+		err = __open_file(inode, file);	/* open a file */
+
+	/* freeing the allocated resources, and fput the opened files */
+	if (err) {
+		for (bindex = bstart; bindex <= bend; bindex++) {
+			lower_file = unionfs_lower_file_idx(file, bindex);
+			if (!lower_file)
+				continue;
+
+			branchput(file->f_path.dentry->d_sb, bindex);
+			/* fput calls dput for lower_dentry */
+			fput(lower_file);
+		}
+	}
+
+	/* XXX: should this unlock be moved to the function bottom? */
+	unionfs_unlock_dentry(dentry);
+
+out:
+	if (err) {
+		kfree(UNIONFS_F(file)->lower_files);
+		kfree(UNIONFS_F(file)->saved_branch_ids);
+		kfree(UNIONFS_F(file));
+	}
+out_nofree:
+	unionfs_check_inode(inode);
+	if (!err) {
+		unionfs_check_file(file);
+		unionfs_check_dentry(file->f_path.dentry->d_parent);
+	}
+	unionfs_read_unlock(inode->i_sb);
+	return err;
+}
+
+/*
+ * release all lower object references & free the file info structure
+ *
+ * No need to grab sb info's rwsem.
+ */
+int unionfs_file_release(struct inode *inode, struct file *file)
+{
+	struct file *lower_file = NULL;
+	struct unionfs_file_info *fileinfo;
+	struct unionfs_inode_info *inodeinfo;
+	struct super_block *sb = inode->i_sb;
+	struct dentry *dentry = file->f_path.dentry;
+	int bindex, bstart, bend;
+	int fgen, err = 0;
+
+	unionfs_read_lock(sb);
+	/*
+	 * Yes, we have to revalidate this file even if it's being released.
+	 * This is important for open-but-unlinked files, as well as mmap
+	 * support.
+	 */
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+	fileinfo = UNIONFS_F(file);
+	BUG_ON(file->f_path.dentry->d_inode != inode);
+	inodeinfo = UNIONFS_I(inode);
+
+	/* fput all the lower files */
+	fgen = atomic_read(&fileinfo->generation);
+	bstart = fbstart(file);
+	bend = fbend(file);
+
+	unionfs_lock_dentry(dentry);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_file = unionfs_lower_file_idx(file, bindex);
+
+		if (lower_file) {
+			fput(lower_file);
+			branchput(sb, bindex);
+		}
+
+		/* if there are no more refs to the dentry, dput it */
+		if (d_deleted(dentry)) {
+			dput(unionfs_lower_dentry_idx(dentry, bindex));
+			unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+		}
+	}
+	unionfs_unlock_dentry(dentry);
+
+	kfree(fileinfo->lower_files);
+	kfree(fileinfo->saved_branch_ids);
+
+	if (fileinfo->rdstate) {
+		fileinfo->rdstate->access = jiffies;
+		spin_lock(&inodeinfo->rdlock);
+		inodeinfo->rdcount++;
+		list_add_tail(&fileinfo->rdstate->cache,
+			      &inodeinfo->readdircache);
+		mark_inode_dirty(inode);
+		spin_unlock(&inodeinfo->rdlock);
+		fileinfo->rdstate = NULL;
+	}
+	kfree(fileinfo);
+
+out:
+	unionfs_read_unlock(sb);
+	return err;
+}
+
+/* pass the ioctl to the lower fs */
+static long do_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct file *lower_file;
+	int err;
+
+	lower_file = unionfs_lower_file(file);
+
+	err = -ENOTTY;
+	if (!lower_file || !lower_file->f_op)
+		goto out;
+	if (lower_file->f_op->unlocked_ioctl) {
+		err = lower_file->f_op->unlocked_ioctl(lower_file, cmd, arg);
+	} else if (lower_file->f_op->ioctl) {
+		lock_kernel();
+		err = lower_file->f_op->ioctl(
+			lower_file->f_path.dentry->d_inode,
+			lower_file, cmd, arg);
+		unlock_kernel();
+	}
+
+out:
+	return err;
+}
+
+/*
+ * return to user-space the branch indices containing the file in question
+ *
+ * We use fd_set and therefore we are limited to the number of the branches
+ * to FD_SETSIZE, which is currently 1024 - plenty for most people
+ */
+static int unionfs_ioctl_queryfile(struct file *file, unsigned int cmd,
+				   unsigned long arg)
+{
+	int err = 0;
+	fd_set branchlist;
+	int bstart = 0, bend = 0, bindex = 0;
+	int orig_bstart, orig_bend;
+	struct dentry *dentry, *lower_dentry;
+	struct vfsmount *mnt;
+
+	dentry = file->f_path.dentry;
+	unionfs_lock_dentry(dentry);
+	orig_bstart = dbstart(dentry);
+	orig_bend = dbend(dentry);
+	err = unionfs_partial_lookup(dentry);
+	if (err)
+		goto out;
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+
+	FD_ZERO(&branchlist);
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry)
+			continue;
+		if (likely(lower_dentry->d_inode))
+			FD_SET(bindex, &branchlist);
+		/* purge any lower objects after partial_lookup */
+		if (bindex < orig_bstart || bindex > orig_bend) {
+			dput(lower_dentry);
+			unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+			iput(unionfs_lower_inode_idx(dentry->d_inode, bindex));
+			unionfs_set_lower_inode_idx(dentry->d_inode, bindex,
+						    NULL);
+			mnt = unionfs_lower_mnt_idx(dentry, bindex);
+			if (!mnt)
+				continue;
+			unionfs_mntput(dentry, bindex);
+			unionfs_set_lower_mnt_idx(dentry, bindex, NULL);
+		}
+	}
+	/* restore original dentry's offsets */
+	set_dbstart(dentry, orig_bstart);
+	set_dbend(dentry, orig_bend);
+	ibstart(dentry->d_inode) = orig_bstart;
+	ibend(dentry->d_inode) = orig_bend;
+
+	err = copy_to_user((void __user *)arg, &branchlist, sizeof(fd_set));
+	if (unlikely(err))
+		err = -EFAULT;
+
+out:
+	unionfs_unlock_dentry(dentry);
+	return err < 0 ? err : bend;
+}
+
+long unionfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	long err;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+
+	/* check if asked for local commands */
+	switch (cmd) {
+	case UNIONFS_IOCTL_INCGEN:
+		/* Increment the superblock generation count */
+		pr_info("unionfs: incgen ioctl deprecated; "
+			"use \"-o remount,incgen\"\n");
+		err = -ENOSYS;
+		break;
+
+	case UNIONFS_IOCTL_QUERYFILE:
+		/* Return list of branches containing the given file */
+		err = unionfs_ioctl_queryfile(file, cmd, arg);
+		break;
+
+	default:
+		/* pass the ioctl down */
+		err = do_ioctl(file, cmd, arg);
+		break;
+	}
+
+out:
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+int unionfs_flush(struct file *file, fl_owner_t id)
+{
+	int err = 0;
+	struct file *lower_file = NULL;
+	struct dentry *dentry = file->f_path.dentry;
+	int bindex, bstart, bend;
+
+	unionfs_read_lock(dentry->d_sb);
+
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	bstart = fbstart(file);
+	bend = fbend(file);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_file = unionfs_lower_file_idx(file, bindex);
+
+		if (lower_file && lower_file->f_op &&
+		    lower_file->f_op->flush) {
+			err = lower_file->f_op->flush(lower_file, id);
+			if (err)
+				goto out;
+		}
+
+	}
+
+	/* on success, update our times */
+	unionfs_copy_attr_times(dentry->d_inode);
+	/* parent time could have changed too (async) */
+	unionfs_copy_attr_times(dentry->d_parent->d_inode);
+
+out:
+	unionfs_check_file(file);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 13/42] Unionfs: basic file operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (11 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 12/42] Unionfs: common file copyup/revalidation operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 14/42] Unionfs: lower-level copyup routines Erez Zadok
                   ` (29 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes read, write, mmap, fsync, and fasync.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/file.c |  227 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 227 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/file.c

diff --git a/fs/unionfs/file.c b/fs/unionfs/file.c
new file mode 100644
index 0000000..c922173
--- /dev/null
+++ b/fs/unionfs/file.c
@@ -0,0 +1,227 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+static ssize_t unionfs_read(struct file *file, char __user *buf,
+			    size_t count, loff_t *ppos)
+{
+	int err;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	err = unionfs_file_revalidate(file, false);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	err = do_sync_read(file, buf, count, ppos);
+
+out:
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+static ssize_t unionfs_write(struct file *file, const char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	int err = 0;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	err = do_sync_write(file, buf, count, ppos);
+	/* update our inode times upon a successful lower write */
+	if (err >= 0) {
+		unionfs_copy_attr_times(file->f_path.dentry->d_inode);
+		unionfs_check_file(file);
+	}
+
+out:
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+static int unionfs_file_readdir(struct file *file, void *dirent,
+				filldir_t filldir)
+{
+	return -ENOTDIR;
+}
+
+static int unionfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int err = 0;
+	bool willwrite;
+	struct file *lower_file;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+
+	/* This might be deferred to mmap's writepage */
+	willwrite = ((vma->vm_flags | VM_SHARED | VM_WRITE) == vma->vm_flags);
+	err = unionfs_file_revalidate(file, willwrite);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	/*
+	 * File systems which do not implement ->writepage may use
+	 * generic_file_readonly_mmap as their ->mmap op.  If you call
+	 * generic_file_readonly_mmap with VM_WRITE, you'd get an -EINVAL.
+	 * But we cannot call the lower ->mmap op, so we can't tell that
+	 * writeable mappings won't work.  Therefore, our only choice is to
+	 * check if the lower file system supports the ->writepage, and if
+	 * not, return EINVAL (the same error that
+	 * generic_file_readonly_mmap returns in that case).
+	 */
+	lower_file = unionfs_lower_file(file);
+	if (willwrite && !lower_file->f_mapping->a_ops->writepage) {
+		err = -EINVAL;
+		printk(KERN_ERR "unionfs: branch %d file system does not "
+		       "support writeable mmap\n", fbstart(file));
+	} else {
+		err = generic_file_mmap(file, vma);
+		if (err)
+			printk(KERN_ERR
+			       "unionfs: generic_file_mmap failed %d\n", err);
+	}
+
+out:
+	if (!err) {
+		/* copyup could cause parent dir times to change */
+		unionfs_copy_attr_times(file->f_path.dentry->d_parent->d_inode);
+		unionfs_check_file(file);
+		unionfs_check_dentry(file->f_path.dentry->d_parent);
+	}
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+int unionfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	int bindex, bstart, bend;
+	struct file *lower_file;
+	struct dentry *lower_dentry;
+	struct inode *lower_inode, *inode;
+	int err = -EINVAL;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	bstart = fbstart(file);
+	bend = fbend(file);
+	if (bstart < 0 || bend < 0)
+		goto out;
+
+	inode = dentry->d_inode;
+	if (unlikely(!inode)) {
+		printk(KERN_ERR
+		       "unionfs: null lower inode in unionfs_fsync\n");
+		goto out;
+	}
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (!lower_inode || !lower_inode->i_fop->fsync)
+			continue;
+		lower_file = unionfs_lower_file_idx(file, bindex);
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		mutex_lock(&lower_inode->i_mutex);
+		err = lower_inode->i_fop->fsync(lower_file,
+						lower_dentry,
+						datasync);
+		mutex_unlock(&lower_inode->i_mutex);
+		if (err)
+			goto out;
+	}
+
+	unionfs_copy_attr_times(inode);
+
+out:
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+int unionfs_fasync(int fd, struct file *file, int flag)
+{
+	int bindex, bstart, bend;
+	struct file *lower_file;
+	struct dentry *dentry;
+	struct inode *lower_inode, *inode;
+	int err = 0;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	bstart = fbstart(file);
+	bend = fbend(file);
+	if (bstart < 0 || bend < 0)
+		goto out;
+
+	dentry = file->f_path.dentry;
+	inode = dentry->d_inode;
+	if (unlikely(!inode)) {
+		printk(KERN_ERR
+		       "unionfs: null lower inode in unionfs_fasync\n");
+		goto out;
+	}
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (!lower_inode || !lower_inode->i_fop->fasync)
+			continue;
+		lower_file = unionfs_lower_file_idx(file, bindex);
+		mutex_lock(&lower_inode->i_mutex);
+		err = lower_inode->i_fop->fasync(fd, lower_file, flag);
+		mutex_unlock(&lower_inode->i_mutex);
+		if (err)
+			goto out;
+	}
+
+	unionfs_copy_attr_times(inode);
+
+out:
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+struct file_operations unionfs_main_fops = {
+	.llseek		= generic_file_llseek,
+	.read		= unionfs_read,
+	.aio_read	= generic_file_aio_read,
+	.write		= unionfs_write,
+	.aio_write	= generic_file_aio_write,
+	.readdir	= unionfs_file_readdir,
+	.unlocked_ioctl	= unionfs_ioctl,
+	.mmap		= unionfs_mmap,
+	.open		= unionfs_open,
+	.flush		= unionfs_flush,
+	.release	= unionfs_file_release,
+	.fsync		= unionfs_fsync,
+	.fasync		= unionfs_fasync,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= generic_file_splice_write,
+};
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 14/42] Unionfs: lower-level copyup routines
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (12 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 13/42] Unionfs: basic file operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 15/42] Unionfs: dentry revalidation Erez Zadok
                   ` (28 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/copyup.c |  897 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 897 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/copyup.c

diff --git a/fs/unionfs/copyup.c b/fs/unionfs/copyup.c
new file mode 100644
index 0000000..3fe4865
--- /dev/null
+++ b/fs/unionfs/copyup.c
@@ -0,0 +1,897 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * For detailed explanation of copyup see:
+ * Documentation/filesystems/unionfs/concepts.txt
+ */
+
+#ifdef CONFIG_UNION_FS_XATTR
+/* copyup all extended attrs for a given dentry */
+static int copyup_xattrs(struct dentry *old_lower_dentry,
+			 struct dentry *new_lower_dentry)
+{
+	int err = 0;
+	ssize_t list_size = -1;
+	char *name_list = NULL;
+	char *attr_value = NULL;
+	char *name_list_buf = NULL;
+
+	/* query the actual size of the xattr list */
+	list_size = vfs_listxattr(old_lower_dentry, NULL, 0);
+	if (list_size <= 0) {
+		err = list_size;
+		goto out;
+	}
+
+	/* allocate space for the actual list */
+	name_list = unionfs_xattr_alloc(list_size + 1, XATTR_LIST_MAX);
+	if (unlikely(!name_list || IS_ERR(name_list))) {
+		err = PTR_ERR(name_list);
+		goto out;
+	}
+
+	name_list_buf = name_list; /* save for kfree at end */
+
+	/* now get the actual xattr list of the source file */
+	list_size = vfs_listxattr(old_lower_dentry, name_list, list_size);
+	if (list_size <= 0) {
+		err = list_size;
+		goto out;
+	}
+
+	/* allocate space to hold each xattr's value */
+	attr_value = unionfs_xattr_alloc(XATTR_SIZE_MAX, XATTR_SIZE_MAX);
+	if (unlikely(!attr_value || IS_ERR(attr_value))) {
+		err = PTR_ERR(name_list);
+		goto out;
+	}
+
+	/* in a loop, get and set each xattr from src to dst file */
+	while (*name_list) {
+		ssize_t size;
+
+		/* Lock here since vfs_getxattr doesn't lock for us */
+		mutex_lock(&old_lower_dentry->d_inode->i_mutex);
+		size = vfs_getxattr(old_lower_dentry, name_list,
+				    attr_value, XATTR_SIZE_MAX);
+		mutex_unlock(&old_lower_dentry->d_inode->i_mutex);
+		if (size < 0) {
+			err = size;
+			goto out;
+		}
+		if (size > XATTR_SIZE_MAX) {
+			err = -E2BIG;
+			goto out;
+		}
+		/* Don't lock here since vfs_setxattr does it for us. */
+		err = vfs_setxattr(new_lower_dentry, name_list, attr_value,
+				   size, 0);
+		/*
+		 * Selinux depends on "security.*" xattrs, so to maintain
+		 * the security of copied-up files, if Selinux is active,
+		 * then we must copy these xattrs as well.  So we need to
+		 * temporarily get FOWNER privileges.
+		 * XXX: move entire copyup code to SIOQ.
+		 */
+		if (err == -EPERM && !capable(CAP_FOWNER)) {
+			cap_raise(current->cap_effective, CAP_FOWNER);
+			err = vfs_setxattr(new_lower_dentry, name_list,
+					   attr_value, size, 0);
+			cap_lower(current->cap_effective, CAP_FOWNER);
+		}
+		if (err < 0)
+			goto out;
+		name_list += strlen(name_list) + 1;
+	}
+out:
+	unionfs_xattr_kfree(name_list_buf);
+	unionfs_xattr_kfree(attr_value);
+	/* Ignore if xattr isn't supported */
+	if (err == -ENOTSUPP || err == -EOPNOTSUPP)
+		err = 0;
+	return err;
+}
+#endif /* CONFIG_UNION_FS_XATTR */
+
+/*
+ * Determine the mode based on the copyup flags, and the existing dentry.
+ *
+ * Handle file systems which may not support certain options.  For example
+ * jffs2 doesn't allow one to chmod a symlink.  So we ignore such harmless
+ * errors, rather than propagating them up, which results in copyup errors
+ * and errors returned back to users.
+ */
+static int copyup_permissions(struct super_block *sb,
+			      struct dentry *old_lower_dentry,
+			      struct dentry *new_lower_dentry)
+{
+	struct inode *i = old_lower_dentry->d_inode;
+	struct iattr newattrs;
+	int err;
+
+	newattrs.ia_atime = i->i_atime;
+	newattrs.ia_mtime = i->i_mtime;
+	newattrs.ia_ctime = i->i_ctime;
+	newattrs.ia_gid = i->i_gid;
+	newattrs.ia_uid = i->i_uid;
+	newattrs.ia_valid = ATTR_CTIME | ATTR_ATIME | ATTR_MTIME |
+		ATTR_ATIME_SET | ATTR_MTIME_SET | ATTR_FORCE |
+		ATTR_GID | ATTR_UID;
+	err = notify_change(new_lower_dentry, &newattrs);
+	if (err)
+		goto out;
+
+	/* now try to change the mode and ignore EOPNOTSUPP on symlinks */
+	newattrs.ia_mode = i->i_mode;
+	newattrs.ia_valid = ATTR_MODE | ATTR_FORCE;
+	err = notify_change(new_lower_dentry, &newattrs);
+	if (err == -EOPNOTSUPP &&
+	    S_ISLNK(new_lower_dentry->d_inode->i_mode)) {
+		printk(KERN_WARNING
+		       "unionfs: changing \"%s\" symlink mode unsupported\n",
+		       new_lower_dentry->d_name.name);
+		err = 0;
+	}
+
+out:
+	return err;
+}
+
+/*
+ * create the new device/file/directory - use copyup_permission to copyup
+ * times, and mode
+ *
+ * if the object being copied up is a regular file, the file is only created,
+ * the contents have to be copied up separately
+ */
+static int __copyup_ndentry(struct dentry *old_lower_dentry,
+			    struct dentry *new_lower_dentry,
+			    struct dentry *new_lower_parent_dentry,
+			    char *symbuf)
+{
+	int err = 0;
+	umode_t old_mode = old_lower_dentry->d_inode->i_mode;
+	struct sioq_args args;
+
+	if (S_ISDIR(old_mode)) {
+		args.mkdir.parent = new_lower_parent_dentry->d_inode;
+		args.mkdir.dentry = new_lower_dentry;
+		args.mkdir.mode = old_mode;
+
+		run_sioq(__unionfs_mkdir, &args);
+		err = args.err;
+	} else if (S_ISLNK(old_mode)) {
+		args.symlink.parent = new_lower_parent_dentry->d_inode;
+		args.symlink.dentry = new_lower_dentry;
+		args.symlink.symbuf = symbuf;
+		args.symlink.mode = old_mode;
+
+		run_sioq(__unionfs_symlink, &args);
+		err = args.err;
+	} else if (S_ISBLK(old_mode) || S_ISCHR(old_mode) ||
+		   S_ISFIFO(old_mode) || S_ISSOCK(old_mode)) {
+		args.mknod.parent = new_lower_parent_dentry->d_inode;
+		args.mknod.dentry = new_lower_dentry;
+		args.mknod.mode = old_mode;
+		args.mknod.dev = old_lower_dentry->d_inode->i_rdev;
+
+		run_sioq(__unionfs_mknod, &args);
+		err = args.err;
+	} else if (S_ISREG(old_mode)) {
+		struct nameidata nd;
+		err = init_lower_nd(&nd, LOOKUP_CREATE);
+		if (unlikely(err < 0))
+			goto out;
+		args.create.nd = &nd;
+		args.create.parent = new_lower_parent_dentry->d_inode;
+		args.create.dentry = new_lower_dentry;
+		args.create.mode = old_mode;
+
+		run_sioq(__unionfs_create, &args);
+		err = args.err;
+		release_lower_nd(&nd, err);
+	} else {
+		printk(KERN_CRIT "unionfs: unknown inode type %d\n",
+		       old_mode);
+		BUG();
+	}
+
+out:
+	return err;
+}
+
+static int __copyup_reg_data(struct dentry *dentry,
+			     struct dentry *new_lower_dentry, int new_bindex,
+			     struct dentry *old_lower_dentry, int old_bindex,
+			     struct file **copyup_file, loff_t len)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct file *input_file;
+	struct file *output_file;
+	struct vfsmount *output_mnt;
+	mm_segment_t old_fs;
+	char *buf = NULL;
+	ssize_t read_bytes, write_bytes;
+	loff_t size;
+	int err = 0;
+
+	/* open old file */
+	unionfs_mntget(dentry, old_bindex);
+	branchget(sb, old_bindex);
+	/* dentry_open calls dput and mntput if it returns an error */
+	input_file = dentry_open(old_lower_dentry,
+				 unionfs_lower_mnt_idx(dentry, old_bindex),
+				 O_RDONLY | O_LARGEFILE);
+	if (IS_ERR(input_file)) {
+		dput(old_lower_dentry);
+		err = PTR_ERR(input_file);
+		goto out;
+	}
+	if (unlikely(!input_file->f_op || !input_file->f_op->read)) {
+		err = -EINVAL;
+		goto out_close_in;
+	}
+
+	/* open new file */
+	dget(new_lower_dentry);
+	output_mnt = unionfs_mntget(sb->s_root, new_bindex);
+	branchget(sb, new_bindex);
+	output_file = dentry_open(new_lower_dentry, output_mnt,
+				  O_RDWR | O_LARGEFILE);
+	if (IS_ERR(output_file)) {
+		err = PTR_ERR(output_file);
+		goto out_close_in2;
+	}
+	if (unlikely(!output_file->f_op || !output_file->f_op->write)) {
+		err = -EINVAL;
+		goto out_close_out;
+	}
+
+	/* allocating a buffer */
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (unlikely(!buf)) {
+		err = -ENOMEM;
+		goto out_close_out;
+	}
+
+	input_file->f_pos = 0;
+	output_file->f_pos = 0;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	size = len;
+	err = 0;
+	do {
+		if (len >= PAGE_SIZE)
+			size = PAGE_SIZE;
+		else if ((len < PAGE_SIZE) && (len > 0))
+			size = len;
+
+		len -= PAGE_SIZE;
+
+		read_bytes =
+			input_file->f_op->read(input_file,
+					       (char __user *)buf, size,
+					       &input_file->f_pos);
+		if (read_bytes <= 0) {
+			err = read_bytes;
+			break;
+		}
+
+		write_bytes =
+			output_file->f_op->write(output_file,
+						 (char __user *)buf,
+						 read_bytes,
+						 &output_file->f_pos);
+		if ((write_bytes < 0) || (write_bytes < read_bytes)) {
+			err = write_bytes;
+			break;
+		}
+	} while ((read_bytes > 0) && (len > 0));
+
+	set_fs(old_fs);
+
+	kfree(buf);
+
+	if (!err)
+		err = output_file->f_op->fsync(output_file,
+					       new_lower_dentry, 0);
+
+	if (err)
+		goto out_close_out;
+
+	if (copyup_file) {
+		*copyup_file = output_file;
+		goto out_close_in;
+	}
+
+out_close_out:
+	fput(output_file);
+
+out_close_in2:
+	branchput(sb, new_bindex);
+
+out_close_in:
+	fput(input_file);
+
+out:
+	branchput(sb, old_bindex);
+
+	return err;
+}
+
+/*
+ * dput the lower references for old and new dentry & clear a lower dentry
+ * pointer
+ */
+static void __clear(struct dentry *dentry, struct dentry *old_lower_dentry,
+		    int old_bstart, int old_bend,
+		    struct dentry *new_lower_dentry, int new_bindex)
+{
+	/* get rid of the lower dentry and all its traces */
+	unionfs_set_lower_dentry_idx(dentry, new_bindex, NULL);
+	set_dbstart(dentry, old_bstart);
+	set_dbend(dentry, old_bend);
+
+	dput(new_lower_dentry);
+	dput(old_lower_dentry);
+}
+
+/*
+ * Copy up a dentry to a file of specified name.
+ *
+ * @dir: used to pull the ->i_sb to access other branches
+ * @dentry: the non-negative dentry whose lower_inode we should copy
+ * @bstart: the branch of the lower_inode to copy from
+ * @new_bindex: the branch to create the new file in
+ * @name: the name of the file to create
+ * @namelen: length of @name
+ * @copyup_file: the "struct file" to return (optional)
+ * @len: how many bytes to copy-up?
+ */
+int copyup_dentry(struct inode *dir, struct dentry *dentry, int bstart,
+		  int new_bindex, const char *name, int namelen,
+		  struct file **copyup_file, loff_t len)
+{
+	struct dentry *new_lower_dentry;
+	struct dentry *old_lower_dentry = NULL;
+	struct super_block *sb;
+	int err = 0;
+	int old_bindex;
+	int old_bstart;
+	int old_bend;
+	struct dentry *new_lower_parent_dentry = NULL;
+	mm_segment_t oldfs;
+	char *symbuf = NULL;
+
+	verify_locked(dentry);
+
+	old_bindex = bstart;
+	old_bstart = dbstart(dentry);
+	old_bend = dbend(dentry);
+
+	BUG_ON(new_bindex < 0);
+	BUG_ON(new_bindex >= old_bindex);
+
+	sb = dir->i_sb;
+
+	err = is_robranch_super(sb, new_bindex);
+	if (err)
+		goto out;
+
+	/* Create the directory structure above this dentry. */
+	new_lower_dentry = create_parents(dir, dentry, name, new_bindex);
+	if (IS_ERR(new_lower_dentry)) {
+		err = PTR_ERR(new_lower_dentry);
+		goto out;
+	}
+
+	old_lower_dentry = unionfs_lower_dentry_idx(dentry, old_bindex);
+	/* we conditionally dput this old_lower_dentry at end of function */
+	dget(old_lower_dentry);
+
+	/* For symlinks, we must read the link before we lock the directory. */
+	if (S_ISLNK(old_lower_dentry->d_inode->i_mode)) {
+
+		symbuf = kmalloc(PATH_MAX, GFP_KERNEL);
+		if (unlikely(!symbuf)) {
+			__clear(dentry, old_lower_dentry,
+				old_bstart, old_bend,
+				new_lower_dentry, new_bindex);
+			err = -ENOMEM;
+			goto out_free;
+		}
+
+		oldfs = get_fs();
+		set_fs(KERNEL_DS);
+		err = old_lower_dentry->d_inode->i_op->readlink(
+			old_lower_dentry,
+			(char __user *)symbuf,
+			PATH_MAX);
+		set_fs(oldfs);
+		if (err < 0) {
+			__clear(dentry, old_lower_dentry,
+				old_bstart, old_bend,
+				new_lower_dentry, new_bindex);
+			goto out_free;
+		}
+		symbuf[err] = '\0';
+	}
+
+	/* Now we lock the parent, and create the object in the new branch. */
+	new_lower_parent_dentry = lock_parent(new_lower_dentry);
+
+	/* create the new inode */
+	err = __copyup_ndentry(old_lower_dentry, new_lower_dentry,
+			       new_lower_parent_dentry, symbuf);
+
+	if (err) {
+		__clear(dentry, old_lower_dentry,
+			old_bstart, old_bend,
+			new_lower_dentry, new_bindex);
+		goto out_unlock;
+	}
+
+	/* We actually copyup the file here. */
+	if (S_ISREG(old_lower_dentry->d_inode->i_mode))
+		err = __copyup_reg_data(dentry, new_lower_dentry, new_bindex,
+					old_lower_dentry, old_bindex,
+					copyup_file, len);
+	if (err)
+		goto out_unlink;
+
+	/* Set permissions. */
+	err = copyup_permissions(sb, old_lower_dentry, new_lower_dentry);
+	if (err)
+		goto out_unlink;
+
+#ifdef CONFIG_UNION_FS_XATTR
+	/* Selinux uses extended attributes for permissions. */
+	err = copyup_xattrs(old_lower_dentry, new_lower_dentry);
+	if (err)
+		goto out_unlink;
+#endif /* CONFIG_UNION_FS_XATTR */
+
+	/* do not allow files getting deleted to be re-interposed */
+	if (!d_deleted(dentry))
+		unionfs_reinterpose(dentry);
+
+	goto out_unlock;
+
+out_unlink:
+	/*
+	 * copyup failed, because we possibly ran out of space or
+	 * quota, or something else happened so let's unlink; we don't
+	 * really care about the return value of vfs_unlink
+	 */
+	vfs_unlink(new_lower_parent_dentry->d_inode, new_lower_dentry);
+
+	if (copyup_file) {
+		/* need to close the file */
+
+		fput(*copyup_file);
+		branchput(sb, new_bindex);
+	}
+
+	/*
+	 * TODO: should we reset the error to something like -EIO?
+	 *
+	 * If we don't reset, the user may get some nonsensical errors, but
+	 * on the other hand, if we reset to EIO, we guarantee that the user
+	 * will get a "confusing" error message.
+	 */
+
+out_unlock:
+	unlock_dir(new_lower_parent_dentry);
+
+out_free:
+	/*
+	 * If old_lower_dentry was a directory, we need to dput it.  If it
+	 * was a file, then it was already dput indirectly by other
+	 * functions we call above which operate on regular files.
+	 */
+	if (old_lower_dentry && old_lower_dentry->d_inode &&
+	    (S_ISDIR(old_lower_dentry->d_inode->i_mode) ||
+	     S_ISLNK(old_lower_dentry->d_inode->i_mode)))
+		dput(old_lower_dentry);
+	kfree(symbuf);
+
+	if (err)
+		goto out;
+	if (!S_ISDIR(dentry->d_inode->i_mode)) {
+		unionfs_postcopyup_release(dentry);
+		if (!unionfs_lower_inode(dentry->d_inode)) {
+			/*
+			 * If we got here, then we copied up to an
+			 * unlinked-open file, whose name is .unionfsXXXXX.
+			 */
+			struct inode *inode = new_lower_dentry->d_inode;
+			atomic_inc(&inode->i_count);
+			unionfs_set_lower_inode_idx(dentry->d_inode,
+						    ibstart(dentry->d_inode),
+						    inode);
+		}
+	}
+	unionfs_postcopyup_setmnt(dentry);
+	/* sync inode times from copied-up inode to our inode */
+	unionfs_copy_attr_times(dentry->d_inode);
+	unionfs_check_inode(dir);
+	unionfs_check_dentry(dentry);
+out:
+	return err;
+}
+
+/*
+ * This function creates a copy of a file represented by 'file' which
+ * currently resides in branch 'bstart' to branch 'new_bindex.'  The copy
+ * will be named "name".
+ */
+int copyup_named_file(struct inode *dir, struct file *file, char *name,
+		      int bstart, int new_bindex, loff_t len)
+{
+	int err = 0;
+	struct file *output_file = NULL;
+
+	err = copyup_dentry(dir, file->f_path.dentry, bstart, new_bindex,
+			    name, strlen(name), &output_file, len);
+	if (!err) {
+		fbstart(file) = new_bindex;
+		unionfs_set_lower_file_idx(file, new_bindex, output_file);
+	}
+
+	return err;
+}
+
+/*
+ * This function creates a copy of a file represented by 'file' which
+ * currently resides in branch 'bstart' to branch 'new_bindex'.
+ */
+int copyup_file(struct inode *dir, struct file *file, int bstart,
+		int new_bindex, loff_t len)
+{
+	int err = 0;
+	struct file *output_file = NULL;
+	struct dentry *dentry = file->f_path.dentry;
+
+	err = copyup_dentry(dir, dentry, bstart, new_bindex,
+			    dentry->d_name.name, dentry->d_name.len,
+			    &output_file, len);
+	if (!err) {
+		fbstart(file) = new_bindex;
+		unionfs_set_lower_file_idx(file, new_bindex, output_file);
+	}
+
+	return err;
+}
+
+/* purge a dentry's lower-branch states (dput/mntput, etc.) */
+static void __cleanup_dentry(struct dentry *dentry, int bindex,
+			     int old_bstart, int old_bend)
+{
+	int loop_start;
+	int loop_end;
+	int new_bstart = -1;
+	int new_bend = -1;
+	int i;
+
+	loop_start = min(old_bstart, bindex);
+	loop_end = max(old_bend, bindex);
+
+	/*
+	 * This loop sets the bstart and bend for the new dentry by
+	 * traversing from left to right.  It also dputs all negative
+	 * dentries except bindex
+	 */
+	for (i = loop_start; i <= loop_end; i++) {
+		if (!unionfs_lower_dentry_idx(dentry, i))
+			continue;
+
+		if (i == bindex) {
+			new_bend = i;
+			if (new_bstart < 0)
+				new_bstart = i;
+			continue;
+		}
+
+		if (!unionfs_lower_dentry_idx(dentry, i)->d_inode) {
+			dput(unionfs_lower_dentry_idx(dentry, i));
+			unionfs_set_lower_dentry_idx(dentry, i, NULL);
+
+			unionfs_mntput(dentry, i);
+			unionfs_set_lower_mnt_idx(dentry, i, NULL);
+		} else {
+			if (new_bstart < 0)
+				new_bstart = i;
+			new_bend = i;
+		}
+	}
+
+	if (new_bstart < 0)
+		new_bstart = bindex;
+	if (new_bend < 0)
+		new_bend = bindex;
+	set_dbstart(dentry, new_bstart);
+	set_dbend(dentry, new_bend);
+
+}
+
+/* set lower inode ptr and update bstart & bend if necessary */
+static void __set_inode(struct dentry *upper, struct dentry *lower,
+			int bindex)
+{
+	unionfs_set_lower_inode_idx(upper->d_inode, bindex,
+				    igrab(lower->d_inode));
+	if (likely(ibstart(upper->d_inode) > bindex))
+		ibstart(upper->d_inode) = bindex;
+	if (likely(ibend(upper->d_inode) < bindex))
+		ibend(upper->d_inode) = bindex;
+
+}
+
+/* set lower dentry ptr and update bstart & bend if necessary */
+static void __set_dentry(struct dentry *upper, struct dentry *lower,
+			 int bindex)
+{
+	unionfs_set_lower_dentry_idx(upper, bindex, lower);
+	if (likely(dbstart(upper) > bindex))
+		set_dbstart(upper, bindex);
+	if (likely(dbend(upper) < bindex))
+		set_dbend(upper, bindex);
+}
+
+/*
+ * This function replicates the directory structure up-to given dentry
+ * in the bindex branch.
+ */
+struct dentry *create_parents(struct inode *dir, struct dentry *dentry,
+			      const char *name, int bindex)
+{
+	int err;
+	struct dentry *child_dentry;
+	struct dentry *parent_dentry;
+	struct dentry *lower_parent_dentry = NULL;
+	struct dentry *lower_dentry = NULL;
+	const char *childname;
+	unsigned int childnamelen;
+	int nr_dentry;
+	int count = 0;
+	int old_bstart;
+	int old_bend;
+	struct dentry **path = NULL;
+	struct super_block *sb;
+
+	verify_locked(dentry);
+
+	err = is_robranch_super(dir->i_sb, bindex);
+	if (err) {
+		lower_dentry = ERR_PTR(err);
+		goto out;
+	}
+
+	old_bstart = dbstart(dentry);
+	old_bend = dbend(dentry);
+
+	lower_dentry = ERR_PTR(-ENOMEM);
+
+	/* There is no sense allocating any less than the minimum. */
+	nr_dentry = 1;
+	path = kmalloc(nr_dentry * sizeof(struct dentry *), GFP_KERNEL);
+	if (unlikely(!path))
+		goto out;
+
+	/* assume the negative dentry of unionfs as the parent dentry */
+	parent_dentry = dentry;
+
+	/*
+	 * This loop finds the first parent that exists in the given branch.
+	 * We start building the directory structure from there.  At the end
+	 * of the loop, the following should hold:
+	 *  - child_dentry is the first nonexistent child
+	 *  - parent_dentry is the first existent parent
+	 *  - path[0] is the = deepest child
+	 *  - path[count] is the first child to create
+	 */
+	do {
+		child_dentry = parent_dentry;
+
+		/* find the parent directory dentry in unionfs */
+		parent_dentry = child_dentry->d_parent;
+		unionfs_lock_dentry(parent_dentry);
+
+		/* find out the lower_parent_dentry in the given branch */
+		lower_parent_dentry =
+			unionfs_lower_dentry_idx(parent_dentry, bindex);
+
+		/* grow path table */
+		if (count == nr_dentry) {
+			void *p;
+
+			nr_dentry *= 2;
+			p = krealloc(path, nr_dentry * sizeof(struct dentry *),
+				     GFP_KERNEL);
+			if (unlikely(!p)) {
+				lower_dentry = ERR_PTR(-ENOMEM);
+				goto out;
+			}
+			path = p;
+		}
+
+		/* store the child dentry */
+		path[count++] = child_dentry;
+	} while (!lower_parent_dentry);
+	count--;
+
+	sb = dentry->d_sb;
+
+	/*
+	 * This code goes between the begin/end labels and basically
+	 * emulates a while(child_dentry != dentry), only cleaner and
+	 * shorter than what would be a much longer while loop.
+	 */
+begin:
+	/* get lower parent dir in the current branch */
+	lower_parent_dentry = unionfs_lower_dentry_idx(parent_dentry, bindex);
+	unionfs_unlock_dentry(parent_dentry);
+
+	/* init the values to lookup */
+	childname = child_dentry->d_name.name;
+	childnamelen = child_dentry->d_name.len;
+
+	if (child_dentry != dentry) {
+		/* lookup child in the underlying file system */
+		lower_dentry = lookup_one_len(childname, lower_parent_dentry,
+					      childnamelen);
+		if (IS_ERR(lower_dentry))
+			goto out;
+	} else {
+		/*
+		 * Is the name a whiteout of the child name ?  lookup the
+		 * whiteout child in the underlying file system
+		 */
+		lower_dentry = lookup_one_len(name, lower_parent_dentry,
+					      strlen(name));
+		if (IS_ERR(lower_dentry))
+			goto out;
+
+		/* Replace the current dentry (if any) with the new one */
+		dput(unionfs_lower_dentry_idx(dentry, bindex));
+		unionfs_set_lower_dentry_idx(dentry, bindex,
+					     lower_dentry);
+
+		__cleanup_dentry(dentry, bindex, old_bstart, old_bend);
+		goto out;
+	}
+
+	if (lower_dentry->d_inode) {
+		/*
+		 * since this already exists we dput to avoid
+		 * multiple references on the same dentry
+		 */
+		dput(lower_dentry);
+	} else {
+		struct sioq_args args;
+
+		/* it's a negative dentry, create a new dir */
+		lower_parent_dentry = lock_parent(lower_dentry);
+
+		args.mkdir.parent = lower_parent_dentry->d_inode;
+		args.mkdir.dentry = lower_dentry;
+		args.mkdir.mode = child_dentry->d_inode->i_mode;
+
+		run_sioq(__unionfs_mkdir, &args);
+		err = args.err;
+
+		if (!err)
+			err = copyup_permissions(dir->i_sb, child_dentry,
+						 lower_dentry);
+		unlock_dir(lower_parent_dentry);
+		if (err) {
+			struct inode *inode = lower_dentry->d_inode;
+			/*
+			 * If we get here, it means that we created a new
+			 * dentry+inode, but copying permissions failed.
+			 * Therefore, we should delete this inode and dput
+			 * the dentry so as not to leave cruft behind.
+			 */
+			if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
+				lower_dentry->d_op->d_iput(lower_dentry,
+							   inode);
+			else
+				iput(inode);
+			lower_dentry->d_inode = NULL;
+			dput(lower_dentry);
+			lower_dentry = ERR_PTR(err);
+			goto out;
+		}
+
+	}
+
+	__set_inode(child_dentry, lower_dentry, bindex);
+	__set_dentry(child_dentry, lower_dentry, bindex);
+	/*
+	 * update times of this dentry, but also the parent, because if
+	 * we changed, the parent may have changed too.
+	 */
+	unionfs_copy_attr_times(parent_dentry->d_inode);
+	unionfs_copy_attr_times(child_dentry->d_inode);
+
+	parent_dentry = child_dentry;
+	child_dentry = path[--count];
+	goto begin;
+out:
+	/* cleanup any leftover locks from the do/while loop above */
+	if (IS_ERR(lower_dentry))
+		while (count)
+			unionfs_unlock_dentry(path[count--]);
+	kfree(path);
+	return lower_dentry;
+}
+
+/*
+ * Post-copyup helper to ensure we have valid mnts: set lower mnt of
+ * dentry+parents to the first parent node that has an mnt.
+ */
+void unionfs_postcopyup_setmnt(struct dentry *dentry)
+{
+	struct dentry *parent, *hasone;
+	int bindex = dbstart(dentry);
+
+	if (unionfs_lower_mnt_idx(dentry, bindex))
+		return;
+	hasone = dentry->d_parent;
+	/* this loop should stop at root dentry */
+	while (!unionfs_lower_mnt_idx(hasone, bindex))
+		hasone = hasone->d_parent;
+	parent = dentry;
+	while (!unionfs_lower_mnt_idx(parent, bindex)) {
+		unionfs_set_lower_mnt_idx(parent, bindex,
+					  unionfs_mntget(hasone, bindex));
+		parent = parent->d_parent;
+	}
+}
+
+/*
+ * Post-copyup helper to release all non-directory source objects of a
+ * copied-up file.  Regular files should have only one lower object.
+ */
+void unionfs_postcopyup_release(struct dentry *dentry)
+{
+	int bindex;
+
+	BUG_ON(S_ISDIR(dentry->d_inode->i_mode));
+	for (bindex = dbstart(dentry)+1; bindex <= dbend(dentry); bindex++) {
+		if (unionfs_lower_mnt_idx(dentry, bindex)) {
+			unionfs_mntput(dentry, bindex);
+			unionfs_set_lower_mnt_idx(dentry, bindex, NULL);
+		}
+		if (unionfs_lower_dentry_idx(dentry, bindex)) {
+			dput(unionfs_lower_dentry_idx(dentry, bindex));
+			unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+			iput(unionfs_lower_inode_idx(dentry->d_inode, bindex));
+			unionfs_set_lower_inode_idx(dentry->d_inode, bindex,
+						    NULL);
+		}
+	}
+	bindex = dbstart(dentry);
+	set_dbend(dentry, bindex);
+	ibend(dentry->d_inode) = ibstart(dentry->d_inode) = bindex;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 15/42] Unionfs: dentry revalidation
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (13 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 14/42] Unionfs: lower-level copyup routines Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 16/42] Unionfs: lower-level lookup routines Erez Zadok
                   ` (27 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes d_release methods and cache-coherency support for dentries.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/dentry.c |  498 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 498 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/dentry.c

diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c
new file mode 100644
index 0000000..7d27987
--- /dev/null
+++ b/fs/unionfs/dentry.c
@@ -0,0 +1,498 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * Revalidate a single dentry.
+ * Assume that dentry's info node is locked.
+ * Assume that parent(s) are all valid already, but
+ * the child may not yet be valid.
+ * Returns true if valid, false otherwise.
+ */
+static bool __unionfs_d_revalidate_one(struct dentry *dentry,
+				       struct nameidata *nd)
+{
+	bool valid = true;	/* default is valid */
+	struct dentry *lower_dentry;
+	int bindex, bstart, bend;
+	int sbgen, dgen;
+	int positive = 0;
+	int locked = 0;
+	int interpose_flag;
+	struct nameidata lowernd; /* TODO: be gentler to the stack */
+
+	if (nd)
+		memcpy(&lowernd, nd, sizeof(struct nameidata));
+	else
+		memset(&lowernd, 0, sizeof(struct nameidata));
+
+	verify_locked(dentry);
+
+	/* if the dentry is unhashed, do NOT revalidate */
+	if (d_deleted(dentry))
+		goto out;
+
+	BUG_ON(dbstart(dentry) == -1);
+	if (dentry->d_inode)
+		positive = 1;
+	dgen = atomic_read(&UNIONFS_D(dentry)->generation);
+	sbgen = atomic_read(&UNIONFS_SB(dentry->d_sb)->generation);
+	/*
+	 * If we are working on an unconnected dentry, then there is no
+	 * revalidation to be done, because this file does not exist within
+	 * the namespace, and Unionfs operates on the namespace, not data.
+	 */
+	if (unlikely(sbgen != dgen)) {
+		struct dentry *result;
+		int pdgen;
+
+		/* The root entry should always be valid */
+		BUG_ON(IS_ROOT(dentry));
+
+		/* We can't work correctly if our parent isn't valid. */
+		pdgen = atomic_read(&UNIONFS_D(dentry->d_parent)->generation);
+		BUG_ON(pdgen != sbgen);	/* should never happen here */
+
+		/* Free the pointers for our inodes and this dentry. */
+		bstart = dbstart(dentry);
+		bend = dbend(dentry);
+		if (bstart >= 0) {
+			struct dentry *lower_dentry;
+			for (bindex = bstart; bindex <= bend; bindex++) {
+				lower_dentry =
+					unionfs_lower_dentry_idx(dentry,
+								 bindex);
+				dput(lower_dentry);
+			}
+		}
+		set_dbstart(dentry, -1);
+		set_dbend(dentry, -1);
+
+		interpose_flag = INTERPOSE_REVAL_NEG;
+		if (positive) {
+			interpose_flag = INTERPOSE_REVAL;
+			/*
+			 * During BRM, the VFS could already hold a lock on
+			 * a file being read, so don't lock it again
+			 * (deadlock), but if you lock it in this function,
+			 * then release it here too.
+			 */
+			if (!mutex_is_locked(&dentry->d_inode->i_mutex)) {
+				mutex_lock(&dentry->d_inode->i_mutex);
+				locked = 1;
+			}
+
+			bstart = ibstart(dentry->d_inode);
+			bend = ibend(dentry->d_inode);
+			if (bstart >= 0) {
+				struct inode *lower_inode;
+				for (bindex = bstart; bindex <= bend;
+				     bindex++) {
+					lower_inode =
+						unionfs_lower_inode_idx(
+							dentry->d_inode,
+							bindex);
+					iput(lower_inode);
+				}
+			}
+			kfree(UNIONFS_I(dentry->d_inode)->lower_inodes);
+			UNIONFS_I(dentry->d_inode)->lower_inodes = NULL;
+			ibstart(dentry->d_inode) = -1;
+			ibend(dentry->d_inode) = -1;
+			if (locked)
+				mutex_unlock(&dentry->d_inode->i_mutex);
+		}
+
+		result = unionfs_lookup_backend(dentry, &lowernd,
+						interpose_flag);
+		if (result) {
+			if (IS_ERR(result)) {
+				valid = false;
+				goto out;
+			}
+			/*
+			 * current unionfs_lookup_backend() doesn't return
+			 * a valid dentry
+			 */
+			dput(dentry);
+			dentry = result;
+		}
+
+		if (unlikely(positive && UNIONFS_I(dentry->d_inode)->stale)) {
+			make_bad_inode(dentry->d_inode);
+			d_drop(dentry);
+			valid = false;
+			goto out;
+		}
+		goto out;
+	}
+
+	/* The revalidation must occur across all branches */
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+	BUG_ON(bstart == -1);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry || !lower_dentry->d_op
+		    || !lower_dentry->d_op->d_revalidate)
+			continue;
+		/*
+		 * Don't pass nameidata to lower file system, because we
+		 * don't want an arbitrary lower file being opened or
+		 * returned to us: it may be useless to us because of the
+		 * fanout nature of unionfs (cf. file/directory open-file
+		 * invariants).  We will open lower files as and when needed
+		 * later on.
+		 */
+		if (!lower_dentry->d_op->d_revalidate(lower_dentry, NULL))
+			valid = false;
+	}
+
+	if (!dentry->d_inode)
+		valid = false;
+
+	if (valid) {
+		/*
+		 * If we get here, and we copy the meta-data from the lower
+		 * inode to our inode, then it is vital that we have already
+		 * purged all unionfs-level file data.  We do that in the
+		 * caller (__unionfs_d_revalidate_chain) by calling
+		 * purge_inode_data.
+		 */
+		unionfs_copy_attr_all(dentry->d_inode,
+				      unionfs_lower_inode(dentry->d_inode));
+		fsstack_copy_inode_size(dentry->d_inode,
+					unionfs_lower_inode(dentry->d_inode));
+	}
+
+out:
+	return valid;
+}
+
+/*
+ * Determine if the lower inode objects have changed from below the unionfs
+ * inode.  Return true if changed, false otherwise.
+ *
+ * We check if the mtime or ctime have changed.  However, the inode times
+ * can be changed by anyone without much protection, including
+ * asynchronously.  This can sometimes cause unionfs to find that the lower
+ * file system doesn't change its inode times quick enough, resulting in a
+ * false positive indication (which is harmless, it just makes unionfs do
+ * extra work in re-validating the objects).  To minimize the chances of
+ * these situations, we still consider such small time changes valid, but we
+ * don't print debugging messages unless the time changes are greater than
+ * UNIONFS_MIN_CC_TIME (which defaults to 3 seconds, as with NFS's acregmin)
+ * because significant changes are more likely due to users manually
+ * touching lower files.
+ */
+bool is_newer_lower(const struct dentry *dentry)
+{
+	int bindex;
+	struct inode *inode;
+	struct inode *lower_inode;
+
+	/* ignore if we're called on semi-initialized dentries/inodes */
+	if (!dentry || !UNIONFS_D(dentry))
+		return false;
+	inode = dentry->d_inode;
+	if (!inode || !UNIONFS_I(inode) ||
+	    ibstart(inode) < 0 || ibend(inode) < 0)
+		return false;
+
+	for (bindex = ibstart(inode); bindex <= ibend(inode); bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (!lower_inode)
+			continue;
+
+		/* check if mtime/ctime have changed */
+		if (unlikely(timespec_compare(&inode->i_mtime,
+					      &lower_inode->i_mtime) < 0)) {
+			if ((lower_inode->i_mtime.tv_sec -
+			     inode->i_mtime.tv_sec) > UNIONFS_MIN_CC_TIME) {
+				pr_info("unionfs: new lower inode mtime "
+					"(bindex=%d, name=%s)\n", bindex,
+					dentry->d_name.name);
+				show_dinode_times(dentry);
+			}
+			return true;
+		}
+		if (unlikely(timespec_compare(&inode->i_ctime,
+					      &lower_inode->i_ctime) < 0)) {
+			if ((lower_inode->i_ctime.tv_sec -
+			     inode->i_ctime.tv_sec) > UNIONFS_MIN_CC_TIME) {
+				pr_info("unionfs: new lower inode ctime "
+					"(bindex=%d, name=%s)\n", bindex,
+					dentry->d_name.name);
+				show_dinode_times(dentry);
+			}
+			return true;
+		}
+	}
+	return false;		/* default: lower is not newer */
+}
+
+/*
+ * Purge/remove/unmap all date pages of a unionfs inode.  This is called
+ * when the lower inode has changed, and we have to force processes to get
+ * the new data.
+ *
+ * XXX: Our implementation works in that as long as a user process will have
+ * caused Unionfs to be called, directly or indirectly, even to just do
+ * ->d_revalidate; then we will have purged the current Unionfs data and the
+ * process will see the new data.  For example, a process that continually
+ * re-reads the same file's data will see the NEW data as soon as the lower
+ * file had changed, upon the next read(2) syscall (even if the file is
+ * still open!)  However, this doesn't work when the process re-reads the
+ * open file's data via mmap(2) (unless the user unmaps/closes the file and
+ * remaps/reopens it).  Once we respond to ->readpage(s), then the kernel
+ * maps the page into the process's address space and there doesn't appear
+ * to be a way to force the kernel to invalidate those pages/mappings, and
+ * force the process to re-issue ->readpage.  If there's a way to invalidate
+ * active mappings and force a ->readpage, let us know please
+ * (invalidate_inode_pages2 doesn't do the trick).
+ */
+static inline void purge_inode_data(struct inode *inode)
+{
+	/* remove all non-private mappings */
+	unmap_mapping_range(inode->i_mapping, 0, 0, 0);
+
+	if (inode->i_data.nrpages)
+		truncate_inode_pages(&inode->i_data, 0);
+}
+
+/*
+ * Revalidate a parent chain of dentries, then the actual node.
+ * Assumes that dentry is locked, but will lock all parents if/when needed.
+ *
+ * If 'willwrite' is true, and the lower inode times are not in sync, then
+ * *don't* purge_inode_data, as it could deadlock if ->write calls us and we
+ * try to truncate a locked page.  Besides, if unionfs is about to write
+ * data to a file, then there's the data unionfs is about to write is more
+ * authoritative than what's below, therefore we can safely overwrite the
+ * lower inode times and data.
+ */
+bool __unionfs_d_revalidate_chain(struct dentry *dentry, struct nameidata *nd,
+				  bool willwrite)
+{
+	bool valid = false;	/* default is invalid */
+	struct dentry **chain = NULL; /* chain of dentries to reval */
+	int chain_len = 0;
+	struct dentry *dtmp;
+	int sbgen, dgen, i;
+	int saved_bstart, saved_bend, bindex;
+
+	/* find length of chain needed to revalidate */
+	/* XXX: should I grab some global (dcache?) lock? */
+	chain_len = 0;
+	sbgen = atomic_read(&UNIONFS_SB(dentry->d_sb)->generation);
+	dtmp = dentry->d_parent;
+	dgen = atomic_read(&UNIONFS_D(dtmp)->generation);
+	/* XXX: should we check if is_newer_lower all the way up? */
+	if (unlikely(is_newer_lower(dtmp))) {
+		/*
+		 * Special case: the root dentry's generation number must
+		 * always be valid, but its lower inode times don't have to
+		 * be, so sync up the times only.
+		 */
+		if (IS_ROOT(dtmp)) {
+			unionfs_copy_attr_times(dtmp->d_inode);
+		} else {
+			/*
+			 * reset generation number to zero, guaranteed to be
+			 * "old"
+			 */
+			dgen = 0;
+			atomic_set(&UNIONFS_D(dtmp)->generation, dgen);
+		}
+		purge_inode_data(dtmp->d_inode);
+	}
+	while (sbgen != dgen) {
+		/* The root entry should always be valid */
+		BUG_ON(IS_ROOT(dtmp));
+		chain_len++;
+		dtmp = dtmp->d_parent;
+		dgen = atomic_read(&UNIONFS_D(dtmp)->generation);
+	}
+	if (chain_len == 0)
+		goto out_this;	/* shortcut if parents are OK */
+
+	/*
+	 * Allocate array of dentries to reval.  We could use linked lists,
+	 * but the number of entries we need to alloc here is often small,
+	 * and short lived, so locality will be better.
+	 */
+	chain = kzalloc(chain_len * sizeof(struct dentry *), GFP_KERNEL);
+	if (unlikely(!chain)) {
+		printk(KERN_CRIT "unionfs: no more memory in %s\n",
+		       __FUNCTION__);
+		goto out;
+	}
+
+	/*
+	 * lock all dentries in chain, in child to parent order.
+	 * if failed, then sleep for a little, then retry.
+	 */
+	dtmp = dentry->d_parent;
+	for (i = chain_len-1; i >= 0; i--) {
+		chain[i] = dget(dtmp);
+		dtmp = dtmp->d_parent;
+	}
+
+	/*
+	 * call __unionfs_d_revalidate_one() on each dentry, but in parent
+	 * to child order.
+	 */
+	for (i = 0; i < chain_len; i++) {
+		unionfs_lock_dentry(chain[i]);
+		saved_bstart = dbstart(chain[i]);
+		saved_bend = dbend(chain[i]);
+		sbgen = atomic_read(&UNIONFS_SB(dentry->d_sb)->generation);
+		dgen = atomic_read(&UNIONFS_D(chain[i])->generation);
+
+		valid = __unionfs_d_revalidate_one(chain[i], nd);
+		/* XXX: is this the correct mntput condition?! */
+		if (valid && chain_len > 0 &&
+		    sbgen != dgen && chain[i]->d_inode &&
+		    S_ISDIR(chain[i]->d_inode->i_mode)) {
+			for (bindex = saved_bstart; bindex <= saved_bend;
+			     bindex++)
+				unionfs_mntput(chain[i], bindex);
+		}
+		unionfs_unlock_dentry(chain[i]);
+
+		if (unlikely(!valid))
+			goto out_free;
+	}
+
+
+out_this:
+	/* finally, lock this dentry and revalidate it */
+	verify_locked(dentry);
+	dgen = atomic_read(&UNIONFS_D(dentry)->generation);
+
+	if (unlikely(is_newer_lower(dentry))) {
+		/* root dentry special case as aforementioned */
+		if (IS_ROOT(dentry)) {
+			unionfs_copy_attr_times(dentry->d_inode);
+		} else {
+			/*
+			 * reset generation number to zero, guaranteed to be
+			 * "old"
+			 */
+			dgen = 0;
+			atomic_set(&UNIONFS_D(dentry)->generation, dgen);
+		}
+		if (!willwrite)
+			purge_inode_data(dentry->d_inode);
+	}
+	valid = __unionfs_d_revalidate_one(dentry, nd);
+
+	/*
+	 * If __unionfs_d_revalidate_one() succeeded above, then it will
+	 * have incremented the refcnt of the mnt's, but also the branch
+	 * indices of the dentry will have been updated (to take into
+	 * account any branch insertions/deletion.  So the current
+	 * dbstart/dbend match the current, and new, indices of the mnts
+	 * which __unionfs_d_revalidate_one has incremented.  Note: the "if"
+	 * test below does not depend on whether chain_len was 0 or greater.
+	 */
+	if (valid && sbgen != dgen)
+		for (bindex = dbstart(dentry);
+		     bindex <= dbend(dentry);
+		     bindex++)
+			unionfs_mntput(dentry, bindex);
+
+out_free:
+	/* unlock/dput all dentries in chain and return status */
+	if (chain_len > 0) {
+		for (i = 0; i < chain_len; i++)
+			dput(chain[i]);
+		kfree(chain);
+	}
+out:
+	return valid;
+}
+
+static int unionfs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	int err;
+
+	unionfs_read_lock(dentry->d_sb);
+
+	unionfs_lock_dentry(dentry);
+	err = __unionfs_d_revalidate_chain(dentry, nd, false);
+	if (likely(err > 0)) { /* true==1: dentry is valid */
+		unionfs_check_dentry(dentry);
+		unionfs_check_nd(nd);
+	}
+	unionfs_unlock_dentry(dentry);
+
+	unionfs_read_unlock(dentry->d_sb);
+
+	return err;
+}
+
+/*
+ * At this point no one can reference this dentry, so we don't have to be
+ * careful about concurrent access.
+ */
+static void unionfs_d_release(struct dentry *dentry)
+{
+	int bindex, bstart, bend;
+
+	unionfs_read_lock(dentry->d_sb);
+
+	unionfs_check_dentry(dentry);
+	/* this could be a negative dentry, so check first */
+	if (unlikely(!UNIONFS_D(dentry))) {
+		printk(KERN_ERR "unionfs: dentry without private data: %.*s\n",
+		       dentry->d_name.len, dentry->d_name.name);
+		goto out;
+	} else if (dbstart(dentry) < 0)
+		goto out_free;  /* due to a (normal) failed lookup */
+
+	/* Release all the lower dentries */
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		dput(unionfs_lower_dentry_idx(dentry, bindex));
+		unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+		/* NULL lower mnt is ok if this is a negative dentry */
+		if (!dentry->d_inode && !unionfs_lower_mnt_idx(dentry, bindex))
+			continue;
+		unionfs_mntput(dentry, bindex);
+		unionfs_set_lower_mnt_idx(dentry, bindex, NULL);
+	}
+	/* free private data (unionfs_dentry_info) here */
+	kfree(UNIONFS_D(dentry)->lower_paths);
+	UNIONFS_D(dentry)->lower_paths = NULL;
+
+out_free:
+	/* No need to unlock it, because it is disappeared. */
+	free_dentry_private_data(dentry);
+
+out:
+	unionfs_read_unlock(dentry->d_sb);
+	return;
+}
+
+struct dentry_operations unionfs_dops = {
+	.d_revalidate	= unionfs_d_revalidate,
+	.d_release	= unionfs_d_release,
+};
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 16/42] Unionfs: lower-level lookup routines
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (14 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 15/42] Unionfs: dentry revalidation Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 17/42] Unionfs: rename method and helpers Erez Zadok
                   ` (26 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes lower nameidata support routines.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/lookup.c |  652 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 652 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/lookup.c

diff --git a/fs/unionfs/lookup.c b/fs/unionfs/lookup.c
new file mode 100644
index 0000000..a1904c9
--- /dev/null
+++ b/fs/unionfs/lookup.c
@@ -0,0 +1,652 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+static int realloc_dentry_private_data(struct dentry *dentry);
+
+/* is the filename valid == !(whiteout for a file or opaque dir marker) */
+static int is_validname(const char *name)
+{
+	if (!strncmp(name, UNIONFS_WHPFX, UNIONFS_WHLEN))
+		return 0;
+	if (!strncmp(name, UNIONFS_DIR_OPAQUE_NAME,
+		     sizeof(UNIONFS_DIR_OPAQUE_NAME) - 1))
+		return 0;
+	return 1;
+}
+
+/* The rest of these are utility functions for lookup. */
+static noinline int is_opaque_dir(struct dentry *dentry, int bindex)
+{
+	int err = 0;
+	struct dentry *lower_dentry;
+	struct dentry *wh_lower_dentry;
+	struct inode *lower_inode;
+	struct sioq_args args;
+
+	lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+	lower_inode = lower_dentry->d_inode;
+
+	BUG_ON(!S_ISDIR(lower_inode->i_mode));
+
+	mutex_lock(&lower_inode->i_mutex);
+
+	if (!permission(lower_inode, MAY_EXEC, NULL)) {
+		wh_lower_dentry =
+			lookup_one_len(UNIONFS_DIR_OPAQUE, lower_dentry,
+				       sizeof(UNIONFS_DIR_OPAQUE) - 1);
+	} else {
+		args.is_opaque.dentry = lower_dentry;
+		run_sioq(__is_opaque_dir, &args);
+		wh_lower_dentry = args.ret;
+	}
+
+	mutex_unlock(&lower_inode->i_mutex);
+
+	if (IS_ERR(wh_lower_dentry)) {
+		err = PTR_ERR(wh_lower_dentry);
+		goto out;
+	}
+
+	/* This is an opaque dir iff wh_lower_dentry is positive */
+	err = !!wh_lower_dentry->d_inode;
+
+	dput(wh_lower_dentry);
+out:
+	return err;
+}
+
+/*
+ * Main (and complex) driver function for Unionfs's lookup
+ *
+ * Returns: NULL (ok), ERR_PTR if an error occurred, or a non-null non-error
+ * PTR if d_splice returned a different dentry.
+ *
+ * If lookupmode is INTERPOSE_PARTIAL/REVAL/REVAL_NEG, the passed dentry's
+ * inode info must be locked.  If lookupmode is INTERPOSE_LOOKUP (i.e., a
+ * newly looked-up dentry), then unionfs_lookup_backend will return a locked
+ * dentry's info, which the caller must unlock.
+ */
+struct dentry *unionfs_lookup_backend(struct dentry *dentry,
+				      struct nameidata *nd, int lookupmode)
+{
+	int err = 0;
+	struct dentry *lower_dentry = NULL;
+	struct dentry *wh_lower_dentry = NULL;
+	struct dentry *lower_dir_dentry = NULL;
+	struct dentry *parent_dentry = NULL;
+	struct dentry *d_interposed = NULL;
+	int bindex, bstart = -1, bend, bopaque;
+	int dentry_count = 0;	/* Number of positive dentries. */
+	int first_dentry_offset = -1; /* -1 is uninitialized */
+	struct dentry *first_dentry = NULL;
+	struct dentry *first_lower_dentry = NULL;
+	struct vfsmount *first_lower_mnt = NULL;
+	int locked_parent = 0;
+	int opaque;
+	char *whname = NULL;
+	const char *name;
+	int namelen;
+
+	/*
+	 * We should already have a lock on this dentry in the case of a
+	 * partial lookup, or a revalidation. Otherwise it is returned from
+	 * new_dentry_private_data already locked.
+	 */
+	if (lookupmode == INTERPOSE_PARTIAL || lookupmode == INTERPOSE_REVAL ||
+	    lookupmode == INTERPOSE_REVAL_NEG)
+		verify_locked(dentry);
+	else			/* this could only be INTERPOSE_LOOKUP */
+		BUG_ON(UNIONFS_D(dentry) != NULL);
+
+	switch (lookupmode) {
+	case INTERPOSE_PARTIAL:
+		break;
+	case INTERPOSE_LOOKUP:
+		err = new_dentry_private_data(dentry);
+		if (unlikely(err))
+			goto out;
+		break;
+	default:
+		/* default: can only be INTERPOSE_REVAL/REVAL_NEG */
+		err = realloc_dentry_private_data(dentry);
+		if (unlikely(err))
+			goto out;
+		break;
+	}
+
+	/* must initialize dentry operations */
+	dentry->d_op = &unionfs_dops;
+
+	parent_dentry = dget_parent(dentry);
+	/* We never partial lookup the root directory. */
+	if (parent_dentry != dentry) {
+		unionfs_lock_dentry(parent_dentry);
+		locked_parent = 1;
+	} else {
+		dput(parent_dentry);
+		parent_dentry = NULL;
+		goto out;
+	}
+
+	name = dentry->d_name.name;
+	namelen = dentry->d_name.len;
+
+	/* No dentries should get created for possible whiteout names. */
+	if (!is_validname(name)) {
+		err = -EPERM;
+		goto out_free;
+	}
+
+	/* Now start the actual lookup procedure. */
+	bstart = dbstart(parent_dentry);
+	bend = dbend(parent_dentry);
+	bopaque = dbopaque(parent_dentry);
+	BUG_ON(bstart < 0);
+
+	/*
+	 * It would be ideal if we could convert partial lookups to only have
+	 * to do this work when they really need to.  It could probably improve
+	 * performance quite a bit, and maybe simplify the rest of the code.
+	 */
+	if (lookupmode == INTERPOSE_PARTIAL) {
+		bstart++;
+		if ((bopaque != -1) && (bopaque < bend))
+			bend = bopaque;
+	}
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (lookupmode == INTERPOSE_PARTIAL && lower_dentry)
+			continue;
+		BUG_ON(lower_dentry != NULL);
+
+		lower_dir_dentry =
+			unionfs_lower_dentry_idx(parent_dentry, bindex);
+
+		/* if the parent lower dentry does not exist skip this */
+		if (!(lower_dir_dentry && lower_dir_dentry->d_inode))
+			continue;
+
+		/* also skip it if the parent isn't a directory. */
+		if (!S_ISDIR(lower_dir_dentry->d_inode->i_mode))
+			continue;
+
+		/* Reuse the whiteout name because its value doesn't change. */
+		if (!whname) {
+			whname = alloc_whname(name, namelen);
+			if (unlikely(IS_ERR(whname))) {
+				err = PTR_ERR(whname);
+				goto out_free;
+			}
+		}
+
+		/* check if whiteout exists in this branch: lookup .wh.foo */
+		wh_lower_dentry = lookup_one_len(whname, lower_dir_dentry,
+						 namelen + UNIONFS_WHLEN);
+		if (IS_ERR(wh_lower_dentry)) {
+			dput(first_lower_dentry);
+			unionfs_mntput(first_dentry, first_dentry_offset);
+			err = PTR_ERR(wh_lower_dentry);
+			goto out_free;
+		}
+
+		if (wh_lower_dentry->d_inode) {
+			/* We found a whiteout so let's give up. */
+			if (S_ISREG(wh_lower_dentry->d_inode->i_mode)) {
+				set_dbend(dentry, bindex);
+				set_dbopaque(dentry, bindex);
+				dput(wh_lower_dentry);
+				break;
+			}
+			err = -EIO;
+			printk(KERN_ERR "unionfs: EIO: invalid whiteout "
+			       "entry type %d\n",
+			       wh_lower_dentry->d_inode->i_mode);
+			dput(wh_lower_dentry);
+			dput(first_lower_dentry);
+			unionfs_mntput(first_dentry, first_dentry_offset);
+			goto out_free;
+		}
+
+		dput(wh_lower_dentry);
+		wh_lower_dentry = NULL;
+
+		/* Now do regular lookup; lookup foo */
+		lower_dentry = lookup_one_len(name, lower_dir_dentry, namelen);
+		if (IS_ERR(lower_dentry)) {
+			dput(first_lower_dentry);
+			unionfs_mntput(first_dentry, first_dentry_offset);
+			err = PTR_ERR(lower_dentry);
+			goto out_free;
+		}
+
+		/*
+		 * Store the first negative dentry specially, because if they
+		 * are all negative we need this for future creates.
+		 */
+		if (!lower_dentry->d_inode) {
+			if (!first_lower_dentry && (dbstart(dentry) == -1)) {
+				first_lower_dentry = lower_dentry;
+				/*
+				 * FIXME: following line needs to be changed
+				 * to allow mount-point crossing
+				 */
+				first_dentry = parent_dentry;
+				first_lower_mnt =
+					unionfs_mntget(parent_dentry, bindex);
+				first_dentry_offset = bindex;
+			} else {
+				dput(lower_dentry);
+			}
+
+			continue;
+		}
+
+		/* number of positive dentries */
+		dentry_count++;
+
+		/* store underlying dentry */
+		if (dbstart(dentry) == -1)
+			set_dbstart(dentry, bindex);
+		unionfs_set_lower_dentry_idx(dentry, bindex, lower_dentry);
+		/*
+		 * FIXME: the following line needs to get fixed to allow
+		 * mount-point crossing
+		 */
+		unionfs_set_lower_mnt_idx(dentry, bindex,
+					  unionfs_mntget(parent_dentry,
+							 bindex));
+		set_dbend(dentry, bindex);
+
+		/* update parent directory's atime with the bindex */
+		fsstack_copy_attr_atime(parent_dentry->d_inode,
+					lower_dir_dentry->d_inode);
+
+		/* We terminate file lookups here. */
+		if (!S_ISDIR(lower_dentry->d_inode->i_mode)) {
+			if (lookupmode == INTERPOSE_PARTIAL)
+				continue;
+			if (dentry_count == 1)
+				goto out_positive;
+			/* This can only happen with mixed D-*-F-* */
+			BUG_ON(!S_ISDIR(unionfs_lower_dentry(dentry)->
+					d_inode->i_mode));
+			continue;
+		}
+
+		opaque = is_opaque_dir(dentry, bindex);
+		if (opaque < 0) {
+			dput(first_lower_dentry);
+			unionfs_mntput(first_dentry, first_dentry_offset);
+			err = opaque;
+			goto out_free;
+		} else if (opaque) {
+			set_dbend(dentry, bindex);
+			set_dbopaque(dentry, bindex);
+			break;
+		}
+	}
+
+	if (dentry_count)
+		goto out_positive;
+	else
+		goto out_negative;
+
+out_negative:
+	if (lookupmode == INTERPOSE_PARTIAL)
+		goto out;
+
+	/* If we've only got negative dentries, then use the leftmost one. */
+	if (lookupmode == INTERPOSE_REVAL) {
+		if (dentry->d_inode)
+			UNIONFS_I(dentry->d_inode)->stale = 1;
+		goto out;
+	}
+	/* This should only happen if we found a whiteout. */
+	if (first_dentry_offset == -1) {
+		first_lower_dentry = lookup_one_len(name, lower_dir_dentry,
+						    namelen);
+		first_dentry_offset = bindex;
+		if (IS_ERR(first_lower_dentry)) {
+			err = PTR_ERR(first_lower_dentry);
+			goto out;
+		}
+
+		/*
+		 * FIXME: the following line needs to be changed to allow
+		 * mount-point crossing
+		 */
+		first_dentry = dentry;
+		first_lower_mnt = unionfs_mntget(dentry->d_sb->s_root,
+						 bindex);
+	}
+	unionfs_set_lower_dentry_idx(dentry, first_dentry_offset,
+				     first_lower_dentry);
+	unionfs_set_lower_mnt_idx(dentry, first_dentry_offset,
+				  first_lower_mnt);
+	set_dbstart(dentry, first_dentry_offset);
+	set_dbend(dentry, first_dentry_offset);
+
+	if (lookupmode == INTERPOSE_REVAL_NEG)
+		BUG_ON(dentry->d_inode != NULL);
+	else
+		d_add(dentry, NULL);
+	goto out;
+
+/* This part of the code is for positive dentries. */
+out_positive:
+	BUG_ON(dentry_count <= 0);
+
+	/*
+	 * If we're holding onto the first negative dentry & corresponding
+	 * vfsmount - throw it out.
+	 */
+	dput(first_lower_dentry);
+	unionfs_mntput(first_dentry, first_dentry_offset);
+
+	/* Partial lookups need to re-interpose, or throw away older negs. */
+	if (lookupmode == INTERPOSE_PARTIAL) {
+		if (dentry->d_inode) {
+			unionfs_reinterpose(dentry);
+			goto out;
+		}
+
+		/*
+		 * This somehow turned positive, so it is as if we had a
+		 * negative revalidation.
+		 */
+		lookupmode = INTERPOSE_REVAL_NEG;
+
+		update_bstart(dentry);
+		bstart = dbstart(dentry);
+		bend = dbend(dentry);
+	}
+
+	/*
+	 * Interpose can return a dentry if d_splice returned a different
+	 * dentry.
+	 */
+	d_interposed = unionfs_interpose(dentry, dentry->d_sb, lookupmode);
+	if (IS_ERR(d_interposed))
+		err = PTR_ERR(d_interposed);
+	else if (d_interposed)
+		dentry = d_interposed;
+
+	if (err)
+		goto out_drop;
+
+	goto out;
+
+out_drop:
+	d_drop(dentry);
+
+out_free:
+	/* should dput all the underlying dentries on error condition */
+	bstart = dbstart(dentry);
+	if (bstart >= 0) {
+		bend = dbend(dentry);
+		for (bindex = bstart; bindex <= bend; bindex++) {
+			dput(unionfs_lower_dentry_idx(dentry, bindex));
+			unionfs_mntput(dentry, bindex);
+		}
+	}
+	kfree(UNIONFS_D(dentry)->lower_paths);
+	UNIONFS_D(dentry)->lower_paths = NULL;
+	set_dbstart(dentry, -1);
+	set_dbend(dentry, -1);
+
+out:
+	if (!err && UNIONFS_D(dentry)) {
+		BUG_ON(dbend(dentry) > UNIONFS_D(dentry)->bcount);
+		BUG_ON(dbend(dentry) > sbmax(dentry->d_sb));
+		if (dbstart(dentry) < 0 &&
+		    dentry->d_inode && bstart >= 0 &&
+		    (!UNIONFS_I(dentry->d_inode) ||
+		     !UNIONFS_I(dentry->d_inode)->lower_inodes)) {
+			unionfs_mntput(dentry->d_sb->s_root, bstart);
+			dput(first_lower_dentry);
+			UNIONFS_I(dentry->d_inode)->stale = 1;
+		}
+	}
+	kfree(whname);
+	if (locked_parent)
+		unionfs_unlock_dentry(parent_dentry);
+	dput(parent_dentry);
+	if (err && (lookupmode == INTERPOSE_LOOKUP))
+		unionfs_unlock_dentry(dentry);
+	if (!err && d_interposed)
+		return d_interposed;
+	if (dentry->d_inode && UNIONFS_I(dentry->d_inode)->stale &&
+	    first_dentry_offset >= 0)
+		unionfs_mntput(dentry->d_sb->s_root, first_dentry_offset);
+	return ERR_PTR(err);
+}
+
+/*
+ * This is a utility function that fills in a unionfs dentry.
+ *
+ * Returns: 0 (ok), or -ERRNO if an error occurred.
+ */
+int unionfs_partial_lookup(struct dentry *dentry)
+{
+	struct dentry *tmp;
+	struct nameidata nd = { .flags = 0 };
+	int err = -ENOSYS;
+
+	tmp = unionfs_lookup_backend(dentry, &nd, INTERPOSE_PARTIAL);
+	if (!tmp) {
+		err = 0;
+		goto out;
+	}
+	if (IS_ERR(tmp)) {
+		err = PTR_ERR(tmp);
+		goto out;
+	}
+	/* need to change the interface */
+	BUG_ON(tmp != dentry);
+out:
+	return err;
+}
+
+/* The dentry cache is just so we have properly sized dentries. */
+static struct kmem_cache *unionfs_dentry_cachep;
+int unionfs_init_dentry_cache(void)
+{
+	unionfs_dentry_cachep =
+		kmem_cache_create("unionfs_dentry",
+				  sizeof(struct unionfs_dentry_info),
+				  0, SLAB_RECLAIM_ACCOUNT, NULL);
+
+	return (unionfs_dentry_cachep ? 0 : -ENOMEM);
+}
+
+void unionfs_destroy_dentry_cache(void)
+{
+	if (unionfs_dentry_cachep)
+		kmem_cache_destroy(unionfs_dentry_cachep);
+}
+
+void free_dentry_private_data(struct dentry *dentry)
+{
+	if (!dentry || !dentry->d_fsdata)
+		return;
+	kmem_cache_free(unionfs_dentry_cachep, dentry->d_fsdata);
+	dentry->d_fsdata = NULL;
+}
+
+static inline int __realloc_dentry_private_data(struct dentry *dentry)
+{
+	struct unionfs_dentry_info *info = UNIONFS_D(dentry);
+	void *p;
+	int size;
+
+	BUG_ON(!info);
+
+	size = sizeof(struct path) * sbmax(dentry->d_sb);
+	p = krealloc(info->lower_paths, size, GFP_ATOMIC);
+	if (unlikely(!p))
+		return -ENOMEM;
+
+	info->lower_paths = p;
+
+	info->bstart = -1;
+	info->bend = -1;
+	info->bopaque = -1;
+	info->bcount = sbmax(dentry->d_sb);
+	atomic_set(&info->generation,
+			atomic_read(&UNIONFS_SB(dentry->d_sb)->generation));
+
+	memset(info->lower_paths, 0, size);
+
+	return 0;
+}
+
+/* UNIONFS_D(dentry)->lock must be locked */
+static int realloc_dentry_private_data(struct dentry *dentry)
+{
+	if (!__realloc_dentry_private_data(dentry))
+		return 0;
+
+	kfree(UNIONFS_D(dentry)->lower_paths);
+	free_dentry_private_data(dentry);
+	return -ENOMEM;
+}
+
+/* allocate new dentry private data */
+int new_dentry_private_data(struct dentry *dentry)
+{
+	struct unionfs_dentry_info *info = UNIONFS_D(dentry);
+
+	BUG_ON(info);
+
+	info = kmem_cache_alloc(unionfs_dentry_cachep, GFP_ATOMIC);
+	if (unlikely(!info))
+		return -ENOMEM;
+
+	mutex_init(&info->lock);
+	mutex_lock(&info->lock);
+
+	info->lower_paths = NULL;
+
+	dentry->d_fsdata = info;
+
+	if (!__realloc_dentry_private_data(dentry))
+		return 0;
+
+	mutex_unlock(&info->lock);
+	free_dentry_private_data(dentry);
+	return -ENOMEM;
+}
+
+/*
+ * scan through the lower dentry objects, and set bstart to reflect the
+ * starting branch
+ */
+void update_bstart(struct dentry *dentry)
+{
+	int bindex;
+	int bstart = dbstart(dentry);
+	int bend = dbend(dentry);
+	struct dentry *lower_dentry;
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry)
+			continue;
+		if (lower_dentry->d_inode) {
+			set_dbstart(dentry, bindex);
+			break;
+		}
+		dput(lower_dentry);
+		unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+	}
+}
+
+
+/*
+ * Initialize a nameidata structure (the intent part) we can pass to a lower
+ * file system.  Returns 0 on success or -error (only -ENOMEM possible).
+ * Inside that nd structure, this function may also return an allocated
+ * struct file (for open intents).  The caller, when done with this nd, must
+ * kfree the intent file (using release_lower_nd).
+ *
+ * XXX: this code, and the callers of this code, should be redone using
+ * vfs_path_lookup() when (1) the nameidata structure is refactored into a
+ * separate intent-structure, and (2) open_namei() is broken into a VFS-only
+ * function and a method that other file systems can call.
+ */
+int init_lower_nd(struct nameidata *nd, unsigned int flags)
+{
+	int err = 0;
+#ifdef ALLOC_LOWER_ND_FILE
+	/*
+	 * XXX: one day we may need to have the lower return an open file
+	 * for us.  It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may
+	 * very well be needed for nfs4.
+	 */
+	struct file *file;
+#endif /* ALLOC_LOWER_ND_FILE */
+
+	memset(nd, 0, sizeof(struct nameidata));
+	if (!flags)
+		return err;
+
+	switch (flags) {
+	case LOOKUP_CREATE:
+		nd->intent.open.flags |= O_CREAT;
+		/* fall through: shared code for create/open cases */
+	case LOOKUP_OPEN:
+		nd->flags = flags;
+		nd->intent.open.flags |= (FMODE_READ | FMODE_WRITE);
+#ifdef ALLOC_LOWER_ND_FILE
+		file = kzalloc(sizeof(struct file), GFP_KERNEL);
+		if (unlikely(!file)) {
+			err = -ENOMEM;
+			break; /* exit switch statement and thus return */
+		}
+		nd->intent.open.file = file;
+#endif /* ALLOC_LOWER_ND_FILE */
+		break;
+	case LOOKUP_ACCESS:
+		nd->flags = flags;
+		break;
+	default:
+		/*
+		 * We should never get here, for now.
+		 * We can add new cases here later on.
+		 */
+		pr_debug("unionfs: unknown nameidata flag 0x%x\n", flags);
+		BUG();
+		break;
+	}
+
+	return err;
+}
+
+void release_lower_nd(struct nameidata *nd, int err)
+{
+	if (!nd->intent.open.file)
+		return;
+	else if (!err)
+		release_open_intent(nd);
+#ifdef ALLOC_LOWER_ND_FILE
+	kfree(nd->intent.open.file);
+#endif /* ALLOC_LOWER_ND_FILE */
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 17/42] Unionfs: rename method and helpers
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (15 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 16/42] Unionfs: lower-level lookup routines Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 18/42] Unionfs: directory reading file operations Erez Zadok
                   ` (25 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/rename.c |  533 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 533 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/rename.c

diff --git a/fs/unionfs/rename.c b/fs/unionfs/rename.c
new file mode 100644
index 0000000..452d1e7
--- /dev/null
+++ b/fs/unionfs/rename.c
@@ -0,0 +1,533 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+static int __unionfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+			    struct inode *new_dir, struct dentry *new_dentry,
+			    int bindex, struct dentry **wh_old)
+{
+	int err = 0;
+	struct dentry *lower_old_dentry;
+	struct dentry *lower_new_dentry;
+	struct dentry *lower_old_dir_dentry;
+	struct dentry *lower_new_dir_dentry;
+	struct dentry *lower_wh_dentry;
+	struct dentry *lower_wh_dir_dentry;
+	char *wh_name = NULL;
+
+	lower_new_dentry = unionfs_lower_dentry_idx(new_dentry, bindex);
+	lower_old_dentry = unionfs_lower_dentry_idx(old_dentry, bindex);
+
+	if (!lower_new_dentry) {
+		lower_new_dentry =
+			create_parents(new_dentry->d_parent->d_inode,
+				       new_dentry, new_dentry->d_name.name,
+				       bindex);
+		if (IS_ERR(lower_new_dentry)) {
+			err = PTR_ERR(lower_new_dentry);
+			if (IS_COPYUP_ERR(err))
+				goto out;
+			printk(KERN_ERR "unionfs: error creating directory "
+			       "tree for rename, bindex=%d err=%d\n",
+			       bindex, err);
+			goto out;
+		}
+	}
+
+	wh_name = alloc_whname(new_dentry->d_name.name,
+			       new_dentry->d_name.len);
+	if (unlikely(IS_ERR(wh_name))) {
+		err = PTR_ERR(wh_name);
+		goto out;
+	}
+
+	lower_wh_dentry = lookup_one_len(wh_name, lower_new_dentry->d_parent,
+					 new_dentry->d_name.len +
+					 UNIONFS_WHLEN);
+	if (IS_ERR(lower_wh_dentry)) {
+		err = PTR_ERR(lower_wh_dentry);
+		goto out;
+	}
+
+	if (lower_wh_dentry->d_inode) {
+		/* get rid of the whiteout that is existing */
+		if (lower_new_dentry->d_inode) {
+			printk(KERN_ERR "unionfs: both a whiteout and a "
+			       "dentry exist when doing a rename!\n");
+			err = -EIO;
+
+			dput(lower_wh_dentry);
+			goto out;
+		}
+
+		lower_wh_dir_dentry = lock_parent(lower_wh_dentry);
+		err = is_robranch_super(old_dentry->d_sb, bindex);
+		if (!err)
+			err = vfs_unlink(lower_wh_dir_dentry->d_inode,
+					 lower_wh_dentry);
+
+		dput(lower_wh_dentry);
+		unlock_dir(lower_wh_dir_dentry);
+		if (err)
+			goto out;
+	} else {
+		dput(lower_wh_dentry);
+	}
+
+	dget(lower_old_dentry);
+	lower_old_dir_dentry = dget_parent(lower_old_dentry);
+	lower_new_dir_dentry = dget_parent(lower_new_dentry);
+
+	lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+
+	err = is_robranch_super(old_dentry->d_sb, bindex);
+	if (err)
+		goto out_unlock;
+
+	/*
+	 * ready to whiteout for old_dentry. caller will create the actual
+	 * whiteout, and must dput(*wh_old)
+	 */
+	if (wh_old) {
+		char *whname;
+		whname = alloc_whname(old_dentry->d_name.name,
+				      old_dentry->d_name.len);
+		err = PTR_ERR(whname);
+		if (unlikely(IS_ERR(whname)))
+			goto out_unlock;
+		*wh_old = lookup_one_len(whname, lower_old_dir_dentry,
+					 old_dentry->d_name.len +
+					 UNIONFS_WHLEN);
+		kfree(whname);
+		err = PTR_ERR(*wh_old);
+		if (IS_ERR(*wh_old)) {
+			*wh_old = NULL;
+			goto out_unlock;
+		}
+	}
+
+	err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
+			 lower_new_dir_dentry->d_inode, lower_new_dentry);
+
+out_unlock:
+	unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+
+	dput(lower_old_dir_dentry);
+	dput(lower_new_dir_dentry);
+	dput(lower_old_dentry);
+
+out:
+	if (!err) {
+		/* Fixup the new_dentry. */
+		if (bindex < dbstart(new_dentry))
+			set_dbstart(new_dentry, bindex);
+		else if (bindex > dbend(new_dentry))
+			set_dbend(new_dentry, bindex);
+	}
+
+	kfree(wh_name);
+
+	return err;
+}
+
+/*
+ * Main rename code.  This is sufficiently complex, that it's documented in
+ * Documentation/filesystems/unionfs/rename.txt.  This routine calls
+ * __unionfs_rename() above to perform some of the work.
+ */
+static int do_unionfs_rename(struct inode *old_dir,
+			     struct dentry *old_dentry,
+			     struct inode *new_dir,
+			     struct dentry *new_dentry)
+{
+	int err = 0;
+	int bindex, bwh_old;
+	int old_bstart, old_bend;
+	int new_bstart, new_bend;
+	int do_copyup = -1;
+	struct dentry *parent_dentry;
+	int local_err = 0;
+	int eio = 0;
+	int revert = 0;
+	struct dentry *wh_old = NULL;
+
+	old_bstart = dbstart(old_dentry);
+	bwh_old = old_bstart;
+	old_bend = dbend(old_dentry);
+	parent_dentry = old_dentry->d_parent;
+
+	new_bstart = dbstart(new_dentry);
+	new_bend = dbend(new_dentry);
+
+	/* Rename source to destination. */
+	err = __unionfs_rename(old_dir, old_dentry, new_dir, new_dentry,
+			       old_bstart, &wh_old);
+	if (err) {
+		if (!IS_COPYUP_ERR(err))
+			goto out;
+		do_copyup = old_bstart - 1;
+	} else {
+		revert = 1;
+	}
+
+	/*
+	 * Unlink all instances of destination that exist to the left of
+	 * bstart of source. On error, revert back, goto out.
+	 */
+	for (bindex = old_bstart - 1; bindex >= new_bstart; bindex--) {
+		struct dentry *unlink_dentry;
+		struct dentry *unlink_dir_dentry;
+
+		unlink_dentry = unionfs_lower_dentry_idx(new_dentry, bindex);
+		if (!unlink_dentry)
+			continue;
+
+		unlink_dir_dentry = lock_parent(unlink_dentry);
+		err = is_robranch_super(old_dir->i_sb, bindex);
+		if (!err)
+			err = vfs_unlink(unlink_dir_dentry->d_inode,
+					 unlink_dentry);
+
+		fsstack_copy_attr_times(new_dentry->d_parent->d_inode,
+					unlink_dir_dentry->d_inode);
+		/* propagate number of hard-links */
+		new_dentry->d_parent->d_inode->i_nlink =
+			unionfs_get_nlinks(new_dentry->d_parent->d_inode);
+
+		unlock_dir(unlink_dir_dentry);
+		if (!err) {
+			if (bindex != new_bstart) {
+				dput(unlink_dentry);
+				unionfs_set_lower_dentry_idx(new_dentry,
+							     bindex, NULL);
+			}
+		} else if (IS_COPYUP_ERR(err)) {
+			do_copyup = bindex - 1;
+		} else if (revert) {
+			dput(wh_old);
+			goto revert;
+		}
+	}
+
+	if (do_copyup != -1) {
+		for (bindex = do_copyup; bindex >= 0; bindex--) {
+			/*
+			 * copyup the file into some left directory, so that
+			 * you can rename it
+			 */
+			err = copyup_dentry(old_dentry->d_parent->d_inode,
+					    old_dentry, old_bstart, bindex,
+					    old_dentry->d_name.name,
+					    old_dentry->d_name.len, NULL,
+					    i_size_read(old_dentry->d_inode));
+			/* if copyup failed, try next branch to the left */
+			if (err)
+				continue;
+			dput(wh_old);
+			bwh_old = bindex;
+			err = __unionfs_rename(old_dir, old_dentry,
+					       new_dir, new_dentry,
+					       bindex, &wh_old);
+			break;
+		}
+	}
+
+	/* make it opaque */
+	if (S_ISDIR(old_dentry->d_inode->i_mode)) {
+		err = make_dir_opaque(old_dentry, dbstart(old_dentry));
+		if (err)
+			goto revert;
+	}
+
+	/*
+	 * Create whiteout for source, only if:
+	 * (1) There is more than one underlying instance of source.
+	 * (2) We did a copy_up
+	 */
+	if ((old_bstart != old_bend) || (do_copyup != -1)) {
+		struct dentry *lower_parent;
+		struct nameidata nd;
+		if (!wh_old || wh_old->d_inode || bwh_old < 0) {
+			printk(KERN_ERR "unionfs: rename error "
+			       "(wh_old=%p/%p bwh_old=%d)\n", wh_old,
+			       (wh_old ? wh_old->d_inode : NULL), bwh_old);
+			err = -EIO;
+			goto out;
+		}
+		err = init_lower_nd(&nd, LOOKUP_CREATE);
+		if (unlikely(err < 0))
+			goto out;
+		lower_parent = lock_parent(wh_old);
+		local_err = vfs_create(lower_parent->d_inode, wh_old, S_IRUGO,
+				       &nd);
+		unlock_dir(lower_parent);
+		if (!local_err) {
+			set_dbopaque(old_dentry, bwh_old);
+		} else {
+			/*
+			 * we can't fix anything now, so we cop-out and use
+			 * -EIO.
+			 */
+			printk(KERN_ERR "unionfs: can't create a whiteout for "
+			       "the source in rename!\n");
+			err = -EIO;
+		}
+		release_lower_nd(&nd, local_err);
+	}
+
+out:
+	dput(wh_old);
+	return err;
+
+revert:
+	/* Do revert here. */
+	local_err = unionfs_refresh_lower_dentry(new_dentry, old_bstart);
+	if (local_err) {
+		printk(KERN_ERR "unionfs: revert failed in rename: "
+		       "the new refresh failed\n");
+		eio = -EIO;
+	}
+
+	local_err = unionfs_refresh_lower_dentry(old_dentry, old_bstart);
+	if (local_err) {
+		printk(KERN_ERR "unionfs: revert failed in rename: "
+		       "the old refresh failed\n");
+		eio = -EIO;
+		goto revert_out;
+	}
+
+	if (!unionfs_lower_dentry_idx(new_dentry, bindex) ||
+	    !unionfs_lower_dentry_idx(new_dentry, bindex)->d_inode) {
+		printk(KERN_ERR "unionfs: revert failed in rename: "
+		       "the object disappeared from under us!\n");
+		eio = -EIO;
+		goto revert_out;
+	}
+
+	if (unionfs_lower_dentry_idx(old_dentry, bindex) &&
+	    unionfs_lower_dentry_idx(old_dentry, bindex)->d_inode) {
+		printk(KERN_ERR "unionfs: revert failed in rename: "
+		       "the object was created underneath us!\n");
+		eio = -EIO;
+		goto revert_out;
+	}
+
+	local_err = __unionfs_rename(new_dir, new_dentry,
+				     old_dir, old_dentry, old_bstart, NULL);
+
+	/* If we can't fix it, then we cop-out with -EIO. */
+	if (local_err) {
+		printk(KERN_ERR "unionfs: revert failed in rename!\n");
+		eio = -EIO;
+	}
+
+	local_err = unionfs_refresh_lower_dentry(new_dentry, bindex);
+	if (local_err)
+		eio = -EIO;
+	local_err = unionfs_refresh_lower_dentry(old_dentry, bindex);
+	if (local_err)
+		eio = -EIO;
+
+revert_out:
+	if (eio)
+		err = eio;
+	return err;
+}
+
+static struct dentry *lookup_whiteout(struct dentry *dentry)
+{
+	char *whname;
+	int bindex = -1, bstart = -1, bend = -1;
+	struct dentry *parent, *lower_parent, *wh_dentry;
+
+	whname = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+	if (unlikely(IS_ERR(whname)))
+		return (void *)whname;
+
+	parent = dget_parent(dentry);
+	unionfs_lock_dentry(parent);
+	bstart = dbstart(parent);
+	bend = dbend(parent);
+	wh_dentry = ERR_PTR(-ENOENT);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_parent = unionfs_lower_dentry_idx(parent, bindex);
+		if (!lower_parent)
+			continue;
+		wh_dentry = lookup_one_len(whname, lower_parent,
+					   dentry->d_name.len + UNIONFS_WHLEN);
+		if (IS_ERR(wh_dentry))
+			continue;
+		if (wh_dentry->d_inode)
+			break;
+		dput(wh_dentry);
+		wh_dentry = ERR_PTR(-ENOENT);
+	}
+	unionfs_unlock_dentry(parent);
+	dput(parent);
+	kfree(whname);
+	return wh_dentry;
+}
+
+/*
+ * We can't copyup a directory, because it may involve huge numbers of
+ * children, etc.  Doing that in the kernel would be bad, so instead we
+ * return EXDEV to the user-space utility that caused this, and let the
+ * user-space recurse and ask us to copy up each file separately.
+ */
+static int may_rename_dir(struct dentry *dentry)
+{
+	int err, bstart;
+
+	err = check_empty(dentry, NULL);
+	if (err == -ENOTEMPTY) {
+		if (is_robranch(dentry))
+			return -EXDEV;
+	} else if (err) {
+		return err;
+	}
+
+	bstart = dbstart(dentry);
+	if (dbend(dentry) == bstart || dbopaque(dentry) == bstart)
+		return 0;
+
+	set_dbstart(dentry, bstart + 1);
+	err = check_empty(dentry, NULL);
+	set_dbstart(dentry, bstart);
+	if (err == -ENOTEMPTY)
+		err = -EXDEV;
+	return err;
+}
+
+int unionfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+		   struct inode *new_dir, struct dentry *new_dentry)
+{
+	int err = 0;
+	struct dentry *wh_dentry;
+
+	unionfs_read_lock(old_dentry->d_sb);
+	unionfs_double_lock_dentry(old_dentry, new_dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(old_dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+	if (unlikely(!d_deleted(new_dentry) && new_dentry->d_inode &&
+		     !__unionfs_d_revalidate_chain(new_dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	if (!S_ISDIR(old_dentry->d_inode->i_mode))
+		err = unionfs_partial_lookup(old_dentry);
+	else
+		err = may_rename_dir(old_dentry);
+
+	if (err)
+		goto out;
+
+	err = unionfs_partial_lookup(new_dentry);
+	if (err)
+		goto out;
+
+	/*
+	 * if new_dentry is already lower because of whiteout,
+	 * simply override it even if the whited-out dir is not empty.
+	 */
+	wh_dentry = lookup_whiteout(new_dentry);
+	if (!IS_ERR(wh_dentry)) {
+		dput(wh_dentry);
+	} else if (new_dentry->d_inode) {
+		if (S_ISDIR(old_dentry->d_inode->i_mode) !=
+		    S_ISDIR(new_dentry->d_inode->i_mode)) {
+			err = S_ISDIR(old_dentry->d_inode->i_mode) ?
+				-ENOTDIR : -EISDIR;
+			goto out;
+		}
+
+		if (S_ISDIR(new_dentry->d_inode->i_mode)) {
+			struct unionfs_dir_state *namelist;
+			/* check if this unionfs directory is empty or not */
+			err = check_empty(new_dentry, &namelist);
+			if (err)
+				goto out;
+
+			if (!is_robranch(new_dentry))
+				err = delete_whiteouts(new_dentry,
+						       dbstart(new_dentry),
+						       namelist);
+
+			free_rdstate(namelist);
+
+			if (err)
+				goto out;
+		}
+	}
+	err = do_unionfs_rename(old_dir, old_dentry, new_dir, new_dentry);
+out:
+	if (err) {
+		/* clear the new_dentry stuff created */
+		d_drop(new_dentry);
+	} else {
+		/*
+		 * force re-lookup since the dir on ro branch is not renamed,
+		 * and lower dentries still indicate the un-renamed ones.
+		 */
+		if (S_ISDIR(old_dentry->d_inode->i_mode))
+			atomic_dec(&UNIONFS_D(old_dentry)->generation);
+		else
+			unionfs_postcopyup_release(old_dentry);
+		if (new_dentry->d_inode &&
+		    !S_ISDIR(new_dentry->d_inode->i_mode)) {
+			unionfs_postcopyup_release(new_dentry);
+			unionfs_postcopyup_setmnt(new_dentry);
+			if (!unionfs_lower_inode(new_dentry->d_inode)) {
+				/*
+				 * If we get here, it means that no copyup
+				 * was needed, and that a file by the old
+				 * name already existing on the destination
+				 * branch; that file got renamed earlier in
+				 * this function, so all we need to do here
+				 * is set the lower inode.
+				 */
+				struct inode *inode;
+				inode = unionfs_lower_inode(
+					old_dentry->d_inode);
+				igrab(inode);
+				unionfs_set_lower_inode_idx(
+					new_dentry->d_inode,
+					dbstart(new_dentry), inode);
+			}
+
+		}
+		/* if all of this renaming succeeded, update our times */
+		unionfs_copy_attr_times(old_dir);
+		unionfs_copy_attr_times(new_dir);
+		unionfs_copy_attr_times(old_dentry->d_inode);
+		unionfs_copy_attr_times(new_dentry->d_inode);
+		unionfs_check_inode(old_dir);
+		unionfs_check_inode(new_dir);
+		unionfs_check_dentry(old_dentry);
+		unionfs_check_dentry(new_dentry);
+	}
+
+	unionfs_unlock_dentry(new_dentry);
+	unionfs_unlock_dentry(old_dentry);
+	unionfs_read_unlock(old_dentry->d_sb);
+	return err;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 18/42] Unionfs: directory reading file operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (16 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 17/42] Unionfs: rename method and helpers Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 19/42] Unionfs: readdir helper functions Erez Zadok
                   ` (24 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/dirfops.c |  290 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 290 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/dirfops.c

diff --git a/fs/unionfs/dirfops.c b/fs/unionfs/dirfops.c
new file mode 100644
index 0000000..88df635
--- /dev/null
+++ b/fs/unionfs/dirfops.c
@@ -0,0 +1,290 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/* Make sure our rdstate is playing by the rules. */
+static void verify_rdstate_offset(struct unionfs_dir_state *rdstate)
+{
+	BUG_ON(rdstate->offset >= DIREOF);
+	BUG_ON(rdstate->cookie >= MAXRDCOOKIE);
+}
+
+struct unionfs_getdents_callback {
+	struct unionfs_dir_state *rdstate;
+	void *dirent;
+	int entries_written;
+	int filldir_called;
+	int filldir_error;
+	filldir_t filldir;
+	struct super_block *sb;
+};
+
+/* based on generic filldir in fs/readir.c */
+static int unionfs_filldir(void *dirent, const char *name, int namelen,
+			   loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct unionfs_getdents_callback *buf = dirent;
+	struct filldir_node *found = NULL;
+	int err = 0;
+	int is_wh_entry = 0;
+
+	buf->filldir_called++;
+
+	if ((namelen > UNIONFS_WHLEN) &&
+	    !strncmp(name, UNIONFS_WHPFX, UNIONFS_WHLEN)) {
+		name += UNIONFS_WHLEN;
+		namelen -= UNIONFS_WHLEN;
+		is_wh_entry = 1;
+	}
+
+	found = find_filldir_node(buf->rdstate, name, namelen, is_wh_entry);
+
+	if (found) {
+		/*
+		 * If we had non-whiteout entry in dir cache, then mark it
+		 * as a whiteout and but leave it in the dir cache.
+		 */
+		if (is_wh_entry && !found->whiteout)
+			found->whiteout = is_wh_entry;
+		goto out;
+	}
+
+	/* if 'name' isn't a whiteout, filldir it. */
+	if (!is_wh_entry) {
+		off_t pos = rdstate2offset(buf->rdstate);
+		u64 unionfs_ino = ino;
+
+		err = buf->filldir(buf->dirent, name, namelen, pos,
+				   unionfs_ino, d_type);
+		buf->rdstate->offset++;
+		verify_rdstate_offset(buf->rdstate);
+	}
+	/*
+	 * If we did fill it, stuff it in our hash, otherwise return an
+	 * error.
+	 */
+	if (err) {
+		buf->filldir_error = err;
+		goto out;
+	}
+	buf->entries_written++;
+	err = add_filldir_node(buf->rdstate, name, namelen,
+			       buf->rdstate->bindex, is_wh_entry);
+	if (err)
+		buf->filldir_error = err;
+
+out:
+	return err;
+}
+
+static int unionfs_readdir(struct file *file, void *dirent, filldir_t filldir)
+{
+	int err = 0;
+	struct file *lower_file = NULL;
+	struct inode *inode = NULL;
+	struct unionfs_getdents_callback buf;
+	struct unionfs_dir_state *uds;
+	int bend;
+	loff_t offset;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+
+	err = unionfs_file_revalidate(file, false);
+	if (unlikely(err))
+		goto out;
+
+	inode = file->f_path.dentry->d_inode;
+
+	uds = UNIONFS_F(file)->rdstate;
+	if (!uds) {
+		if (file->f_pos == DIREOF) {
+			goto out;
+		} else if (file->f_pos > 0) {
+			uds = find_rdstate(inode, file->f_pos);
+			if (unlikely(!uds)) {
+				err = -ESTALE;
+				goto out;
+			}
+			UNIONFS_F(file)->rdstate = uds;
+		} else {
+			init_rdstate(file);
+			uds = UNIONFS_F(file)->rdstate;
+		}
+	}
+	bend = fbend(file);
+
+	while (uds->bindex <= bend) {
+		lower_file = unionfs_lower_file_idx(file, uds->bindex);
+		if (!lower_file) {
+			uds->bindex++;
+			uds->dirpos = 0;
+			continue;
+		}
+
+		/* prepare callback buffer */
+		buf.filldir_called = 0;
+		buf.filldir_error = 0;
+		buf.entries_written = 0;
+		buf.dirent = dirent;
+		buf.filldir = filldir;
+		buf.rdstate = uds;
+		buf.sb = inode->i_sb;
+
+		/* Read starting from where we last left off. */
+		offset = vfs_llseek(lower_file, uds->dirpos, SEEK_SET);
+		if (offset < 0) {
+			err = offset;
+			goto out;
+		}
+		err = vfs_readdir(lower_file, unionfs_filldir, &buf);
+
+		/* Save the position for when we continue. */
+		offset = vfs_llseek(lower_file, 0, SEEK_CUR);
+		if (offset < 0) {
+			err = offset;
+			goto out;
+		}
+		uds->dirpos = offset;
+
+		/* Copy the atime. */
+		fsstack_copy_attr_atime(inode,
+					lower_file->f_path.dentry->d_inode);
+
+		if (err < 0)
+			goto out;
+
+		if (buf.filldir_error)
+			break;
+
+		if (!buf.entries_written) {
+			uds->bindex++;
+			uds->dirpos = 0;
+		}
+	}
+
+	if (!buf.filldir_error && uds->bindex >= bend) {
+		/* Save the number of hash entries for next time. */
+		UNIONFS_I(inode)->hashsize = uds->hashentries;
+		free_rdstate(uds);
+		UNIONFS_F(file)->rdstate = NULL;
+		file->f_pos = DIREOF;
+	} else {
+		file->f_pos = rdstate2offset(uds);
+	}
+
+out:
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+/*
+ * This is not meant to be a generic repositioning function.  If you do
+ * things that aren't supported, then we return EINVAL.
+ *
+ * What is allowed:
+ *  (1) seeking to the same position that you are currently at
+ *	This really has no effect, but returns where you are.
+ *  (2) seeking to the beginning of the file
+ *	This throws out all state, and lets you begin again.
+ */
+static loff_t unionfs_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+	struct unionfs_dir_state *rdstate;
+	loff_t err;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+
+	err = unionfs_file_revalidate(file, false);
+	if (unlikely(err))
+		goto out;
+
+	rdstate = UNIONFS_F(file)->rdstate;
+
+	/*
+	 * we let users seek to their current position, but not anywhere
+	 * else.
+	 */
+	if (!offset) {
+		switch (origin) {
+		case SEEK_SET:
+			if (rdstate) {
+				free_rdstate(rdstate);
+				UNIONFS_F(file)->rdstate = NULL;
+			}
+			init_rdstate(file);
+			err = 0;
+			break;
+		case SEEK_CUR:
+			err = file->f_pos;
+			break;
+		case SEEK_END:
+			/* Unsupported, because we would break everything.  */
+			err = -EINVAL;
+			break;
+		}
+	} else {
+		switch (origin) {
+		case SEEK_SET:
+			if (rdstate) {
+				if (offset == rdstate2offset(rdstate))
+					err = offset;
+				else if (file->f_pos == DIREOF)
+					err = DIREOF;
+				else
+					err = -EINVAL;
+			} else {
+				struct inode *inode;
+				inode = file->f_path.dentry->d_inode;
+				rdstate = find_rdstate(inode, offset);
+				if (rdstate) {
+					UNIONFS_F(file)->rdstate = rdstate;
+					err = rdstate->offset;
+				} else {
+					err = -EINVAL;
+				}
+			}
+			break;
+		case SEEK_CUR:
+		case SEEK_END:
+			/* Unsupported, because we would break everything.  */
+			err = -EINVAL;
+			break;
+		}
+	}
+
+out:
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;
+}
+
+/*
+ * Trimmed directory options, we shouldn't pass everything down since
+ * we don't want to operate on partial directories.
+ */
+struct file_operations unionfs_dir_fops = {
+	.llseek		= unionfs_dir_llseek,
+	.read		= generic_read_dir,
+	.readdir	= unionfs_readdir,
+	.unlocked_ioctl	= unionfs_ioctl,
+	.open		= unionfs_open,
+	.release	= unionfs_file_release,
+	.flush		= unionfs_flush,
+	.fsync		= unionfs_fsync,
+	.fasync		= unionfs_fasync,
+};
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 19/42] Unionfs: readdir helper functions
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (17 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 18/42] Unionfs: directory reading file operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 20/42] Unionfs: readdir state helpers Erez Zadok
                   ` (23 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes whiteout handling for directories.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/dirhelper.c |  272 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 272 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/dirhelper.c

diff --git a/fs/unionfs/dirhelper.c b/fs/unionfs/dirhelper.c
new file mode 100644
index 0000000..2e52fc3
--- /dev/null
+++ b/fs/unionfs/dirhelper.c
@@ -0,0 +1,272 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * Delete all of the whiteouts in a given directory for rmdir.
+ *
+ * lower directory inode should be locked
+ */
+int do_delete_whiteouts(struct dentry *dentry, int bindex,
+			struct unionfs_dir_state *namelist)
+{
+	int err = 0;
+	struct dentry *lower_dir_dentry = NULL;
+	struct dentry *lower_dentry;
+	char *name = NULL, *p;
+	struct inode *lower_dir;
+	int i;
+	struct list_head *pos;
+	struct filldir_node *cursor;
+
+	/* Find out lower parent dentry */
+	lower_dir_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+	BUG_ON(!S_ISDIR(lower_dir_dentry->d_inode->i_mode));
+	lower_dir = lower_dir_dentry->d_inode;
+	BUG_ON(!S_ISDIR(lower_dir->i_mode));
+
+	err = -ENOMEM;
+	name = __getname();
+	if (unlikely(!name))
+		goto out;
+	strcpy(name, UNIONFS_WHPFX);
+	p = name + UNIONFS_WHLEN;
+
+	err = 0;
+	for (i = 0; !err && i < namelist->size; i++) {
+		list_for_each(pos, &namelist->list[i]) {
+			cursor =
+				list_entry(pos, struct filldir_node,
+					   file_list);
+			/* Only operate on whiteouts in this branch. */
+			if (cursor->bindex != bindex)
+				continue;
+			if (!cursor->whiteout)
+				continue;
+
+			strcpy(p, cursor->name);
+			lower_dentry =
+				lookup_one_len(name, lower_dir_dentry,
+					       cursor->namelen +
+					       UNIONFS_WHLEN);
+			if (IS_ERR(lower_dentry)) {
+				err = PTR_ERR(lower_dentry);
+				break;
+			}
+			if (lower_dentry->d_inode)
+				err = vfs_unlink(lower_dir, lower_dentry);
+			dput(lower_dentry);
+			if (err)
+				break;
+		}
+	}
+
+	__putname(name);
+
+	/* After all of the removals, we should copy the attributes once. */
+	fsstack_copy_attr_times(dentry->d_inode, lower_dir_dentry->d_inode);
+
+out:
+	return err;
+}
+
+/* delete whiteouts in a dir (for rmdir operation) using sioq if necessary */
+int delete_whiteouts(struct dentry *dentry, int bindex,
+		     struct unionfs_dir_state *namelist)
+{
+	int err;
+	struct super_block *sb;
+	struct dentry *lower_dir_dentry;
+	struct inode *lower_dir;
+	struct sioq_args args;
+
+	sb = dentry->d_sb;
+
+	BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+	BUG_ON(bindex < dbstart(dentry));
+	BUG_ON(bindex > dbend(dentry));
+	err = is_robranch_super(sb, bindex);
+	if (err)
+		goto out;
+
+	lower_dir_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+	BUG_ON(!S_ISDIR(lower_dir_dentry->d_inode->i_mode));
+	lower_dir = lower_dir_dentry->d_inode;
+	BUG_ON(!S_ISDIR(lower_dir->i_mode));
+
+	mutex_lock(&lower_dir->i_mutex);
+	if (!permission(lower_dir, MAY_WRITE | MAY_EXEC, NULL)) {
+		err = do_delete_whiteouts(dentry, bindex, namelist);
+	} else {
+		args.deletewh.namelist = namelist;
+		args.deletewh.dentry = dentry;
+		args.deletewh.bindex = bindex;
+		run_sioq(__delete_whiteouts, &args);
+		err = args.err;
+	}
+	mutex_unlock(&lower_dir->i_mutex);
+
+out:
+	return err;
+}
+
+#define RD_NONE 0
+#define RD_CHECK_EMPTY 1
+/* The callback structure for check_empty. */
+struct unionfs_rdutil_callback {
+	int err;
+	int filldir_called;
+	struct unionfs_dir_state *rdstate;
+	int mode;
+};
+
+/* This filldir function makes sure only whiteouts exist within a directory. */
+static int readdir_util_callback(void *dirent, const char *name, int namelen,
+				 loff_t offset, u64 ino, unsigned int d_type)
+{
+	int err = 0;
+	struct unionfs_rdutil_callback *buf = dirent;
+	int whiteout = 0;
+	struct filldir_node *found;
+
+	buf->filldir_called = 1;
+
+	if (name[0] == '.' && (namelen == 1 ||
+			       (name[1] == '.' && namelen == 2)))
+		goto out;
+
+	if (namelen > UNIONFS_WHLEN &&
+	    !strncmp(name, UNIONFS_WHPFX, UNIONFS_WHLEN)) {
+		namelen -= UNIONFS_WHLEN;
+		name += UNIONFS_WHLEN;
+		whiteout = 1;
+	}
+
+	found = find_filldir_node(buf->rdstate, name, namelen, whiteout);
+	/* If it was found in the table there was a previous whiteout. */
+	if (found)
+		goto out;
+
+	/*
+	 * if it wasn't found and isn't a whiteout, the directory isn't
+	 * empty.
+	 */
+	err = -ENOTEMPTY;
+	if ((buf->mode == RD_CHECK_EMPTY) && !whiteout)
+		goto out;
+
+	err = add_filldir_node(buf->rdstate, name, namelen,
+			       buf->rdstate->bindex, whiteout);
+
+out:
+	buf->err = err;
+	return err;
+}
+
+/* Is a directory logically empty? */
+int check_empty(struct dentry *dentry, struct unionfs_dir_state **namelist)
+{
+	int err = 0;
+	struct dentry *lower_dentry = NULL;
+	struct super_block *sb;
+	struct file *lower_file;
+	struct unionfs_rdutil_callback *buf = NULL;
+	int bindex, bstart, bend, bopaque;
+
+	sb = dentry->d_sb;
+
+
+	BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+
+	err = unionfs_partial_lookup(dentry);
+	if (err)
+		goto out;
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+	bopaque = dbopaque(dentry);
+	if (0 <= bopaque && bopaque < bend)
+		bend = bopaque;
+
+	buf = kmalloc(sizeof(struct unionfs_rdutil_callback), GFP_KERNEL);
+	if (unlikely(!buf)) {
+		err = -ENOMEM;
+		goto out;
+	}
+	buf->err = 0;
+	buf->mode = RD_CHECK_EMPTY;
+	buf->rdstate = alloc_rdstate(dentry->d_inode, bstart);
+	if (unlikely(!buf->rdstate)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* Process the lower directories with rdutil_callback as a filldir. */
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry)
+			continue;
+		if (!lower_dentry->d_inode)
+			continue;
+		if (!S_ISDIR(lower_dentry->d_inode->i_mode))
+			continue;
+
+		dget(lower_dentry);
+		unionfs_mntget(dentry, bindex);
+		branchget(sb, bindex);
+		lower_file =
+			dentry_open(lower_dentry,
+				    unionfs_lower_mnt_idx(dentry, bindex),
+				    O_RDONLY);
+		if (IS_ERR(lower_file)) {
+			err = PTR_ERR(lower_file);
+			dput(lower_dentry);
+			branchput(sb, bindex);
+			goto out;
+		}
+
+		do {
+			buf->filldir_called = 0;
+			buf->rdstate->bindex = bindex;
+			err = vfs_readdir(lower_file,
+					  readdir_util_callback, buf);
+			if (buf->err)
+				err = buf->err;
+		} while ((err >= 0) && buf->filldir_called);
+
+		/* fput calls dput for lower_dentry */
+		fput(lower_file);
+		branchput(sb, bindex);
+
+		if (err < 0)
+			goto out;
+	}
+
+out:
+	if (buf) {
+		if (namelist && !err)
+			*namelist = buf->rdstate;
+		else if (buf->rdstate)
+			free_rdstate(buf->rdstate);
+		kfree(buf);
+	}
+
+
+	return err;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 20/42] Unionfs: readdir state helpers
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (18 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 19/42] Unionfs: readdir helper functions Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 21/42] Unionfs: inode operations Erez Zadok
                   ` (22 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes duplicate name elimination and whiteout-handling code.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/rdstate.c |  285 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 285 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/rdstate.c

diff --git a/fs/unionfs/rdstate.c b/fs/unionfs/rdstate.c
new file mode 100644
index 0000000..7ba1e1a
--- /dev/null
+++ b/fs/unionfs/rdstate.c
@@ -0,0 +1,285 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/* This file contains the routines for maintaining readdir state. */
+
+/*
+ * There are two structures here, rdstate which is a hash table
+ * of the second structure which is a filldir_node.
+ */
+
+/*
+ * This is a struct kmem_cache for filldir nodes, because we allocate a lot
+ * of them and they shouldn't waste memory.  If the node has a small name
+ * (as defined by the dentry structure), then we use an inline name to
+ * preserve kmalloc space.
+ */
+static struct kmem_cache *unionfs_filldir_cachep;
+
+int unionfs_init_filldir_cache(void)
+{
+	unionfs_filldir_cachep =
+		kmem_cache_create("unionfs_filldir",
+				  sizeof(struct filldir_node), 0,
+				  SLAB_RECLAIM_ACCOUNT, NULL);
+
+	return (unionfs_filldir_cachep ? 0 : -ENOMEM);
+}
+
+void unionfs_destroy_filldir_cache(void)
+{
+	if (unionfs_filldir_cachep)
+		kmem_cache_destroy(unionfs_filldir_cachep);
+}
+
+/*
+ * This is a tuning parameter that tells us roughly how big to make the
+ * hash table in directory entries per page.  This isn't perfect, but
+ * at least we get a hash table size that shouldn't be too overloaded.
+ * The following averages are based on my home directory.
+ * 14.44693	Overall
+ * 12.29	Single Page Directories
+ * 117.93	Multi-page directories
+ */
+#define DENTPAGE 4096
+#define DENTPERONEPAGE 12
+#define DENTPERPAGE 118
+#define MINHASHSIZE 1
+static int guesstimate_hash_size(struct inode *inode)
+{
+	struct inode *lower_inode;
+	int bindex;
+	int hashsize = MINHASHSIZE;
+
+	if (UNIONFS_I(inode)->hashsize > 0)
+		return UNIONFS_I(inode)->hashsize;
+
+	for (bindex = ibstart(inode); bindex <= ibend(inode); bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (!lower_inode)
+			continue;
+
+		if (i_size_read(lower_inode) == DENTPAGE)
+			hashsize += DENTPERONEPAGE;
+		else
+			hashsize += (i_size_read(lower_inode) / DENTPAGE) *
+				DENTPERPAGE;
+	}
+
+	return hashsize;
+}
+
+int init_rdstate(struct file *file)
+{
+	BUG_ON(sizeof(loff_t) !=
+	       (sizeof(unsigned int) + sizeof(unsigned int)));
+	BUG_ON(UNIONFS_F(file)->rdstate != NULL);
+
+	UNIONFS_F(file)->rdstate = alloc_rdstate(file->f_path.dentry->d_inode,
+						 fbstart(file));
+
+	return (UNIONFS_F(file)->rdstate ? 0 : -ENOMEM);
+}
+
+struct unionfs_dir_state *find_rdstate(struct inode *inode, loff_t fpos)
+{
+	struct unionfs_dir_state *rdstate = NULL;
+	struct list_head *pos;
+
+	spin_lock(&UNIONFS_I(inode)->rdlock);
+	list_for_each(pos, &UNIONFS_I(inode)->readdircache) {
+		struct unionfs_dir_state *r =
+			list_entry(pos, struct unionfs_dir_state, cache);
+		if (fpos == rdstate2offset(r)) {
+			UNIONFS_I(inode)->rdcount--;
+			list_del(&r->cache);
+			rdstate = r;
+			break;
+		}
+	}
+	spin_unlock(&UNIONFS_I(inode)->rdlock);
+	return rdstate;
+}
+
+struct unionfs_dir_state *alloc_rdstate(struct inode *inode, int bindex)
+{
+	int i = 0;
+	int hashsize;
+	unsigned long mallocsize = sizeof(struct unionfs_dir_state);
+	struct unionfs_dir_state *rdstate;
+
+	hashsize = guesstimate_hash_size(inode);
+	mallocsize += hashsize * sizeof(struct list_head);
+	mallocsize = __roundup_pow_of_two(mallocsize);
+
+	/* This should give us about 500 entries anyway. */
+	if (mallocsize > PAGE_SIZE)
+		mallocsize = PAGE_SIZE;
+
+	hashsize = (mallocsize - sizeof(struct unionfs_dir_state)) /
+		sizeof(struct list_head);
+
+	rdstate = kmalloc(mallocsize, GFP_KERNEL);
+	if (unlikely(!rdstate))
+		return NULL;
+
+	spin_lock(&UNIONFS_I(inode)->rdlock);
+	if (UNIONFS_I(inode)->cookie >= (MAXRDCOOKIE - 1))
+		UNIONFS_I(inode)->cookie = 1;
+	else
+		UNIONFS_I(inode)->cookie++;
+
+	rdstate->cookie = UNIONFS_I(inode)->cookie;
+	spin_unlock(&UNIONFS_I(inode)->rdlock);
+	rdstate->offset = 1;
+	rdstate->access = jiffies;
+	rdstate->bindex = bindex;
+	rdstate->dirpos = 0;
+	rdstate->hashentries = 0;
+	rdstate->size = hashsize;
+	for (i = 0; i < rdstate->size; i++)
+		INIT_LIST_HEAD(&rdstate->list[i]);
+
+	return rdstate;
+}
+
+static void free_filldir_node(struct filldir_node *node)
+{
+	if (node->namelen >= DNAME_INLINE_LEN_MIN)
+		kfree(node->name);
+	kmem_cache_free(unionfs_filldir_cachep, node);
+}
+
+void free_rdstate(struct unionfs_dir_state *state)
+{
+	struct filldir_node *tmp;
+	int i;
+
+	for (i = 0; i < state->size; i++) {
+		struct list_head *head = &(state->list[i]);
+		struct list_head *pos, *n;
+
+		/* traverse the list and deallocate space */
+		list_for_each_safe(pos, n, head) {
+			tmp = list_entry(pos, struct filldir_node, file_list);
+			list_del(&tmp->file_list);
+			free_filldir_node(tmp);
+		}
+	}
+
+	kfree(state);
+}
+
+struct filldir_node *find_filldir_node(struct unionfs_dir_state *rdstate,
+				       const char *name, int namelen,
+				       int is_whiteout)
+{
+	int index;
+	unsigned int hash;
+	struct list_head *head;
+	struct list_head *pos;
+	struct filldir_node *cursor = NULL;
+	int found = 0;
+
+	BUG_ON(namelen <= 0);
+
+	hash = full_name_hash(name, namelen);
+	index = hash % rdstate->size;
+
+	head = &(rdstate->list[index]);
+	list_for_each(pos, head) {
+		cursor = list_entry(pos, struct filldir_node, file_list);
+
+		if (cursor->namelen == namelen && cursor->hash == hash &&
+		    !strncmp(cursor->name, name, namelen)) {
+			/*
+			 * a duplicate exists, and hence no need to create
+			 * entry to the list
+			 */
+			found = 1;
+
+			/*
+			 * if a duplicate is found in this branch, and is
+			 * not due to the caller looking for an entry to
+			 * whiteout, then the file system may be corrupted.
+			 */
+			if (unlikely(!is_whiteout &&
+				     cursor->bindex == rdstate->bindex))
+				printk(KERN_ERR "unionfs: filldir: possible "
+				       "I/O error: a file is duplicated "
+				       "in the same branch %d: %s\n",
+				       rdstate->bindex, cursor->name);
+			break;
+		}
+	}
+
+	if (!found)
+		cursor = NULL;
+
+	return cursor;
+}
+
+int add_filldir_node(struct unionfs_dir_state *rdstate, const char *name,
+		     int namelen, int bindex, int whiteout)
+{
+	struct filldir_node *new;
+	unsigned int hash;
+	int index;
+	int err = 0;
+	struct list_head *head;
+
+	BUG_ON(namelen <= 0);
+
+	hash = full_name_hash(name, namelen);
+	index = hash % rdstate->size;
+	head = &(rdstate->list[index]);
+
+	new = kmem_cache_alloc(unionfs_filldir_cachep, GFP_KERNEL);
+	if (unlikely(!new)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&new->file_list);
+	new->namelen = namelen;
+	new->hash = hash;
+	new->bindex = bindex;
+	new->whiteout = whiteout;
+
+	if (namelen < DNAME_INLINE_LEN_MIN) {
+		new->name = new->iname;
+	} else {
+		new->name = kmalloc(namelen + 1, GFP_KERNEL);
+		if (unlikely(!new->name)) {
+			kmem_cache_free(unionfs_filldir_cachep, new);
+			new = NULL;
+			goto out;
+		}
+	}
+
+	memcpy(new->name, name, namelen);
+	new->name[namelen] = '\0';
+
+	rdstate->hashentries++;
+
+	list_add(&(new->file_list), head);
+out:
+	return err;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 21/42] Unionfs: inode operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (19 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 20/42] Unionfs: readdir state helpers Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 22/42] Unionfs: unlink/rmdir operations Erez Zadok
                   ` (21 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes create, lookup, link, symlink, mkdir, mknod, readlink, follow_link,
put_link, permission, and setattr.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/inode.c | 1154 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1154 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/inode.c

diff --git a/fs/unionfs/inode.c b/fs/unionfs/inode.c
new file mode 100644
index 0000000..63ff3d3
--- /dev/null
+++ b/fs/unionfs/inode.c
@@ -0,0 +1,1154 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+static int unionfs_create(struct inode *parent, struct dentry *dentry,
+			  int mode, struct nameidata *nd)
+{
+	int err = 0;
+	struct dentry *lower_dentry = NULL;
+	struct dentry *wh_dentry = NULL;
+	struct dentry *lower_parent_dentry = NULL;
+	char *name = NULL;
+	int valid = 0;
+	struct nameidata lower_nd;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	unionfs_lock_dentry(dentry->d_parent);
+	valid = __unionfs_d_revalidate_chain(dentry->d_parent, nd, false);
+	unionfs_unlock_dentry(dentry->d_parent);
+	if (unlikely(!valid)) {
+		err = -ESTALE;	/* same as what real_lookup does */
+		goto out;
+	}
+	valid = __unionfs_d_revalidate_chain(dentry, nd, false);
+	/*
+	 * It's only a bug if this dentry was not negative and couldn't be
+	 * revalidated (shouldn't happen).
+	 */
+	BUG_ON(!valid && dentry->d_inode);
+
+	/*
+	 * We shouldn't create things in a read-only branch; this check is a
+	 * bit redundant as we don't allow branch 0 to be read-only at the
+	 * moment
+	 */
+	err = is_robranch_super(dentry->d_sb, 0);
+	if (err) {
+		err = -EROFS;
+		goto out;
+	}
+
+	/*
+	 * We _always_ create on branch 0
+	 */
+	lower_dentry = unionfs_lower_dentry_idx(dentry, 0);
+	if (lower_dentry) {
+		/*
+		 * check if whiteout exists in this branch, i.e. lookup .wh.foo
+		 * first.
+		 */
+		name = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+		if (unlikely(IS_ERR(name))) {
+			err = PTR_ERR(name);
+			goto out;
+		}
+
+		wh_dentry = lookup_one_len(name, lower_dentry->d_parent,
+					   dentry->d_name.len + UNIONFS_WHLEN);
+		if (IS_ERR(wh_dentry)) {
+			err = PTR_ERR(wh_dentry);
+			wh_dentry = NULL;
+			goto out;
+		}
+
+		if (wh_dentry->d_inode) {
+			/*
+			 * .wh.foo has been found, so let's unlink it
+			 */
+			struct dentry *lower_dir_dentry;
+
+			lower_dir_dentry = lock_parent(wh_dentry);
+			err = vfs_unlink(lower_dir_dentry->d_inode, wh_dentry);
+			unlock_dir(lower_dir_dentry);
+
+			/*
+			 * Whiteouts are special files and should be deleted
+			 * no matter what (as if they never existed), in
+			 * order to allow this create operation to succeed.
+			 * This is especially important in sticky
+			 * directories: a whiteout may have been created by
+			 * one user, but the newly created file may be
+			 * created by another user.  Therefore, in order to
+			 * maintain Unix semantics, if the vfs_unlink above
+			 * ailed, then we have to try to directly unlink the
+			 * whiteout.  Note: in the ODF version of unionfs,
+			 * whiteout are handled much more cleanly.
+			 */
+			if (err == -EPERM) {
+				struct inode *inode = lower_dir_dentry->d_inode;
+				err = inode->i_op->unlink(inode, wh_dentry);
+			}
+			if (err) {
+				printk(KERN_ERR "unionfs: create: could not "
+				       "unlink whiteout, err = %d\n", err);
+				goto out;
+			}
+		}
+	} else {
+		/*
+		 * if lower_dentry is NULL, create the entire
+		 * dentry directory structure in branch 0.
+		 */
+		lower_dentry = create_parents(parent, dentry,
+					      dentry->d_name.name, 0);
+		if (IS_ERR(lower_dentry)) {
+			err = PTR_ERR(lower_dentry);
+			goto out;
+		}
+	}
+
+	lower_parent_dentry = lock_parent(lower_dentry);
+	if (IS_ERR(lower_parent_dentry)) {
+		err = PTR_ERR(lower_parent_dentry);
+		goto out;
+	}
+
+	err = init_lower_nd(&lower_nd, LOOKUP_CREATE);
+	if (unlikely(err < 0))
+		goto out;
+	err = vfs_create(lower_parent_dentry->d_inode, lower_dentry, mode,
+			 &lower_nd);
+	release_lower_nd(&lower_nd, err);
+
+	if (!err) {
+		err = PTR_ERR(unionfs_interpose(dentry, parent->i_sb, 0));
+		if (!err) {
+			unionfs_copy_attr_times(parent);
+			fsstack_copy_inode_size(parent,
+						lower_parent_dentry->d_inode);
+			/* update no. of links on parent directory */
+			parent->i_nlink = unionfs_get_nlinks(parent);
+		}
+	}
+
+	unlock_dir(lower_parent_dentry);
+
+out:
+	dput(wh_dentry);
+	kfree(name);
+
+	if (!err)
+		unionfs_postcopyup_setmnt(dentry);
+
+	unionfs_check_inode(parent);
+	if (!err) {
+		unionfs_check_dentry(dentry->d_parent);
+		unionfs_check_nd(nd);
+	}
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
+
+/*
+ * unionfs_lookup is the only special function which takes a dentry, yet we
+ * do NOT want to call __unionfs_d_revalidate_chain because by definition,
+ * we don't have a valid dentry here yet.
+ */
+static struct dentry *unionfs_lookup(struct inode *parent,
+				     struct dentry *dentry,
+				     struct nameidata *nd)
+{
+	struct path path_save;
+	struct dentry *ret;
+
+	unionfs_read_lock(dentry->d_sb);
+
+	/* save the dentry & vfsmnt from namei */
+	if (nd) {
+		path_save.dentry = nd->dentry;
+		path_save.mnt = nd->mnt;
+	}
+
+	/*
+	 * unionfs_lookup_backend returns a locked dentry upon success,
+	 * so we'll have to unlock it below.
+	 */
+	ret = unionfs_lookup_backend(dentry, nd, INTERPOSE_LOOKUP);
+
+	/* restore the dentry & vfsmnt in namei */
+	if (nd) {
+		nd->dentry = path_save.dentry;
+		nd->mnt = path_save.mnt;
+	}
+	if (!IS_ERR(ret)) {
+		if (ret)
+			dentry = ret;
+		/* parent times may have changed */
+		unionfs_copy_attr_times(dentry->d_parent->d_inode);
+	}
+
+	unionfs_check_inode(parent);
+	unionfs_check_dentry(dentry);
+	unionfs_check_dentry(dentry->d_parent);
+	unionfs_check_nd(nd);
+	if (!IS_ERR(ret))
+		unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+
+	return ret;
+}
+
+static int unionfs_link(struct dentry *old_dentry, struct inode *dir,
+			struct dentry *new_dentry)
+{
+	int err = 0;
+	struct dentry *lower_old_dentry = NULL;
+	struct dentry *lower_new_dentry = NULL;
+	struct dentry *lower_dir_dentry = NULL;
+	struct dentry *whiteout_dentry;
+	char *name = NULL;
+
+	unionfs_read_lock(old_dentry->d_sb);
+	unionfs_double_lock_dentry(new_dentry, old_dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(old_dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+	if (unlikely(new_dentry->d_inode &&
+		     !__unionfs_d_revalidate_chain(new_dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	lower_new_dentry = unionfs_lower_dentry(new_dentry);
+
+	/*
+	 * check if whiteout exists in the branch of new dentry, i.e. lookup
+	 * .wh.foo first. If present, delete it
+	 */
+	name = alloc_whname(new_dentry->d_name.name, new_dentry->d_name.len);
+	if (unlikely(IS_ERR(name))) {
+		err = PTR_ERR(name);
+		goto out;
+	}
+
+	whiteout_dentry = lookup_one_len(name, lower_new_dentry->d_parent,
+					 new_dentry->d_name.len +
+					 UNIONFS_WHLEN);
+	if (IS_ERR(whiteout_dentry)) {
+		err = PTR_ERR(whiteout_dentry);
+		goto out;
+	}
+
+	if (!whiteout_dentry->d_inode) {
+		dput(whiteout_dentry);
+		whiteout_dentry = NULL;
+	} else {
+		/* found a .wh.foo entry, unlink it and then call vfs_link() */
+		lower_dir_dentry = lock_parent(whiteout_dentry);
+		err = is_robranch_super(new_dentry->d_sb, dbstart(new_dentry));
+		if (!err)
+			err = vfs_unlink(lower_dir_dentry->d_inode,
+					 whiteout_dentry);
+
+		fsstack_copy_attr_times(dir, lower_dir_dentry->d_inode);
+		dir->i_nlink = unionfs_get_nlinks(dir);
+		unlock_dir(lower_dir_dentry);
+		lower_dir_dentry = NULL;
+		dput(whiteout_dentry);
+		if (err)
+			goto out;
+	}
+
+	if (dbstart(old_dentry) != dbstart(new_dentry)) {
+		lower_new_dentry = create_parents(dir, new_dentry,
+						  new_dentry->d_name.name,
+						  dbstart(old_dentry));
+		err = PTR_ERR(lower_new_dentry);
+		if (IS_COPYUP_ERR(err))
+			goto docopyup;
+		if (!lower_new_dentry || IS_ERR(lower_new_dentry))
+			goto out;
+	}
+	lower_new_dentry = unionfs_lower_dentry(new_dentry);
+	lower_old_dentry = unionfs_lower_dentry(old_dentry);
+
+	BUG_ON(dbstart(old_dentry) != dbstart(new_dentry));
+	lower_dir_dentry = lock_parent(lower_new_dentry);
+	err = is_robranch(old_dentry);
+	if (!err)
+		err = vfs_link(lower_old_dentry, lower_dir_dentry->d_inode,
+			       lower_new_dentry);
+	unlock_dir(lower_dir_dentry);
+
+docopyup:
+	if (IS_COPYUP_ERR(err)) {
+		int old_bstart = dbstart(old_dentry);
+		int bindex;
+
+		for (bindex = old_bstart - 1; bindex >= 0; bindex--) {
+			err = copyup_dentry(old_dentry->d_parent->d_inode,
+					    old_dentry, old_bstart,
+					    bindex, old_dentry->d_name.name,
+					    old_dentry->d_name.len, NULL,
+					    i_size_read(old_dentry->d_inode));
+			if (!err) {
+				lower_new_dentry =
+					create_parents(dir, new_dentry,
+						       new_dentry->d_name.name,
+						       bindex);
+				lower_old_dentry =
+					unionfs_lower_dentry(old_dentry);
+				lower_dir_dentry =
+					lock_parent(lower_new_dentry);
+				/* do vfs_link */
+				err = vfs_link(lower_old_dentry,
+					       lower_dir_dentry->d_inode,
+					       lower_new_dentry);
+				unlock_dir(lower_dir_dentry);
+				goto check_link;
+			}
+		}
+		goto out;
+	}
+
+check_link:
+	if (err || !lower_new_dentry->d_inode)
+		goto out;
+
+	/* Its a hard link, so use the same inode */
+	new_dentry->d_inode = igrab(old_dentry->d_inode);
+	d_instantiate(new_dentry, new_dentry->d_inode);
+	unionfs_copy_attr_all(dir, lower_new_dentry->d_parent->d_inode);
+	fsstack_copy_inode_size(dir, lower_new_dentry->d_parent->d_inode);
+
+	/* propagate number of hard-links */
+	old_dentry->d_inode->i_nlink = unionfs_get_nlinks(old_dentry->d_inode);
+	/* new dentry's ctime may have changed due to hard-link counts */
+	unionfs_copy_attr_times(new_dentry->d_inode);
+
+out:
+	if (!new_dentry->d_inode)
+		d_drop(new_dentry);
+
+	kfree(name);
+	if (!err)
+		unionfs_postcopyup_setmnt(new_dentry);
+
+	unionfs_check_inode(dir);
+	unionfs_check_dentry(new_dentry);
+	unionfs_check_dentry(old_dentry);
+
+	unionfs_unlock_dentry(new_dentry);
+	unionfs_unlock_dentry(old_dentry);
+	unionfs_read_unlock(old_dentry->d_sb);
+
+	return err;
+}
+
+static int unionfs_symlink(struct inode *dir, struct dentry *dentry,
+			   const char *symname)
+{
+	int err = 0;
+	struct dentry *lower_dentry = NULL;
+	struct dentry *whiteout_dentry = NULL;
+	struct dentry *lower_dir_dentry = NULL;
+	umode_t mode;
+	int bindex = 0, bstart;
+	char *name = NULL;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(dentry->d_inode &&
+		     !__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	/* We start out in the leftmost branch. */
+	bstart = dbstart(dentry);
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	/*
+	 * check if whiteout exists in this branch, i.e. lookup .wh.foo
+	 * first. If present, delete it
+	 */
+	name = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+	if (unlikely(IS_ERR(name))) {
+		err = PTR_ERR(name);
+		goto out;
+	}
+
+	whiteout_dentry =
+		lookup_one_len(name, lower_dentry->d_parent,
+			       dentry->d_name.len + UNIONFS_WHLEN);
+	if (IS_ERR(whiteout_dentry)) {
+		err = PTR_ERR(whiteout_dentry);
+		goto out;
+	}
+
+	if (!whiteout_dentry->d_inode) {
+		dput(whiteout_dentry);
+		whiteout_dentry = NULL;
+	} else {
+		/*
+		 * found a .wh.foo entry, unlink it and then call
+		 * vfs_symlink().
+		 */
+		lower_dir_dentry = lock_parent(whiteout_dentry);
+
+		err = is_robranch_super(dentry->d_sb, bstart);
+		if (!err)
+			err = vfs_unlink(lower_dir_dentry->d_inode,
+					 whiteout_dentry);
+		dput(whiteout_dentry);
+
+		fsstack_copy_attr_times(dir, lower_dir_dentry->d_inode);
+		/* propagate number of hard-links */
+		dir->i_nlink = unionfs_get_nlinks(dir);
+
+		unlock_dir(lower_dir_dentry);
+
+		if (err) {
+			/* exit if the error returned was NOT -EROFS */
+			if (!IS_COPYUP_ERR(err))
+				goto out;
+			/*
+			 * should now try to create symlink in the another
+			 * branch.
+			 */
+			bstart--;
+		}
+	}
+
+	/*
+	 * deleted whiteout if it was present, now do a normal vfs_symlink()
+	 * with possible recursive directory creation
+	 */
+	for (bindex = bstart; bindex >= 0; bindex--) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry) {
+			/*
+			 * if lower_dentry is NULL, create the entire
+			 * dentry directory structure in branch 'bindex'.
+			 * lower_dentry will NOT be null when bindex ==
+			 * bstart because lookup passed as a negative
+			 * unionfs dentry pointing to a lone negative
+			 * underlying dentry
+			 */
+			lower_dentry = create_parents(dir, dentry,
+						      dentry->d_name.name,
+						      bindex);
+			if (!lower_dentry || IS_ERR(lower_dentry)) {
+				if (IS_ERR(lower_dentry))
+					err = PTR_ERR(lower_dentry);
+				if (!IS_COPYUP_ERR(err))
+					printk(KERN_ERR
+					       "unionfs: create_parents for "
+					       "symlink failed: bindex=%d "
+					       "err=%d\n", bindex, err);
+				continue;
+			}
+		}
+
+		lower_dir_dentry = lock_parent(lower_dentry);
+
+		err = is_robranch_super(dentry->d_sb, bindex);
+		if (!err) {
+			mode = S_IALLUGO;
+			err = vfs_symlink(lower_dir_dentry->d_inode,
+					  lower_dentry, symname, mode);
+		}
+		unlock_dir(lower_dir_dentry);
+
+		if (err || !lower_dentry->d_inode) {
+			/*
+			 * break out of for loop if error returned was NOT
+			 * -EROFS.
+			 */
+			if (!IS_COPYUP_ERR(err))
+				break;
+		} else {
+			/*
+			 * Only INTERPOSE_LOOKUP can return a value other
+			 * than 0 on err.
+			 */
+			err = PTR_ERR(unionfs_interpose(dentry,
+							dir->i_sb, 0));
+			if (!err) {
+				fsstack_copy_attr_times(dir,
+							lower_dir_dentry->
+							d_inode);
+				fsstack_copy_inode_size(dir,
+							lower_dir_dentry->
+							d_inode);
+				/*
+				 * update number of links on parent
+				 * directory.
+				 */
+				dir->i_nlink = unionfs_get_nlinks(dir);
+			}
+			break;
+		}
+	}
+
+out:
+	if (!dentry->d_inode)
+		d_drop(dentry);
+
+	kfree(name);
+	if (!err)
+		unionfs_postcopyup_setmnt(dentry);
+
+	unionfs_check_inode(dir);
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+
+	return err;
+}
+
+static int unionfs_mkdir(struct inode *parent, struct dentry *dentry, int mode)
+{
+	int err = 0;
+	struct dentry *lower_dentry = NULL, *whiteout_dentry = NULL;
+	struct dentry *lower_parent_dentry = NULL;
+	int bindex = 0, bstart;
+	char *name = NULL;
+	int whiteout_unlinked = 0;
+	struct sioq_args args;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(dentry->d_inode &&
+		     !__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	bstart = dbstart(dentry);
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	/*
+	 * check if whiteout exists in this branch, i.e. lookup .wh.foo
+	 * first.
+	 */
+	name = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+	if (unlikely(IS_ERR(name))) {
+		err = PTR_ERR(name);
+		goto out;
+	}
+
+	whiteout_dentry = lookup_one_len(name, lower_dentry->d_parent,
+					 dentry->d_name.len + UNIONFS_WHLEN);
+	if (IS_ERR(whiteout_dentry)) {
+		err = PTR_ERR(whiteout_dentry);
+		goto out;
+	}
+
+	if (!whiteout_dentry->d_inode) {
+		dput(whiteout_dentry);
+		whiteout_dentry = NULL;
+	} else {
+		lower_parent_dentry = lock_parent(whiteout_dentry);
+
+		/* found a.wh.foo entry, remove it then do vfs_mkdir */
+		err = is_robranch_super(dentry->d_sb, bstart);
+		if (!err) {
+			args.unlink.parent = lower_parent_dentry->d_inode;
+			args.unlink.dentry = whiteout_dentry;
+			run_sioq(__unionfs_unlink, &args);
+			err = args.err;
+		}
+		dput(whiteout_dentry);
+
+		unlock_dir(lower_parent_dentry);
+
+		if (err) {
+			/* exit if the error returned was NOT -EROFS */
+			if (!IS_COPYUP_ERR(err))
+				goto out;
+			bstart--;
+		} else {
+			whiteout_unlinked = 1;
+		}
+	}
+
+	for (bindex = bstart; bindex >= 0; bindex--) {
+		int i;
+		int bend = dbend(dentry);
+
+		if (is_robranch_super(dentry->d_sb, bindex))
+			continue;
+
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry) {
+			lower_dentry = create_parents(parent, dentry,
+						      dentry->d_name.name,
+						      bindex);
+			if (!lower_dentry || IS_ERR(lower_dentry)) {
+				printk(KERN_ERR "unionfs: lower dentry "
+				       " NULL for bindex = %d\n", bindex);
+				continue;
+			}
+		}
+
+		lower_parent_dentry = lock_parent(lower_dentry);
+
+		if (IS_ERR(lower_parent_dentry)) {
+			err = PTR_ERR(lower_parent_dentry);
+			goto out;
+		}
+
+		err = vfs_mkdir(lower_parent_dentry->d_inode, lower_dentry,
+				mode);
+
+		unlock_dir(lower_parent_dentry);
+
+		/* did the mkdir succeed? */
+		if (err)
+			break;
+
+		for (i = bindex + 1; i < bend; i++) {
+			if (unionfs_lower_dentry_idx(dentry, i)) {
+				dput(unionfs_lower_dentry_idx(dentry, i));
+				unionfs_set_lower_dentry_idx(dentry, i, NULL);
+			}
+		}
+		set_dbend(dentry, bindex);
+
+		/*
+		 * Only INTERPOSE_LOOKUP can return a value other than 0 on
+		 * err.
+		 */
+		err = PTR_ERR(unionfs_interpose(dentry, parent->i_sb, 0));
+		if (!err) {
+			unionfs_copy_attr_times(parent);
+			fsstack_copy_inode_size(parent,
+						lower_parent_dentry->d_inode);
+
+			/* update number of links on parent directory */
+			parent->i_nlink = unionfs_get_nlinks(parent);
+		}
+
+		err = make_dir_opaque(dentry, dbstart(dentry));
+		if (err) {
+			printk(KERN_ERR "unionfs: mkdir: error creating "
+			       ".wh.__dir_opaque: %d\n", err);
+			goto out;
+		}
+
+		/* we are done! */
+		break;
+	}
+
+out:
+	if (!dentry->d_inode)
+		d_drop(dentry);
+
+	kfree(name);
+
+	if (!err) {
+		unionfs_copy_attr_times(dentry->d_inode);
+		unionfs_postcopyup_setmnt(dentry);
+	}
+	unionfs_check_inode(parent);
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+
+	return err;
+}
+
+static int unionfs_mknod(struct inode *dir, struct dentry *dentry, int mode,
+			 dev_t dev)
+{
+	int err = 0;
+	struct dentry *lower_dentry = NULL, *whiteout_dentry = NULL;
+	struct dentry *lower_parent_dentry = NULL;
+	int bindex = 0, bstart;
+	char *name = NULL;
+	int whiteout_unlinked = 0;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(dentry->d_inode &&
+		     !__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	bstart = dbstart(dentry);
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	/*
+	 * check if whiteout exists in this branch, i.e. lookup .wh.foo
+	 * first.
+	 */
+	name = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+	if (unlikely(IS_ERR(name))) {
+		err = PTR_ERR(name);
+		goto out;
+	}
+
+	whiteout_dentry = lookup_one_len(name, lower_dentry->d_parent,
+					 dentry->d_name.len + UNIONFS_WHLEN);
+	if (IS_ERR(whiteout_dentry)) {
+		err = PTR_ERR(whiteout_dentry);
+		goto out;
+	}
+
+	if (!whiteout_dentry->d_inode) {
+		dput(whiteout_dentry);
+		whiteout_dentry = NULL;
+	} else {
+		/* found .wh.foo, unlink it */
+		lower_parent_dentry = lock_parent(whiteout_dentry);
+
+		/* found a.wh.foo entry, remove it then do vfs_mkdir */
+		err = is_robranch_super(dentry->d_sb, bstart);
+		if (!err)
+			err = vfs_unlink(lower_parent_dentry->d_inode,
+					 whiteout_dentry);
+		dput(whiteout_dentry);
+
+		unlock_dir(lower_parent_dentry);
+
+		if (err) {
+			if (!IS_COPYUP_ERR(err))
+				goto out;
+			bstart--;
+		} else {
+			whiteout_unlinked = 1;
+		}
+	}
+
+	for (bindex = bstart; bindex >= 0; bindex--) {
+		if (is_robranch_super(dentry->d_sb, bindex))
+			continue;
+
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry) {
+			lower_dentry = create_parents(dir, dentry,
+						      dentry->d_name.name,
+						      bindex);
+			if (IS_ERR(lower_dentry)) {
+				printk(KERN_ERR "unionfs: failed to create "
+				       "parents on %d, err = %ld\n",
+				       bindex, PTR_ERR(lower_dentry));
+				continue;
+			}
+		}
+
+		lower_parent_dentry = lock_parent(lower_dentry);
+		if (IS_ERR(lower_parent_dentry)) {
+			err = PTR_ERR(lower_parent_dentry);
+			goto out;
+		}
+
+		err = vfs_mknod(lower_parent_dentry->d_inode,
+				lower_dentry, mode, dev);
+
+		if (err) {
+			unlock_dir(lower_parent_dentry);
+			break;
+		}
+
+		/*
+		 * Only INTERPOSE_LOOKUP can return a value other than 0 on
+		 * err.
+		 */
+		err = PTR_ERR(unionfs_interpose(dentry, dir->i_sb, 0));
+		if (!err) {
+			fsstack_copy_attr_times(dir,
+						lower_parent_dentry->d_inode);
+			fsstack_copy_inode_size(dir,
+						lower_parent_dentry->d_inode);
+			/* update number of links on parent directory */
+			dir->i_nlink = unionfs_get_nlinks(dir);
+		}
+		unlock_dir(lower_parent_dentry);
+
+		break;
+	}
+
+out:
+	if (!dentry->d_inode)
+		d_drop(dentry);
+
+	kfree(name);
+
+	if (!err)
+		unionfs_postcopyup_setmnt(dentry);
+
+	unionfs_check_inode(dir);
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+
+	return err;
+}
+
+static int unionfs_readlink(struct dentry *dentry, char __user *buf,
+			    int bufsiz)
+{
+	int err;
+	struct dentry *lower_dentry;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	if (!lower_dentry->d_inode->i_op ||
+	    !lower_dentry->d_inode->i_op->readlink) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = lower_dentry->d_inode->i_op->readlink(lower_dentry,
+						    buf, bufsiz);
+	if (err > 0)
+		fsstack_copy_attr_atime(dentry->d_inode,
+					lower_dentry->d_inode);
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+
+	return err;
+}
+
+/*
+ * unionfs_follow_link takes a dentry, but it is simple.  It only needs to
+ * allocate some memory and then call our ->readlink method.  Our
+ * unionfs_readlink *does* lock our dentry and revalidate the dentry.
+ * Therefore, we do not have to lock our dentry here, to prevent a deadlock;
+ * nor do we need to revalidate it either.  It is safe to not lock our
+ * dentry here, nor revalidate it, because unionfs_follow_link does not do
+ * anything (prior to calling ->readlink) which could become inconsistent
+ * due to branch management.
+ */
+static void *unionfs_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	char *buf;
+	int len = PAGE_SIZE, err;
+	mm_segment_t old_fs;
+
+	unionfs_read_lock(dentry->d_sb);
+
+	/* This is freed by the put_link method assuming a successful call. */
+	buf = kmalloc(len, GFP_KERNEL);
+	if (unlikely(!buf)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* read the symlink, and then we will follow it */
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	err = dentry->d_inode->i_op->readlink(dentry, (char __user *)buf, len);
+	set_fs(old_fs);
+	if (err < 0) {
+		kfree(buf);
+		buf = NULL;
+		goto out;
+	}
+	buf[err] = 0;
+	nd_set_link(nd, buf);
+	err = 0;
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_check_nd(nd);
+	unionfs_read_unlock(dentry->d_sb);
+	return ERR_PTR(err);
+}
+
+/* FIXME: We may not have to lock here */
+static void unionfs_put_link(struct dentry *dentry, struct nameidata *nd,
+			     void *cookie)
+{
+	unionfs_read_lock(dentry->d_sb);
+
+	unionfs_lock_dentry(dentry);
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, nd, false)))
+		printk(KERN_ERR
+		       "unionfs: put_link failed to revalidate dentry\n");
+
+	unionfs_check_dentry(dentry);
+	unionfs_check_nd(nd);
+	kfree(nd_get_link(nd));
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+}
+
+/*
+ * Don't grab the superblock read-lock in unionfs_permission, which prevents
+ * a deadlock with the branch-management "add branch" code (which grabbed
+ * the write lock).  It is safe to not grab the read lock here, because even
+ * with branch management taking place, there is no chance that
+ * unionfs_permission, or anything it calls, will use stale branch
+ * information.
+ */
+static int unionfs_permission(struct inode *inode, int mask,
+			      struct nameidata *nd)
+{
+	struct inode *lower_inode = NULL;
+	int err = 0;
+	int bindex, bstart, bend;
+	const int is_file = !S_ISDIR(inode->i_mode);
+	const int write_mask = (mask & MAY_WRITE) && !(mask & MAY_READ);
+
+	bstart = ibstart(inode);
+	bend = ibend(inode);
+	if (unlikely(bstart < 0 || bend < 0)) {
+		/*
+		 * With branch-management, we can get a stale inode here.
+		 * If so, we return ESTALE back to link_path_walk, which
+		 * would discard the dcache entry and re-lookup the
+		 * dentry+inode.  This should be equivalent to issuing
+		 * __unionfs_d_revalidate_chain on nd.dentry here.
+		 */
+		if (is_file)	/* dirs can be unlinked but chdir'ed to */
+			err = -ESTALE;	/* force revalidate */
+		goto out;
+	}
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (!lower_inode)
+			continue;
+
+		/*
+		 * check the condition for D-F-D underlying files/directories,
+		 * we don't have to check for files, if we are checking for
+		 * directories.
+		 */
+		if (!is_file && !S_ISDIR(lower_inode->i_mode))
+			continue;
+
+		/*
+		 * We check basic permissions, but we ignore any conditions
+		 * such as readonly file systems or branches marked as
+		 * readonly, because those conditions should lead to a
+		 * copyup taking place later on.
+		 */
+		err = permission(lower_inode, mask, nd);
+		if (err && bindex > 0) {
+			umode_t mode = lower_inode->i_mode;
+			if (is_robranch_super(inode->i_sb, bindex) &&
+			    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+				err = 0;
+			if (IS_COPYUP_ERR(err))
+				err = 0;
+		}
+
+		/*
+		 * The permissions are an intersection of the overall directory
+		 * permissions, so we fail if one fails.
+		 */
+		if (err)
+			goto out;
+
+		/* only the leftmost file matters. */
+		if (is_file || write_mask) {
+			if (is_file && write_mask) {
+				err = get_write_access(lower_inode);
+				if (!err)
+					put_write_access(lower_inode);
+			}
+			break;
+		}
+	}
+	/* sync times which may have changed (asynchronously) below */
+	unionfs_copy_attr_times(inode);
+
+out:
+	unionfs_check_inode(inode);
+	unionfs_check_nd(nd);
+	return err;
+}
+
+static int unionfs_setattr(struct dentry *dentry, struct iattr *ia)
+{
+	int err = 0;
+	struct dentry *lower_dentry;
+	struct inode *inode = NULL;
+	struct inode *lower_inode = NULL;
+	int bstart, bend, bindex;
+	int i;
+	int copyup = 0;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+	inode = dentry->d_inode;
+
+	/*
+	 * mode change is for clearing setuid/setgid. Allow lower filesystem
+	 * to reinterpret it in its own way.
+	 */
+	if (ia->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))
+		ia->ia_valid &= ~ATTR_MODE;
+
+	for (bindex = bstart; (bindex <= bend) || (bindex == bstart);
+	     bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry)
+			continue;
+		BUG_ON(lower_dentry->d_inode == NULL);
+
+		/* If the file is on a read only branch */
+		if (is_robranch_super(dentry->d_sb, bindex)
+		    || IS_RDONLY(lower_dentry->d_inode)) {
+			if (copyup || (bindex != bstart))
+				continue;
+			/* Only if its the leftmost file, copyup the file */
+			for (i = bstart - 1; i >= 0; i--) {
+				loff_t size = i_size_read(dentry->d_inode);
+				if (ia->ia_valid & ATTR_SIZE)
+					size = ia->ia_size;
+				err = copyup_dentry(dentry->d_parent->d_inode,
+						    dentry, bstart, i,
+						    dentry->d_name.name,
+						    dentry->d_name.len,
+						    NULL, size);
+
+				if (!err) {
+					copyup = 1;
+					lower_dentry =
+						unionfs_lower_dentry(dentry);
+					break;
+				}
+				/*
+				 * if error is in the leftmost branch, pass
+				 * it up.
+				 */
+				if (i == 0)
+					goto out;
+			}
+
+		}
+		err = notify_change(lower_dentry, ia);
+		if (err)
+			goto out;
+		break;
+	}
+
+	/* for mmap */
+	if (ia->ia_valid & ATTR_SIZE) {
+		if (ia->ia_size != i_size_read(inode)) {
+			err = vmtruncate(inode, ia->ia_size);
+			if (err)
+				printk(KERN_ERR
+				       "unionfs: setattr: vmtruncate failed\n");
+		}
+	}
+
+	/* get the size from the first lower inode */
+	lower_inode = unionfs_lower_inode(inode);
+	unionfs_copy_attr_all(inode, lower_inode);
+	/*
+	 * unionfs_copy_attr_all will copy the lower times to our inode if
+	 * the lower ones are newer (useful for cache coherency).  However,
+	 * ->setattr is the only place in which we may have to copy the
+	 * lower inode times absolutely, to support utimes(2).
+	 */
+	if (ia->ia_valid & ATTR_MTIME_SET)
+		inode->i_mtime = lower_inode->i_mtime;
+	if (ia->ia_valid & ATTR_CTIME)
+		inode->i_ctime = lower_inode->i_ctime;
+	if (ia->ia_valid & ATTR_ATIME_SET)
+		inode->i_atime = lower_inode->i_atime;
+	fsstack_copy_inode_size(inode, lower_inode);
+	/* if setattr succeeded, then parent dir may have changed */
+	unionfs_copy_attr_times(dentry->d_parent->d_inode);
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_check_dentry(dentry->d_parent);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+
+	return err;
+}
+
+struct inode_operations unionfs_symlink_iops = {
+	.readlink	= unionfs_readlink,
+	.permission	= unionfs_permission,
+	.follow_link	= unionfs_follow_link,
+	.setattr	= unionfs_setattr,
+	.put_link	= unionfs_put_link,
+};
+
+struct inode_operations unionfs_dir_iops = {
+	.create		= unionfs_create,
+	.lookup		= unionfs_lookup,
+	.link		= unionfs_link,
+	.unlink		= unionfs_unlink,
+	.symlink	= unionfs_symlink,
+	.mkdir		= unionfs_mkdir,
+	.rmdir		= unionfs_rmdir,
+	.mknod		= unionfs_mknod,
+	.rename		= unionfs_rename,
+	.permission	= unionfs_permission,
+	.setattr	= unionfs_setattr,
+#ifdef CONFIG_UNION_FS_XATTR
+	.setxattr	= unionfs_setxattr,
+	.getxattr	= unionfs_getxattr,
+	.removexattr	= unionfs_removexattr,
+	.listxattr	= unionfs_listxattr,
+#endif /* CONFIG_UNION_FS_XATTR */
+};
+
+struct inode_operations unionfs_main_iops = {
+	.permission	= unionfs_permission,
+	.setattr	= unionfs_setattr,
+#ifdef CONFIG_UNION_FS_XATTR
+	.setxattr	= unionfs_setxattr,
+	.getxattr	= unionfs_getxattr,
+	.removexattr	= unionfs_removexattr,
+	.listxattr	= unionfs_listxattr,
+#endif /* CONFIG_UNION_FS_XATTR */
+};
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 22/42] Unionfs: unlink/rmdir operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (20 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 21/42] Unionfs: inode operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 23/42] Unionfs: address-space operations Erez Zadok
                   ` (20 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/unlink.c |  236 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 236 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/unlink.c

diff --git a/fs/unionfs/unlink.c b/fs/unionfs/unlink.c
new file mode 100644
index 0000000..423ff36
--- /dev/null
+++ b/fs/unionfs/unlink.c
@@ -0,0 +1,236 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/* unlink a file by creating a whiteout */
+static int unionfs_unlink_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	struct dentry *lower_dentry;
+	struct dentry *lower_dir_dentry;
+	int bindex;
+	int err = 0;
+
+	err = unionfs_partial_lookup(dentry);
+	if (err)
+		goto out;
+
+	bindex = dbstart(dentry);
+
+	lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+	if (!lower_dentry)
+		goto out;
+
+	lower_dir_dentry = lock_parent(lower_dentry);
+
+	/* avoid destroying the lower inode if the file is in use */
+	dget(lower_dentry);
+	err = is_robranch_super(dentry->d_sb, bindex);
+	if (!err)
+		err = vfs_unlink(lower_dir_dentry->d_inode, lower_dentry);
+	/* if vfs_unlink succeeded, update our inode's times */
+	if (!err)
+		unionfs_copy_attr_times(dentry->d_inode);
+	dput(lower_dentry);
+	fsstack_copy_attr_times(dir, lower_dir_dentry->d_inode);
+	unlock_dir(lower_dir_dentry);
+
+	if (err && !IS_COPYUP_ERR(err))
+		goto out;
+
+	/*
+	 * We create whiteouts if (1) there was an error unlinking the main
+	 * file; (2) there is a lower priority file with the same name
+	 * (dbopaque); (3) the branch in which the file is not the last
+	 * (rightmost0 branch.  The last rule is an optimization to avoid
+	 * creating all those whiteouts if there's no chance they'd be
+	 * masking any lower-priority branch, as well as unionfs is used
+	 * with only one branch (using only one branch, while odd, is still
+	 * possible).
+	 */
+	if (err) {
+		if (dbstart(dentry) == 0)
+			goto out;
+		err = create_whiteout(dentry, dbstart(dentry) - 1);
+	} else if (dbopaque(dentry) != -1) {
+		err = create_whiteout(dentry, dbopaque(dentry));
+	} else if (dbstart(dentry) < sbend(dentry->d_sb)) {
+		err = create_whiteout(dentry, dbstart(dentry));
+	}
+
+out:
+	if (!err)
+		dentry->d_inode->i_nlink--;
+
+	/* We don't want to leave negative leftover dentries for revalidate. */
+	if (!err && (dbopaque(dentry) != -1))
+		update_bstart(dentry);
+
+	return err;
+}
+
+int unionfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int err = 0;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+	unionfs_check_dentry(dentry);
+
+	err = unionfs_unlink_whiteout(dir, dentry);
+	/* call d_drop so the system "forgets" about us */
+	if (!err) {
+		if (!S_ISDIR(dentry->d_inode->i_mode))
+			unionfs_postcopyup_release(dentry);
+		d_drop(dentry);
+		/*
+		 * if unlink/whiteout succeeded, parent dir mtime has
+		 * changed
+		 */
+		unionfs_copy_attr_times(dir);
+	}
+
+out:
+	if (!err) {
+		unionfs_check_dentry(dentry);
+		unionfs_check_inode(dir);
+	}
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
+
+static int unionfs_rmdir_first(struct inode *dir, struct dentry *dentry,
+			       struct unionfs_dir_state *namelist)
+{
+	int err;
+	struct dentry *lower_dentry;
+	struct dentry *lower_dir_dentry = NULL;
+
+	/* Here we need to remove whiteout entries. */
+	err = delete_whiteouts(dentry, dbstart(dentry), namelist);
+	if (err)
+		goto out;
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	lower_dir_dentry = lock_parent(lower_dentry);
+
+	/* avoid destroying the lower inode if the file is in use */
+	dget(lower_dentry);
+	err = is_robranch(dentry);
+	if (!err)
+		err = vfs_rmdir(lower_dir_dentry->d_inode, lower_dentry);
+	dput(lower_dentry);
+
+	fsstack_copy_attr_times(dir, lower_dir_dentry->d_inode);
+	/* propagate number of hard-links */
+	dentry->d_inode->i_nlink = unionfs_get_nlinks(dentry->d_inode);
+
+out:
+	if (lower_dir_dentry)
+		unlock_dir(lower_dir_dentry);
+	return err;
+}
+
+int unionfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	int err = 0;
+	struct unionfs_dir_state *namelist = NULL;
+	int dstart, dend;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+	unionfs_check_dentry(dentry);
+
+	/* check if this unionfs directory is empty or not */
+	err = check_empty(dentry, &namelist);
+	if (err)
+		goto out;
+
+	err = unionfs_rmdir_first(dir, dentry, namelist);
+	dstart = dbstart(dentry);
+	dend = dbend(dentry);
+	/*
+	 * We create a whiteout for the directory if there was an error to
+	 * rmdir the first directory entry in the union.  Otherwise, we
+	 * create a whiteout only if there is no chance that a lower
+	 * priority branch might also have the same named directory.  IOW,
+	 * if there is not another same-named directory at a lower priority
+	 * branch, then we don't need to create a whiteout for it.
+	 */
+	if (!err) {
+		if (dstart < dend)
+			err = create_whiteout(dentry, dstart);
+	} else {
+		int new_err;
+
+		if (dstart == 0)
+			goto out;
+
+		/* exit if the error returned was NOT -EROFS */
+		if (!IS_COPYUP_ERR(err))
+			goto out;
+
+		new_err = create_whiteout(dentry, dstart - 1);
+		if (new_err != -EEXIST)
+			err = new_err;
+	}
+
+out:
+	/*
+	 * Drop references to lower dentry/inode so storage space for them
+	 * can be reclaimed.  Then, call d_drop so the system "forgets"
+	 * about us.
+	 */
+	if (!err) {
+		struct inode *inode = dentry->d_inode;
+		BUG_ON(!inode);
+		iput(unionfs_lower_inode_idx(inode, dstart));
+		unionfs_set_lower_inode_idx(inode, dstart, NULL);
+		dput(unionfs_lower_dentry_idx(dentry, dstart));
+		unionfs_set_lower_dentry_idx(dentry, dstart, NULL);
+		/*
+		 * If the last directory is unlinked, then mark istart/end
+		 * as -1, (to maintain the invariant that if there are no
+		 * lower objects, then branch index start and end are set to
+		 * -1).
+		 */
+		if (!unionfs_lower_inode_idx(inode, dstart) &&
+		    !unionfs_lower_inode_idx(inode, dend))
+			ibstart(inode) = ibend(inode) = -1;
+		d_drop(dentry);
+	}
+
+	if (namelist)
+		free_rdstate(namelist);
+
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 23/42] Unionfs: address-space operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (21 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 22/42] Unionfs: unlink/rmdir operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 24/42] Unionfs: mount-time and stacking-interposition functions Erez Zadok
                   ` (19 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes writepage, writepages, readpage, prepare_write, and commit_write.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/mmap.c |  338 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 338 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/mmap.c

diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c
new file mode 100644
index 0000000..4d05352
--- /dev/null
+++ b/fs/unionfs/mmap.c
@@ -0,0 +1,338 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2006      Shaya Potter
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
+{
+	int err = -EIO;
+	struct inode *inode;
+	struct inode *lower_inode;
+	struct page *lower_page;
+	struct address_space *lower_mapping; /* lower inode mapping */
+	gfp_t mask;
+
+	BUG_ON(!PageUptodate(page));
+	inode = page->mapping->host;
+	lower_inode = unionfs_lower_inode(inode);
+	lower_mapping = lower_inode->i_mapping;
+
+	/*
+	 * find lower page (returns a locked page)
+	 *
+	 * We turn off __GFP_FS while we look for or create a new lower
+	 * page.  This prevents a recursion into the file system code, which
+	 * under memory pressure conditions could lead to a deadlock.  This
+	 * is similar to how the loop driver behaves (see loop_set_fd in
+	 * drivers/block/loop.c).  If we can't find the lower page, we
+	 * redirty our page and return "success" so that the VM will call us
+	 * again in the (hopefully near) future.
+	 */
+	mask = mapping_gfp_mask(lower_mapping) & ~(__GFP_FS);
+	lower_page = find_or_create_page(lower_mapping, page->index, mask);
+	if (!lower_page) {
+		err = 0;
+		set_page_dirty(page);
+		goto out;
+	}
+
+	/* copy page data from our upper page to the lower page */
+	copy_highpage(lower_page, page);
+	flush_dcache_page(lower_page);
+	SetPageUptodate(lower_page);
+	set_page_dirty(lower_page);
+
+	/*
+	 * Call lower writepage (expects locked page).  However, if we are
+	 * called with wbc->for_reclaim, then the VFS/VM just wants to
+	 * reclaim our page.  Therefore, we don't need to call the lower
+	 * ->writepage: just copy our data to the lower page (already done
+	 * above), then mark the lower page dirty and unlock it, and return
+	 * success.
+	 */
+	if (wbc->for_reclaim) {
+		unlock_page(lower_page);
+		goto out_release;
+	}
+
+	BUG_ON(!lower_mapping->a_ops->writepage);
+	wait_on_page_writeback(lower_page); /* prevent multiple writers */
+	clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
+	err = lower_mapping->a_ops->writepage(lower_page, wbc);
+	if (err < 0)
+		goto out_release;
+
+	/*
+	 * Lower file systems such as ramfs and tmpfs, may return
+	 * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
+	 * write the page again for a while.  But those lower file systems
+	 * also set the page dirty bit back again.  Since we successfully
+	 * copied our page data to the lower page, then the VM will come
+	 * back to the lower page (directly) and try to flush it.  So we can
+	 * save the VM the hassle of coming back to our page and trying to
+	 * flush too.  Therefore, we don't re-dirty our own page, and we
+	 * never return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
+	 * this a success).
+	 *
+	 * We also unlock the lower page if the lower ->writepage returned
+	 * AOP_WRITEPAGE_ACTIVATE.  (This "anomalous" behaviour may be
+	 * addressed in future shmem/VM code.)
+	 */
+	if (err == AOP_WRITEPAGE_ACTIVATE) {
+		err = 0;
+		unlock_page(lower_page);
+	}
+
+	/* all is well */
+
+	/* lower mtimes have changed: update ours */
+	unionfs_copy_attr_times(inode);
+
+out_release:
+	/* b/c find_or_create_page increased refcnt */
+	page_cache_release(lower_page);
+out:
+	/*
+	 * We unlock our page unconditionally, because we never return
+	 * AOP_WRITEPAGE_ACTIVATE.
+	 */
+	unlock_page(page);
+	return err;
+}
+
+static int unionfs_writepages(struct address_space *mapping,
+			      struct writeback_control *wbc)
+{
+	int err = 0;
+	struct inode *lower_inode;
+	struct inode *inode;
+
+	inode = mapping->host;
+	if (ibstart(inode) < 0 && ibend(inode) < 0)
+		goto out;
+	lower_inode = unionfs_lower_inode(inode);
+	if (!lower_inode)
+		goto out;
+
+	if (!mapping_cap_writeback_dirty(lower_inode->i_mapping))
+		goto out;
+
+	err = generic_writepages(mapping, wbc);
+	if (!err)
+		unionfs_copy_attr_times(inode);
+out:
+	return err;
+}
+
+/* Readpage expects a locked page, and must unlock it */
+static int unionfs_readpage(struct file *file, struct page *page)
+{
+	int err;
+	struct file *lower_file;
+	struct inode *inode;
+	mm_segment_t old_fs;
+	char *page_data = NULL;
+	mode_t orig_mode;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	err = unionfs_file_revalidate(file, false);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	if (!UNIONFS_F(file)) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	lower_file = unionfs_lower_file(file);
+	/* FIXME: is this assertion right here? */
+	BUG_ON(lower_file == NULL);
+
+	inode = file->f_path.dentry->d_inode;
+
+	page_data = (char *) kmap(page);
+	/*
+	 * Use vfs_read because some lower file systems don't have a
+	 * readpage method, and some file systems (esp. distributed ones)
+	 * don't like their pages to be accessed directly.  Using vfs_read
+	 * may be a little slower, but a lot safer, as the VFS does a lot of
+	 * the necessary magic for us.
+	 */
+	lower_file->f_pos = page_offset(page);
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	/*
+	 * generic_file_splice_write may call us on a file not opened for
+	 * reading, so temporarily allow reading.
+	 */
+	orig_mode = lower_file->f_mode;
+	lower_file->f_mode |= FMODE_READ;
+	err = vfs_read(lower_file, page_data, PAGE_CACHE_SIZE,
+		       &lower_file->f_pos);
+	lower_file->f_mode = orig_mode;
+	set_fs(old_fs);
+	if (err >= 0 && err < PAGE_CACHE_SIZE)
+		memset(page_data + err, 0, PAGE_CACHE_SIZE - err);
+	kunmap(page);
+
+	if (err < 0)
+		goto out;
+	err = 0;
+
+	/* if vfs_read succeeded above, sync up our times */
+	unionfs_copy_attr_times(inode);
+
+	flush_dcache_page(page);
+
+	/*
+	 * we have to unlock our page, b/c we _might_ have gotten a locked
+	 * page.  but we no longer have to wakeup on our page here, b/c
+	 * UnlockPage does it
+	 */
+out:
+	if (err == 0)
+		SetPageUptodate(page);
+	else
+		ClearPageUptodate(page);
+
+	unlock_page(page);
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+
+	return err;
+}
+
+static int unionfs_prepare_write(struct file *file, struct page *page,
+				 unsigned from, unsigned to)
+{
+	int err;
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	/*
+	 * This is the only place where we unconditionally copy the lower
+	 * attribute times before calling unionfs_file_revalidate.  The
+	 * reason is that our ->write calls do_sync_write which in turn will
+	 * call our ->prepare_write and then ->commit_write.  Before our
+	 * ->write is called, the lower mtimes are in sync, but by the time
+	 * the VFS calls our ->commit_write, the lower mtimes have changed.
+	 * Therefore, the only reasonable time for us to sync up from the
+	 * changed lower mtimes, and avoid an invariant violation warning,
+	 * is here, in ->prepare_write.
+	 */
+	unionfs_copy_attr_times(file->f_path.dentry->d_inode);
+	err = unionfs_file_revalidate(file, true);
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+
+	return err;
+}
+
+static int unionfs_commit_write(struct file *file, struct page *page,
+				unsigned from, unsigned to)
+{
+	int err = -ENOMEM;
+	struct inode *inode, *lower_inode;
+	struct file *lower_file = NULL;
+	unsigned bytes = to - from;
+	char *page_data = NULL;
+	mm_segment_t old_fs;
+
+	BUG_ON(file == NULL);
+
+	unionfs_read_lock(file->f_path.dentry->d_sb);
+	err = unionfs_file_revalidate(file, true);
+	if (unlikely(err))
+		goto out;
+	unionfs_check_file(file);
+
+	inode = page->mapping->host;
+	lower_inode = unionfs_lower_inode(inode);
+
+	if (UNIONFS_F(file) != NULL)
+		lower_file = unionfs_lower_file(file);
+
+	/* FIXME: is this assertion right here? */
+	BUG_ON(lower_file == NULL);
+
+	page_data = (char *)kmap(page);
+	lower_file->f_pos = page_offset(page) + from;
+
+	/*
+	 * We use vfs_write instead of copying page data and the
+	 * prepare_write/commit_write combo because file system's like
+	 * GFS/OCFS2 don't like things touching those directly,
+	 * calling the underlying write op, while a little bit slower, will
+	 * call all the FS specific code as well
+	 */
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	err = vfs_write(lower_file, page_data + from, bytes,
+			&lower_file->f_pos);
+	set_fs(old_fs);
+
+	kunmap(page);
+
+	if (err < 0)
+		goto out;
+
+	/* if vfs_write succeeded above, sync up our times/sizes */
+	fsstack_copy_inode_size(inode, lower_inode);
+	unionfs_copy_attr_times(inode);
+	mark_inode_dirty_sync(inode);
+
+out:
+	if (err < 0)
+		ClearPageUptodate(page);
+
+	unionfs_check_file(file);
+	unionfs_read_unlock(file->f_path.dentry->d_sb);
+	return err;		/* assume all is ok */
+}
+
+/*
+ * Although unionfs isn't a block-based file system, it may stack on one.
+ * ->bmap is needed, for example, to swapon(2) files.
+ */
+sector_t unionfs_bmap(struct address_space *mapping, sector_t block)
+{
+	int err = -EINVAL;
+	struct inode *inode, *lower_inode;
+	sector_t (*bmap)(struct address_space *, sector_t);
+
+	inode = (struct inode *)mapping->host;
+	lower_inode = unionfs_lower_inode(inode);
+	if (!lower_inode)
+		goto out;
+	bmap = lower_inode->i_mapping->a_ops->bmap;
+	if (bmap)
+		err = bmap(lower_inode->i_mapping, block);
+out:
+	return err;
+}
+
+
+struct address_space_operations unionfs_aops = {
+	.writepage	= unionfs_writepage,
+	.writepages	= unionfs_writepages,
+	.readpage	= unionfs_readpage,
+	.prepare_write	= unionfs_prepare_write,
+	.commit_write	= unionfs_commit_write,
+	.bmap		= unionfs_bmap,
+};
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 24/42] Unionfs: mount-time and stacking-interposition functions
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (22 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 23/42] Unionfs: address-space operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 25/42] Unionfs: super_block operations Erez Zadok
                   ` (18 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes read_super and module-linkage routines.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/main.c |  783 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 783 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/main.c

diff --git a/fs/unionfs/main.c b/fs/unionfs/main.c
new file mode 100644
index 0000000..22aa6e6
--- /dev/null
+++ b/fs/unionfs/main.c
@@ -0,0 +1,783 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+static void unionfs_fill_inode(struct dentry *dentry,
+			       struct inode *inode)
+{
+	struct inode *lower_inode;
+	struct dentry *lower_dentry;
+	int bindex, bstart, bend;
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry) {
+			unionfs_set_lower_inode_idx(inode, bindex, NULL);
+			continue;
+		}
+
+		/* Initialize the lower inode to the new lower inode. */
+		if (!lower_dentry->d_inode)
+			continue;
+
+		unionfs_set_lower_inode_idx(inode, bindex,
+					    igrab(lower_dentry->d_inode));
+	}
+
+	ibstart(inode) = dbstart(dentry);
+	ibend(inode) = dbend(dentry);
+
+	/* Use attributes from the first branch. */
+	lower_inode = unionfs_lower_inode(inode);
+
+	/* Use different set of inode ops for symlinks & directories */
+	if (S_ISLNK(lower_inode->i_mode))
+		inode->i_op = &unionfs_symlink_iops;
+	else if (S_ISDIR(lower_inode->i_mode))
+		inode->i_op = &unionfs_dir_iops;
+
+	/* Use different set of file ops for directories */
+	if (S_ISDIR(lower_inode->i_mode))
+		inode->i_fop = &unionfs_dir_fops;
+
+	/* properly initialize special inodes */
+	if (S_ISBLK(lower_inode->i_mode) || S_ISCHR(lower_inode->i_mode) ||
+	    S_ISFIFO(lower_inode->i_mode) || S_ISSOCK(lower_inode->i_mode))
+		init_special_inode(inode, lower_inode->i_mode,
+				   lower_inode->i_rdev);
+
+	/* all well, copy inode attributes */
+	unionfs_copy_attr_all(inode, lower_inode);
+	fsstack_copy_inode_size(inode, lower_inode);
+}
+
+/*
+ * Connect a unionfs inode dentry/inode with several lower ones.  This is
+ * the classic stackable file system "vnode interposition" action.
+ *
+ * @sb: unionfs's super_block
+ */
+struct dentry *unionfs_interpose(struct dentry *dentry, struct super_block *sb,
+				 int flag)
+{
+	int err = 0;
+	struct inode *inode;
+	int is_negative_dentry = 1;
+	int bindex, bstart, bend;
+	int need_fill_inode = 1;
+	struct dentry *spliced = NULL;
+
+	verify_locked(dentry);
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+
+	/* Make sure that we didn't get a negative dentry. */
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		if (unionfs_lower_dentry_idx(dentry, bindex) &&
+		    unionfs_lower_dentry_idx(dentry, bindex)->d_inode) {
+			is_negative_dentry = 0;
+			break;
+		}
+	}
+	BUG_ON(is_negative_dentry);
+
+	/*
+	 * We allocate our new inode below, by calling iget.
+	 * iget will call our read_inode which will initialize some
+	 * of the new inode's fields
+	 */
+
+	/*
+	 * On revalidate we've already got our own inode and just need
+	 * to fix it up.
+	 */
+	if (flag == INTERPOSE_REVAL) {
+		inode = dentry->d_inode;
+		UNIONFS_I(inode)->bstart = -1;
+		UNIONFS_I(inode)->bend = -1;
+		atomic_set(&UNIONFS_I(inode)->generation,
+			   atomic_read(&UNIONFS_SB(sb)->generation));
+
+		UNIONFS_I(inode)->lower_inodes =
+			kcalloc(sbmax(sb), sizeof(struct inode *), GFP_KERNEL);
+		if (unlikely(!UNIONFS_I(inode)->lower_inodes)) {
+			err = -ENOMEM;
+			goto out;
+		}
+	} else {
+		/* get unique inode number for unionfs */
+		inode = iget(sb, iunique(sb, UNIONFS_ROOT_INO));
+		if (!inode) {
+			err = -EACCES;
+			goto out;
+		}
+		if (atomic_read(&inode->i_count) > 1)
+			goto skip;
+	}
+
+	need_fill_inode = 0;
+	unionfs_fill_inode(dentry, inode);
+
+skip:
+	/* only (our) lookup wants to do a d_add */
+	switch (flag) {
+	case INTERPOSE_DEFAULT:
+	case INTERPOSE_REVAL_NEG:
+		d_instantiate(dentry, inode);
+		break;
+	case INTERPOSE_LOOKUP:
+		spliced = d_splice_alias(inode, dentry);
+		if (IS_ERR(spliced)) {
+			err = PTR_ERR(spliced);
+		} else if (spliced && spliced != dentry) {
+			/*
+			 * d_splice can return a dentry if it was
+			 * disconnected and had to be moved.  We must ensure
+			 * that the private data of the new dentry is
+			 * correct and that the inode info was filled
+			 * properly.  Finally we must return this new
+			 * dentry.
+			 */
+			spliced->d_op = &unionfs_dops;
+			spliced->d_fsdata = dentry->d_fsdata;
+			dentry->d_fsdata = NULL;
+			dentry = spliced;
+			if (need_fill_inode) {
+				need_fill_inode = 0;
+				unionfs_fill_inode(dentry, inode);
+			}
+			goto out_spliced;
+		}
+		break;
+	case INTERPOSE_REVAL:
+		/* Do nothing. */
+		break;
+	default:
+		printk(KERN_CRIT "unionfs: invalid interpose flag passed!\n");
+		BUG();
+	}
+	goto out;
+
+out_spliced:
+	if (!err)
+		return spliced;
+out:
+	return ERR_PTR(err);
+}
+
+/* like interpose above, but for an already existing dentry */
+void unionfs_reinterpose(struct dentry *dentry)
+{
+	struct dentry *lower_dentry;
+	struct inode *inode;
+	int bindex, bstart, bend;
+
+	verify_locked(dentry);
+
+	/* This is pre-allocated inode */
+	inode = dentry->d_inode;
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (!lower_dentry)
+			continue;
+
+		if (!lower_dentry->d_inode)
+			continue;
+		if (unionfs_lower_inode_idx(inode, bindex))
+			continue;
+		unionfs_set_lower_inode_idx(inode, bindex,
+					    igrab(lower_dentry->d_inode));
+	}
+	ibstart(inode) = dbstart(dentry);
+	ibend(inode) = dbend(dentry);
+}
+
+/*
+ * make sure the branch we just looked up (nd) makes sense:
+ *
+ * 1) we're not trying to stack unionfs on top of unionfs
+ * 2) it exists
+ * 3) is a directory
+ */
+int check_branch(struct nameidata *nd)
+{
+	/* XXX: remove in ODF code -- stacking unions allowed there */
+	if (!strcmp(nd->dentry->d_sb->s_type->name, UNIONFS_NAME))
+		return -EINVAL;
+	if (!nd->dentry->d_inode)
+		return -ENOENT;
+	if (!S_ISDIR(nd->dentry->d_inode->i_mode))
+		return -ENOTDIR;
+	return 0;
+}
+
+/* checks if two lower_dentries have overlapping branches */
+static int is_branch_overlap(struct dentry *dent1, struct dentry *dent2)
+{
+	struct dentry *dent = NULL;
+
+	dent = dent1;
+	while ((dent != dent2) && (dent->d_parent != dent))
+		dent = dent->d_parent;
+
+	if (dent == dent2)
+		return 1;
+
+	dent = dent2;
+	while ((dent != dent1) && (dent->d_parent != dent))
+		dent = dent->d_parent;
+
+	return (dent == dent1);
+}
+
+/*
+ * Parse "ro" or "rw" options, but default to "rw" if no mode options was
+ * specified.  Fill the mode bits in @perms.  If encounter an unknown
+ * string, return -EINVAL.  Otherwise return 0.
+ */
+int parse_branch_mode(const char *name, int *perms)
+{
+	if (!name || !strcmp(name, "rw")) {
+		*perms = MAY_READ | MAY_WRITE;
+		return 0;
+	}
+	if (!strcmp(name, "ro")) {
+		*perms = MAY_READ;
+		return 0;
+	}
+	return -EINVAL;
+}
+
+/*
+ * parse the dirs= mount argument
+ *
+ * We don't need to lock the superblock private data's rwsem, as we get
+ * called only by unionfs_read_super - it is still a long time before anyone
+ * can even get a reference to us.
+ */
+static int parse_dirs_option(struct super_block *sb, struct unionfs_dentry_info
+			     *lower_root_info, char *options)
+{
+	struct nameidata nd;
+	char *name;
+	int err = 0;
+	int branches = 1;
+	int bindex = 0;
+	int i = 0;
+	int j = 0;
+	struct dentry *dent1;
+	struct dentry *dent2;
+
+	if (options[0] == '\0') {
+		printk(KERN_ERR "unionfs: no branches specified\n");
+		err = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Each colon means we have a separator, this is really just a rough
+	 * guess, since strsep will handle empty fields for us.
+	 */
+	for (i = 0; options[i]; i++)
+		if (options[i] == ':')
+			branches++;
+
+	/* allocate space for underlying pointers to lower dentry */
+	UNIONFS_SB(sb)->data =
+		kcalloc(branches, sizeof(struct unionfs_data), GFP_KERNEL);
+	if (unlikely(!UNIONFS_SB(sb)->data)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	lower_root_info->lower_paths =
+		kcalloc(branches, sizeof(struct path), GFP_KERNEL);
+	if (unlikely(!lower_root_info->lower_paths)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	/* now parsing a string such as "b1:b2=rw:b3=ro:b4" */
+	branches = 0;
+	while ((name = strsep(&options, ":")) != NULL) {
+		int perms;
+		char *mode = strchr(name, '=');
+
+		if (!name)
+			continue;
+		if (!*name) {	/* bad use of ':' (extra colons) */
+			err = -EINVAL;
+			goto out;
+		}
+
+		branches++;
+
+		/* strip off '=' if any */
+		if (mode)
+			*mode++ = '\0';
+
+		err = parse_branch_mode(mode, &perms);
+		if (err) {
+			printk(KERN_ERR "unionfs: invalid mode \"%s\" for "
+			       "branch %d\n", mode, bindex);
+			goto out;
+		}
+		/* ensure that leftmost branch is writeable */
+		if (!bindex && !(perms & MAY_WRITE)) {
+			printk(KERN_ERR "unionfs: leftmost branch cannot be "
+			       "read-only (use \"-o ro\" to create a "
+			       "read-only union)\n");
+			err = -EINVAL;
+			goto out;
+		}
+
+		err = path_lookup(name, LOOKUP_FOLLOW, &nd);
+		if (err) {
+			printk(KERN_ERR "unionfs: error accessing "
+			       "lower directory '%s' (error %d)\n",
+			       name, err);
+			goto out;
+		}
+
+		err = check_branch(&nd);
+		if (err) {
+			printk(KERN_ERR "unionfs: lower directory "
+			       "'%s' is not a valid branch\n", name);
+			path_release(&nd);
+			goto out;
+		}
+
+		lower_root_info->lower_paths[bindex].dentry = nd.dentry;
+		lower_root_info->lower_paths[bindex].mnt = nd.mnt;
+
+		set_branchperms(sb, bindex, perms);
+		set_branch_count(sb, bindex, 0);
+		new_branch_id(sb, bindex);
+
+		if (lower_root_info->bstart < 0)
+			lower_root_info->bstart = bindex;
+		lower_root_info->bend = bindex;
+		bindex++;
+	}
+
+	if (branches == 0) {
+		printk(KERN_ERR "unionfs: no branches specified\n");
+		err = -EINVAL;
+		goto out;
+	}
+
+	BUG_ON(branches != (lower_root_info->bend + 1));
+
+	/*
+	 * Ensure that no overlaps exist in the branches.
+	 *
+	 * This test is required because the Linux kernel has no support
+	 * currently for ensuring coherency between stackable layers and
+	 * branches.  If we were to allow overlapping branches, it would be
+	 * possible, for example, to delete a file via one branch, which
+	 * would not be reflected in another branch.  Such incoherency could
+	 * lead to inconsistencies and even kernel oopses.  Rather than
+	 * implement hacks to work around some of these cache-coherency
+	 * problems, we prevent branch overlapping, for now.  A complete
+	 * solution will involve proper kernel/VFS support for cache
+	 * coherency, at which time we could safely remove this
+	 * branch-overlapping test.
+	 */
+	for (i = 0; i < branches; i++) {
+		dent1 = lower_root_info->lower_paths[i].dentry;
+		for (j = i + 1; j < branches; j++) {
+			dent2 = lower_root_info->lower_paths[j].dentry;
+			if (is_branch_overlap(dent1, dent2)) {
+				printk(KERN_ERR "unionfs: branches %d and "
+				       "%d overlap\n", i, j);
+				err = -EINVAL;
+				goto out;
+			}
+		}
+	}
+
+out:
+	if (err) {
+		for (i = 0; i < branches; i++)
+			if (lower_root_info->lower_paths[i].dentry) {
+				dput(lower_root_info->lower_paths[i].dentry);
+				/* initialize: can't use unionfs_mntput here */
+				mntput(lower_root_info->lower_paths[i].mnt);
+			}
+
+		kfree(lower_root_info->lower_paths);
+		kfree(UNIONFS_SB(sb)->data);
+
+		/*
+		 * MUST clear the pointers to prevent potential double free if
+		 * the caller dies later on
+		 */
+		lower_root_info->lower_paths = NULL;
+		UNIONFS_SB(sb)->data = NULL;
+	}
+	return err;
+}
+
+/*
+ * Parse mount options.  See the manual page for usage instructions.
+ *
+ * Returns the dentry object of the lower-level (lower) directory;
+ * We want to mount our stackable file system on top of that lower directory.
+ */
+static struct unionfs_dentry_info *unionfs_parse_options(
+					 struct super_block *sb,
+					 char *options)
+{
+	struct unionfs_dentry_info *lower_root_info;
+	char *optname;
+	int err = 0;
+	int bindex;
+	int dirsfound = 0;
+
+	/* allocate private data area */
+	err = -ENOMEM;
+	lower_root_info =
+		kzalloc(sizeof(struct unionfs_dentry_info), GFP_KERNEL);
+	if (unlikely(!lower_root_info))
+		goto out_error;
+	lower_root_info->bstart = -1;
+	lower_root_info->bend = -1;
+	lower_root_info->bopaque = -1;
+
+	while ((optname = strsep(&options, ",")) != NULL) {
+		char *optarg;
+		char *endptr;
+		int intval;
+
+		if (!optname || !*optname)
+			continue;
+
+		optarg = strchr(optname, '=');
+		if (optarg)
+			*optarg++ = '\0';
+
+		/*
+		 * All of our options take an argument now. Insert ones that
+		 * don't, above this check.
+		 */
+		if (!optarg) {
+			printk(KERN_ERR "unionfs: %s requires an argument\n",
+			       optname);
+			err = -EINVAL;
+			goto out_error;
+		}
+
+		if (!strcmp("dirs", optname)) {
+			if (++dirsfound > 1) {
+				printk(KERN_ERR
+				       "unionfs: multiple dirs specified\n");
+				err = -EINVAL;
+				goto out_error;
+			}
+			err = parse_dirs_option(sb, lower_root_info, optarg);
+			if (err)
+				goto out_error;
+			continue;
+		}
+
+		/* All of these options require an integer argument. */
+		intval = simple_strtoul(optarg, &endptr, 0);
+		if (*endptr) {
+			printk(KERN_ERR
+			       "unionfs: invalid %s option '%s'\n",
+			       optname, optarg);
+			err = -EINVAL;
+			goto out_error;
+		}
+
+		err = -EINVAL;
+		printk(KERN_ERR
+		       "unionfs: unrecognized option '%s'\n", optname);
+		goto out_error;
+	}
+	if (dirsfound != 1) {
+		printk(KERN_ERR "unionfs: dirs option required\n");
+		err = -EINVAL;
+		goto out_error;
+	}
+	goto out;
+
+out_error:
+	if (lower_root_info && lower_root_info->lower_paths) {
+		for (bindex = lower_root_info->bstart;
+		     bindex >= 0 && bindex <= lower_root_info->bend;
+		     bindex++) {
+			struct dentry *d;
+			struct vfsmount *m;
+
+			d = lower_root_info->lower_paths[bindex].dentry;
+			m = lower_root_info->lower_paths[bindex].mnt;
+
+			dput(d);
+			/* initializing: can't use unionfs_mntput here */
+			mntput(m);
+		}
+	}
+
+	kfree(lower_root_info->lower_paths);
+	kfree(lower_root_info);
+
+	kfree(UNIONFS_SB(sb)->data);
+	UNIONFS_SB(sb)->data = NULL;
+
+	lower_root_info = ERR_PTR(err);
+out:
+	return lower_root_info;
+}
+
+/*
+ * our custom d_alloc_root work-alike
+ *
+ * we can't use d_alloc_root if we want to use our own interpose function
+ * unchanged, so we simply call our own "fake" d_alloc_root
+ */
+static struct dentry *unionfs_d_alloc_root(struct super_block *sb)
+{
+	struct dentry *ret = NULL;
+
+	if (sb) {
+		static const struct qstr name = {
+			.name = "/",
+			.len = 1
+		};
+
+		ret = d_alloc(NULL, &name);
+		if (likely(ret)) {
+			ret->d_op = &unionfs_dops;
+			ret->d_sb = sb;
+			ret->d_parent = ret;
+		}
+	}
+	return ret;
+}
+
+/*
+ * There is no need to lock the unionfs_super_info's rwsem as there is no
+ * way anyone can have a reference to the superblock at this point in time.
+ */
+static int unionfs_read_super(struct super_block *sb, void *raw_data,
+			      int silent)
+{
+	int err = 0;
+	struct unionfs_dentry_info *lower_root_info = NULL;
+	int bindex, bstart, bend;
+
+	if (!raw_data) {
+		printk(KERN_ERR
+		       "unionfs: read_super: missing data argument\n");
+		err = -EINVAL;
+		goto out;
+	}
+
+	/* Allocate superblock private data */
+	sb->s_fs_info = kzalloc(sizeof(struct unionfs_sb_info), GFP_KERNEL);
+	if (unlikely(!UNIONFS_SB(sb))) {
+		printk(KERN_CRIT "unionfs: read_super: out of memory\n");
+		err = -ENOMEM;
+		goto out;
+	}
+
+	UNIONFS_SB(sb)->bend = -1;
+	atomic_set(&UNIONFS_SB(sb)->generation, 1);
+	init_rwsem(&UNIONFS_SB(sb)->rwsem);
+	UNIONFS_SB(sb)->high_branch_id = -1; /* -1 == invalid branch ID */
+
+	lower_root_info = unionfs_parse_options(sb, raw_data);
+	if (IS_ERR(lower_root_info)) {
+		printk(KERN_ERR
+		       "unionfs: read_super: error while parsing options "
+		       "(err = %ld)\n", PTR_ERR(lower_root_info));
+		err = PTR_ERR(lower_root_info);
+		lower_root_info = NULL;
+		goto out_free;
+	}
+	if (lower_root_info->bstart == -1) {
+		err = -ENOENT;
+		goto out_free;
+	}
+
+	/* set the lower superblock field of upper superblock */
+	bstart = lower_root_info->bstart;
+	BUG_ON(bstart != 0);
+	sbend(sb) = bend = lower_root_info->bend;
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		struct dentry *d = lower_root_info->lower_paths[bindex].dentry;
+		unionfs_set_lower_super_idx(sb, bindex, d->d_sb);
+	}
+
+	/* max Bytes is the maximum bytes from highest priority branch */
+	sb->s_maxbytes = unionfs_lower_super_idx(sb, 0)->s_maxbytes;
+
+	sb->s_op = &unionfs_sops;
+
+	/* See comment next to the definition of unionfs_d_alloc_root */
+	sb->s_root = unionfs_d_alloc_root(sb);
+	if (unlikely(!sb->s_root)) {
+		err = -ENOMEM;
+		goto out_dput;
+	}
+
+	/* link the upper and lower dentries */
+	sb->s_root->d_fsdata = NULL;
+	err = new_dentry_private_data(sb->s_root);
+	if (unlikely(err))
+		goto out_freedpd;
+
+	/* Set the lower dentries for s_root */
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		struct dentry *d;
+		struct vfsmount *m;
+
+		d = lower_root_info->lower_paths[bindex].dentry;
+		m = lower_root_info->lower_paths[bindex].mnt;
+
+		unionfs_set_lower_dentry_idx(sb->s_root, bindex, d);
+		unionfs_set_lower_mnt_idx(sb->s_root, bindex, m);
+	}
+	set_dbstart(sb->s_root, bstart);
+	set_dbend(sb->s_root, bend);
+
+	/* Set the generation number to one, since this is for the mount. */
+	atomic_set(&UNIONFS_D(sb->s_root)->generation, 1);
+
+	/*
+	 * Call interpose to create the upper level inode.  Only
+	 * INTERPOSE_LOOKUP can return a value other than 0 on err.
+	 */
+	err = PTR_ERR(unionfs_interpose(sb->s_root, sb, 0));
+	unionfs_unlock_dentry(sb->s_root);
+	if (!err)
+		goto out;
+	/* else fall through */
+
+out_freedpd:
+	if (UNIONFS_D(sb->s_root)) {
+		kfree(UNIONFS_D(sb->s_root)->lower_paths);
+		free_dentry_private_data(sb->s_root);
+	}
+	dput(sb->s_root);
+
+out_dput:
+	if (lower_root_info && !IS_ERR(lower_root_info)) {
+		for (bindex = lower_root_info->bstart;
+		     bindex <= lower_root_info->bend; bindex++) {
+			struct dentry *d;
+			struct vfsmount *m;
+
+			d = lower_root_info->lower_paths[bindex].dentry;
+			m = lower_root_info->lower_paths[bindex].mnt;
+
+			dput(d);
+			/* initializing: can't use unionfs_mntput here */
+			mntput(m);
+		}
+		kfree(lower_root_info->lower_paths);
+		kfree(lower_root_info);
+		lower_root_info = NULL;
+	}
+
+out_free:
+	kfree(UNIONFS_SB(sb)->data);
+	kfree(UNIONFS_SB(sb));
+	sb->s_fs_info = NULL;
+
+out:
+	if (lower_root_info && !IS_ERR(lower_root_info)) {
+		kfree(lower_root_info->lower_paths);
+		kfree(lower_root_info);
+	}
+	return err;
+}
+
+static int unionfs_get_sb(struct file_system_type *fs_type,
+			  int flags, const char *dev_name,
+			  void *raw_data, struct vfsmount *mnt)
+{
+	return get_sb_nodev(fs_type, flags, raw_data, unionfs_read_super, mnt);
+}
+
+static struct file_system_type unionfs_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= UNIONFS_NAME,
+	.get_sb		= unionfs_get_sb,
+	.kill_sb	= generic_shutdown_super,
+	.fs_flags	= FS_REVAL_DOT,
+};
+
+static int __init init_unionfs_fs(void)
+{
+	int err;
+
+	pr_info("Registering unionfs " UNIONFS_VERSION "\n");
+
+	err = unionfs_init_filldir_cache();
+	if (unlikely(err))
+		goto out;
+	err = unionfs_init_inode_cache();
+	if (unlikely(err))
+		goto out;
+	err = unionfs_init_dentry_cache();
+	if (unlikely(err))
+		goto out;
+	err = init_sioq();
+	if (unlikely(err))
+		goto out;
+	err = register_filesystem(&unionfs_fs_type);
+out:
+	if (unlikely(err)) {
+		stop_sioq();
+		unionfs_destroy_filldir_cache();
+		unionfs_destroy_inode_cache();
+		unionfs_destroy_dentry_cache();
+	}
+	return err;
+}
+
+static void __exit exit_unionfs_fs(void)
+{
+	stop_sioq();
+	unionfs_destroy_filldir_cache();
+	unionfs_destroy_inode_cache();
+	unionfs_destroy_dentry_cache();
+	unregister_filesystem(&unionfs_fs_type);
+	pr_info("Completed unionfs module unload\n");
+}
+
+MODULE_AUTHOR("Erez Zadok, Filesystems and Storage Lab, Stony Brook University"
+	      " (http://www.fsl.cs.sunysb.edu)");
+MODULE_DESCRIPTION("Unionfs " UNIONFS_VERSION
+		   " (http://unionfs.filesystems.org)");
+MODULE_LICENSE("GPL");
+
+module_init(init_unionfs_fs);
+module_exit(exit_unionfs_fs);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 25/42] Unionfs: super_block operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (23 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 24/42] Unionfs: mount-time and stacking-interposition functions Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:41 ` [PATCH 26/42] Unionfs: extended attributes operations Erez Zadok
                   ` (17 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Includes read_inode, delete_inode, put_super, statfs, remount_fs (which
supports branch-management ops), clear_inode, alloc_inode, destroy_inode,
write_inode, umount_begin, and show_options.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/super.c | 1020 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1020 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/super.c

diff --git a/fs/unionfs/super.c b/fs/unionfs/super.c
new file mode 100644
index 0000000..d9cf2a7
--- /dev/null
+++ b/fs/unionfs/super.c
@@ -0,0 +1,1020 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * The inode cache is used with alloc_inode for both our inode info and the
+ * vfs inode.
+ */
+static struct kmem_cache *unionfs_inode_cachep;
+
+static void unionfs_read_inode(struct inode *inode)
+{
+	int size;
+	struct unionfs_inode_info *info = UNIONFS_I(inode);
+
+	unionfs_read_lock(inode->i_sb);
+
+	memset(info, 0, offsetof(struct unionfs_inode_info, vfs_inode));
+	info->bstart = -1;
+	info->bend = -1;
+	atomic_set(&info->generation,
+		   atomic_read(&UNIONFS_SB(inode->i_sb)->generation));
+	spin_lock_init(&info->rdlock);
+	info->rdcount = 1;
+	info->hashsize = -1;
+	INIT_LIST_HEAD(&info->readdircache);
+
+	size = sbmax(inode->i_sb) * sizeof(struct inode *);
+	info->lower_inodes = kzalloc(size, GFP_KERNEL);
+	if (unlikely(!info->lower_inodes)) {
+		printk(KERN_CRIT "unionfs: no kernel memory when allocating "
+		       "lower-pointer array!\n");
+		BUG();
+	}
+
+	inode->i_version++;
+	inode->i_op = &unionfs_main_iops;
+	inode->i_fop = &unionfs_main_fops;
+
+	inode->i_mapping->a_ops = &unionfs_aops;
+
+	unionfs_read_unlock(inode->i_sb);
+}
+
+/*
+ * we now define delete_inode, because there are two VFS paths that may
+ * destroy an inode: one of them calls clear inode before doing everything
+ * else that's needed, and the other is fine.  This way we truncate the inode
+ * size (and its pages) and then clear our own inode, which will do an iput
+ * on our and the lower inode.
+ *
+ * No need to lock sb info's rwsem.
+ */
+static void unionfs_delete_inode(struct inode *inode)
+{
+	i_size_write(inode, 0);	/* every f/s seems to do that */
+
+	if (inode->i_data.nrpages)
+		truncate_inode_pages(&inode->i_data, 0);
+
+	clear_inode(inode);
+}
+
+/*
+ * final actions when unmounting a file system
+ *
+ * No need to lock rwsem.
+ */
+static void unionfs_put_super(struct super_block *sb)
+{
+	int bindex, bstart, bend;
+	struct unionfs_sb_info *spd;
+	int leaks = 0;
+
+	spd = UNIONFS_SB(sb);
+	if (!spd)
+		return;
+
+	bstart = sbstart(sb);
+	bend = sbend(sb);
+
+	/* Make sure we have no leaks of branchget/branchput. */
+	for (bindex = bstart; bindex <= bend; bindex++)
+		if (unlikely(branch_count(sb, bindex) != 0)) {
+			printk(KERN_CRIT
+			       "unionfs: branch %d has %d references left!\n",
+			       bindex, branch_count(sb, bindex));
+			leaks = 1;
+		}
+	BUG_ON(leaks != 0);
+
+	kfree(spd->data);
+	kfree(spd);
+	sb->s_fs_info = NULL;
+}
+
+/*
+ * Since people use this to answer the "How big of a file can I write?"
+ * question, we report the size of the highest priority branch as the size of
+ * the union.
+ */
+static int unionfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	int err	= 0;
+	struct super_block *sb;
+	struct dentry *lower_dentry;
+
+	sb = dentry->d_sb;
+
+	unionfs_read_lock(sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+	unionfs_check_dentry(dentry);
+
+	lower_dentry = unionfs_lower_dentry(sb->s_root);
+	err = vfs_statfs(lower_dentry, buf);
+
+	/* set return buf to our f/s to avoid confusing user-level utils */
+	buf->f_type = UNIONFS_SUPER_MAGIC;
+	/*
+	 * Our maximum file name can is shorter by a few bytes because every
+	 * file name could potentially be whited-out.
+	 *
+	 * XXX: this restriction goes away with ODF.
+	 */
+	buf->f_namelen -= UNIONFS_WHLEN;
+
+	/*
+	 * reset two fields to avoid confusing user-land.
+	 * XXX: is this still necessary?
+	 */
+	memset(&buf->f_fsid, 0, sizeof(__kernel_fsid_t));
+	memset(&buf->f_spare, 0, sizeof(buf->f_spare));
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(sb);
+	return err;
+}
+
+/* handle mode changing during remount */
+static noinline int do_remount_mode_option(char *optarg, int cur_branches,
+					   struct unionfs_data *new_data,
+					   struct path *new_lower_paths)
+{
+	int err = -EINVAL;
+	int perms, idx;
+	char *modename = strchr(optarg, '=');
+	struct nameidata nd;
+
+	/* by now, optarg contains the branch name */
+	if (!*optarg) {
+		printk(KERN_ERR
+		       "unionfs: no branch specified for mode change\n");
+		goto out;
+	}
+	if (!modename) {
+		printk(KERN_ERR "unionfs: branch \"%s\" requires a mode\n",
+		       optarg);
+		goto out;
+	}
+	*modename++ = '\0';
+	err = parse_branch_mode(modename, &perms);
+	if (err) {
+		printk(KERN_ERR "unionfs: invalid mode \"%s\" for \"%s\"\n",
+		       modename, optarg);
+		goto out;
+	}
+
+	/*
+	 * Find matching branch index.  For now, this assumes that nothing
+	 * has been mounted on top of this Unionfs stack.  Once we have /odf
+	 * and cache-coherency resolved, we'll address the branch-path
+	 * uniqueness.
+	 */
+	err = path_lookup(optarg, LOOKUP_FOLLOW, &nd);
+	if (err) {
+		printk(KERN_ERR "unionfs: error accessing "
+		       "lower directory \"%s\" (error %d)\n",
+		       optarg, err);
+		goto out;
+	}
+	for (idx = 0; idx < cur_branches; idx++)
+		if (nd.mnt == new_lower_paths[idx].mnt &&
+		    nd.dentry == new_lower_paths[idx].dentry)
+			break;
+	path_release(&nd);	/* no longer needed */
+	if (idx == cur_branches) {
+		err = -ENOENT;	/* err may have been reset above */
+		printk(KERN_ERR "unionfs: branch \"%s\" "
+		       "not found\n", optarg);
+		goto out;
+	}
+	/* check/change mode for existing branch */
+	/* we don't warn if perms==branchperms */
+	new_data[idx].branchperms = perms;
+	err = 0;
+out:
+	return err;
+}
+
+/* handle branch deletion during remount */
+static noinline int do_remount_del_option(char *optarg, int cur_branches,
+					  struct unionfs_data *new_data,
+					  struct path *new_lower_paths)
+{
+	int err = -EINVAL;
+	int idx;
+	struct nameidata nd;
+
+	/* optarg contains the branch name to delete */
+
+	/*
+	 * Find matching branch index.  For now, this assumes that nothing
+	 * has been mounted on top of this Unionfs stack.  Once we have /odf
+	 * and cache-coherency resolved, we'll address the branch-path
+	 * uniqueness.
+	 */
+	err = path_lookup(optarg, LOOKUP_FOLLOW, &nd);
+	if (err) {
+		printk(KERN_ERR "unionfs: error accessing "
+		       "lower directory \"%s\" (error %d)\n",
+		       optarg, err);
+		goto out;
+	}
+	for (idx = 0; idx < cur_branches; idx++)
+		if (nd.mnt == new_lower_paths[idx].mnt &&
+		    nd.dentry == new_lower_paths[idx].dentry)
+			break;
+	path_release(&nd);	/* no longer needed */
+	if (idx == cur_branches) {
+		printk(KERN_ERR "unionfs: branch \"%s\" "
+		       "not found\n", optarg);
+		err = -ENOENT;
+		goto out;
+	}
+	/* check if there are any open files on the branch to be deleted */
+	if (atomic_read(&new_data[idx].open_files) > 0) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Now we have to delete the branch.  First, release any handles it
+	 * has.  Then, move the remaining array indexes past "idx" in
+	 * new_data and new_lower_paths one to the left.  Finally, adjust
+	 * cur_branches.
+	 */
+	pathput(&new_lower_paths[idx]);
+
+	if (idx < cur_branches - 1) {
+		/* if idx==cur_branches-1, we delete last branch: easy */
+		memmove(&new_data[idx], &new_data[idx+1],
+			(cur_branches - 1 - idx) *
+			sizeof(struct unionfs_data));
+		memmove(&new_lower_paths[idx], &new_lower_paths[idx+1],
+			(cur_branches - 1 - idx) * sizeof(struct path));
+	}
+
+	err = 0;
+out:
+	return err;
+}
+
+/* handle branch insertion during remount */
+static noinline int do_remount_add_option(char *optarg, int cur_branches,
+					  struct unionfs_data *new_data,
+					  struct path *new_lower_paths,
+					  int *high_branch_id)
+{
+	int err = -EINVAL;
+	int perms;
+	int idx = 0;		/* default: insert at beginning */
+	char *new_branch , *modename = NULL;
+	struct nameidata nd;
+
+	/*
+	 * optarg can be of several forms:
+	 *
+	 * /bar:/foo		insert /foo before /bar
+	 * /bar:/foo=ro		insert /foo in ro mode before /bar
+	 * /foo			insert /foo in the beginning (prepend)
+	 * :/foo		insert /foo at the end (append)
+	 */
+	if (*optarg == ':') {	/* append? */
+		new_branch = optarg + 1; /* skip ':' */
+		idx = cur_branches;
+		goto found_insertion_point;
+	}
+	new_branch = strchr(optarg, ':');
+	if (!new_branch) {	/* prepend? */
+		new_branch = optarg;
+		goto found_insertion_point;
+	}
+	*new_branch++ = '\0';	/* holds path+mode of new branch */
+
+	/*
+	 * Find matching branch index.  For now, this assumes that nothing
+	 * has been mounted on top of this Unionfs stack.  Once we have /odf
+	 * and cache-coherency resolved, we'll address the branch-path
+	 * uniqueness.
+	 */
+	err = path_lookup(optarg, LOOKUP_FOLLOW, &nd);
+	if (err) {
+		printk(KERN_ERR "unionfs: error accessing "
+		       "lower directory \"%s\" (error %d)\n",
+		       optarg, err);
+		goto out;
+	}
+	for (idx = 0; idx < cur_branches; idx++)
+		if (nd.mnt == new_lower_paths[idx].mnt &&
+		    nd.dentry == new_lower_paths[idx].dentry)
+			break;
+	path_release(&nd);	/* no longer needed */
+	if (idx == cur_branches) {
+		printk(KERN_ERR "unionfs: branch \"%s\" "
+		       "not found\n", optarg);
+		err = -ENOENT;
+		goto out;
+	}
+
+	/*
+	 * At this point idx will hold the index where the new branch should
+	 * be inserted before.
+	 */
+found_insertion_point:
+	/* find the mode for the new branch */
+	if (new_branch)
+		modename = strchr(new_branch, '=');
+	if (modename)
+		*modename++ = '\0';
+	if (!new_branch || !*new_branch) {
+		printk(KERN_ERR "unionfs: null new branch\n");
+		err = -EINVAL;
+		goto out;
+	}
+	err = parse_branch_mode(modename, &perms);
+	if (err) {
+		printk(KERN_ERR "unionfs: invalid mode \"%s\" for "
+		       "branch \"%s\"\n", modename, new_branch);
+		goto out;
+	}
+	err = path_lookup(new_branch, LOOKUP_FOLLOW, &nd);
+	if (err) {
+		printk(KERN_ERR "unionfs: error accessing "
+		       "lower directory \"%s\" (error %d)\n",
+		       new_branch, err);
+		goto out;
+	}
+	/*
+	 * It's probably safe to check_mode the new branch to insert.  Note:
+	 * we don't allow inserting branches which are unionfs's by
+	 * themselves (check_branch returns EINVAL in that case).  This is
+	 * because this code base doesn't support stacking unionfs: the ODF
+	 * code base supports that correctly.
+	 */
+	err = check_branch(&nd);
+	if (err) {
+		printk(KERN_ERR "unionfs: lower directory "
+		       "\"%s\" is not a valid branch\n", optarg);
+		path_release(&nd);
+		goto out;
+	}
+
+	/*
+	 * Now we have to insert the new branch.  But first, move the bits
+	 * to make space for the new branch, if needed.  Finally, adjust
+	 * cur_branches.
+	 * We don't release nd here; it's kept until umount/remount.
+	 */
+	if (idx < cur_branches) {
+		/* if idx==cur_branches, we append: easy */
+		memmove(&new_data[idx+1], &new_data[idx],
+			(cur_branches - idx) * sizeof(struct unionfs_data));
+		memmove(&new_lower_paths[idx+1], &new_lower_paths[idx],
+			(cur_branches - idx) * sizeof(struct path));
+	}
+	new_lower_paths[idx].dentry = nd.dentry;
+	new_lower_paths[idx].mnt = nd.mnt;
+
+	new_data[idx].sb = nd.dentry->d_sb;
+	atomic_set(&new_data[idx].open_files, 0);
+	new_data[idx].branchperms = perms;
+	new_data[idx].branch_id = ++*high_branch_id; /* assign new branch ID */
+
+	err = 0;
+out:
+	return err;
+}
+
+
+/*
+ * Support branch management options on remount.
+ *
+ * See Documentation/filesystems/unionfs/ for details.
+ *
+ * @flags: numeric mount options
+ * @options: mount options string
+ *
+ * This function can rearrange a mounted union dynamically, adding and
+ * removing branches, including changing branch modes.  Clearly this has to
+ * be done safely and atomically.  Luckily, the VFS already calls this
+ * function with lock_super(sb) and lock_kernel() held, preventing
+ * concurrent mixing of new mounts, remounts, and unmounts.  Moreover,
+ * do_remount_sb(), our caller function, already called shrink_dcache_sb(sb)
+ * to purge dentries/inodes from our superblock, and also called
+ * fsync_super(sb) to purge any dirty pages.  So we're good.
+ *
+ * XXX: however, our remount code may also need to invalidate mapped pages
+ * so as to force them to be re-gotten from the (newly reconfigured) lower
+ * branches.  This has to wait for proper mmap and cache coherency support
+ * in the VFS.
+ *
+ */
+static int unionfs_remount_fs(struct super_block *sb, int *flags,
+			      char *options)
+{
+	int err = 0;
+	int i;
+	char *optionstmp, *tmp_to_free;	/* kstrdup'ed of "options" */
+	char *optname;
+	int cur_branches = 0;	/* no. of current branches */
+	int new_branches = 0;	/* no. of branches actually left in the end */
+	int add_branches;	/* est. no. of branches to add */
+	int del_branches;	/* est. no. of branches to del */
+	int max_branches;	/* max possible no. of branches */
+	struct unionfs_data *new_data = NULL, *tmp_data = NULL;
+	struct path *new_lower_paths = NULL, *tmp_lower_paths = NULL;
+	struct inode **new_lower_inodes = NULL;
+	int new_high_branch_id;	/* new high branch ID */
+	int size;		/* memory allocation size, temp var */
+	int old_ibstart, old_ibend;
+
+	unionfs_write_lock(sb);
+
+	/*
+	 * The VFS will take care of "ro" and "rw" flags, and we can safely
+	 * ignore MS_SILENT, but anything else left over is an error.  So we
+	 * need to check if any other flags may have been passed (none are
+	 * allowed/supported as of now).
+	 */
+	if ((*flags & ~(MS_RDONLY | MS_SILENT)) != 0) {
+		printk(KERN_ERR
+		       "unionfs: remount flags 0x%x unsupported\n", *flags);
+		err = -EINVAL;
+		goto out_error;
+	}
+
+	/*
+	 * If 'options' is NULL, it's probably because the user just changed
+	 * the union to a "ro" or "rw" and the VFS took care of it.  So
+	 * nothing to do and we're done.
+	 */
+	if (!options || options[0] == '\0')
+		goto out_error;
+
+	/*
+	 * Find out how many branches we will have in the end, counting
+	 * "add" and "del" commands.  Copy the "options" string because
+	 * strsep modifies the string and we need it later.
+	 */
+	tmp_to_free = kstrdup(options, GFP_KERNEL);
+	optionstmp = tmp_to_free;
+	if (unlikely(!optionstmp)) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	cur_branches = sbmax(sb); /* current no. branches */
+	new_branches = sbmax(sb);
+	del_branches = 0;
+	add_branches = 0;
+	new_high_branch_id = sbhbid(sb); /* save current high_branch_id */
+	while ((optname = strsep(&optionstmp, ",")) != NULL) {
+		char *optarg;
+
+		if (!optname || !*optname)
+			continue;
+
+		optarg = strchr(optname, '=');
+		if (optarg)
+			*optarg++ = '\0';
+
+		if (!strcmp("add", optname))
+			add_branches++;
+		else if (!strcmp("del", optname))
+			del_branches++;
+	}
+	kfree(tmp_to_free);
+	/* after all changes, will we have at least one branch left? */
+	if ((new_branches + add_branches - del_branches) < 1) {
+		printk(KERN_ERR
+		       "unionfs: no branches left after remount\n");
+		err = -EINVAL;
+		goto out_free;
+	}
+
+	/*
+	 * Since we haven't actually parsed all the add/del options, nor
+	 * have we checked them for errors, we don't know for sure how many
+	 * branches we will have after all changes have taken place.  In
+	 * fact, the total number of branches left could be less than what
+	 * we have now.  So we need to allocate space for a temporary
+	 * placeholder that is at least as large as the maximum number of
+	 * branches we *could* have, which is the current number plus all
+	 * the additions.  Once we're done with these temp placeholders, we
+	 * may have to re-allocate the final size, copy over from the temp,
+	 * and then free the temps (done near the end of this function).
+	 */
+	max_branches = cur_branches + add_branches;
+	/* allocate space for new pointers to lower dentry */
+	tmp_data = kcalloc(max_branches,
+			   sizeof(struct unionfs_data), GFP_KERNEL);
+	if (unlikely(!tmp_data)) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	/* allocate space for new pointers to lower paths */
+	tmp_lower_paths = kcalloc(max_branches,
+				  sizeof(struct path), GFP_KERNEL);
+	if (unlikely(!tmp_lower_paths)) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+	/* copy current info into new placeholders, incrementing refcnts */
+	memcpy(tmp_data, UNIONFS_SB(sb)->data,
+	       cur_branches * sizeof(struct unionfs_data));
+	memcpy(tmp_lower_paths, UNIONFS_D(sb->s_root)->lower_paths,
+	       cur_branches * sizeof(struct path));
+	for (i = 0; i < cur_branches; i++)
+		pathget(&tmp_lower_paths[i]); /* drop refs at end of fxn */
+
+	/*******************************************************************
+	 * For each branch command, do path_lookup on the requested branch,
+	 * and apply the change to a temp branch list.  To handle errors, we
+	 * already dup'ed the old arrays (above), and increased the refcnts
+	 * on various f/s objects.  So now we can do all the path_lookups
+	 * and branch-management commands on the new arrays.  If it fail mid
+	 * way, we free the tmp arrays and *put all objects.  If we succeed,
+	 * then we free old arrays and *put its objects, and then replace
+	 * the arrays with the new tmp list (we may have to re-allocate the
+	 * memory because the temp lists could have been larger than what we
+	 * actually needed).
+	 *******************************************************************/
+
+	while ((optname = strsep(&options, ",")) != NULL) {
+		char *optarg;
+
+		if (!optname || !*optname)
+			continue;
+		/*
+		 * At this stage optname holds a comma-delimited option, but
+		 * without the commas.  Next, we need to break the string on
+		 * the '=' symbol to separate CMD=ARG, where ARG itself can
+		 * be KEY=VAL.  For example, in mode=/foo=rw, CMD is "mode",
+		 * KEY is "/foo", and VAL is "rw".
+		 */
+		optarg = strchr(optname, '=');
+		if (optarg)
+			*optarg++ = '\0';
+		/* incgen remount option (instead of old ioctl) */
+		if (!strcmp("incgen", optname)) {
+			err = 0;
+			goto out_no_change;
+		}
+
+		/*
+		 * All of our options take an argument now.  (Insert ones
+		 * that don't above this check.)  So at this stage optname
+		 * contains the CMD part and optarg contains the ARG part.
+		 */
+		if (!optarg || !*optarg) {
+			printk(KERN_ERR "unionfs: all remount options require "
+			       "an argument (%s)\n", optname);
+			err = -EINVAL;
+			goto out_release;
+		}
+
+		if (!strcmp("add", optname)) {
+			err = do_remount_add_option(optarg, new_branches,
+						    tmp_data,
+						    tmp_lower_paths,
+						    &new_high_branch_id);
+			if (err)
+				goto out_release;
+			new_branches++;
+			if (new_branches > UNIONFS_MAX_BRANCHES) {
+				printk(KERN_ERR "unionfs: command exceeds "
+				       "%d branches\n", UNIONFS_MAX_BRANCHES);
+				err = -E2BIG;
+				goto out_release;
+			}
+			continue;
+		}
+		if (!strcmp("del", optname)) {
+			err = do_remount_del_option(optarg, new_branches,
+						    tmp_data,
+						    tmp_lower_paths);
+			if (err)
+				goto out_release;
+			new_branches--;
+			continue;
+		}
+		if (!strcmp("mode", optname)) {
+			err = do_remount_mode_option(optarg, new_branches,
+						     tmp_data,
+						     tmp_lower_paths);
+			if (err)
+				goto out_release;
+			continue;
+		}
+
+		/*
+		 * When you use "mount -o remount,ro", mount(8) will
+		 * reportedly pass the original dirs= string from
+		 * /proc/mounts.  So for now, we have to ignore dirs= and
+		 * not consider it an error, unless we want to allow users
+		 * to pass dirs= in remount.  Note that to allow the VFS to
+		 * actually process the ro/rw remount options, we have to
+		 * return 0 from this function.
+		 */
+		if (!strcmp("dirs", optname)) {
+			printk(KERN_WARNING
+			       "unionfs: remount ignoring option \"%s\"\n",
+			       optname);
+			continue;
+		}
+
+		err = -EINVAL;
+		printk(KERN_ERR
+		       "unionfs: unrecognized option \"%s\"\n", optname);
+		goto out_release;
+	}
+
+out_no_change:
+
+	/******************************************************************
+	 * WE'RE ALMOST DONE: check if leftmost branch might be read-only,
+	 * see if we need to allocate a small-sized new vector, copy the
+	 * vectors to their correct place, release the refcnt of the older
+	 * ones, and return.  Also handle invalidating any pages that will
+	 * have to be re-read.
+	 *******************************************************************/
+
+	if (!(tmp_data[0].branchperms & MAY_WRITE)) {
+		printk(KERN_ERR "unionfs: leftmost branch cannot be read-only "
+		       "(use \"remount,ro\" to create a read-only union)\n");
+		err = -EINVAL;
+		goto out_release;
+	}
+
+	/* (re)allocate space for new pointers to lower dentry */
+	size = new_branches * sizeof(struct unionfs_data);
+	new_data = krealloc(tmp_data, size, GFP_KERNEL);
+	if (unlikely(!new_data)) {
+		err = -ENOMEM;
+		goto out_release;
+	}
+
+	/* allocate space for new pointers to lower paths */
+	size = new_branches * sizeof(struct path);
+	new_lower_paths = krealloc(tmp_lower_paths, size, GFP_KERNEL);
+	if (unlikely(!new_lower_paths)) {
+		err = -ENOMEM;
+		goto out_release;
+	}
+
+	/* allocate space for new pointers to lower inodes */
+	new_lower_inodes = kcalloc(new_branches,
+				   sizeof(struct inode *), GFP_KERNEL);
+	if (unlikely(!new_lower_inodes)) {
+		err = -ENOMEM;
+		goto out_release;
+	}
+
+	/*
+	 * OK, just before we actually put the new set of branches in place,
+	 * we need to ensure that our own f/s has no dirty objects left.
+	 * Luckily, do_remount_sb() already calls shrink_dcache_sb(sb) and
+	 * fsync_super(sb), taking care of dentries, inodes, and dirty
+	 * pages.  So all that's left is for us to invalidate any leftover
+	 * (non-dirty) pages to ensure that they will be re-read from the
+	 * new lower branches (and to support mmap).
+	 */
+
+	/*
+	 * Now we call drop_pagecache_sb() to invalidate all pages in this
+	 * super.  This function calls invalidate_inode_pages(mapping),
+	 * which calls invalidate_mapping_pages(): the latter, however, will
+	 * not invalidate pages which are dirty, locked, under writeback, or
+	 * mapped into page tables.  We shouldn't have to worry about dirty
+	 * or under-writeback pages, because do_remount_sb() called
+	 * fsync_super() which would not have returned until all dirty pages
+	 * were flushed.
+	 *
+	 * But do we have to worry about locked pages?  Is there any chance
+	 * that in here we'll get locked pages?
+	 *
+	 * XXX: what about pages mapped into pagetables?  Are these pages
+	 * which user processes may have mmap(2)'ed?  If so, then we need to
+	 * invalidate those too, no?  Maybe we'll have to write our own
+	 * version of invalidate_mapping_pages() which also handled mapped
+	 * pages.
+	 *
+	 * XXX: Alternatively, maybe we should call truncate_inode_pages(),
+	 * which use two passes over the pages list, and will truncate all
+	 * pages.
+	 */
+	drop_pagecache_sb(sb);
+
+	/* copy new vectors into their correct place */
+	tmp_data = UNIONFS_SB(sb)->data;
+	UNIONFS_SB(sb)->data = new_data;
+	new_data = NULL;	/* so don't free good pointers below */
+	tmp_lower_paths = UNIONFS_D(sb->s_root)->lower_paths;
+	UNIONFS_D(sb->s_root)->lower_paths = new_lower_paths;
+	new_lower_paths = NULL;	/* so don't free good pointers below */
+
+	/* update our unionfs_sb_info and root dentry index of last branch */
+	i = sbmax(sb);		/* save no. of branches to release at end */
+	sbend(sb) = new_branches - 1;
+	set_dbend(sb->s_root, new_branches - 1);
+	old_ibstart = ibstart(sb->s_root->d_inode);
+	old_ibend = ibend(sb->s_root->d_inode);
+	ibend(sb->s_root->d_inode) = new_branches - 1;
+	UNIONFS_D(sb->s_root)->bcount = new_branches;
+	new_branches = i; /* no. of branches to release below */
+
+	/*
+	 * Update lower inodes: 3 steps
+	 * 1. grab ref on all new lower inodes
+	 */
+	for (i = dbstart(sb->s_root); i <= dbend(sb->s_root); i++) {
+		struct dentry *lower_dentry =
+			unionfs_lower_dentry_idx(sb->s_root, i);
+		igrab(lower_dentry->d_inode);
+		new_lower_inodes[i] = lower_dentry->d_inode;
+	}
+	/* 2. release reference on all older lower inodes */
+	for (i = old_ibstart; i <= old_ibend; i++) {
+		iput(unionfs_lower_inode_idx(sb->s_root->d_inode, i));
+		unionfs_set_lower_inode_idx(sb->s_root->d_inode, i, NULL);
+	}
+	kfree(UNIONFS_I(sb->s_root->d_inode)->lower_inodes);
+	/* 3. update root dentry's inode to new lower_inodes array */
+	UNIONFS_I(sb->s_root->d_inode)->lower_inodes = new_lower_inodes;
+	new_lower_inodes = NULL;
+
+	/* maxbytes may have changed */
+	sb->s_maxbytes = unionfs_lower_super_idx(sb, 0)->s_maxbytes;
+	/* update high branch ID */
+	sbhbid(sb) = new_high_branch_id;
+
+	/* update our sb->generation for revalidating objects */
+	i = atomic_inc_return(&UNIONFS_SB(sb)->generation);
+	atomic_set(&UNIONFS_D(sb->s_root)->generation, i);
+	atomic_set(&UNIONFS_I(sb->s_root->d_inode)->generation, i);
+	if (!(*flags & MS_SILENT))
+		pr_info("unionfs: new generation number %d\n", i);
+	/* finally, update the root dentry's times */
+	unionfs_copy_attr_times(sb->s_root->d_inode);
+	err = 0;		/* reset to success */
+
+	/*
+	 * The code above falls through to the next label, and releases the
+	 * refcnts of the older ones (stored in tmp_*): if we fell through
+	 * here, it means success.  However, if we jump directly to this
+	 * label from any error above, then an error occurred after we
+	 * grabbed various refcnts, and so we have to release the
+	 * temporarily constructed structures.
+	 */
+out_release:
+	/* no need to cleanup/release anything in tmp_data */
+	if (tmp_lower_paths)
+		for (i = 0; i < new_branches; i++)
+			pathput(&tmp_lower_paths[i]);
+out_free:
+	kfree(tmp_lower_paths);
+	kfree(tmp_data);
+	kfree(new_lower_paths);
+	kfree(new_data);
+	kfree(new_lower_inodes);
+out_error:
+	unionfs_check_dentry(sb->s_root);
+	unionfs_write_unlock(sb);
+	return err;
+}
+
+/*
+ * Called by iput() when the inode reference count reached zero
+ * and the inode is not hashed anywhere.  Used to clear anything
+ * that needs to be, before the inode is completely destroyed and put
+ * on the inode free list.
+ *
+ * No need to lock sb info's rwsem.
+ */
+static void unionfs_clear_inode(struct inode *inode)
+{
+	int bindex, bstart, bend;
+	struct inode *lower_inode;
+	struct list_head *pos, *n;
+	struct unionfs_dir_state *rdstate;
+
+	list_for_each_safe(pos, n, &UNIONFS_I(inode)->readdircache) {
+		rdstate = list_entry(pos, struct unionfs_dir_state, cache);
+		list_del(&rdstate->cache);
+		free_rdstate(rdstate);
+	}
+
+	/*
+	 * Decrement a reference to a lower_inode, which was incremented
+	 * by our read_inode when it was created initially.
+	 */
+	bstart = ibstart(inode);
+	bend = ibend(inode);
+	if (bstart >= 0) {
+		for (bindex = bstart; bindex <= bend; bindex++) {
+			lower_inode = unionfs_lower_inode_idx(inode, bindex);
+			if (!lower_inode)
+				continue;
+			iput(lower_inode);
+		}
+	}
+
+	kfree(UNIONFS_I(inode)->lower_inodes);
+	UNIONFS_I(inode)->lower_inodes = NULL;
+}
+
+static struct inode *unionfs_alloc_inode(struct super_block *sb)
+{
+	struct unionfs_inode_info *i;
+
+	i = kmem_cache_alloc(unionfs_inode_cachep, GFP_KERNEL);
+	if (unlikely(!i))
+		return NULL;
+
+	/* memset everything up to the inode to 0 */
+	memset(i, 0, offsetof(struct unionfs_inode_info, vfs_inode));
+
+	i->vfs_inode.i_version = 1;
+	return &i->vfs_inode;
+}
+
+static void unionfs_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(unionfs_inode_cachep, UNIONFS_I(inode));
+}
+
+/* unionfs inode cache constructor */
+static void init_once(struct kmem_cache *cachep, void *obj)
+{
+	struct unionfs_inode_info *i = obj;
+
+	inode_init_once(&i->vfs_inode);
+}
+
+int unionfs_init_inode_cache(void)
+{
+	int err = 0;
+
+	unionfs_inode_cachep =
+		kmem_cache_create("unionfs_inode_cache",
+				  sizeof(struct unionfs_inode_info), 0,
+				  SLAB_RECLAIM_ACCOUNT, init_once);
+	if (unlikely(!unionfs_inode_cachep))
+		err = -ENOMEM;
+	return err;
+}
+
+/* unionfs inode cache destructor */
+void unionfs_destroy_inode_cache(void)
+{
+	if (unionfs_inode_cachep)
+		kmem_cache_destroy(unionfs_inode_cachep);
+}
+
+/*
+ * Called when we have a dirty inode, right here we only throw out
+ * parts of our readdir list that are too old.
+ *
+ * No need to grab sb info's rwsem.
+ */
+static int unionfs_write_inode(struct inode *inode, int sync)
+{
+	struct list_head *pos, *n;
+	struct unionfs_dir_state *rdstate;
+
+	spin_lock(&UNIONFS_I(inode)->rdlock);
+	list_for_each_safe(pos, n, &UNIONFS_I(inode)->readdircache) {
+		rdstate = list_entry(pos, struct unionfs_dir_state, cache);
+		/* We keep this list in LRU order. */
+		if ((rdstate->access + RDCACHE_JIFFIES) > jiffies)
+			break;
+		UNIONFS_I(inode)->rdcount--;
+		list_del(&rdstate->cache);
+		free_rdstate(rdstate);
+	}
+	spin_unlock(&UNIONFS_I(inode)->rdlock);
+
+	return 0;
+}
+
+/*
+ * Used only in nfs, to kill any pending RPC tasks, so that subsequent
+ * code can actually succeed and won't leave tasks that need handling.
+ */
+static void unionfs_umount_begin(struct vfsmount *mnt, int flags)
+{
+	struct super_block *sb, *lower_sb;
+	struct vfsmount *lower_mnt;
+	int bindex, bstart, bend;
+
+	if (!(flags & MNT_FORCE))
+		/*
+		 * we are not being MNT_FORCE'd, therefore we should emulate
+		 * old behavior
+		 */
+		return;
+
+	sb = mnt->mnt_sb;
+
+	unionfs_read_lock(sb);
+
+	bstart = sbstart(sb);
+	bend = sbend(sb);
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		lower_mnt = unionfs_lower_mnt_idx(sb->s_root, bindex);
+		lower_sb = unionfs_lower_super_idx(sb, bindex);
+
+		if (lower_mnt && lower_sb && lower_sb->s_op &&
+		    lower_sb->s_op->umount_begin)
+			lower_sb->s_op->umount_begin(lower_mnt, flags);
+	}
+
+	unionfs_read_unlock(sb);
+}
+
+static int unionfs_show_options(struct seq_file *m, struct vfsmount *mnt)
+{
+	struct super_block *sb = mnt->mnt_sb;
+	int ret = 0;
+	char *tmp_page;
+	char *path;
+	int bindex, bstart, bend;
+	int perms;
+
+	unionfs_read_lock(sb);
+
+	unionfs_lock_dentry(sb->s_root);
+
+	tmp_page = (char *) __get_free_page(GFP_KERNEL);
+	if (unlikely(!tmp_page)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	bstart = sbstart(sb);
+	bend = sbend(sb);
+
+	seq_printf(m, ",dirs=");
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		path = d_path(unionfs_lower_dentry_idx(sb->s_root, bindex),
+			      unionfs_lower_mnt_idx(sb->s_root, bindex),
+			      tmp_page, PAGE_SIZE);
+		if (IS_ERR(path)) {
+			ret = PTR_ERR(path);
+			goto out;
+		}
+
+		perms = branchperms(sb, bindex);
+
+		seq_printf(m, "%s=%s", path,
+			   perms & MAY_WRITE ? "rw" : "ro");
+		if (bindex != bend)
+			seq_printf(m, ":");
+	}
+
+out:
+	free_page((unsigned long) tmp_page);
+
+	unionfs_unlock_dentry(sb->s_root);
+
+	unionfs_read_unlock(sb);
+
+	return ret;
+}
+
+struct super_operations unionfs_sops = {
+	.read_inode	= unionfs_read_inode,
+	.delete_inode	= unionfs_delete_inode,
+	.put_super	= unionfs_put_super,
+	.statfs		= unionfs_statfs,
+	.remount_fs	= unionfs_remount_fs,
+	.clear_inode	= unionfs_clear_inode,
+	.umount_begin	= unionfs_umount_begin,
+	.show_options	= unionfs_show_options,
+	.write_inode	= unionfs_write_inode,
+	.alloc_inode	= unionfs_alloc_inode,
+	.destroy_inode	= unionfs_destroy_inode,
+};
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 26/42] Unionfs: extended attributes operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (24 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 25/42] Unionfs: super_block operations Erez Zadok
@ 2007-12-10  2:41 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 27/42] Unionfs: async I/O queue headers Erez Zadok
                   ` (16 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:41 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/xattr.c |  153 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 153 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/xattr.c

diff --git a/fs/unionfs/xattr.c b/fs/unionfs/xattr.c
new file mode 100644
index 0000000..00c6d0d
--- /dev/null
+++ b/fs/unionfs/xattr.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/* This is lifted from fs/xattr.c */
+void *unionfs_xattr_alloc(size_t size, size_t limit)
+{
+	void *ptr;
+
+	if (size > limit)
+		return ERR_PTR(-E2BIG);
+
+	if (!size)		/* size request, no buffer is needed */
+		return NULL;
+
+	ptr = kmalloc(size, GFP_KERNEL);
+	if (unlikely(!ptr))
+		return ERR_PTR(-ENOMEM);
+	return ptr;
+}
+
+/*
+ * BKL held by caller.
+ * dentry->d_inode->i_mutex locked
+ */
+ssize_t unionfs_getxattr(struct dentry *dentry, const char *name, void *value,
+			 size_t size)
+{
+	struct dentry *lower_dentry = NULL;
+	int err = -EOPNOTSUPP;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	err = vfs_getxattr(lower_dentry, (char *) name, value, size);
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
+
+/*
+ * BKL held by caller.
+ * dentry->d_inode->i_mutex locked
+ */
+int unionfs_setxattr(struct dentry *dentry, const char *name,
+		     const void *value, size_t size, int flags)
+{
+	struct dentry *lower_dentry = NULL;
+	int err = -EOPNOTSUPP;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	err = vfs_setxattr(lower_dentry, (char *) name, (void *) value,
+			   size, flags);
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
+
+/*
+ * BKL held by caller.
+ * dentry->d_inode->i_mutex locked
+ */
+int unionfs_removexattr(struct dentry *dentry, const char *name)
+{
+	struct dentry *lower_dentry = NULL;
+	int err = -EOPNOTSUPP;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	err = vfs_removexattr(lower_dentry, (char *) name);
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
+
+/*
+ * BKL held by caller.
+ * dentry->d_inode->i_mutex locked
+ */
+ssize_t unionfs_listxattr(struct dentry *dentry, char *list, size_t size)
+{
+	struct dentry *lower_dentry = NULL;
+	int err = -EOPNOTSUPP;
+	char *encoded_list = NULL;
+
+	unionfs_read_lock(dentry->d_sb);
+	unionfs_lock_dentry(dentry);
+
+	if (unlikely(!__unionfs_d_revalidate_chain(dentry, NULL, false))) {
+		err = -ESTALE;
+		goto out;
+	}
+
+	lower_dentry = unionfs_lower_dentry(dentry);
+
+	encoded_list = list;
+	err = vfs_listxattr(lower_dentry, encoded_list, size);
+
+out:
+	unionfs_check_dentry(dentry);
+	unionfs_unlock_dentry(dentry);
+	unionfs_read_unlock(dentry->d_sb);
+	return err;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 27/42] Unionfs: async I/O queue headers
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (25 preceding siblings ...)
  2007-12-10  2:41 ` [PATCH 26/42] Unionfs: extended attributes operations Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 28/42] Unionfs: async I/O queue operations Erez Zadok
                   ` (15 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/sioq.h |   92 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 92 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/sioq.h

diff --git a/fs/unionfs/sioq.h b/fs/unionfs/sioq.h
new file mode 100644
index 0000000..afb71ee
--- /dev/null
+++ b/fs/unionfs/sioq.h
@@ -0,0 +1,92 @@
+/*
+ * Copyright (c) 2006-2007 Erez Zadok
+ * Copyright (c) 2006      Charles P. Wright
+ * Copyright (c) 2006-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2006      Junjiro Okajima
+ * Copyright (c) 2006      David P. Quigley
+ * Copyright (c) 2006-2007 Stony Brook University
+ * Copyright (c) 2006-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _SIOQ_H
+#define _SIOQ_H
+
+struct deletewh_args {
+	struct unionfs_dir_state *namelist;
+	struct dentry *dentry;
+	int bindex;
+};
+
+struct is_opaque_args {
+	struct dentry *dentry;
+};
+
+struct create_args {
+	struct inode *parent;
+	struct dentry *dentry;
+	umode_t mode;
+	struct nameidata *nd;
+};
+
+struct mkdir_args {
+	struct inode *parent;
+	struct dentry *dentry;
+	umode_t mode;
+};
+
+struct mknod_args {
+	struct inode *parent;
+	struct dentry *dentry;
+	umode_t mode;
+	dev_t dev;
+};
+
+struct symlink_args {
+	struct inode *parent;
+	struct dentry *dentry;
+	char *symbuf;
+	umode_t mode;
+};
+
+struct unlink_args {
+	struct inode *parent;
+	struct dentry *dentry;
+};
+
+
+struct sioq_args {
+	struct completion comp;
+	struct work_struct work;
+	int err;
+	void *ret;
+
+	union {
+		struct deletewh_args deletewh;
+		struct is_opaque_args is_opaque;
+		struct create_args create;
+		struct mkdir_args mkdir;
+		struct mknod_args mknod;
+		struct symlink_args symlink;
+		struct unlink_args unlink;
+	};
+};
+
+/* Extern definitions for SIOQ functions */
+extern int __init init_sioq(void);
+extern void stop_sioq(void);
+extern void run_sioq(work_func_t func, struct sioq_args *args);
+
+/* Extern definitions for our privilege escalation helpers */
+extern void __unionfs_create(struct work_struct *work);
+extern void __unionfs_mkdir(struct work_struct *work);
+extern void __unionfs_mknod(struct work_struct *work);
+extern void __unionfs_symlink(struct work_struct *work);
+extern void __unionfs_unlink(struct work_struct *work);
+extern void __delete_whiteouts(struct work_struct *work);
+extern void __is_opaque_dir(struct work_struct *work);
+
+#endif /* not _SIOQ_H */
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 28/42] Unionfs: async I/O queue operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (26 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 27/42] Unionfs: async I/O queue headers Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 29/42] Unionfs: miscellaneous helper routines Erez Zadok
                   ` (14 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/sioq.c |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 119 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/sioq.c

diff --git a/fs/unionfs/sioq.c b/fs/unionfs/sioq.c
new file mode 100644
index 0000000..2a8c88e
--- /dev/null
+++ b/fs/unionfs/sioq.c
@@ -0,0 +1,119 @@
+/*
+ * Copyright (c) 2006-2007 Erez Zadok
+ * Copyright (c) 2006      Charles P. Wright
+ * Copyright (c) 2006-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2006      Junjiro Okajima
+ * Copyright (c) 2006      David P. Quigley
+ * Copyright (c) 2006-2007 Stony Brook University
+ * Copyright (c) 2006-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * Super-user IO work Queue - sometimes we need to perform actions which
+ * would fail due to the unix permissions on the parent directory (e.g.,
+ * rmdir a directory which appears empty, but in reality contains
+ * whiteouts).
+ */
+
+static struct workqueue_struct *superio_workqueue;
+
+int __init init_sioq(void)
+{
+	int err;
+
+	superio_workqueue = create_workqueue("unionfs_siod");
+	if (!IS_ERR(superio_workqueue))
+		return 0;
+
+	err = PTR_ERR(superio_workqueue);
+	printk(KERN_ERR "unionfs: create_workqueue failed %d\n", err);
+	superio_workqueue = NULL;
+	return err;
+}
+
+void stop_sioq(void)
+{
+	if (superio_workqueue)
+		destroy_workqueue(superio_workqueue);
+}
+
+void run_sioq(work_func_t func, struct sioq_args *args)
+{
+	INIT_WORK(&args->work, func);
+
+	init_completion(&args->comp);
+	while (!queue_work(superio_workqueue, &args->work)) {
+		/* TODO: do accounting if needed */
+		schedule();
+	}
+	wait_for_completion(&args->comp);
+}
+
+void __unionfs_create(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+	struct create_args *c = &args->create;
+
+	args->err = vfs_create(c->parent, c->dentry, c->mode, c->nd);
+	complete(&args->comp);
+}
+
+void __unionfs_mkdir(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+	struct mkdir_args *m = &args->mkdir;
+
+	args->err = vfs_mkdir(m->parent, m->dentry, m->mode);
+	complete(&args->comp);
+}
+
+void __unionfs_mknod(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+	struct mknod_args *m = &args->mknod;
+
+	args->err = vfs_mknod(m->parent, m->dentry, m->mode, m->dev);
+	complete(&args->comp);
+}
+
+void __unionfs_symlink(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+	struct symlink_args *s = &args->symlink;
+
+	args->err = vfs_symlink(s->parent, s->dentry, s->symbuf, s->mode);
+	complete(&args->comp);
+}
+
+void __unionfs_unlink(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+	struct unlink_args *u = &args->unlink;
+
+	args->err = vfs_unlink(u->parent, u->dentry);
+	complete(&args->comp);
+}
+
+void __delete_whiteouts(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+	struct deletewh_args *d = &args->deletewh;
+
+	args->err = do_delete_whiteouts(d->dentry, d->bindex, d->namelist);
+	complete(&args->comp);
+}
+
+void __is_opaque_dir(struct work_struct *work)
+{
+	struct sioq_args *args = container_of(work, struct sioq_args, work);
+
+	args->ret = lookup_one_len(UNIONFS_DIR_OPAQUE, args->is_opaque.dentry,
+				   sizeof(UNIONFS_DIR_OPAQUE) - 1);
+	complete(&args->comp);
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 29/42] Unionfs: miscellaneous helper routines
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (27 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 28/42] Unionfs: async I/O queue operations Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 30/42] Unionfs: debugging infrastructure Erez Zadok
                   ` (13 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Mostly related to whiteouts.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/subr.c |  242 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 242 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/subr.c

diff --git a/fs/unionfs/subr.c b/fs/unionfs/subr.c
new file mode 100644
index 0000000..1a26c57
--- /dev/null
+++ b/fs/unionfs/subr.c
@@ -0,0 +1,242 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2003-2006 Charles P. Wright
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2005-2006 Junjiro Okajima
+ * Copyright (c) 2005      Arun M. Krishnakumar
+ * Copyright (c) 2004-2006 David P. Quigley
+ * Copyright (c) 2003-2004 Mohammad Nayyer Zubair
+ * Copyright (c) 2003      Puja Gupta
+ * Copyright (c) 2003      Harikesavan Krishnan
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * Pass an unionfs dentry and an index.  It will try to create a whiteout
+ * for the filename in dentry, and will try in branch 'index'.  On error,
+ * it will proceed to a branch to the left.
+ */
+int create_whiteout(struct dentry *dentry, int start)
+{
+	int bstart, bend, bindex;
+	struct dentry *lower_dir_dentry;
+	struct dentry *lower_dentry;
+	struct dentry *lower_wh_dentry;
+	struct nameidata nd;
+	char *name = NULL;
+	int err = -EINVAL;
+
+	verify_locked(dentry);
+
+	bstart = dbstart(dentry);
+	bend = dbend(dentry);
+
+	/* create dentry's whiteout equivalent */
+	name = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+	if (unlikely(IS_ERR(name))) {
+		err = PTR_ERR(name);
+		goto out;
+	}
+
+	for (bindex = start; bindex >= 0; bindex--) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+
+		if (!lower_dentry) {
+			/*
+			 * if lower dentry is not present, create the
+			 * entire lower dentry directory structure and go
+			 * ahead.  Since we want to just create whiteout, we
+			 * only want the parent dentry, and hence get rid of
+			 * this dentry.
+			 */
+			lower_dentry = create_parents(dentry->d_inode,
+						      dentry,
+						      dentry->d_name.name,
+						      bindex);
+			if (!lower_dentry || IS_ERR(lower_dentry)) {
+				int ret = PTR_ERR(lower_dentry);
+				if (!IS_COPYUP_ERR(ret))
+					printk(KERN_ERR
+					       "unionfs: create_parents for "
+					       "whiteout failed: bindex=%d "
+					       "err=%d\n", bindex, ret);
+				continue;
+			}
+		}
+
+		lower_wh_dentry =
+			lookup_one_len(name, lower_dentry->d_parent,
+				       dentry->d_name.len + UNIONFS_WHLEN);
+		if (IS_ERR(lower_wh_dentry))
+			continue;
+
+		/*
+		 * The whiteout already exists. This used to be impossible,
+		 * but now is possible because of opaqueness.
+		 */
+		if (lower_wh_dentry->d_inode) {
+			dput(lower_wh_dentry);
+			err = 0;
+			goto out;
+		}
+
+		err = init_lower_nd(&nd, LOOKUP_CREATE);
+		if (unlikely(err < 0))
+			goto out;
+		lower_dir_dentry = lock_parent(lower_wh_dentry);
+		err = is_robranch_super(dentry->d_sb, bindex);
+		if (!err)
+			err = vfs_create(lower_dir_dentry->d_inode,
+					 lower_wh_dentry,
+					 ~current->fs->umask & S_IRWXUGO,
+					 &nd);
+		unlock_dir(lower_dir_dentry);
+		dput(lower_wh_dentry);
+		release_lower_nd(&nd, err);
+
+		if (!err || !IS_COPYUP_ERR(err))
+			break;
+	}
+
+	/* set dbopaque so that lookup will not proceed after this branch */
+	if (!err)
+		set_dbopaque(dentry, bindex);
+
+out:
+	kfree(name);
+	return err;
+}
+
+/*
+ * This is a helper function for rename, which ends up with hosed over
+ * dentries when it needs to revert.
+ */
+int unionfs_refresh_lower_dentry(struct dentry *dentry, int bindex)
+{
+	struct dentry *lower_dentry;
+	struct dentry *lower_parent;
+	int err = 0;
+
+	verify_locked(dentry);
+
+	unionfs_lock_dentry(dentry->d_parent);
+	lower_parent = unionfs_lower_dentry_idx(dentry->d_parent, bindex);
+	unionfs_unlock_dentry(dentry->d_parent);
+
+	BUG_ON(!S_ISDIR(lower_parent->d_inode->i_mode));
+
+	lower_dentry = lookup_one_len(dentry->d_name.name, lower_parent,
+				      dentry->d_name.len);
+	if (IS_ERR(lower_dentry)) {
+		err = PTR_ERR(lower_dentry);
+		goto out;
+	}
+
+	dput(unionfs_lower_dentry_idx(dentry, bindex));
+	iput(unionfs_lower_inode_idx(dentry->d_inode, bindex));
+	unionfs_set_lower_inode_idx(dentry->d_inode, bindex, NULL);
+
+	if (!lower_dentry->d_inode) {
+		dput(lower_dentry);
+		unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+	} else {
+		unionfs_set_lower_dentry_idx(dentry, bindex, lower_dentry);
+		unionfs_set_lower_inode_idx(dentry->d_inode, bindex,
+					    igrab(lower_dentry->d_inode));
+	}
+
+out:
+	return err;
+}
+
+int make_dir_opaque(struct dentry *dentry, int bindex)
+{
+	int err = 0;
+	struct dentry *lower_dentry, *diropq;
+	struct inode *lower_dir;
+	struct nameidata nd;
+	kernel_cap_t orig_cap;
+
+	/*
+	 * Opaque directory whiteout markers are special files (like regular
+	 * whiteouts), and should appear to the users as if they don't
+	 * exist.  They should be created/deleted regardless of directory
+	 * search/create permissions, but only for the duration of this
+	 * creation of the .wh.__dir_opaque: file.  Note, this does not
+	 * circumvent normal ->permission).
+	 */
+	orig_cap = current->cap_effective;
+	cap_raise(current->cap_effective, CAP_DAC_READ_SEARCH);
+	cap_raise(current->cap_effective, CAP_DAC_OVERRIDE);
+
+	lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+	lower_dir = lower_dentry->d_inode;
+	BUG_ON(!S_ISDIR(dentry->d_inode->i_mode) ||
+	       !S_ISDIR(lower_dir->i_mode));
+
+	mutex_lock(&lower_dir->i_mutex);
+	diropq = lookup_one_len(UNIONFS_DIR_OPAQUE, lower_dentry,
+				sizeof(UNIONFS_DIR_OPAQUE) - 1);
+	if (IS_ERR(diropq)) {
+		err = PTR_ERR(diropq);
+		goto out;
+	}
+
+	err = init_lower_nd(&nd, LOOKUP_CREATE);
+	if (unlikely(err < 0))
+		goto out;
+	if (!diropq->d_inode)
+		err = vfs_create(lower_dir, diropq, S_IRUGO, &nd);
+	if (!err)
+		set_dbopaque(dentry, bindex);
+	release_lower_nd(&nd, err);
+
+	dput(diropq);
+
+out:
+	mutex_unlock(&lower_dir->i_mutex);
+	current->cap_effective = orig_cap;
+	return err;
+}
+
+/*
+ * returns the right n_link value based on the inode type
+ */
+int unionfs_get_nlinks(const struct inode *inode)
+{
+	/* don't bother to do all the work since we're unlinked */
+	if (inode->i_nlink == 0)
+		return 0;
+
+	if (!S_ISDIR(inode->i_mode))
+		return unionfs_lower_inode(inode)->i_nlink;
+
+	/*
+	 * For directories, we return 1. The only place that could cares
+	 * about links is readdir, and there's d_type there so even that
+	 * doesn't matter.
+	 */
+	return 1;
+}
+
+/* construct whiteout filename */
+char *alloc_whname(const char *name, int len)
+{
+	char *buf;
+
+	buf = kmalloc(len + UNIONFS_WHLEN + 1, GFP_KERNEL);
+	if (unlikely(!buf))
+		return ERR_PTR(-ENOMEM);
+
+	strcpy(buf, UNIONFS_WHPFX);
+	strlcat(buf, name, len + UNIONFS_WHLEN + 1);
+
+	return buf;
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 30/42] Unionfs: debugging infrastructure
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (28 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 29/42] Unionfs: miscellaneous helper routines Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 31/42] VFS: fs_stack header cleanups Erez Zadok
                   ` (12 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/unionfs/debug.c |  532 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 532 insertions(+), 0 deletions(-)
 create mode 100644 fs/unionfs/debug.c

diff --git a/fs/unionfs/debug.c b/fs/unionfs/debug.c
new file mode 100644
index 0000000..c2b8b58
--- /dev/null
+++ b/fs/unionfs/debug.c
@@ -0,0 +1,532 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include "union.h"
+
+/*
+ * Helper debugging functions for maintainers (and for users to report back
+ * useful information back to maintainers)
+ */
+
+/* it's always useful to know what part of the code called us */
+#define PRINT_CALLER(fname, fxn, line)					\
+	do {								\
+		if (!printed_caller) {					\
+			pr_debug("PC:%s:%s:%d\n", (fname), (fxn), (line)); \
+			printed_caller = 1;				\
+		}							\
+	} while (0)
+
+/*
+ * __unionfs_check_{inode,dentry,file} perform exhaustive sanity checking on
+ * the fan-out of various Unionfs objects.  We check that no lower objects
+ * exist  outside the start/end branch range; that all objects within are
+ * non-NULL (with some allowed exceptions); that for every lower file
+ * there's a lower dentry+inode; that the start/end ranges match for all
+ * corresponding lower objects; that open files/symlinks have only one lower
+ * objects, but directories can have several; and more.
+ */
+void __unionfs_check_inode(const struct inode *inode,
+			   const char *fname, const char *fxn, int line)
+{
+	int bindex;
+	int istart, iend;
+	struct inode *lower_inode;
+	struct super_block *sb;
+	int printed_caller = 0;
+	void *poison_ptr;
+
+	/* for inodes now */
+	BUG_ON(!inode);
+	sb = inode->i_sb;
+	istart = ibstart(inode);
+	iend = ibend(inode);
+	/* don't check inode if no lower branches */
+	if (istart < 0 && iend < 0)
+		return;
+	if (unlikely(istart > iend)) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" Ci0: inode=%p istart/end=%d:%d\n",
+			 inode, istart, iend);
+	}
+	if (unlikely((istart == -1 && iend != -1) ||
+		     (istart != -1 && iend == -1))) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" Ci1: inode=%p istart/end=%d:%d\n",
+			 inode, istart, iend);
+	}
+	if (!S_ISDIR(inode->i_mode)) {
+		if (unlikely(iend != istart)) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" Ci2: inode=%p istart=%d iend=%d\n",
+				 inode, istart, iend);
+		}
+	}
+
+	for (bindex = sbstart(sb); bindex < sbmax(sb); bindex++) {
+		if (unlikely(!UNIONFS_I(inode))) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" Ci3: no inode_info %p\n", inode);
+			return;
+		}
+		if (unlikely(!UNIONFS_I(inode)->lower_inodes)) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" Ci4: no lower_inodes %p\n", inode);
+			return;
+		}
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (lower_inode) {
+			memset(&poison_ptr, POISON_INUSE, sizeof(void *));
+			if (unlikely(bindex < istart || bindex > iend)) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" Ci5: inode/linode=%p:%p bindex=%d "
+					 "istart/end=%d:%d\n", inode,
+					 lower_inode, bindex, istart, iend);
+			} else if (unlikely(lower_inode == poison_ptr)) {
+				/* freed inode! */
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" Ci6: inode/linode=%p:%p bindex=%d "
+					 "istart/end=%d:%d\n", inode,
+					 lower_inode, bindex, istart, iend);
+			}
+			continue;
+		}
+		/* if we get here, then lower_inode == NULL */
+		if (bindex < istart || bindex > iend)
+			continue;
+		/*
+		 * directories can have NULL lower inodes in b/t start/end,
+		 * but NOT if at the start/end range.
+		 */
+		if (unlikely(S_ISDIR(inode->i_mode) &&
+			     bindex > istart && bindex < iend))
+			continue;
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" Ci7: inode/linode=%p:%p "
+			 "bindex=%d istart/end=%d:%d\n",
+			 inode, lower_inode, bindex, istart, iend);
+	}
+}
+
+void __unionfs_check_dentry(const struct dentry *dentry,
+			    const char *fname, const char *fxn, int line)
+{
+	int bindex;
+	int dstart, dend, istart, iend;
+	struct dentry *lower_dentry;
+	struct inode *inode, *lower_inode;
+	struct super_block *sb;
+	struct vfsmount *lower_mnt;
+	int printed_caller = 0;
+	void *poison_ptr;
+
+	BUG_ON(!dentry);
+	sb = dentry->d_sb;
+	inode = dentry->d_inode;
+	dstart = dbstart(dentry);
+	dend = dbend(dentry);
+	/* don't check dentry/mnt if no lower branches */
+	if (dstart < 0 && dend < 0)
+		goto check_inode;
+	BUG_ON(dstart > dend);
+
+	if (unlikely((dstart == -1 && dend != -1) ||
+		     (dstart != -1 && dend == -1))) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CD0: dentry=%p dstart/end=%d:%d\n",
+			 dentry, dstart, dend);
+	}
+	/*
+	 * check for NULL dentries inside the start/end range, or
+	 * non-NULL dentries outside the start/end range.
+	 */
+	for (bindex = sbstart(sb); bindex < sbmax(sb); bindex++) {
+		lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+		if (lower_dentry) {
+			if (unlikely(bindex < dstart || bindex > dend)) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CD1: dentry/lower=%p:%p(%p) "
+					 "bindex=%d dstart/end=%d:%d\n",
+					 dentry, lower_dentry,
+					 (lower_dentry ? lower_dentry->d_inode :
+					  (void *) -1L),
+					 bindex, dstart, dend);
+			}
+		} else {	/* lower_dentry == NULL */
+			if (bindex < dstart || bindex > dend)
+				continue;
+			/*
+			 * Directories can have NULL lower inodes in b/t
+			 * start/end, but NOT if at the start/end range.
+			 * Ignore this rule, however, if this is a NULL
+			 * dentry or a deleted dentry.
+			 */
+			if (unlikely(!d_deleted((struct dentry *) dentry) &&
+				     inode &&
+				     !(inode && S_ISDIR(inode->i_mode) &&
+				       bindex > dstart && bindex < dend))) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CD2: dentry/lower=%p:%p(%p) "
+					 "bindex=%d dstart/end=%d:%d\n",
+					 dentry, lower_dentry,
+					 (lower_dentry ?
+					  lower_dentry->d_inode :
+					  (void *) -1L),
+					 bindex, dstart, dend);
+			}
+		}
+	}
+
+	/* check for vfsmounts same as for dentries */
+	for (bindex = sbstart(sb); bindex < sbmax(sb); bindex++) {
+		lower_mnt = unionfs_lower_mnt_idx(dentry, bindex);
+		if (lower_mnt) {
+			if (unlikely(bindex < dstart || bindex > dend)) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CM0: dentry/lmnt=%p:%p bindex=%d "
+					 "dstart/end=%d:%d\n", dentry,
+					 lower_mnt, bindex, dstart, dend);
+			}
+		} else {	/* lower_mnt == NULL */
+			if (bindex < dstart || bindex > dend)
+				continue;
+			/*
+			 * Directories can have NULL lower inodes in b/t
+			 * start/end, but NOT if at the start/end range.
+			 * Ignore this rule, however, if this is a NULL
+			 * dentry.
+			 */
+			if (unlikely(inode &&
+				     !(inode && S_ISDIR(inode->i_mode) &&
+				       bindex > dstart && bindex < dend))) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CM1: dentry/lmnt=%p:%p "
+					 "bindex=%d dstart/end=%d:%d\n",
+					 dentry, lower_mnt, bindex,
+					 dstart, dend);
+			}
+		}
+	}
+
+check_inode:
+	/* for inodes now */
+	if (!inode)
+		return;
+	istart = ibstart(inode);
+	iend = ibend(inode);
+	/* don't check inode if no lower branches */
+	if (istart < 0 && iend < 0)
+		return;
+	BUG_ON(istart > iend);
+	if (unlikely((istart == -1 && iend != -1) ||
+		     (istart != -1 && iend == -1))) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CI0: dentry/inode=%p:%p istart/end=%d:%d\n",
+			 dentry, inode, istart, iend);
+	}
+	if (unlikely(istart != dstart)) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CI1: dentry/inode=%p:%p istart=%d dstart=%d\n",
+			 dentry, inode, istart, dstart);
+	}
+	if (unlikely(iend != dend)) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CI2: dentry/inode=%p:%p iend=%d dend=%d\n",
+			 dentry, inode, iend, dend);
+	}
+
+	if (!S_ISDIR(inode->i_mode)) {
+		if (unlikely(dend != dstart)) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" CI3: dentry/inode=%p:%p dstart=%d dend=%d\n",
+				 dentry, inode, dstart, dend);
+		}
+		if (unlikely(iend != istart)) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" CI4: dentry/inode=%p:%p istart=%d iend=%d\n",
+				 dentry, inode, istart, iend);
+		}
+	}
+
+	for (bindex = sbstart(sb); bindex < sbmax(sb); bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (lower_inode) {
+			memset(&poison_ptr, POISON_INUSE, sizeof(void *));
+			if (unlikely(bindex < istart || bindex > iend)) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CI5: dentry/linode=%p:%p bindex=%d "
+					 "istart/end=%d:%d\n", dentry,
+					 lower_inode, bindex, istart, iend);
+			} else if (unlikely(lower_inode == poison_ptr)) {
+				/* freed inode! */
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CI6: dentry/linode=%p:%p bindex=%d "
+					 "istart/end=%d:%d\n", dentry,
+					 lower_inode, bindex, istart, iend);
+			}
+			continue;
+		}
+		/* if we get here, then lower_inode == NULL */
+		if (bindex < istart || bindex > iend)
+			continue;
+		/*
+		 * directories can have NULL lower inodes in b/t start/end,
+		 * but NOT if at the start/end range.
+		 */
+		if (unlikely(S_ISDIR(inode->i_mode) &&
+			     bindex > istart && bindex < iend))
+			continue;
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CI7: dentry/linode=%p:%p "
+			 "bindex=%d istart/end=%d:%d\n",
+			 dentry, lower_inode, bindex, istart, iend);
+	}
+
+	/*
+	 * If it's a directory, then intermediate objects b/t start/end can
+	 * be NULL.  But, check that all three are NULL: lower dentry, mnt,
+	 * and inode.
+	 */
+	if (dstart >= 0 && dend >= 0 && S_ISDIR(inode->i_mode))
+		for (bindex = dstart+1; bindex < dend; bindex++) {
+			lower_inode = unionfs_lower_inode_idx(inode, bindex);
+			lower_dentry = unionfs_lower_dentry_idx(dentry,
+								bindex);
+			lower_mnt = unionfs_lower_mnt_idx(dentry, bindex);
+			if (unlikely(!((lower_inode && lower_dentry &&
+					lower_mnt) ||
+				       (!lower_inode &&
+					!lower_dentry && !lower_mnt)))) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" Cx: lmnt/ldentry/linode=%p:%p:%p "
+					 "bindex=%d dstart/end=%d:%d\n",
+					 lower_mnt, lower_dentry, lower_inode,
+					 bindex, dstart, dend);
+			}
+		}
+	/* check if lower inode is newer than upper one (it shouldn't) */
+	if (unlikely(is_newer_lower(dentry))) {
+		PRINT_CALLER(fname, fxn, line);
+		for (bindex = ibstart(inode); bindex <= ibend(inode);
+		     bindex++) {
+			lower_inode = unionfs_lower_inode_idx(inode, bindex);
+			if (unlikely(!lower_inode))
+				continue;
+			pr_debug(" CI8: bindex=%d mtime/lmtime=%lu.%lu/%lu.%lu "
+				 "ctime/lctime=%lu.%lu/%lu.%lu\n",
+				 bindex,
+				 inode->i_mtime.tv_sec,
+				 inode->i_mtime.tv_nsec,
+				 lower_inode->i_mtime.tv_sec,
+				 lower_inode->i_mtime.tv_nsec,
+				 inode->i_ctime.tv_sec,
+				 inode->i_ctime.tv_nsec,
+				 lower_inode->i_ctime.tv_sec,
+				 lower_inode->i_ctime.tv_nsec);
+		}
+	}
+}
+
+void __unionfs_check_file(const struct file *file,
+			  const char *fname, const char *fxn, int line)
+{
+	int bindex;
+	int dstart, dend, fstart, fend;
+	struct dentry *dentry;
+	struct file *lower_file;
+	struct inode *inode;
+	struct super_block *sb;
+	int printed_caller = 0;
+
+	BUG_ON(!file);
+	dentry = file->f_path.dentry;
+	sb = dentry->d_sb;
+	dstart = dbstart(dentry);
+	dend = dbend(dentry);
+	BUG_ON(dstart > dend);
+	fstart = fbstart(file);
+	fend = fbend(file);
+	BUG_ON(fstart > fend);
+
+	if (unlikely((fstart == -1 && fend != -1) ||
+		     (fstart != -1 && fend == -1))) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CF0: file/dentry=%p:%p fstart/end=%d:%d\n",
+			 file, dentry, fstart, fend);
+	}
+	if (unlikely(fstart != dstart)) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CF1: file/dentry=%p:%p fstart=%d dstart=%d\n",
+			 file, dentry, fstart, dstart);
+	}
+	if (unlikely(fend != dend)) {
+		PRINT_CALLER(fname, fxn, line);
+		pr_debug(" CF2: file/dentry=%p:%p fend=%d dend=%d\n",
+			 file, dentry, fend, dend);
+	}
+	inode = dentry->d_inode;
+	if (!S_ISDIR(inode->i_mode)) {
+		if (unlikely(fend != fstart)) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" CF3: file/inode=%p:%p fstart=%d fend=%d\n",
+				 file, inode, fstart, fend);
+		}
+		if (unlikely(dend != dstart)) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" CF4: file/dentry=%p:%p dstart=%d dend=%d\n",
+				 file, dentry, dstart, dend);
+		}
+	}
+
+	/*
+	 * check for NULL dentries inside the start/end range, or
+	 * non-NULL dentries outside the start/end range.
+	 */
+	for (bindex = sbstart(sb); bindex < sbmax(sb); bindex++) {
+		lower_file = unionfs_lower_file_idx(file, bindex);
+		if (lower_file) {
+			if (unlikely(bindex < fstart || bindex > fend)) {
+				PRINT_CALLER(fname, fxn, line);
+				pr_debug(" CF5: file/lower=%p:%p bindex=%d "
+					 "fstart/end=%d:%d\n", file,
+					 lower_file, bindex, fstart, fend);
+			}
+		} else {	/* lower_file == NULL */
+			if (bindex >= fstart && bindex <= fend) {
+				/*
+				 * directories can have NULL lower inodes in
+				 * b/t start/end, but NOT if at the
+				 * start/end range.
+				 */
+				if (unlikely(!(S_ISDIR(inode->i_mode) &&
+					       bindex > fstart &&
+					       bindex < fend))) {
+					PRINT_CALLER(fname, fxn, line);
+					pr_debug(" CF6: file/lower=%p:%p "
+						 "bindex=%d fstart/end=%d:%d\n",
+						 file, lower_file, bindex,
+						 fstart, fend);
+				}
+			}
+		}
+	}
+
+	__unionfs_check_dentry(dentry, fname, fxn, line);
+}
+
+void __unionfs_check_nd(const struct nameidata *nd,
+			const char *fname, const char *fxn, int line)
+{
+	struct file *file;
+	int printed_caller = 0;
+
+	if (unlikely(!nd))
+		return;
+	if (nd->flags & LOOKUP_OPEN) {
+		file = nd->intent.open.file;
+		if (unlikely(file->f_path.dentry &&
+			     strcmp(file->f_path.dentry->d_sb->s_type->name,
+				    UNIONFS_NAME))) {
+			PRINT_CALLER(fname, fxn, line);
+			pr_debug(" CND1: lower_file of type %s\n",
+				 file->f_path.dentry->d_sb->s_type->name);
+			BUG();
+		}
+	}
+}
+
+/* useful to track vfsmount leaks that could cause EBUSY on unmount */
+void __show_branch_counts(const struct super_block *sb,
+			  const char *file, const char *fxn, int line)
+{
+	int i;
+	struct vfsmount *mnt;
+
+	pr_debug("BC:");
+	for (i = 0; i < sbmax(sb); i++) {
+		if (likely(sb->s_root))
+			mnt = UNIONFS_D(sb->s_root)->lower_paths[i].mnt;
+		else
+			mnt = NULL;
+		pr_debug("%d:", (mnt ? atomic_read(&mnt->mnt_count) : -99));
+	}
+	pr_debug("%s:%s:%d\n", file, fxn, line);
+}
+
+void __show_inode_times(const struct inode *inode,
+			const char *file, const char *fxn, int line)
+{
+	struct inode *lower_inode;
+	int bindex;
+
+	for (bindex = ibstart(inode); bindex <= ibend(inode); bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (unlikely(!lower_inode))
+			continue;
+		pr_debug("IT(%lu:%d): ", inode->i_ino, bindex);
+		pr_debug("%s:%s:%d ", file, fxn, line);
+		pr_debug("um=%lu/%lu lm=%lu/%lu ",
+			 inode->i_mtime.tv_sec, inode->i_mtime.tv_nsec,
+			 lower_inode->i_mtime.tv_sec,
+			 lower_inode->i_mtime.tv_nsec);
+		pr_debug("uc=%lu/%lu lc=%lu/%lu\n",
+			 inode->i_ctime.tv_sec, inode->i_ctime.tv_nsec,
+			 lower_inode->i_ctime.tv_sec,
+			 lower_inode->i_ctime.tv_nsec);
+	}
+}
+
+void __show_dinode_times(const struct dentry *dentry,
+			const char *file, const char *fxn, int line)
+{
+	struct inode *inode = dentry->d_inode;
+	struct inode *lower_inode;
+	int bindex;
+
+	for (bindex = ibstart(inode); bindex <= ibend(inode); bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (!lower_inode)
+			continue;
+		pr_debug("DT(%s:%lu:%d): ", dentry->d_name.name, inode->i_ino,
+			 bindex);
+		pr_debug("%s:%s:%d ", file, fxn, line);
+		pr_debug("um=%lu/%lu lm=%lu/%lu ",
+			 inode->i_mtime.tv_sec, inode->i_mtime.tv_nsec,
+			 lower_inode->i_mtime.tv_sec,
+			 lower_inode->i_mtime.tv_nsec);
+		pr_debug("uc=%lu/%lu lc=%lu/%lu\n",
+			 inode->i_ctime.tv_sec, inode->i_ctime.tv_nsec,
+			 lower_inode->i_ctime.tv_sec,
+			 lower_inode->i_ctime.tv_nsec);
+	}
+}
+
+void __show_inode_counts(const struct inode *inode,
+			const char *file, const char *fxn, int line)
+{
+	struct inode *lower_inode;
+	int bindex;
+
+	if (unlikely(!inode)) {
+		pr_debug("SiC: Null inode\n");
+		return;
+	}
+	for (bindex = sbstart(inode->i_sb); bindex <= sbend(inode->i_sb);
+	     bindex++) {
+		lower_inode = unionfs_lower_inode_idx(inode, bindex);
+		if (unlikely(!lower_inode))
+			continue;
+		pr_debug("SIC(%lu:%d:%d): ", inode->i_ino, bindex,
+			 atomic_read(&(inode)->i_count));
+		pr_debug("lc=%d ", atomic_read(&(lower_inode)->i_count));
+		pr_debug("%s:%s:%d\n", file, fxn, line);
+	}
+}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 31/42] VFS: fs_stack header cleanups
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (29 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 30/42] Unionfs: debugging infrastructure Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 32/42] Unionfs file system magic number Erez Zadok
                   ` (11 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 include/linux/fs_stack.h |   21 ++++++++++++++++-----
 1 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs_stack.h b/include/linux/fs_stack.h
index bb516ce..6b52faf 100644
--- a/include/linux/fs_stack.h
+++ b/include/linux/fs_stack.h
@@ -1,17 +1,28 @@
+/*
+ * Copyright (c) 2006-2007 Erez Zadok
+ * Copyright (c) 2006-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2006-2007 Stony Brook University
+ * Copyright (c) 2006-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
 #ifndef _LINUX_FS_STACK_H
 #define _LINUX_FS_STACK_H
 
-/* This file defines generic functions used primarily by stackable
+/*
+ * This file defines generic functions used primarily by stackable
  * filesystems; none of these functions require i_mutex to be held.
  */
 
 #include <linux/fs.h>
 
 /* externs for fs/stack.c */
-extern void fsstack_copy_attr_all(struct inode *dest, const struct inode *src,
-				int (*get_nlinks)(struct inode *));
-
-extern void fsstack_copy_inode_size(struct inode *dst, const struct inode *src);
+extern void fsstack_copy_attr_all(struct inode *dest, const struct inode *src);
+extern void fsstack_copy_inode_size(struct inode *dst,
+				    const struct inode *src);
 
 /* inlines */
 static inline void fsstack_copy_attr_atime(struct inode *dest,
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 32/42] Unionfs file system magic number
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (30 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 31/42] VFS: fs_stack header cleanups Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 33/42] MM: extern for drop_pagecache_sb Erez Zadok
                   ` (10 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 include/linux/magic.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1fa0c2c..67043ed 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -35,6 +35,8 @@
 #define REISER2FS_SUPER_MAGIC_STRING	"ReIsEr2Fs"
 #define REISER2FS_JR_SUPER_MAGIC_STRING	"ReIsEr3Fs"
 
+#define UNIONFS_SUPER_MAGIC 0xf15f083d
+
 #define SMB_SUPER_MAGIC		0x517B
 #define USBDEVICE_SUPER_MAGIC	0x9fa2
 #define CGROUP_SUPER_MAGIC	0x27e0eb
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 33/42] MM: extern for drop_pagecache_sb
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (31 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 32/42] Unionfs file system magic number Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10 14:04   ` Adrian Bunk
  2007-12-10  2:42 ` [PATCH 34/42] VFS path get/put ops used by Unionfs Erez Zadok
                   ` (9 subsequent siblings)
  42 siblings, 1 reply; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 include/linux/mm.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1b7b95c..fc61bd3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct anon_vma;
 struct file_ra_state;
 struct user_struct;
 struct writeback_control;
+struct super_block;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -1135,6 +1136,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 			unsigned long lru_pages);
+extern void drop_pagecache_sb(struct super_block *);
 void drop_pagecache(void);
 void drop_slab(void);
 
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 34/42] VFS path get/put ops used by Unionfs
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (32 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 33/42] MM: extern for drop_pagecache_sb Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 35/42] Unionfs: common header file for user-land utilities and kernel Erez Zadok
                   ` (8 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Note: this will become obsolete once similar patches, now in -mm, make it to
mainline.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 include/linux/namei.h |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/linux/namei.h b/include/linux/namei.h
index 4cb4f8d..63f16d9 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -3,6 +3,7 @@
 
 #include <linux/dcache.h>
 #include <linux/linkage.h>
+#include <linux/mount.h>
 
 struct vfsmount;
 
@@ -100,4 +101,16 @@ static inline char *nd_get_link(struct nameidata *nd)
 	return nd->saved_names[nd->depth];
 }
 
+static inline void pathget(struct path *path)
+{
+	mntget(path->mnt);
+	dget(path->dentry);
+}
+
+static inline void pathput(struct path *path)
+{
+	dput(path->dentry);
+	mntput(path->mnt);
+}
+
 #endif /* _LINUX_NAMEI_H */
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 35/42] Unionfs: common header file for user-land utilities and kernel
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (33 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 34/42] VFS path get/put ops used by Unionfs Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 36/42] VFS: export drop_pagecache_sb Erez Zadok
                   ` (7 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 include/linux/union_fs.h |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/union_fs.h

diff --git a/include/linux/union_fs.h b/include/linux/union_fs.h
new file mode 100644
index 0000000..d29318f
--- /dev/null
+++ b/include/linux/union_fs.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2003-2007 Erez Zadok
+ * Copyright (c) 2005-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2003-2007 Stony Brook University
+ * Copyright (c) 2003-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_UNION_FS_H
+#define _LINUX_UNION_FS_H
+
+#define UNIONFS_VERSION  "2.1-mm"
+
+/*
+ * DEFINITIONS FOR USER AND KERNEL CODE:
+ */
+# define UNIONFS_IOCTL_INCGEN		_IOR(0x15, 11, int)
+# define UNIONFS_IOCTL_QUERYFILE	_IOR(0x15, 15, int)
+
+#endif /* _LINUX_UNIONFS_H */
+
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 36/42] VFS: export drop_pagecache_sb
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (34 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 35/42] Unionfs: common header file for user-land utilities and kernel Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-12  5:38   ` Nick Piggin
  2007-12-10  2:42 ` [PATCH 37/42] VFS: export release_open_intent symbol Erez Zadok
                   ` (6 subsequent siblings)
  42 siblings, 1 reply; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Needed to maintain cache coherency after branch management.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/drop_caches.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 59375ef..90410ac 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -3,6 +3,7 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/fs.h>
 #include <linux/writeback.h>
@@ -12,7 +13,7 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb)
+void drop_pagecache_sb(struct super_block *sb)
 {
 	struct inode *inode;
 
@@ -24,6 +25,7 @@ static void drop_pagecache_sb(struct super_block *sb)
 	}
 	spin_unlock(&inode_lock);
 }
+EXPORT_SYMBOL(drop_pagecache_sb);
 
 void drop_pagecache(void)
 {
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 37/42] VFS: export release_open_intent symbol
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (35 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 36/42] VFS: export drop_pagecache_sb Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 38/42] VFS: simplified fsstack_copy_attr_all Erez Zadok
                   ` (5 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Needed to release the resources of the lower nameidata structures that we
create and pass to lower file systems (e.g., when calling vfs_create).

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/namei.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 3b993db..14f9861 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -389,6 +389,7 @@ void release_open_intent(struct nameidata *nd)
 	else
 		fput(nd->intent.open.file);
 }
+EXPORT_SYMBOL(release_open_intent);
 
 static inline struct dentry *
 do_revalidate(struct dentry *dentry, struct nameidata *nd)
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 38/42] VFS: simplified fsstack_copy_attr_all
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (36 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 37/42] VFS: export release_open_intent symbol Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 39/42] Put Unionfs and eCryptfs under one layered filesystems menu Erez Zadok
                   ` (4 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/stack.c |   30 +++++++++++++++++-------------
 1 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/stack.c b/fs/stack.c
index 67716f6..a548aac 100644
--- a/fs/stack.c
+++ b/fs/stack.c
@@ -1,8 +1,20 @@
+/*
+ * Copyright (c) 2006-2007 Erez Zadok
+ * Copyright (c) 2006-2007 Josef 'Jeff' Sipek
+ * Copyright (c) 2006-2007 Stony Brook University
+ * Copyright (c) 2006-2007 The Research Foundation of SUNY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/fs_stack.h>
 
-/* does _NOT_ require i_mutex to be held.
+/*
+ * does _NOT_ require i_mutex to be held.
  *
  * This function cannot be inlined since i_size_{read,write} is rather
  * heavy-weight on 32-bit systems
@@ -14,11 +26,11 @@ void fsstack_copy_inode_size(struct inode *dst, const struct inode *src)
 }
 EXPORT_SYMBOL_GPL(fsstack_copy_inode_size);
 
-/* copy all attributes; get_nlinks is optional way to override the i_nlink
+/*
+ * copy all attributes; get_nlinks is optional way to override the i_nlink
  * copying
  */
-void fsstack_copy_attr_all(struct inode *dest, const struct inode *src,
-				int (*get_nlinks)(struct inode *))
+void fsstack_copy_attr_all(struct inode *dest, const struct inode *src)
 {
 	dest->i_mode = src->i_mode;
 	dest->i_uid = src->i_uid;
@@ -29,14 +41,6 @@ void fsstack_copy_attr_all(struct inode *dest, const struct inode *src,
 	dest->i_ctime = src->i_ctime;
 	dest->i_blkbits = src->i_blkbits;
 	dest->i_flags = src->i_flags;
-
-	/*
-	 * Update the nlinks AFTER updating the above fields, because the
-	 * get_links callback may depend on them.
-	 */
-	if (!get_nlinks)
-		dest->i_nlink = src->i_nlink;
-	else
-		dest->i_nlink = (*get_nlinks)(dest);
+	dest->i_nlink = src->i_nlink;
 }
 EXPORT_SYMBOL_GPL(fsstack_copy_attr_all);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 39/42] Put Unionfs and eCryptfs under one layered filesystems menu
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (37 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 38/42] VFS: simplified fsstack_copy_attr_all Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 40/42] eCryptfs: use simplified fs_stack API for dentry operations Erez Zadok
                   ` (3 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/Kconfig |   53 +++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 635f3e2..cbcbbee 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1041,6 +1041,47 @@ config CONFIGFS_FS
 
 endmenu
 
+menu "Layered filesystems"
+
+config ECRYPT_FS
+	tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL && KEYS && CRYPTO && NET
+	help
+	  Encrypted filesystem that operates on the VFS layer.  See
+	  <file:Documentation/filesystems/ecryptfs.txt> to learn more about
+	  eCryptfs.  Userspace components are required and can be
+	  obtained from <http://ecryptfs.sf.net>.
+
+	  To compile this file system support as a module, choose M here: the
+	  module will be called ecryptfs.
+
+config UNION_FS
+	tristate "Union file system (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  Unionfs is a stackable unification file system, which appears to
+	  merge the contents of several directories (branches), while keeping
+	  their physical content separate.
+
+	  See <http://unionfs.filesystems.org> for details
+
+config UNION_FS_XATTR
+	bool "Unionfs extended attributes"
+	depends on UNION_FS
+	help
+	  Extended attributes are name:value pairs associated with inodes by
+	  the kernel or by users (see the attr(5) manual page).
+
+	  If unsure, say N.
+
+config UNION_FS_DEBUG
+	bool "Debug Unionfs"
+	depends on UNION_FS
+	help
+	  If you say Y here, you can turn on debugging output from Unionfs.
+
+endmenu
+
 menu "Miscellaneous filesystems"
 
 config ADFS_FS
@@ -1093,18 +1134,6 @@ config AFFS_FS
 	  To compile this file system support as a module, choose M here: the
 	  module will be called affs.  If unsure, say N.
 
-config ECRYPT_FS
-	tristate "eCrypt filesystem layer support (EXPERIMENTAL)"
-	depends on EXPERIMENTAL && KEYS && CRYPTO && NET
-	help
-	  Encrypted filesystem that operates on the VFS layer.  See
-	  <file:Documentation/filesystems/ecryptfs.txt> to learn more about
-	  eCryptfs.  Userspace components are required and can be
-	  obtained from <http://ecryptfs.sf.net>.
-
-	  To compile this file system support as a module, choose M here: the
-	  module will be called ecryptfs.
-
 config HFS_FS
 	tristate "Apple Macintosh file system support (EXPERIMENTAL)"
 	depends on BLOCK && EXPERIMENTAL
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 40/42] eCryptfs: use simplified fs_stack API for dentry operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (38 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 39/42] Put Unionfs and eCryptfs under one layered filesystems menu Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 41/42] eCryptfs: use simplified fs_stack API for inode operations Erez Zadok
                   ` (2 subsequent siblings)
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok, Mike Halcrow

CC: Mike Halcrow <mhalcrow@us.ibm.com>

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/ecryptfs/dentry.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c
index cb20b96..a8c1686 100644
--- a/fs/ecryptfs/dentry.c
+++ b/fs/ecryptfs/dentry.c
@@ -62,7 +62,7 @@ static int ecryptfs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
 		struct inode *lower_inode =
 			ecryptfs_inode_to_lower(dentry->d_inode);
 
-		fsstack_copy_attr_all(dentry->d_inode, lower_inode, NULL);
+		fsstack_copy_attr_all(dentry->d_inode, lower_inode);
 	}
 out:
 	return rc;
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 41/42] eCryptfs: use simplified fs_stack API for inode operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (39 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 40/42] eCryptfs: use simplified fs_stack API for dentry operations Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  2:42 ` [PATCH 42/42] eCryptfs: use simplified fs_stack API for main operations Erez Zadok
  2007-12-10  3:20 ` [UNIONFS] 00/42 Unionfs and related patches review hooanon05
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok, Mike Halcrow

CC: Mike Halcrow <mhalcrow@us.ibm.com>

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/ecryptfs/inode.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 0b1ab01..a846557 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -588,9 +588,9 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 			lower_new_dir_dentry->d_inode, lower_new_dentry);
 	if (rc)
 		goto out_lock;
-	fsstack_copy_attr_all(new_dir, lower_new_dir_dentry->d_inode, NULL);
+	fsstack_copy_attr_all(new_dir, lower_new_dir_dentry->d_inode);
 	if (new_dir != old_dir)
-		fsstack_copy_attr_all(old_dir, lower_old_dir_dentry->d_inode, NULL);
+		fsstack_copy_attr_all(old_dir, lower_old_dir_dentry->d_inode);
 out_lock:
 	unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
 	dput(lower_new_dentry->d_parent);
@@ -924,7 +924,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 
 	rc = notify_change(lower_dentry, ia);
 out:
-	fsstack_copy_attr_all(inode, lower_inode, NULL);
+	fsstack_copy_attr_all(inode, lower_inode);
 	return rc;
 }
 
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 42/42] eCryptfs: use simplified fs_stack API for main operations
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (40 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 41/42] eCryptfs: use simplified fs_stack API for inode operations Erez Zadok
@ 2007-12-10  2:42 ` Erez Zadok
  2007-12-10  3:20 ` [UNIONFS] 00/42 Unionfs and related patches review hooanon05
  42 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-10  2:42 UTC (permalink / raw)
  To: hch, viro, akpm; +Cc: linux-kernel, linux-fsdevel, Erez Zadok, Mike Halcrow

CC: Mike Halcrow <mhalcrow@us.ibm.com>

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
---
 fs/ecryptfs/main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index b83a512..cebe7dc 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -208,7 +208,7 @@ int ecryptfs_interpose(struct dentry *lower_dentry, struct dentry *dentry,
 		d_add(dentry, inode);
 	else
 		d_instantiate(dentry, inode);
-	fsstack_copy_attr_all(inode, lower_inode, NULL);
+	fsstack_copy_attr_all(inode, lower_inode);
 	/* This size will be overwritten for real files w/ headers and
 	 * other metadata */
 	fsstack_copy_inode_size(inode, lower_inode);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [UNIONFS] 00/42 Unionfs and related patches review
  2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
                   ` (41 preceding siblings ...)
  2007-12-10  2:42 ` [PATCH 42/42] eCryptfs: use simplified fs_stack API for main operations Erez Zadok
@ 2007-12-10  3:20 ` hooanon05
  2007-12-13 15:29   ` Erez Zadok
  42 siblings, 1 reply; 54+ messages in thread
From: hooanon05 @ 2007-12-10  3:20 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel


Erez Zadok:
> (1) Cache coherency: by far, the biggest concern had been around cache
	:::
> unionfs.  The solution we have implemented is to compare the mtime/ctime of
> upper/lower objects during revalidation (esp. of dentries); and if the lower
> times are newer, we reconstruct the union object (drop the older objects,
> and re-lookup them).  This time-based cache-coherency works well and is
	:::

The resolution of mtime/ctime may be too low since some filesystems sets
them in unit of a second, which means you cannot detect the changes made
within a second.
I think it is better to use inotify for every directory while it
consumes a little more resources.
Additionally, if you implement vm_operations instead of
struggling along address_space_operations or VFS patches, in order to
share the mmap-ed memory pages between lower inode and unionfs inode,
then most of issues will be gone.
You can see this approach and how it is working in http://aufs.sf.net
(and get the source file from CVS).

But I am afraid the approach sharing memory pages will not be avaiable
for ecryptfs.


Junjiro Okajima

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/42] MM: extern for drop_pagecache_sb
  2007-12-10  2:42 ` [PATCH 33/42] MM: extern for drop_pagecache_sb Erez Zadok
@ 2007-12-10 14:04   ` Adrian Bunk
  2007-12-13 15:05     ` Erez Zadok
  0 siblings, 1 reply; 54+ messages in thread
From: Adrian Bunk @ 2007-12-10 14:04 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel

On Sun, Dec 09, 2007 at 09:42:06PM -0500, Erez Zadok wrote:
> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
> ---
>  include/linux/mm.h |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1b7b95c..fc61bd3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
>...
> @@ -1135,6 +1136,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
>  					void __user *, size_t *, loff_t *);
>  unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
>  			unsigned long lru_pages);
> +extern void drop_pagecache_sb(struct super_block *);
>  void drop_pagecache(void);
>  void drop_slab(void);

This breaks compilation (and therefore bisecting) until the function 
becomes non-static in patch #36.

Please merge patches #33 and #36 into one patch.

That's not terribly important now that you've sent the patches for 
review, but keep in mind that the tree should compile at all stages for 
making bisecting possible when preparing the final submission.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 01/42] Unionfs: filesystems documentation index
  2007-12-10  2:41 ` [PATCH 01/42] Unionfs: filesystems documentation index Erez Zadok
@ 2007-12-10 14:47   ` Jan Engelhardt
  2007-12-13 15:06     ` Erez Zadok
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Engelhardt @ 2007-12-10 14:47 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel


On Dec 9 2007 21:41, Erez Zadok wrote:
>
>Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
>---
> Documentation/filesystems/00-INDEX |    2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>

This is not CVS, you are allowed to have multiple files in a changeset :-)
Perhaps fold 1-7.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 36/42] VFS: export drop_pagecache_sb
  2007-12-10  2:42 ` [PATCH 36/42] VFS: export drop_pagecache_sb Erez Zadok
@ 2007-12-12  5:38   ` Nick Piggin
  2007-12-13 15:24     ` Erez Zadok
  0 siblings, 1 reply; 54+ messages in thread
From: Nick Piggin @ 2007-12-12  5:38 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel

On Monday 10 December 2007 13:42, Erez Zadok wrote:
> Needed to maintain cache coherency after branch management.
>

Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages
before this function gets exported.

As it is, it can cause massive latencies on preemption and the inode_lock
so it is pretty much debug-only IMO. I'd rather it didn't escape into the
wild as is.

Either that or rework your cache coherency somehow.


> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
> ---
>  fs/drop_caches.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index 59375ef..90410ac 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -3,6 +3,7 @@
>   */
>
>  #include <linux/kernel.h>
> +#include <linux/module.h>
>  #include <linux/mm.h>
>  #include <linux/fs.h>
>  #include <linux/writeback.h>
> @@ -12,7 +13,7 @@
>  /* A global variable is a bit ugly, but it keeps the code simple */
>  int sysctl_drop_caches;
>
> -static void drop_pagecache_sb(struct super_block *sb)
> +void drop_pagecache_sb(struct super_block *sb)
>  {
>  	struct inode *inode;
>
> @@ -24,6 +25,7 @@ static void drop_pagecache_sb(struct super_block *sb)
>  	}
>  	spin_unlock(&inode_lock);
>  }
> +EXPORT_SYMBOL(drop_pagecache_sb);
>
>  void drop_pagecache(void)
>  {

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/42] MM: extern for drop_pagecache_sb
  2007-12-10 14:04   ` Adrian Bunk
@ 2007-12-13 15:05     ` Erez Zadok
  0 siblings, 0 replies; 54+ messages in thread
From: Erez Zadok @ 2007-12-13 15:05 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: Erez Zadok, hch, viro, akpm, linux-kernel, linux-fsdevel

In message <20071210140426.GL20441@stusta.de>, Adrian Bunk writes:
> On Sun, Dec 09, 2007 at 09:42:06PM -0500, Erez Zadok wrote:
> > Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
> > ---
> >  include/linux/mm.h |    2 ++
> >  1 files changed, 2 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1b7b95c..fc61bd3 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> >...
> > @@ -1135,6 +1136,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
> >  					void __user *, size_t *, loff_t *);
> >  unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> >  			unsigned long lru_pages);
> > +extern void drop_pagecache_sb(struct super_block *);
> >  void drop_pagecache(void);
> >  void drop_slab(void);
> 
> This breaks compilation (and therefore bisecting) until the function 
> becomes non-static in patch #36.
> 
> Please merge patches #33 and #36 into one patch.
> 
> That's not terribly important now that you've sent the patches for 
> review, but keep in mind that the tree should compile at all stages for 
> making bisecting possible when preparing the final submission.
> 
> cu
> Adrian

Adrian, having used git-bisect successfully before, I agree with you.  I've
reorganized my patches now, and merged a few small ones, so they all build
incrementally.

Cheers,
Erez.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 01/42] Unionfs: filesystems documentation index
  2007-12-10 14:47   ` Jan Engelhardt
@ 2007-12-13 15:06     ` Erez Zadok
  2007-12-13 21:25       ` Jan Engelhardt
  0 siblings, 1 reply; 54+ messages in thread
From: Erez Zadok @ 2007-12-13 15:06 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Erez Zadok, hch, viro, akpm, linux-kernel, linux-fsdevel

In message <Pine.LNX.4.64.0712101546460.25446@fbirervta.pbzchgretzou.qr>, Jan Engelhardt writes:
> 
> On Dec 9 2007 21:41, Erez Zadok wrote:
> >
> >Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
> >---
> > Documentation/filesystems/00-INDEX |    2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> 
> This is not CVS, you are allowed to have multiple files in a changeset :-)
> Perhaps fold 1-7.

Agreed.  I've merged the documentation patches together, as well as several other
related ones (esp. small ones); this also helps for bisecting.

Erez.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 36/42] VFS: export drop_pagecache_sb
  2007-12-12  5:38   ` Nick Piggin
@ 2007-12-13 15:24     ` Erez Zadok
  2007-12-13 22:47       ` Nick Piggin
  0 siblings, 1 reply; 54+ messages in thread
From: Erez Zadok @ 2007-12-13 15:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Erez Zadok, hch, viro, akpm, linux-kernel, linux-fsdevel

In message <200712121638.35167.nickpiggin@yahoo.com.au>, Nick Piggin writes:
> On Monday 10 December 2007 13:42, Erez Zadok wrote:
> > Needed to maintain cache coherency after branch management.
> >
> 
> Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages
> before this function gets exported.
> 
> As it is, it can cause massive latencies on preemption and the inode_lock
> so it is pretty much debug-only IMO. I'd rather it didn't escape into the
> wild as is.
> 
> Either that or rework your cache coherency somehow.

Nick, thanks for the advice.

We use a generation number after each successful branch configuration
command, so that ->d_revalidate later on can discover that change, and
rebuild the union of objects.  At ->remount time, I figured it'd be nice to
"encourage" that revalidation to happen sooner, by invalidating as many
upper pages as possible, thus causing ->d_revalidate/->readpage to take
place sooner.  So we used to call drop_pagecache_sb from our remount code:
it was the only caller of drop_pagecache_sb.  It wasn't too much of an
latency issue to call drop_pagecache_sb there: the VFS remount code path is
already pretty slow (dropping temporarily to readonly mode, and dropping
other caches), and remount isn't an operation used often, so a little bit
more latency would probably not have been noticed by users.

Nevertheless, it was not strictly necessary to call drop_pagecache_sb in
unionfs_remount, because the objects in question will have gotten
revalidated sooner or later anyway; the call to drop_pagecache_sb was just
an optimization (one which I wasn't 100% sure about anyway, as per my long
"XXX" comment above that call in unionfs_remount).

So I agree with you: if this symbol can be abused by modules and cause
problems, then exporting it to modules is too risky.  I've reworked my code
to avoid calling drop_pagecache_sb and I'll [sic] drop that patch.

Cheers,
Erez.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [UNIONFS] 00/42 Unionfs and related patches review
  2007-12-10  3:20 ` [UNIONFS] 00/42 Unionfs and related patches review hooanon05
@ 2007-12-13 15:29   ` Erez Zadok
  2007-12-14 21:15     ` hooanon05
  0 siblings, 1 reply; 54+ messages in thread
From: Erez Zadok @ 2007-12-13 15:29 UTC (permalink / raw)
  To: hooanon05; +Cc: Erez Zadok, hch, viro, akpm, linux-kernel, linux-fsdevel


Dear Junjiro, thanks for your comments.  I am familiar with the issues and
techniques you mention below, and more.  To properly address your comments,
I had to explain some background before addressing each of your points.  So
my apologies in advance to everyone for the length of this reply.

When stackable file systems were first proposed in the early 90s, they were
nothing more than a research curiosity.  In 1994 I first came across them
and saw their premise as easing file system development.  In order to help
as many people actually use stackable file systems, I developed a set of
stackable f/s templates for several OSs (linux, bsd, and solaris), and
provided several working examples of how to use them (e.g., the older
"cryptfs").  One of my key criteria then was to make it as easy as possible
for people to use stackable file systems.  Therefore I avoided VFS changes
at all costs, even if it meant I had to copy-n-paste VFS code into the
stackable template, and hack it there, among other workarounds.  This was
difficult to do but necessary to promote greater use of stacking.  There
were many instances I wished I could have changed the VFS to better suit my
stacking needs (to be fair, several small patches made it into the 2.3.x
kernel thanks to Al).

Nearly 14 years later, I still maintain those "FiST" stackable templates,
but things are very different nowadays.  There's a greater awareness of
stackable file systems, Linux has one in mainline, and lots of others are in
use in commercial and research settings.  Therefore, this is a good time to
seriously consider VFS-level changes that can help to better support all
types of stackable file systems.

I believe that small VFS changes to help stackable file systems are
perfectly reasonable, and a good thing; and I'm working on such patches.
Conversely, I am very mindful of the VFS's complexity, so I also believe
that massive VFS changes are a bad thing; I want to avoid bloating the VFS
or de-stabilizing it just for the sake of stacking or any single stackable
f/s.  I am also concerned about not changing existing "lower" file systems
whatsoever, because they are well tested and stable.


In message <E1J1ZBs-0006M0-VL@jroun>, hooanon05@yahoo.co.jp writes:
> 
> Erez Zadok:
> > (1) Cache coherency: by far, the biggest concern had been around cache
> 	:::
> > unionfs.  The solution we have implemented is to compare the mtime/ctime of
> > upper/lower objects during revalidation (esp. of dentries); and if the lower
> > times are newer, we reconstruct the union object (drop the older objects,
> > and re-lookup them).  This time-based cache-coherency works well and is
> 	:::
> 
> The resolution of mtime/ctime may be too low since some filesystems sets
> them in unit of a second, which means you cannot detect the changes made
> within a second.
> I think it is better to use inotify for every directory while it
> consumes a little more resources.

Let's analyze the chance of this problem happening.  First, the vast
majority of time, users access files via the union.  Of the small fraction
of instances where users perform some operations on the lower branches
directly, the most common usage scenario is a large burst of activity
(people like to untar a package, copy a whole dir, etc.); such a large burst
of activity is very likely to cross the 1-second boundary or to at least
cause a change in the mtime/ctime.  Moreover, several file systems already
support sub-second granularity (so users have several options to choose
from).  So in my opinion, the chances are very slim that a large amount of
data changes will happen on a lower inode all within one second and not be
detected by our mtime/cite cache-coherency algorithms.

The 1-second granularity is a well known problem, which had bitten "make"
more and more over time, as CPUs and disks have been getting faster.
Consequently, there's been a push from many file system developers to
support higher-resolution timestamps.  This is happening gradually, and as
it happens, unionfs will automatically benefit from it.

Also, time-based cache coherency is a [sic] time-honored technique in NFS.
Users have gotten used to the fact that if they change something on the
server (i.e., the "layer" below the client), that those changes many not be
immediately visible on the client (esp. with heavy caching on the client).
So if it's been good enough for NFS for over two decades, I don't see a
compelling reason to complicate unionfs for a slim chance of detecting
changes that occur within one second.

Now, you propose fsnotify.  Fsnotify is a neat system, but I don't believe
it was designed with stackable file system inter-layer cache coherency in
mind.  It takes a fair amount of effort (e.g., amount of code) to register,
monitor, and de-register for notifications.  And I'm also not sure how well
it scales when one has to register for notifications on many thousands of
objects.  It seems to me that fsnotify is more suitable for user-level apps
which want to monitor a small number of objects.  Given the overheads and
complexity associated with using fsnotify, and the slim chance that
mtime/ctime won't be enough, I don't feel at this stage that the benefits of
using fsnotify are worth the complexity.

Right now my code to detect when a lower object has changed is very simple
and short: just one "if" statement to compare the corresponding inode
mtimes.  I'd like to keep things simple if at all possible.  Fundamentally,
all I need is ONE simple bit of information that will tell me that the upper
and lower inodes are no longer in sync.  Just one bit, not a whole complex
data structure with callbacks and bit-maps and such.

We're exploring few ideas to see how to implement this cleanly and simply
(some of which are based on our presentation at OLS'07).  One way we're
trying is to set a "stale" bit in the upper inode when the lower one
changes; another is to set a "dirty" bit in the lower inode that the upper
can check; another method we're testing is setting a "modification
generation counter" in stacked inode so that can be compared by a stacked
f/s.  As I've not looked at fsnotify recently, I'll also take a fresh look
at it and see if something lightweight and generic can be made available for
all stackable file systems.

> Additionally, if you implement vm_operations instead of
> struggling along address_space_operations or VFS patches, in order to
> share the mmap-ed memory pages between lower inode and unionfs inode,
> then most of issues will be gone.

What you propose violates the clean layer separation in a way that I'm not
too comfortable with (even if it works for you :-) I believe stackable file
system layers should exhibit as clear a separation of layers as possible
(esp. once people begin stacking more than two layers together).  This layer
separation tends to be the safer approach and also makes it easier to
maintain and debug multiple layers.

Over the years we've explored various layering methodologies to try and
trade off code size/complexity vs. performance.  In a normal stackable file
system, each struct file/dentry/inode has a corresponding lower object.  Our
FiST templates for Linux include an extra mode---called "fist lite"---which
saves on inodes and pages by having BOTH the upper and lower dentry point to
the lower inode.  This saves memory (shared pages) and reduces layering
overhead (but you can't intercept mmap ops, which some stackable f/s like to
do).  The cost of such trick is violating the clean layering separation: a
dentry of the upper file system now points to an inode (via dentry->d_inode)
of the lower file system!  To me, this is dangerous in the long run because
objects from one layer can be "leaked" to another layer, with potentially
disastrous results.

What you propose above with vm_operations requires a sequence of operations
in the vm->fault operation which looks like:

	saved_file = vma->vm_file;
	vma->vm_file = hidden_file;
	call the lower ->fault op passing it the modified vma
	vma->vm_file = saved_file;

We've done something different to achieve a similar goal, in an experimental
version of wrapfs (our "pass-through" layer in FiST).  In the address_space
ops, we wanted to share the upper/lower pages, but since the
page->mapping->host points to our inode, we temporarily overrode it:

	saved_inode = page->mapping->host;
	page->mapping->host = lower_inode;
	call lower f/s ->writepage with upper "fixed up" page
	page->mapping->host = saved_inode;

Even if both of these techniques work (and they do, at least in limited
testing I've done), there is something very unpleasant about having to
temporarily override a field's value, then fix it again, after coming back
from calling the lower op.  Aside from the uncleanliness of this kind of
technique, it can seriously lead to races and other data corruptions when/if
the temporarily-fixed fields "leak" outside the current code.  (I have a
strong feeling that several kernel developers might vomit in disgust if I
dared to submit such hacky patches to unionfs... :-)

To me, any time such a hack has to be employed, it tells me that there's
something wrong with the API in question.  And so I'd much rather see the
API fixed The Right Way[tm], than promote such potentially unsafe practices.
Consequently, in unionfs I take a safer, more conservative approach and
avoid any such cute by potentially dangerous hacks.

> But I am afraid the approach sharing memory pages will not be avaiable
> for ecryptfs.

Agreed.  Although my focus in recent times had been on unionfs, I'm also
looking ahead to a day when linux might have several stackable file systems.
As such, I'd like to consider developing and contributing VFS techniques
that will work for many/all stackable file systems, in a clean manner that
the VFS+MM maintainers will like.

> Junjiro Okajima

Sincerely,
Erez.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 01/42] Unionfs: filesystems documentation index
  2007-12-13 15:06     ` Erez Zadok
@ 2007-12-13 21:25       ` Jan Engelhardt
  0 siblings, 0 replies; 54+ messages in thread
From: Jan Engelhardt @ 2007-12-13 21:25 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel


On Dec 13 2007 10:06, Erez Zadok wrote:
>In message <Pine.LNX.4.64.0712101546460.25446@fbirervta.pbzchgretzou.qr>, Jan Engelhardt writes:
>> 
>> On Dec 9 2007 21:41, Erez Zadok wrote:
>> >
>> >Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
>> >---
>> > Documentation/filesystems/00-INDEX |    2 ++
>> > 1 files changed, 2 insertions(+), 0 deletions(-)
>> >
>> 
>> This is not CVS, you are allowed to have multiple files in a changeset
>> Perhaps fold 1-7.
>
>Agreed.  I've merged the documentation patches together, as well as several other
>related ones (esp. small ones); this also helps for bisecting.
>
Speaking of bisect, there is not really something to bisect in documentation
since it is not compiled :)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 36/42] VFS: export drop_pagecache_sb
  2007-12-13 15:24     ` Erez Zadok
@ 2007-12-13 22:47       ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2007-12-13 22:47 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel

On Friday 14 December 2007 02:24, Erez Zadok wrote:
> In message <200712121638.35167.nickpiggin@yahoo.com.au>, Nick Piggin writes:
> > On Monday 10 December 2007 13:42, Erez Zadok wrote:
> > > Needed to maintain cache coherency after branch management.
> >
> > Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages
> > before this function gets exported.
> >
> > As it is, it can cause massive latencies on preemption and the inode_lock
> > so it is pretty much debug-only IMO. I'd rather it didn't escape into the
> > wild as is.
> >
> > Either that or rework your cache coherency somehow.
>
> Nick, thanks for the advice.
>
> We use a generation number after each successful branch configuration
> command, so that ->d_revalidate later on can discover that change, and
> rebuild the union of objects.  At ->remount time, I figured it'd be nice to
> "encourage" that revalidation to happen sooner, by invalidating as many
> upper pages as possible, thus causing ->d_revalidate/->readpage to take
> place sooner.  So we used to call drop_pagecache_sb from our remount code:
> it was the only caller of drop_pagecache_sb.  It wasn't too much of an
> latency issue to call drop_pagecache_sb there: the VFS remount code path is
> already pretty slow (dropping temporarily to readonly mode, and dropping
> other caches), and remount isn't an operation used often, so a little bit
> more latency would probably not have been noticed by users.

Well a large, infrequent spike is the most damaging to latency sensitive
users. And anyway, I guess the infrequency of remount means it doesn't
have to be really efficient with invalidating pagecache either.


> Nevertheless, it was not strictly necessary to call drop_pagecache_sb in
> unionfs_remount, because the objects in question will have gotten
> revalidated sooner or later anyway; the call to drop_pagecache_sb was just
> an optimization (one which I wasn't 100% sure about anyway, as per my long
> "XXX" comment above that call in unionfs_remount).
>
> So I agree with you: if this symbol can be abused by modules and cause
> problems, then exporting it to modules is too risky.  I've reworked my code
> to avoid calling drop_pagecache_sb and I'll [sic] drop that patch.

Thanks, I'd be much happier with that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [UNIONFS] 00/42 Unionfs and related patches review
  2007-12-13 15:29   ` Erez Zadok
@ 2007-12-14 21:15     ` hooanon05
  0 siblings, 0 replies; 54+ messages in thread
From: hooanon05 @ 2007-12-14 21:15 UTC (permalink / raw)
  To: Erez Zadok; +Cc: hch, viro, akpm, linux-kernel, linux-fsdevel


Hello Professor Zadok,

Erez Zadok:
> I believe that small VFS changes to help stackable file systems are
> perfectly reasonable, and a good thing; and I'm working on such patches.
> Conversely, I am very mindful of the VFS's complexity, so I also believe
> that massive VFS changes are a bad thing; I want to avoid bloating the VFS
> or de-stabilizing it just for the sake of stacking or any single stackable
> f/s.  I am also concerned about not changing existing "lower" file systems
> whatsoever, because they are well tested and stable.

I have no objection against your opinion about massive VFS changes or
existing "lower" filesystems.


> from).  So in my opinion, the chances are very slim that a large amount of
> data changes will happen on a lower inode all within one second and not be
> detected by our mtime/cite cache-coherency algorithms.

I agree that time-based checking is available in many cases.
But there will exist some opeartions which are done in one
second, and it may not be available when a user changes the clock/time
of his system.


> Also, time-based cache coherency is a [sic] time-honored technique in NFS.
> Users have gotten used to the fact that if they change something on the
> server (i.e., the "layer" below the client), that those changes many not be
> immediately visible on the client (esp. with heavy caching on the client).
> So if it's been good enough for NFS for over two decades, I don't see a
> compelling reason to complicate unionfs for a slim chance of detecting
> changes that occur within one second.

Since NFS is a remote filesystem, I don't think it is a good idea to
compare the behaviour of if and a stackable filesystem, since all
lower(branch) filesystems are able to be local filesystems.


> Right now my code to detect when a lower object has changed is very simple
> and short: just one "if" statement to compare the corresponding inode
> mtimes.  I'd like to keep things simple if at all possible.  Fundamentally,
> all I need is ONE simple bit of information that will tell me that the upper
> and lower inodes are no longer in sync.  Just one bit, not a whole complex
> data structure with callbacks and bit-maps and such.

Agreed, so the inotify handler should set a flag or atomic_inc/dec
the internal generation, or enqueue such job and handle it
later (shortly). Of course, when the dentry/inode of the stackable
filesystem corresspoding to the modfied file are not cached, the handler
has nothing to do.
Additionally, it is only directories to be set inotify for monitoring,
instead of all files. The inotify handler for a directory receives all
necessary (for a stackable fs) events for its children.
(but there are a few limitations or exceptions)


> What you propose violates the clean layer separation in a way that I'm not
> too comfortable with (even if it works for you :-) I believe stackable file
	:::
> system, each struct file/dentry/inode has a corresponding lower object.  Our
> FiST templates for Linux include an extra mode---called "fist lite"---which
> saves on inodes and pages by having BOTH the upper and lower dentry point to
> the lower inode.  This saves memory (shared pages) and reduces layering
> overhead (but you can't intercept mmap ops, which some stackable f/s like to
> do).  The cost of such trick is violating the clean layering separation: a
> dentry of the upper file system now points to an inode (via dentry->d_inode)
> of the lower file system!  To me, this is dangerous in the long run because
> objects from one layer can be "leaked" to another layer, with potentially
> disastrous results.

Currently, I don't think sharing page is any kind of
violation. Additionally the dentry of the upper file system does NOT
point to the inode of the lower file system. Of course it can implement
->mmap operation.


> What you propose above with vm_operations requires a sequence of operations
> in the vm->fault operation which looks like:
> 
> 	saved_file = vma->vm_file;
> 	vma->vm_file = hidden_file;
> 	call the lower ->fault op passing it the modified vma
> 	vma->vm_file = saved_file;

Basically, yes.
But there are several things to do such as locking.


> Even if both of these techniques work (and they do, at least in limited
> testing I've done), there is something very unpleasant about having to
> temporarily override a field's value, then fix it again, after coming back
> from calling the lower op.  Aside from the uncleanliness of this kind of
> technique, it can seriously lead to races and other data corruptions when/if
> the temporarily-fixed fields "leak" outside the current code.  (I have a
> strong feeling that several kernel developers might vomit in disgust if I
> dared to submit such hacky patches to unionfs... :-)

I guess probably you forgot some locking.


> To me, any time such a hack has to be employed, it tells me that there's
> something wrong with the API in question.  And so I'd much rather see the
> API fixed The Right Way[tm], than promote such potentially unsafe practices.

If you changed some important members of internal structures without
locking, it would be unsafe and violate something.

Finally I think the approach of sharing pages, you may call it
zero-copy conversely your approach, is safe. At least, this approach is
working over a year while several people are using it.
Of course, I never say it is bug-free. There may exist a problem which
simply I don't know yet.


Sincerely,
Junjiro Okajima

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2007-12-14 21:16 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-10  2:41 [UNIONFS] 00/42 Unionfs and related patches review Erez Zadok
2007-12-10  2:41 ` [PATCH 01/42] Unionfs: filesystems documentation index Erez Zadok
2007-12-10 14:47   ` Jan Engelhardt
2007-12-13 15:06     ` Erez Zadok
2007-12-13 21:25       ` Jan Engelhardt
2007-12-10  2:41 ` [PATCH 02/42] Unionfs: unionfs " Erez Zadok
2007-12-10  2:41 ` [PATCH 03/42] Unionfs: documentation for general concepts Erez Zadok
2007-12-10  2:41 ` [PATCH 04/42] Unionfs: usage documentation for users Erez Zadok
2007-12-10  2:41 ` [PATCH 05/42] Unionfs: documentation for any known issues Erez Zadok
2007-12-10  2:41 ` [PATCH 06/42] Unionfs: documentation about renaming operations Erez Zadok
2007-12-10  2:41 ` [PATCH 07/42] Unionfs maintainers Erez Zadok
2007-12-10  2:41 ` [PATCH 08/42] Makefile: hook to compile unionfs Erez Zadok
2007-12-10  2:41 ` [PATCH 09/42] Unionfs: main Makefile Erez Zadok
2007-12-10  2:41 ` [PATCH 10/42] Unionfs: fanout header definitions Erez Zadok
2007-12-10  2:41 ` [PATCH 11/42] Unionfs: main header file Erez Zadok
2007-12-10  2:41 ` [PATCH 12/42] Unionfs: common file copyup/revalidation operations Erez Zadok
2007-12-10  2:41 ` [PATCH 13/42] Unionfs: basic file operations Erez Zadok
2007-12-10  2:41 ` [PATCH 14/42] Unionfs: lower-level copyup routines Erez Zadok
2007-12-10  2:41 ` [PATCH 15/42] Unionfs: dentry revalidation Erez Zadok
2007-12-10  2:41 ` [PATCH 16/42] Unionfs: lower-level lookup routines Erez Zadok
2007-12-10  2:41 ` [PATCH 17/42] Unionfs: rename method and helpers Erez Zadok
2007-12-10  2:41 ` [PATCH 18/42] Unionfs: directory reading file operations Erez Zadok
2007-12-10  2:41 ` [PATCH 19/42] Unionfs: readdir helper functions Erez Zadok
2007-12-10  2:41 ` [PATCH 20/42] Unionfs: readdir state helpers Erez Zadok
2007-12-10  2:41 ` [PATCH 21/42] Unionfs: inode operations Erez Zadok
2007-12-10  2:41 ` [PATCH 22/42] Unionfs: unlink/rmdir operations Erez Zadok
2007-12-10  2:41 ` [PATCH 23/42] Unionfs: address-space operations Erez Zadok
2007-12-10  2:41 ` [PATCH 24/42] Unionfs: mount-time and stacking-interposition functions Erez Zadok
2007-12-10  2:41 ` [PATCH 25/42] Unionfs: super_block operations Erez Zadok
2007-12-10  2:41 ` [PATCH 26/42] Unionfs: extended attributes operations Erez Zadok
2007-12-10  2:42 ` [PATCH 27/42] Unionfs: async I/O queue headers Erez Zadok
2007-12-10  2:42 ` [PATCH 28/42] Unionfs: async I/O queue operations Erez Zadok
2007-12-10  2:42 ` [PATCH 29/42] Unionfs: miscellaneous helper routines Erez Zadok
2007-12-10  2:42 ` [PATCH 30/42] Unionfs: debugging infrastructure Erez Zadok
2007-12-10  2:42 ` [PATCH 31/42] VFS: fs_stack header cleanups Erez Zadok
2007-12-10  2:42 ` [PATCH 32/42] Unionfs file system magic number Erez Zadok
2007-12-10  2:42 ` [PATCH 33/42] MM: extern for drop_pagecache_sb Erez Zadok
2007-12-10 14:04   ` Adrian Bunk
2007-12-13 15:05     ` Erez Zadok
2007-12-10  2:42 ` [PATCH 34/42] VFS path get/put ops used by Unionfs Erez Zadok
2007-12-10  2:42 ` [PATCH 35/42] Unionfs: common header file for user-land utilities and kernel Erez Zadok
2007-12-10  2:42 ` [PATCH 36/42] VFS: export drop_pagecache_sb Erez Zadok
2007-12-12  5:38   ` Nick Piggin
2007-12-13 15:24     ` Erez Zadok
2007-12-13 22:47       ` Nick Piggin
2007-12-10  2:42 ` [PATCH 37/42] VFS: export release_open_intent symbol Erez Zadok
2007-12-10  2:42 ` [PATCH 38/42] VFS: simplified fsstack_copy_attr_all Erez Zadok
2007-12-10  2:42 ` [PATCH 39/42] Put Unionfs and eCryptfs under one layered filesystems menu Erez Zadok
2007-12-10  2:42 ` [PATCH 40/42] eCryptfs: use simplified fs_stack API for dentry operations Erez Zadok
2007-12-10  2:42 ` [PATCH 41/42] eCryptfs: use simplified fs_stack API for inode operations Erez Zadok
2007-12-10  2:42 ` [PATCH 42/42] eCryptfs: use simplified fs_stack API for main operations Erez Zadok
2007-12-10  3:20 ` [UNIONFS] 00/42 Unionfs and related patches review hooanon05
2007-12-13 15:29   ` Erez Zadok
2007-12-14 21:15     ` hooanon05

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).