All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC Aufs2 #2 00/28] source files
@ 2009-03-16  7:20 J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 01/28] aufs documents J. R. Okajima
                   ` (27 more replies)
  0 siblings, 28 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

This is my second try to post aufs2 source files.
These are full aufs2 version against the mainline (linux-2.6.git).

Kindly review and let me know your comments.
And please consider incorporating into mainline.

Changes since last post,
(comments from Sam Ravnborg)
- correct my GIT work and split several patches against VFS. the number
  of patches became 28.
- add aufs_type.h to include/linux/Kbuild.
- remove priv.mk from fs/aufs/Makefile.
- and several tiny issues.


J. R. Okajima (28):
  aufs documents
  aufs public header file
  aufs module global
  aufs super_block
  aufs branch directory/filesystem
  aufs xino
  aufs object lifetime management via sysfs
  aufs mount options/flags
  aufs workqueue
  aufs sub-VFS
  aufs sub-dcache
  aufs copy-up
  aufs whiteout
  aufs pseudo-link
  aufs policies to select one among multiple writable branches
  aufs dentry and lookup
  aufs file
  aufs direcotry
  aufs inode
  aufs ioctl
  aufs sysfs entries
  aufs branch for loopback block device
  aufs internal inotify
  aufs test for fstype
  aufs debug
  export splice functions
  export lookup functions
  kbuild aufs

 Documentation/filesystems/aufs/README              |  253 ++++
 Documentation/filesystems/aufs/aufs.5              | 1514 ++++++++++++++++++++
 Documentation/filesystems/aufs/design/01intro.txt  |  128 ++
 Documentation/filesystems/aufs/design/02struct.txt |  205 +++
 Documentation/filesystems/aufs/design/03lookup.txt |   95 ++
 Documentation/filesystems/aufs/design/04branch.txt |   67 +
 .../filesystems/aufs/design/05wbr_policy.txt       |   57 +
 .../filesystems/aufs/design/06fmode_exec.txt       |   24 +
 Documentation/filesystems/aufs/design/07mmap.txt   |   44 +
 Documentation/filesystems/aufs/design/08plan.txt   |  169 +++
 fs/Kconfig                                         |    1 +
 fs/Makefile                                        |    1 +
 fs/aufs/Kconfig                                    |   78 +
 fs/aufs/Makefile                                   |   19 +
 fs/aufs/aufs.h                                     |   43 +
 fs/aufs/branch.c                                   |  944 ++++++++++++
 fs/aufs/branch.h                                   |  200 +++
 fs/aufs/cpup.c                                     | 1028 +++++++++++++
 fs/aufs/cpup.h                                     |   68 +
 fs/aufs/dcsub.c                                    |  214 +++
 fs/aufs/dcsub.h                                    |   43 +
 fs/aufs/debug.c                                    |  406 ++++++
 fs/aufs/debug.h                                    |  245 ++++
 fs/aufs/dentry.c                                   |  858 +++++++++++
 fs/aufs/dentry.h                                   |  213 +++
 fs/aufs/dinfo.c                                    |  351 +++++
 fs/aufs/dir.c                                      |  493 +++++++
 fs/aufs/dir.h                                      |  104 ++
 fs/aufs/f_op.c                                     |  551 +++++++
 fs/aufs/file.c                                     |  552 +++++++
 fs/aufs/file.h                                     |  148 ++
 fs/aufs/finfo.c                                    |  126 ++
 fs/aufs/fstype.h                                   |  434 ++++++
 fs/aufs/hinotify.c                                 |  746 ++++++++++
 fs/aufs/i_op.c                                     |  857 +++++++++++
 fs/aufs/i_op_add.c                                 |  625 ++++++++
 fs/aufs/i_op_del.c                                 |  471 ++++++
 fs/aufs/i_op_ren.c                                 |  929 ++++++++++++
 fs/aufs/iinfo.c                                    |  257 ++++
 fs/aufs/inode.c                                    |  356 +++++
 fs/aufs/inode.h                                    |  471 ++++++
 fs/aufs/ioctl.c                                    |   54 +
 fs/aufs/loop.c                                     |   46 +
 fs/aufs/loop.h                                     |   41 +
 fs/aufs/magic.mk                                   |   52 +
 fs/aufs/module.c                                   |  164 +++
 fs/aufs/module.h                                   |   66 +
 fs/aufs/opts.c                                     | 1438 +++++++++++++++++++
 fs/aufs/opts.h                                     |  180 +++
 fs/aufs/plink.c                                    |  335 +++++
 fs/aufs/rwsem.h                                    |   52 +
 fs/aufs/sbinfo.c                                   |  192 +++
 fs/aufs/spl.h                                      |   47 +
 fs/aufs/super.c                                    |  846 +++++++++++
 fs/aufs/super.h                                    |  266 ++++
 fs/aufs/sysaufs.c                                  |   79 +
 fs/aufs/sysaufs.h                                  |  102 ++
 fs/aufs/sysfs.c                                    |  308 ++++
 fs/aufs/sysrq.c                                    |  106 ++
 fs/aufs/vdir.c                                     |  776 ++++++++++
 fs/aufs/vfsub.c                                    |  716 +++++++++
 fs/aufs/vfsub.h                                    |  137 ++
 fs/aufs/wbr_policy.c                               |  628 ++++++++
 fs/aufs/whout.c                                    | 1010 +++++++++++++
 fs/aufs/whout.h                                    |   78 +
 fs/aufs/wkq.c                                      |  249 ++++
 fs/aufs/wkq.h                                      |   72 +
 fs/aufs/xino.c                                     | 1181 +++++++++++++++
 fs/namei.c                                         |    4 +-
 fs/splice.c                                        |   10 +-
 include/linux/Kbuild                               |    1 +
 include/linux/aufs_type.h                          |   98 ++
 include/linux/namei.h                              |    3 +
 include/linux/splice.h                             |    6 +
 74 files changed, 23724 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/filesystems/aufs/README
 create mode 100644 Documentation/filesystems/aufs/aufs.5
 create mode 100644 Documentation/filesystems/aufs/design/01intro.txt
 create mode 100644 Documentation/filesystems/aufs/design/02struct.txt
 create mode 100644 Documentation/filesystems/aufs/design/03lookup.txt
 create mode 100644 Documentation/filesystems/aufs/design/04branch.txt
 create mode 100644 Documentation/filesystems/aufs/design/05wbr_policy.txt
 create mode 100644 Documentation/filesystems/aufs/design/06fmode_exec.txt
 create mode 100644 Documentation/filesystems/aufs/design/07mmap.txt
 create mode 100644 Documentation/filesystems/aufs/design/08plan.txt
 create mode 100644 fs/aufs/Kconfig
 create mode 100644 fs/aufs/Makefile
 create mode 100644 fs/aufs/aufs.h
 create mode 100644 fs/aufs/branch.c
 create mode 100644 fs/aufs/branch.h
 create mode 100644 fs/aufs/cpup.c
 create mode 100644 fs/aufs/cpup.h
 create mode 100644 fs/aufs/dcsub.c
 create mode 100644 fs/aufs/dcsub.h
 create mode 100644 fs/aufs/debug.c
 create mode 100644 fs/aufs/debug.h
 create mode 100644 fs/aufs/dentry.c
 create mode 100644 fs/aufs/dentry.h
 create mode 100644 fs/aufs/dinfo.c
 create mode 100644 fs/aufs/dir.c
 create mode 100644 fs/aufs/dir.h
 create mode 100644 fs/aufs/f_op.c
 create mode 100644 fs/aufs/file.c
 create mode 100644 fs/aufs/file.h
 create mode 100644 fs/aufs/finfo.c
 create mode 100644 fs/aufs/fstype.h
 create mode 100644 fs/aufs/hinotify.c
 create mode 100644 fs/aufs/i_op.c
 create mode 100644 fs/aufs/i_op_add.c
 create mode 100644 fs/aufs/i_op_del.c
 create mode 100644 fs/aufs/i_op_ren.c
 create mode 100644 fs/aufs/iinfo.c
 create mode 100644 fs/aufs/inode.c
 create mode 100644 fs/aufs/inode.h
 create mode 100644 fs/aufs/ioctl.c
 create mode 100644 fs/aufs/loop.c
 create mode 100644 fs/aufs/loop.h
 create mode 100644 fs/aufs/magic.mk
 create mode 100644 fs/aufs/module.c
 create mode 100644 fs/aufs/module.h
 create mode 100644 fs/aufs/opts.c
 create mode 100644 fs/aufs/opts.h
 create mode 100644 fs/aufs/plink.c
 create mode 100644 fs/aufs/rwsem.h
 create mode 100644 fs/aufs/sbinfo.c
 create mode 100644 fs/aufs/spl.h
 create mode 100644 fs/aufs/super.c
 create mode 100644 fs/aufs/super.h
 create mode 100644 fs/aufs/sysaufs.c
 create mode 100644 fs/aufs/sysaufs.h
 create mode 100644 fs/aufs/sysfs.c
 create mode 100644 fs/aufs/sysrq.c
 create mode 100644 fs/aufs/vdir.c
 create mode 100644 fs/aufs/vfsub.c
 create mode 100644 fs/aufs/vfsub.h
 create mode 100644 fs/aufs/wbr_policy.c
 create mode 100644 fs/aufs/whout.c
 create mode 100644 fs/aufs/whout.h
 create mode 100644 fs/aufs/wkq.c
 create mode 100644 fs/aufs/wkq.h
 create mode 100644 fs/aufs/xino.c
 create mode 100644 include/linux/aufs_type.h


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 01/28] aufs documents
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 02/28] aufs public header file J. R. Okajima
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
design and manual

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 Documentation/filesystems/aufs/README              |  253 ++++
 Documentation/filesystems/aufs/aufs.5              | 1514 ++++++++++++++++++++
 Documentation/filesystems/aufs/design/01intro.txt  |  128 ++
 Documentation/filesystems/aufs/design/02struct.txt |  205 +++
 Documentation/filesystems/aufs/design/03lookup.txt |   95 ++
 Documentation/filesystems/aufs/design/04branch.txt |   67 +
 .../filesystems/aufs/design/05wbr_policy.txt       |   57 +
 .../filesystems/aufs/design/06fmode_exec.txt       |   24 +
 Documentation/filesystems/aufs/design/07mmap.txt   |   44 +
 Documentation/filesystems/aufs/design/08plan.txt   |  169 +++
 10 files changed, 2556 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/aufs/README
 create mode 100644 Documentation/filesystems/aufs/aufs.5
 create mode 100644 Documentation/filesystems/aufs/design/01intro.txt
 create mode 100644 Documentation/filesystems/aufs/design/02struct.txt
 create mode 100644 Documentation/filesystems/aufs/design/03lookup.txt
 create mode 100644 Documentation/filesystems/aufs/design/04branch.txt
 create mode 100644 Documentation/filesystems/aufs/design/05wbr_policy.txt
 create mode 100644 Documentation/filesystems/aufs/design/06fmode_exec.txt
 create mode 100644 Documentation/filesystems/aufs/design/07mmap.txt
 create mode 100644 Documentation/filesystems/aufs/design/08plan.txt

diff --git a/Documentation/filesystems/aufs/README b/Documentation/filesystems/aufs/README
new file mode 100644
index 0000000..8d4ba1f
--- /dev/null
+++ b/Documentation/filesystems/aufs/README
@@ -0,0 +1,253 @@
+
+Aufs2 -- advanced multi layered unification filesystem version 2
+http://aufs.sf.net
+Junjiro R. Okajima
+
+
+0. Introduction
+----------------------------------------
+In the early days, aufs was entirely re-designed and re-implemented
+Unionfs Version 1.x series. After many original ideas, approaches,
+improvements and implementations, it becomes totally different from
+Unionfs while keeping the basic features.
+Recently, Unionfs Version 2.x series begin taking some of the same
+approaches to aufs1's.
+Unionfs is being developed by Professor Erez Zadok at Stony Brook
+University and his team.
+
+This version of AUFS, aufs2 has several purposes.
+- to be reviewed easily and widely.
+- to make the source files simpler and smaller by dropping several
+  original features.
+
+Through this work, I found some bad things in aufs1 source code and
+fixed them. Some of the dropped features will be reverted in the future,
+but not all I'm afraid.
+Aufs2 supports linux-2.6.27 and later. If you want older kernel version
+support, try aufs1 from CVS on SourceForge.
+
+
+1. Features
+----------------------------------------
+- unite several directories into a single virtual filesystem. The member
+  directory is called as a branch.
+- you can specify the permission flags to the branch, which are 'readonly',
+  'readwrite' and 'whiteout-able.'
+- by upper writable branch, internal copyup and whiteout, files/dirs on
+  readonly branch are modifiable logically.
+- dynamic branch manipulation, add, del.
+- etc...
+
+Also there are many enhancements in aufs1, such as:
+- keep inode number by external inode number table
+- keep the timestamps of file/dir in internal copyup operation
+- seekable directory, supporting NFS readdir.
+- support mmap(2) including /proc/PID/exe symlink, without page-copy
+- whiteout is hardlinked in order to reduce the consumption of inodes
+  on branch
+- do not copyup, nor create a whiteout when it is unnecessary
+- revert a single systemcall when an error occurs in aufs
+- remount interface instead of ioctl
+- maintain /etc/mtab by an external shell script, /sbin/mount.aufs.
+- loopback mounted filesystem as a branch
+- kernel thread for removing the dir who has a plenty of whiteouts
+- support copyup sparse file (a file which has a 'hole' in it)
+- default permission flags for branches
+- selectable permission flags for ro branch, whether whiteout can
+  exist or not
+- export via NFS.
+- support <sysfs>/fs/aufs.
+- support multiple writable branches, some policies to select one
+  among multiple writable branches.
+- a new semantics for link(2) and rename(2) to support multiple
+  writable branches.
+- a delegation of the internal branch access to support task I/O
+  accounting, which also supports Linux Security Modules (LSM) mainly
+  for Suse AppArmor.
+- nested mount, i.e. aufs as readonly no-whiteout branch of another aufs.
+- copyup-on-open or copyup-on-write
+- show-whiteout mode
+- show configuration even out of kernel tree
+- no glibc changes are required.
+- pseudo hardlink (hardlink over branches)
+- allow a direct access manually to a file on branch, e.g. bypassing aufs.
+  including NFS or remote filesystem branch.
+- and more...
+
+Currently these features are dropped temporary from this version, aufs2.
+See design/08plan.txt in detail.
+- exporting via NFS
+- test only the highest one for the directory permission (dirperm1)
+- show whiteout mode (shwh)
+- copyup on open (coo=)
+- being another aufs's readonly branch (robr)
+- statistics of aufs thread (/sys/fs/aufs/stat)
+- delegation mode (dlgt)
+- intent.open/create (file open in a single lookup)
+
+Features or just an idea in the future (see also design/*.txt),
+- reorder the branch index without del/re-add.
+- permanent xino files
+- an option for refreshing the opened files after add/del branches
+- 'move' policy for copy-up between two writable branches, after
+  checking free space.
+- remount option copy/move between two branches. (almost done)
+- O_DIRECT
+- light version, without branch manipulation. (unnecessary?)
+- copyup in userspace
+- inotify in userspace
+- xattr, acl
+
+
+2. Download
+----------------------------------------
+Kindly one of aufs user, the Center for Scientific Computing and Free
+Software (C3SL), Federal University of Parana offered me a public GIT
+tree space.
+
+There are three GIT trees, aufs2-2.6, aufs2-standalone and aufs2-util.
+While the aufs2-util is always necessary, you need either of aufs2-2.6
+or aufs2-standalone.
+
+The aufs2-2.6 tree includes the whole linux-2.6 GIT tree,
+git://git.kernel.org/.../torvalds/linux-2.6.git.
+And you cannot select CONFIG_AUFS_FS=m for this version, eg. you cannot
+build aufs2 as an externel kernel module.
+If you already have linux-2.6 GIT tree, you may want to pull and merge
+the "aufs2" branch from this tree.
+
+On the other hand, the aufs2-standalone tree has only aufs2 source files
+and a necessary patch, and you can select CONFIG_AUFS_FS=m. In other
+words, the aufs2-standalone tree is generated from aufs2-2.6 tree by,
+- extract new files and modifications.
+- generate a single patch file from modifications.
+- generate a ChangeLog file from git-log.
+- commit the files newly and no log messages. this is not git-pull.
+
+Both of aufs2-2.6 and aufs2-standalone trees have a branch whose name is
+in form of "aufs2-xx" where "xx" represents the linux kernel version,
+"linux-2.6.xx".
+
+o aufs2-2.6 tree
+$ git clone --reference /your/linux-2.6/git/tree \
+	http://git.c3sl.ufpr.br/pub/scm/aufs/aufs2-2.6.git \
+	aufs2-2.6.git
+- if you don't have linux-2.6 GIT tree, then remove "--reference ..."
+$ cd aufs2-2.6.git
+$ git checkout origin/aufs2-xx	# for instance, aufs2-27 for linux-2.6.27
+				# aufs2 (no -xx) for the latest -rc version.
+
+o aufs2-standalone tree
+$ git clone http://git.c3sl.ufpr.br/pub/scm/aufs/aufs2-standalone.git \
+	aufs2-standalone.git
+$ cd aufs2-standalone.git
+$ git checkout origin/aufs2-xx	# for instance, aufs2-27 for linux-2.6.27
+- apply "aufs2-standalone.patch" to your kernel source files.
+
+o aufs2-util tree
+$ git clone http://git.c3sl.ufpr.br/pub/scm/aufs/aufs2-util.git \
+	aufs2-util.git
+$ cd aufs2-util.git
+- no particular tag/branch currently.
+
+
+3. Configuration and Compilation
+----------------------------------------
+For aufs2-2.6 tree,
+- enable CONFIG_EXPERIMENTAL and CONFIG_AUFS_FS.
+- set other aufs configurations if necessary.
+
+For aufs2-standalone tree,
+- enable CONFIG_EXPERIMENTAL and CONFIG_AUFS_FS, you can select =m.
+- edit fs/aufs/config.mk and set other aufs configurations if necessary.
+
+And then,
+- build your kernel (or a module) by "make"
+- install it and reboot your system
+- read README in aufs2-util, build and install it
+
+
+4. Usage
+----------------------------------------
+At first, make sure aufs2-util are installed, and please read the aufs
+manual, ./Documentation/filesystems/aufs/aufs.5.
+$ man -l aufs.5
+
+And then,
+$ mkdir /tmp/rw /tmp/aufs
+# mount -t aufs -o br=/tmp/rw:${HOME}=ro none /tmp/aufs
+
+Here is another example. The result is equivalent.
+# mount -t aufs -o br=/tmp/rw:${HOME} none /tmp/aufs
+  Or
+# mount -t aufs -o br:/tmp/rw none /tmp/aufs
+# mount -o remount,append:${HOME} /tmp/aufs
+
+Then, you can see whole tree of your home dir through /tmp/aufs. If
+you modify a file under /tmp/aufs, the one on your home directory is
+not affected, instead the same named file will be newly created under
+/tmp/rw. And all of your modification to a file will be applied to
+the one under /tmp/rw. This is called the file based Copy on Write
+(COW) method.
+Aufs mount options are described in aufs.5.
+
+Additionally, there are some sample usages of aufs which are a
+diskless system with network booting, and LiveCD over NFS.
+See sample dir in CVS tree on SourceForge.
+
+
+5. Contact
+----------------------------------------
+When you have any problems or strange behaviour in aufs, please let me
+know with:
+- /proc/mounts (instead of the output of mount(8))
+- /sys/fs/aufs/* (if you have them)
+- /sys/module/aufs/*
+- linux kernel version
+  if your kernel is not plain, for example modified by distributor,
+  the url where i can download its source is necessary too.
+- aufs version which was printed at loading the module or booting the
+  system, instead of the date you downloaded.
+- configuration (define/undefine CONFIG_AUFS_xxx)
+- kernel configuration or /proc/config.gz (if you have it)
+- behaviour which you think to be incorrect
+- actual operation, reproducible one is better
+- mailto: aufs-users at lists.sourceforge.net
+
+Usually, I don't watch the Public Areas(Bugs, Support Requests, Patches,
+and Feature Requests) on SourceForge. Please join and write to
+aufs-users ML.
+
+
+6. Acknowledgements
+----------------------------------------
+Thanks to everyone who have tried and are using aufs, whoever
+have reported a bug or any feedback.
+
+Especially donors:
+Tomas Matejicek(slax.org) made a donation (much more than once).
+Dai Itasaka made a donation (2007/8).
+Chuck Smith made a donation (2008/4, 10 and 12).
+Henk Schoneveld made a donation (2008/9).
+Chih-Wei Huang, ASUS, CTC donated Eee PC 4G (2008/10).
+Francois Dupoux made a donation (2008/11).
+Bruno Cesar Ribas and Luis Carlos Erpen de Bona, C3SL serves public GIT
+tree (2009/2).
+William Grant made a donation (2009/3).
+
+Thank you very much.
+Donations are always, including future donations, very important and
+helpful for me to keep on developing aufs.
+
+
+7.
+----------------------------------------
+If you are an experienced user, no explanation is needed. Aufs is
+just a linux filesystem.
+
+
+Enjoy!
+
+# Local variables: ;
+# mode: text;
+# End: ;
diff --git a/Documentation/filesystems/aufs/aufs.5 b/Documentation/filesystems/aufs/aufs.5
new file mode 100644
index 0000000..0b485ea
--- /dev/null
+++ b/Documentation/filesystems/aufs/aufs.5
@@ -0,0 +1,1514 @@
+.ds AUFS_VERSION aufs2-base6
+.ds AUFS_XINO_FNAME .aufs.xino
+.ds AUFS_XINO_DEFPATH /tmp/.aufs.xino
+.ds AUFS_DIRWH_DEF 3
+.ds AUFS_WH_PFX .wh.
+.ds AUFS_WH_PFX_LEN 4
+.ds AUFS_WKQ_NAME aufsd
+.ds AUFS_NWKQ_DEF 4
+.ds AUFS_WH_DIROPQ .wh..wh..opq
+.ds AUFS_WH_BASE .wh..wh.aufs
+.ds AUFS_WH_PLINKDIR .wh..wh.plnk
+.ds AUFS_BRANCH_MAX 127
+.ds AUFS_MFS_SECOND_DEF 30
+.\".so aufs.tmac
+.
+.eo
+.de TQ
+.br
+.ns
+.TP \$1
+..
+.de Bu
+.IP \(bu 4
+..
+.ec
+.\" end of macro definitions
+.
+.\" ----------------------------------------------------------------------
+.TH aufs 5 \*[AUFS_VERSION] Linux "Linux Aufs User\[aq]s Manual"
+.SH NAME
+aufs \- another unionfs. version \*[AUFS_VERSION]
+
+.\" ----------------------------------------------------------------------
+.SH DESCRIPTION
+Aufs is a stackable unification filesystem such as Unionfs, which unifies
+several directories and provides a merged single directory.
+In the early days, aufs was entirely re-designed and re-implemented
+Unionfs Version 1.x series. After
+many original ideas, approaches and improvements, it
+becomes totally different from Unionfs while keeping the basic features.
+See Unionfs Version 1.x series for the basic features.
+Recently, Unionfs Version 2.x series begin taking some of same
+approaches to aufs\[aq]s.
+
+.\" ----------------------------------------------------------------------
+.SH MOUNT OPTIONS
+At mount-time, the order of interpreting options is,
+.RS
+.Bu
+simple flags, except xino/noxino and udba=inotify
+.Bu
+branches
+.Bu
+xino/noxino
+.Bu
+udba=inotify
+.RE
+
+At remount-time,
+the options are interpreted in the given order,
+e.g. left to right.
+.RS
+.Bu
+create or remove
+whiteout-base(\*[AUFS_WH_BASE]) and
+whplink-dir(\*[AUFS_WH_PLINKDIR]) if necessary
+.RE
+.
+.TP
+.B br:BRANCH[:BRANCH ...] (dirs=BRANCH[:BRANCH ...])
+Adds new branches.
+(cf. Branch Syntax).
+
+Aufs rejects the branch which is an ancestor or a descendant of anther
+branch. It is called overlapped. When the branch is loopback-mounted
+directory, aufs also checks the source fs-image file of loopback
+device. If the source file is a descendant of another branch, it will
+be rejected too.
+
+After mounting aufs or adding a branch, if you move a branch under
+another branch and make it descendant of anther branch, aufs will not
+work correctly.
+.
+.TP
+.B [ add | ins ]:index:BRANCH
+Adds a new branch.
+The index begins with 0.
+Aufs creates
+whiteout-base(\*[AUFS_WH_BASE]) and
+whplink-dir(\*[AUFS_WH_PLINKDIR]) if necessary.
+
+If there is the same named file on the lower branch (larger index),
+aufs will hide the lower file.
+You can only see the highest file.
+You will be confused if the added branch has whiteouts (including
+diropq), they may or may not hide the lower entries.
+.\" It is recommended to make sure that the added branch has no whiteout.
+
+If a process have once mapped a file by mmap(2) with MAP_SHARED
+and the same named file exists on the lower branch,
+the process still refers the file on the lower(hidden)
+branch after adding the branch.
+If you want to update the contents of a process address space after
+adding, you need to restart your process or open/mmap the file again.
+.\" Usually, such files are executables or shared libraries.
+(cf. Branch Syntax).
+.
+.TP
+.B del:dir
+Removes a branch.
+Aufs does not remove
+whiteout-base(\*[AUFS_WH_BASE]) and
+whplink-dir(\*[AUFS_WH_PLINKDIR]) automatically.
+For example, when you add a RO branch which was unified as RW, you
+will see whiteout-base or whplink-dir on the added RO branch.
+
+If a process is referencing the file/directory on the deleting branch
+(by open, mmap, current working directory, etc.), aufs will return an
+error EBUSY.
+.
+.TP
+.B mod:BRANCH
+Modifies the permission flags of the branch.
+Aufs creates or removes
+whiteout-base(\*[AUFS_WH_BASE]) and/or
+whplink-dir(\*[AUFS_WH_PLINKDIR]) if necessary.
+
+If the branch permission is been changing \[oq]rw\[cq] to \[oq]ro\[cq], and a process
+is mapping a file by mmap(2)
+.\" with MAP_SHARED
+on the branch, the process may or may not
+be able to modify its mapped memory region after modifying branch
+permission flags.
+(cf. Branch Syntax).
+.
+.TP
+.B append:BRANCH
+equivalent to \[oq]add:(last index + 1):BRANCH\[cq].
+(cf. Branch Syntax).
+.
+.TP
+.B prepend:BRANCH
+equivalent to \[oq]add:0:BRANCH.\[cq]
+(cf. Branch Syntax).
+.
+.TP
+.B xino=filename
+Use external inode number bitmap and translation table.
+It is set to
+<FirstWritableBranch>/\*[AUFS_XINO_FNAME] by default, or
+\*[AUFS_XINO_DEFPATH].
+Comma character in filename is not allowed.
+
+The files are created per an aufs and per a branch filesystem, and
+unlinked. So you
+cannot find this file, but it exists and is read/written frequently by
+aufs.
+(cf. External Inode Number Bitmap, Translation Table).
+.
+.TP
+.B noxino
+Stop using external inode number bitmap and translation table.
+
+If you use this option,
+Some applications will not work correctly.
+.\" And pseudo link feature will not work after the inode cache is
+.\" shrunk.
+(cf. External Inode Number Bitmap, Translation Table).
+.
+.TP
+.B trunc_xib
+Truncate the external inode number bitmap file. The truncation is done
+automatically when you delete a branch unless you do not specify
+\[oq]notrunc_xib\[cq] option.
+(cf. External Inode Number Bitmap, Translation Table).
+.
+.TP
+.B notrunc_xib
+Stop truncating the external inode number bitmap file when you delete
+a branch.
+(cf. External Inode Number Bitmap, Translation Table).
+.
+.TP
+.B create_policy | create=CREATE_POLICY
+.TQ
+.B copyup_policy | copyup | cpup=COPYUP_POLICY
+Policies to select one among multiple writable branches. The default
+values are \[oq]create=tdp\[cq] and \[oq]cpup=tdp\[cq].
+link(2) and rename(2) systemcalls have an exception. In aufs, they
+try keeping their operations in the branch where the source exists.
+(cf. Policies to Select One among Multiple Writable Branches).
+.
+.TP
+.B verbose | v
+Print some information.
+Currently, it is only busy file (or inode) at deleting a branch.
+.
+.TP
+.B noverbose | quiet | q | silent
+Disable \[oq]verbose\[cq] option.
+This is default value.
+.
+.TP
+.B sum
+df(1)/statfs(2) returns the total number of blocks and inodes of
+all branches.
+Note that there are cases that systemcalls may return ENOSPC, even if
+df(1)/statfs(2) shows that aufs has some free space/inode.
+.
+.TP
+.B nosum
+Disable \[oq]sum\[cq] option.
+This is default value.
+.
+.TP
+.B dirwh=N
+Watermark to remove a dir actually at rmdir(2) and rename(2).
+
+If the target dir which is being removed or renamed (destination dir)
+has a huge number of whiteouts, i.e. the dir is empty logically but
+physically, the cost to remove/rename the single
+dir may be very high.
+It is
+required to unlink all of whiteouts internally before issuing
+rmdir/rename to the branch.
+To reduce the cost of single systemcall,
+aufs renames the target dir to a whiteout-ed temporary name and
+invokes a pre-created
+kernel thread to remove whiteout-ed children and the target dir.
+The rmdir/rename systemcall returns just after kicking the thread.
+
+When the number of whiteout-ed children is less than the value of
+dirwh, aufs remove them in a single systemcall instead of passing
+another thread.
+This value is ignored when the branch is NFS.
+The default value is \*[AUFS_DIRWH_DEF].
+.
+.TP
+.B plink
+.TQ
+.B noplink
+Specifies to use \[oq]pseudo link\[cq] feature or not.
+The default is \[oq]plink\[cq] which means use this feature.
+(cf. Pseudo Link)
+.
+.TP
+.B clean_plink
+Removes all pseudo-links in memory.
+In order to make pseudo-link permanent, use
+\[oq]auplink\[cq] utility just before one of these operations,
+unmounting aufs,
+using \[oq]ro\[cq] or \[oq]noplink\[cq] mount option,
+deleting a branch from aufs,
+adding a branch into aufs,
+or changing your writable branch as readonly.
+If you installed both of /sbin/mount.aufs and /sbin/umount.aufs, and your
+mount(8) and umount(8) support them,
+\[oq]auplink\[cq] utility will be executed automatically and flush pseudo-links.
+(cf. Pseudo Link)
+.
+.TP
+.B udba=none | reval | inotify
+Specifies the level of UDBA (User\[aq]s Direct Branch Access) test.
+(cf. User\[aq]s Direct Branch Access and Inotify Limitation).
+.
+.TP
+.B diropq=whiteouted | w | always | a
+Specifies whether mkdir(2) and rename(2) dir case make the created directory
+\[oq]opaque\[cq] or not.
+In other words, to create \[oq]\*[AUFS_WH_DIROPQ]\[cq] under the created or renamed
+directory, or not to create.
+When you specify diropq=w or diropq=whiteouted, aufs will not create
+it if the
+directory was not whiteouted or opaqued. If the directory was whiteouted
+or opaqued, the created or renamed directory will be opaque.
+When you specify diropq=a or diropq==always, aufs will always create
+it regardless
+the directory was whiteouted/opaqued or not.
+The default value is diropq=w, it means not to create when it is unnecessary.
+If you define CONFIG_AUFS_COMPAT at aufs compiling time, the default will be
+diropq=a.
+You need to consider this option if you are planning to add a branch later
+since \[oq]diropq\[cq] affects the same named directory on the added branch.
+.
+.TP
+.B warn_perm
+.TQ
+.B nowarn_perm
+Adding a branch, aufs will issue a warning about uid/gid/permission of
+the adding branch directory,
+when they differ from the existing branch\[aq]s. This difference may or
+may not impose a security risk.
+If you are sure that there is no problem and want to stop the warning,
+use \[oq]nowarn_perm\[cq] option.
+The default is \[oq]warn_perm\[cq] (cf. DIAGNOSTICS).
+
+.\" ----------------------------------------------------------------------
+.SH Module Parameters
+.TP
+.B nwkq=N
+The number of kernel thread named \*[AUFS_WKQ_NAME].
+
+Those threads stay in the system while the aufs module is loaded,
+and handle the special I/O requests from aufs.
+The default value is \*[AUFS_NWKQ_DEF].
+
+The special I/O requests from aufs include a part of copy-up, lookup,
+directory handling, pseudo-link, xino file operations and the
+delegated access to branches.
+For example, Unix filesystems allow you to rmdir(2) which has no write
+permission bit, if its parent directory has write permission bit. In aufs, the
+removing directory may or may not have whiteout or \[oq]dir opaque\[cq] mark as its
+child. And aufs needs to unlink(2) them before rmdir(2).
+Therefore aufs delegates the actual unlink(2) and rmdir(2) to another kernel
+thread which has been created already and has a superuser privilege.
+
+If you enable CONFIG_SYSFS, you can check this value through
+<sysfs>/module/aufs/parameters/nwkq.
+
+.
+.TP
+.B brs=1 | 0
+Specifies to use the branch path data file under sysfs or not.
+
+If the number of your branches is large or their path is long
+and you meet the limitation of mount(8) ro /etc/mtab, you need to
+enable CONFIG_SYSFS and set aufs module parameter brs=1.
+If your linux version is linux\-2.6.24 and earlier, you need to enable
+CONFIG_AUFS_SYSAUFS too.
+
+When this parameter is set as 1, aufs does not show \[oq]br:\[cq] (or dirs=)
+mount option through /proc/mounts, and /sbin/mount.aufs does not put it
+to /etc/mtab. So you can keep yourself from the page limitation of
+mount(8) or /etc/mtab.
+Aufs shows branch paths through <sysfs>/fs/aufs/si_XXX/brNNN.
+Actually the file under sysfs has also a size limitation, but I don\[aq]t
+think it is harmful.
+
+The default is brs=0, which means <sysfs>/fs/aufs/si_XXX/brNNN does not exist
+and \[oq]br:\[cq] option will appear in /proc/mounts, and /etc/mtab if you
+install /sbin/mount.aufs.
+If you did not enable CONFIG_AUFS_SYSAUFS (for
+linux\-2.6.24 and earlier), this parameter will be
+ignored.
+
+There is one more side effect in setting 1 to this parameter.
+If you rename your branch, the branch path written in /etc/mtab will be
+obsoleted and the future remount will meet some error due to the
+unmatched parameters (Remember that mount(8) may take the options from
+/etc/mtab and pass them to the systemcall).
+If you set 1, /etc/mtab will not hold the branch path and you will not
+meet such trouble. On the other hand, the entires for the
+branch path under sysfs are generated dynamically. So it must not be obsoleted.
+But I don\[aq]t think users want to rename branches so often.
+.
+.TP
+.B sysrq=key
+Specifies MagicSysRq key for debugging aufs.
+You need to enable both of CONFIG_MAGIC_SYSRQ and CONFIG_AUFS_DEBUG.
+If your linux version is linux\-2.6.24 and earlier, you need to enable
+CONFIG_AUFS_SYSAUFS too.
+Currently this is for developers only.
+The default is \[oq]a\[cq].
+
+.\" ----------------------------------------------------------------------
+.SH Branch Syntax
+.TP
+.B dir_path[ =permission [ + attribute ] ]
+.TQ
+.B permission := rw | ro | rr
+.TQ
+.B attribute := wh | nolwh
+dir_path is a directory path.
+The keyword after \[oq]dir_path=\[cq] is a
+permission flags for that branch.
+Comma, colon and the permission flags string (including \[oq]=\[cq])in the path
+are not allowed.
+
+Any filesystem can be a branch, except aufs, sysfs, procfs and unionfs.
+If you specify such filesystems as an aufs branch, aufs will return an error
+saying it is unsupported.
+
+Cramfs in linux stable release has strange inodes and it makes aufs
+confused. For example,
+.nf
+$ mkdir -p w/d1 w/d2
+$ > w/z1
+$ > w/z2
+$ mkcramfs w cramfs
+$ sudo mount -t cramfs -o ro,loop cramfs /mnt
+$ find /mnt -ls
+    76    1 drwxr-xr-x   1 jro      232            64 Jan  1  1970 /mnt
+     1    1 drwxr-xr-x   1 jro      232             0 Jan  1  1970 /mnt/d1
+     1    1 drwxr-xr-x   1 jro      232             0 Jan  1  1970 /mnt/d2
+     1    1 -rw-r--r--   1 jro      232             0 Jan  1  1970 /mnt/z1
+     1    1 -rw-r--r--   1 jro      232             0 Jan  1  1970 /mnt/z2
+.fi
+
+All these two directories and two files have the same inode with one
+as their link count. Aufs cannot handle such inode correctly.
+Currently, aufs involves a tiny workaround for such inodes. But some
+applications may not work correctly since aufs inode number for such
+inode will change silently.
+If you do not have any empty files, empty directories or special files,
+inodes on cramfs will be all fine.
+
+A branch should not be shared as the writable branch between multiple
+aufs. A readonly branch can be shared.
+
+The maximum number of branches is configurable at compile time.
+The current value is \*[AUFS_BRANCH_MAX] which depends upon
+configuration.
+
+When an unknown permission or attribute is given, aufs sets ro to that
+branch silently.
+
+.SS Permission
+.
+.TP
+.B rw
+Readable and writable branch. Set as default for the first branch.
+If the branch filesystem is mounted as readonly, you cannot set it \[oq]rw.\[cq]
+.\" A filesystem which does not support link(2) and i_op\->setattr(), for
+.\" example FAT, will not be used as the writable branch.
+.
+.TP
+.B ro
+Readonly branch and it has no whiteouts on it.
+Set as default for all branches except the first one. Aufs never issue
+both of write operation and lookup operation for whiteout to this branch.
+.
+.TP
+.B rr
+Real readonly branch, special case of \[oq]ro\[cq], for natively readonly
+branch. Assuming the branch is natively readonly, aufs can optimize
+some internal operation. For example, if you specify \[oq]udba=inotify\[cq]
+option, aufs does not set inotify for the things on rr branch.
+Set by default for a branch whose fs-type is either \[oq]iso9660\[cq],
+\[oq]cramfs\[cq] or \[oq]romfs\[cq].
+
+When your branch exists on slower device and you have some
+capacity on your hdd, you may want to try ulobdev tool in ULOOP sample.
+It can cache the contents of the real devices on another faster device,
+so you will be able to get the better access performance.
+The ulobdev tool is for a generic block device, and the ulohttp is for a
+filesystem image on http server.
+If you want to spin down your hdd to save the
+battery life or something, then you may want to use ulobdev to save the
+access to the hdd, too.
+See $AufsCVS/sample/uloop in detail.
+
+.SS Attribute
+.
+.TP
+.B wh
+Readonly branch and it has/might have whiteouts on it.
+Aufs never issue write operation to this branch, but lookup for whiteout.
+Use this as \[oq]<branch_dir>=ro+wh\[cq].
+.
+.TP
+.B nolwh
+Usually, aufs creates a whiteout as a hardlink on a writable
+branch. This attributes prohibits aufs to create the hardlinked
+whiteout, including the source file of all hardlinked whiteout
+(\*[AUFS_WH_BASE].)
+If you do not like a hardlink, or your writable branch does not support
+link(2), then use this attribute.
+But I am afraid a filesystem which does not support link(2) natively
+will fail in other place such as copy-up.
+Use this as \[oq]<branch_dir>=rw+nolwh\[cq].
+Also you may want to try \[oq]noplink\[cq] mount option, while it is not recommended.
+
+.\" .SS FUSE as a branch
+.\" A FUSE branch needs special attention.
+.\" The struct fuse_operations has a statfs operation. It is OK, but the
+.\" parameter is struct statvfs* instead of struct statfs*. So almost
+.\" all user\-space implementaion will call statvfs(3)/fstatvfs(3) instead of
+.\" statfs(2)/fstatfs(2).
+.\" In glibc, [f]statvfs(3) issues [f]statfs(2), open(2)/read(2) for
+.\" /proc/mounts,
+.\" and stat(2) for the mountpoint. With this situation, a FUSE branch will
+.\" cause a deadlock in creating something in aufs. Here is a sample
+.\" scenario,
+.\" .\" .RS
+.\" .\" .IN -10
+.\" .Bu
+.\" create/modify a file just under the aufs root dir.
+.\" .Bu
+.\" aufs aquires a write\-lock for the parent directory, ie. the root dir.
+.\" .Bu
+.\" A library function or fuse internal may call statfs for a fuse branch.
+.\" The create=mfs mode in aufs will surely call statfs for each writable
+.\" branches.
+.\" .Bu
+.\" FUSE in kernel\-space converts and redirects the statfs request to the
+.\" user\-space.
+.\" .Bu
+.\" the user\-space statfs handler will call [f]statvfs(3).
+.\" .Bu
+.\" the [f]statvfs(3) in glibc will access /proc/mounts and issue
+.\" stat(2) for the mountpoint. But those require a read\-lock for the aufs
+.\" root directory.
+.\" .Bu
+.\" Then a deadlock occurs.
+.\" .\" .RE 1
+.\" .\" .IN
+.\" 
+.\" In order to avoid this deadlock, I would suggest not to call
+.\" [f]statvfs(3) from fuse. Here is a sample code to do this.
+.\" .nf
+.\" struct statvfs stvfs;
+.\" 
+.\" main()
+.\" {
+.\" 	statvfs(..., &stvfs)
+.\" 	or
+.\" 	fstatvfs(..., &stvfs)
+.\" 	stvfs.f_fsid = 0
+.\" }
+.\" 
+.\" statfs_handler(const char *path, struct statvfs *arg)
+.\" {
+.\" 	struct statfs stfs
+.\" 
+.\" 	memcpy(arg, &stvfs, sizeof(stvfs))
+.\" 
+.\" 	statfs(..., &stfs)
+.\" 	or
+.\" 	fstatfs(..., &stfs)
+.\" 
+.\" 	arg->f_bfree = stfs.f_bfree
+.\" 	arg->f_bavail = stfs.f_bavail
+.\" 	arg->f_ffree = stfs.f_ffree
+.\" 	arg->f_favail = /* any value */
+.\" }
+.\" .fi
+
+.\" ----------------------------------------------------------------------
+.SH External Inode Number Bitmap, Translation Table (xino)
+Aufs uses one external bitmap file and one external inode number
+translation table files per an aufs and per a branch
+filesystem by default.
+The bitmap is for recycling aufs inode number
+and the others
+are a table for converting an inode number on a branch to
+an aufs inode number. The default path
+is \[oq]first writable branch\[cq]/\*[AUFS_XINO_FNAME].
+If there is no writable branch, the
+default path
+will be \*[AUFS_XINO_DEFPATH].
+.\" A user who executes mount(8) needs the privilege to create xino
+.\" file.
+
+Those files are always opened and read/write by aufs frequently.
+If your writable branch is on flash memory device, it is recommended
+to put xino files on other than flash memory by specifying \[oq]xino=\[cq]
+mount option.
+
+The
+maximum file size of the bitmap is, basically, the amount of the
+number of all the files on all branches divided by 8 (the number of
+bits in a byte).
+For example, on a 4KB page size system, if you have 32,768 (or
+2,599,968) files in aufs world,
+then the maximum file size of the bitmap is 4KB (or 320KB).
+
+The
+maximum file size of the table will
+be \[oq]max inode number on the branch x size of an inode number\[cq].
+For example in 32bit environment,
+
+.nf
+$ df -i /branch_fs
+/dev/hda14           2599968  203127 2396841    8% /branch_fs
+.fi
+
+and /branch_fs is an branch of the aufs. When the inode number is
+assigned contiguously (without \[oq]hole\[cq]), the maximum xino file size for
+/branch_fs will be 2,599,968 x 4 bytes = about 10 MB. But it might not be
+allocated all of disk blocks.
+When the inode number is assigned discontinuously, the maximum size of
+xino file will be the largest inode number on a branch x 4 bytes.
+Additionally, the file size is limited to LLONG_MAX or the s_maxbytes
+in filesystem\[aq]s superblock (s_maxbytes may be smaller than
+LLONG_MAX). So the
+support-able largest inode number on a branch is less than
+2305843009213693950 (LLONG_MAX/4\-1).
+This is the current limitation of aufs.
+On 64bit environment, this limitation becomes more strict and the
+supported largest inode number is less than LLONG_MAX/8\-1.
+
+The xino files are always hidden, i.e. removed. So you cannot
+do \[oq]ls \-l xino_file\[cq].
+If you enable CONFIG_SYSFS, you can check these information through
+<sysfs>/fs/aufs/<si_id>/xino (for linux\-2.6.24 and earlier, you
+need to enable CONFIG_AUFS_SYSAUFS too).
+The first line in <sysfs>/fs/aufs/<si_id>/xino (and xigen) shows the
+information of the bitmap file, in the format of,
+
+.nf
+<blocks>x<block size> <file size>
+.fi
+
+Note that a filesystem usually has a
+feature called pre-allocation, which means a number of
+blocks are allocated automatically, and then deallocated
+silently when the filesystem thinks they are unnecessary.
+You do not have to be surprised the sudden changes of the number of
+blocks, when your filesystem which xino files are placed supports the
+pre-allocation feature.
+
+The rests are hidden xino file information in the format of,
+
+.nf
+<branch index>: <file count>, <blocks>x<block size> <file size>
+.fi
+
+If the file count is larger than 1, it means some of your branches are
+on the same filesystem and the xino file is shared by them.
+Note that the file size may not be equal to the actual consuming blocks
+since xino file is a sparse file, i.e. a hole in a file which does not
+consume any disk blocks.
+
+Once you unmount aufs, the xino files for that aufs are totally gone.
+It means that the inode number is not permanent.
+
+The xino files should be created on the filesystem except NFS.
+If your first writable branch is NFS, you will need to specify xino
+file path other than NFS.
+Also if you are going to remove the branch where xino files exist or
+change the branch permission to readonly, you need to use xino option
+before del/mod the branch.
+
+The bitmap file can be truncated.
+For example, if you delete a branch which has huge number of files,
+many inode numbers will be recycled and the bitmap will be truncated
+to smaller size. Aufs does this automatically when a branch is
+deleted.
+You can truncate it anytime you like if you specify \[oq]trunc_xib\[cq] mount
+option. But when the accessed inode number was not deleted, nothing
+will be truncated.
+If you do not want to truncate it (it may be slow) when you delete a
+branch, specify \[oq]notrunc_xib\[cq] after \[oq]del\[cq] mount option.
+
+If you do not want to use xino, use noxino mount option. Use this
+option with care, since the inode number may be changed silently and
+unexpectedly anytime.
+For example,
+rmdir failure, recursive chmod/chown/etc to a large and deep directory
+or anything else.
+And some applications will not work correctly.
+.\" When the inode number has been changed, your system
+.\" can be crazy.
+If you want to change the xino default path, use xino mount option.
+
+After you add branches, the persistence of inode number may not be
+guaranteed.
+At remount time, cached but unused inodes are discarded.
+And the newly appeared inode may have different inode number at the
+next access time. The inodes in use have the persistent inode number.
+
+When aufs assigned an inode number to a file, and if you create the
+same named file on the upper branch directly, then the next time you
+access the file, aufs may assign another inode number to the file even
+if you use xino option.
+Some applications may treat the file whose inode number has been
+changed as totally different file.
+
+.\" ----------------------------------------------------------------------
+.SH Pseudo Link (hardlink over branches)
+Aufs supports \[oq]pseudo link\[cq] which is a logical hard-link over
+branches (cf. ln(1) and link(2)).
+In other words, a copied-up file by link(2) and a copied-up file which was
+hard-linked on a readonly branch filesystem.
+
+When you have files named fileA and fileB which are
+hardlinked on a readonly branch, if you write something into fileA,
+aufs copies-up fileA to a writable branch, and write(2) the originally
+requested thing to the copied-up fileA. On the writable branch,
+fileA is not hardlinked.
+But aufs remembers it was hardlinked, and handles fileB as if it existed
+on the writable branch, by referencing  fileA\[aq]s inode on the writable
+branch as fileB\[aq]s inode.
+
+Once you unmount aufs, the plink info for that aufs kept in memory are totally
+gone.
+It means that the pseudo-link is not permanent.
+If you want to make plink permanent, try \[oq]auplink\[cq] utility just before
+one of these operations,
+unmounting your aufs,
+using \[oq]ro\[cq] or \[oq]noplink\[cq] mount option,
+deleting a branch from aufs,
+adding a branch into aufs,
+or changing your writable branch to readonly.
+
+This utility will reproduces all real hardlinks on a writable branch by linking
+them, and removes pseudo-link info in memory and temporary link on the
+writable branch.
+Since this utility access your branches directly, you cannot hide them by
+\[oq]mount \-\-bind /tmp /branch\[cq] or something.
+
+If you are willing to rebuild your aufs with the same branches later, you
+should use auplink utility before you umount your aufs.
+If you installed both of /sbin/mount.aufs and /sbin/umount.aufs, and your
+mount(8) and umount(8) support them,
+\[oq]auplink\[cq] utility will be executed automatically and flush pseudo-links.
+
+.nf
+# auplink /your/aufs/root flush
+# umount /your/aufs/root
+or
+# auplink /your/aufs/root flush
+# mount -o remount,mod:/your/writable/branch=ro /your/aufs/root
+or
+# auplink /your/aufs/root flush
+# mount -o remount,noplink /your/aufs/root
+or
+# auplink /your/aufs/root flush
+# mount -o remount,del:/your/aufs/branch /your/aufs/root
+or
+# auplink /your/aufs/root flush
+# mount -o remount,append:/your/aufs/branch /your/aufs/root
+.fi
+
+The plinks are kept both in memory and on disk. When they consumes too much
+resources on your system, you can use the \[oq]auplink\[cq] utility at anytime and
+throw away the unnecessary pseudo-links in safe.
+
+Additionally, the \[oq]auplink\[cq] utility is very useful for some security reasons.
+For example, when you have a directory whose permission flags
+are 0700, and a file who is 0644 under the 0700 directory. Usually,
+all files under the 0700 directory are private and no one else can see
+the file. But when the directory is 0711 and someone else knows the 0644
+filename, he can read the file.
+
+Basically, aufs pseudo-link feature creates a temporary link under the
+directory whose owner is root and the permission flags are 0700.
+But when the writable branch is NFS, aufs sets 0711 to the directory.
+When the 0644 file is pseudo-linked, the temporary link, of course the
+contents of the file is totally equivalent, will be created under the
+0711 directory. The filename will be generated by its inode number.
+While it is hard to know the generated filename, someone else may try peeping
+the temporary pseudo-linked file by his software tool which may try the name
+from one to MAX_INT or something.
+In this case, the 0644 file will be read unexpectedly.
+I am afraid that leaving the temporary pseudo-links can be a security hole.
+It makes sense to execute \[oq]auplink /your/aufs/root flush\[cq]
+periodically, when your writable branch is NFS.
+
+When your writable branch is not NFS, or all users are careful enough to set 0600
+to their private files, you do not have to worry about this issue.
+
+If you do not want this feature, use \[oq]noplink\[cq] mount option.
+
+.SS The behaviours of plink and noplink
+This sample shows that the \[oq]f_src_linked2\[cq] with \[oq]noplink\[cq] option cannot follow
+the link.
+
+.nf
+none on /dev/shm/u type aufs (rw,xino=/dev/shm/rw/.aufs.xino,br:/dev/shm/rw=rw:/dev/shm/ro=ro)
+$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
+ls: ./copied: No such file or directory
+15 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
+15 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
+22 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ./f_src_linked
+22 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ./f_src_linked2
+$ echo abc >> f_src_linked
+$ cp f_src_linked copied
+$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
+15 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
+15 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
+36 -rw-r--r--  2 jro jro 6 Dec 22 11:03 ../rw/f_src_linked
+53 -rw-r--r--  1 jro jro 6 Dec 22 11:03 ./copied
+22 -rw-r--r--  2 jro jro 6 Dec 22 11:03 ./f_src_linked
+22 -rw-r--r--  2 jro jro 6 Dec 22 11:03 ./f_src_linked2
+$ cmp copied f_src_linked2
+$
+
+none on /dev/shm/u type aufs (rw,xino=/dev/shm/rw/.aufs.xino,noplink,br:/dev/shm/rw=rw:/dev/shm/ro=ro)
+$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
+ls: ./copied: No such file or directory
+17 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
+17 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
+23 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ./f_src_linked
+23 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ./f_src_linked2
+$ echo abc >> f_src_linked
+$ cp f_src_linked copied
+$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
+17 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
+17 -rw-r--r--  2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
+36 -rw-r--r--  1 jro jro 6 Dec 22 11:03 ../rw/f_src_linked
+53 -rw-r--r--  1 jro jro 6 Dec 22 11:03 ./copied
+23 -rw-r--r--  2 jro jro 6 Dec 22 11:03 ./f_src_linked
+23 -rw-r--r--  2 jro jro 6 Dec 22 11:03 ./f_src_linked2
+$ cmp copied f_src_linked2
+cmp: EOF on f_src_linked2
+$
+.fi
+
+.\"
+.\" If you add/del a branch, or link/unlink the pseudo-linked
+.\" file on a branch
+.\" directly, aufs cannot keep the correct link count, but the status of
+.\" \[oq]pseudo-linked.\[cq]
+.\" Those files may or may not keep the file data after you unlink the
+.\" file on the branch directly, especially the case of your branch is
+.\" NFS.
+
+If you add a branch which has fileA or fileB, aufs does not follow the
+pseudo link. The file on the added branch has no relation to the same
+named file(s) on the lower branch(es).
+If you use noxino mount option, pseudo link will not work after the
+kernel shrinks the inode cache.
+
+This feature will not work for squashfs before version 3.2 since its
+inode is tricky.
+When the inode is hardlinked, squashfs inodes has the same inode
+number and correct link count, but the inode memory object is
+different. Squashfs inodes (before v3.2) are generated for each, even
+they are hardlinked.
+
+.\" ----------------------------------------------------------------------
+.SH User\[aq]s Direct Branch Access (UDBA)
+UDBA means a modification to a branch filesystem manually or directly,
+e.g. bypassing aufs.
+While aufs is designed and implemented to be safe after UDBA,
+it can make yourself and your aufs confused. And some information like
+aufs inode will be incorrect.
+For example, if you rename a file on a branch directly, the file on
+aufs may
+or may not be accessible through both of old and new name.
+Because aufs caches various information about the files on
+branches. And the cache still remains after UDBA.
+
+Aufs has a mount option named \[oq]udba\[cq] which specifies the test level at
+access time whether UDBA was happened or not.
+.
+.TP
+.B udba=none
+Aufs trusts the dentry and the inode cache on the system, and never
+test about UDBA. With this option, aufs runs fastest, but it may show
+you incorrect data.
+Additionally, if you often modify a branch
+directly, aufs will not be able to trace the changes of inodes on the
+branch. It can be a cause of wrong behaviour, deadlock or anything else.
+
+It is recommended to use this option only when you are sure that
+nobody access a file on a branch.
+It might be difficult for you to achieve real \[oq]no UDBA\[cq] world when you
+cannot stop your users doing \[oq]find / \-ls\[cq] or something.
+If you really want to forbid all of your users to UDBA, here is a trick
+for it.
+With this trick, users cannot see the
+branches directly and aufs runs with no problem, except \[oq]auplink\[cq] utility.
+But if you are not familiar with aufs, this trick may make
+yourself confused.
+
+.nf
+# d=/tmp/.aufs.hide
+# mkdir $d
+# for i in $branches_you_want_to_hide
+> do
+>	mount -n --bind $d $i
+> done
+.fi
+
+When you unmount the aufs, delete/modify the branch by remount, or you
+want to show the hidden branches again, unmount the bound
+/tmp/.aufs.hide.
+
+.nf
+# umount -n $branches_you_want_to_unbound
+.fi
+
+If you use FUSE filesystem as an aufs branch which supports hardlink,
+you should not set this option, since FUSE makes inode objects for
+each hardlinks (at least in linux\-2.6.23). When your FUSE filesystem
+maintains them at link/unlinking, it is equivalent
+to \[oq]direct branch access\[cq] for aufs.
+
+.
+.TP
+.B udba=reval
+Aufs tests only the existence of the file which existed. If
+the existed file was removed on the branch directly, aufs
+discard the cache about the file and
+re-lookup it. So the data will be updated.
+This test is at minimum level to keep the performance and ensure the
+existence of a file.
+This is default and aufs runs still fast.
+
+This rule leads to some unexpected situation, but I hope it is
+harmless. Those are totally depends upon cache. Here are just a few
+examples.
+.
+.RS
+.Bu
+If the file is cached as negative or
+not-existed, aufs does not test it. And the file is still handled as
+negative after a user created the file on a branch directly. If the
+file is not cached, aufs will lookup normally and find the file.
+.
+.Bu
+When the file is cached as positive or existed, and a user created the
+same named file directly on the upper branch. Aufs detects the cached
+inode of the file is still existing and will show you the old (cached)
+file which is on the lower branch.
+.
+.Bu
+When the file is cached as positive or existed, and a user renamed the
+file by rename(2) directly. Aufs detects the inode of the file is
+still existing. You may or may not see both of the old and new files.
+Todo: If aufs also tests the name, we can detect this case.
+.RE
+
+If your outer modification (UDBA) is rare and you can ignore the
+temporary and minor differences between virtual aufs world and real
+branch filesystem, then try this mount option.
+.
+.TP
+.B udba=inotify
+Aufs sets `inotify' to all the accessed directories on its branches
+and receives the event about the dir and its children. It consumes
+resources, cpu and memory. And I am afraid that the performance will be
+hurt, but it is most strict test level.
+There are some limitations of linux inotify, see also Inotify
+Limitation.
+So it is recommended to leave udba default option usually, and set it
+to inotify by remount when you need it.
+
+When a user accesses the file which was notified UDBA before, the cached data
+about the file will be discarded and aufs re-lookup it. So the data will
+be updated.
+When an error condition occurs between UDBA and aufs operation, aufs
+will return an error, including EIO.
+To use this option, you need linux\-2.6.18 and later, and need to
+enable CONFIG_INOTIFY and CONFIG_AUFS_UDBA_INOTIFY.
+
+To rename/rmdir a directory on a branch directory may reveal the same named
+directory on the lower branch. Aufs tries re-lookuping the renamed
+directory and the revealed directory and assigning different inode
+number to them. But the inode number including their children can be a
+problem. The inode numbers will be changed silently, and
+aufs may produce a warning. If you rename a directory repeatedly and
+reveal/hide the lower directory, then aufs may confuse their inode
+numbers too. It depends upon the system cache.
+
+When you make a directory in aufs and mount other filesystem on it,
+the directory in aufs cannot be removed expectedly because it is a
+mount point. But the same named directory on the writable branch can
+be removed, if someone wants. It is just an empty directory, instead
+of a mount point.
+Aufs cannot stop such direct rmdir, but produces a warning about it.
+
+If the pseudo-linked file is hardlinked or unlinked on the branch
+directly, its inode link count in aufs may be incorrect. It is
+recommended to flush the psuedo-links by auplink script.
+
+.\" ----------------------------------------------------------------------
+.SH Linux Inotify Limitation
+Unfortunately, current inotify (linux\-2.6.18) has some limitations,
+and aufs must derive it.
+
+.SS IN_ATTRIB, updating atime
+When a file/dir on a branch is accessed directly, the inode atime (access
+time, cf. stat(2)) may or may not be updated. In some cases, inotify
+does not fire this event. So the aufs inode atime may remain old.
+
+.SS IN_ATTRIB, updating nlink
+When the link count of a file on a branch is incremented by link(2)
+directly,
+inotify fires IN_CREATE to the parent
+directory, but IN_ATTRIB to the file. So the aufs inode nlink may
+remain old.
+
+.SS IN_DELETE, removing file on NFS
+When a file on a NFS branch is deleted directly, inotify may or may
+not fire
+IN_DELETE event. It depends upon the status of dentry
+(DCACHE_NFSFS_RENAMED flag).
+In this case, the file on aufs seems still exists. Aufs and any user can see
+the file.
+
+.SS IN_IGNORED, deleted rename target
+When a file/dir on a branch is unlinked by rename(2) directly, inotify
+fires IN_IGNORED which means the inode is deleted. Actually, in some
+cases, the inode survives. For example, the rename target is linked or
+opened. In this case, inotify watch set by aufs is removed by VFS and
+inotify.
+And aufs cannot receive the events anymore. So aufs may show you
+incorrect data about the file/dir.
+
+.\" ----------------------------------------------------------------------
+.SH Copy On Write, or aufs internal copyup and copydown
+Every stackable filesystem which implements copy\-on\-write supports the
+copyup feature. The feature is to copy a file/dir from the lower branch
+to the upper internally. When you have one readonly branch and one
+upper writable branch, and you append a string to a file which exists on
+the readonly branch, then aufs will copy the file from the readonly
+branch to the writable branch with its directory hierarchy. It means one
+write(2) involves several logical/internal mkdir(2), creat(2), read(2),
+write(2) and close(2) systemcalls
+before the actual expected write(2) is performed. Sometimes it may take
+a long time, particulary when the file is very large.
+If CONFIG_AUFS_DEBUG is enabled, aufs produces a message saying `copying
+a large file.\[aq]
+
+You may see the message when you change the xino file path or
+truncate the xino/xib files. Sometimes those files can be large and may
+take a long time to handle them.
+
+.\" ----------------------------------------------------------------------
+.SH Policies to Select One among Multiple Writable Branches
+Aufs has some policies to select one among multiple writable branches
+when you are going to write/modify something. There are two kinds of
+policies, one is for newly create something and the other is for
+internal copy-up.
+You can select them by specifying mount option \[oq]create=CREATE_POLICY\[cq]
+or \[oq]cpup=COPYUP_POLICY.\[cq]
+These policies have no meaning when you have only one writable
+branch. If there is some meaning, it must hurt the performance.
+
+.SS Exceptions for Policies
+In every cases below, even if the policy says that the branch where a
+new file should be created is /rw2, the file will be created on /rw1.
+.
+.Bu
+If there is a readonly branch with \[oq]wh\[cq] attribute above the
+policy-selected branch and the parent dir is marked as opaque,
+or the target (creating) file is whiteouted on the ro+wh branch, then
+the policy will be ignored and the target file will be created on the
+nearest upper writable branch than the ro+wh branch.
+.RS
+.nf
+/aufs = /rw1 + /ro+wh/diropq + /rw2
+/aufs = /rw1 + /ro+wh/wh.tgt + /rw2
+.fi
+.RE
+.
+.Bu
+If there is a writable branch above the policy-selected branch and the
+parent dir is marked as opaque or the target file is whiteouted on the
+branch, then the policy will be ignored and the target file will be
+created on the highest one among the upper writable branches who has
+diropq or whiteout. In case of whiteout, aufs removes it as usual.
+.RS
+.nf
+/aufs = /rw1/diropq + /rw2
+/aufs = /rw1/wh.tgt + /rw2
+.fi
+.RE
+.
+.Bu
+link(2) and rename(2) systemcalls are exceptions in every policy.
+They try selecting the branch where the source exists as possible since
+copyup a large file will take long time. If it can\[aq]t be, ie. the
+branch where the source exists is readonly, then they will follow the
+copyup policy.
+.
+.Bu
+There is an exception for rename(2) when the target exists.
+If the rename target exists, aufs compares the index of the branches
+where the source and the target are existing and selects the higher
+one. If the selected branch is readonly, then aufs follows the copyup
+policy.
+
+.SS Policies for Creating
+.
+.TP
+.B create=tdp | top\-down\-parent
+Selects the highest writable branch where the parent dir exists. If
+the parent dir does not exist on a writable branch, then the internal
+copyup will happen. The policy for this copyup is always \[oq]bottom-up.\[cq]
+This is the default policy.
+.
+.TP
+.B create=rr | round\-robin
+Selects a writable branch in round robin. When you have two writable
+branches and creates 10 new files, 5 files will be created for each
+branch.
+mkdir(2) systemcall is an exception. When you create 10 new directories,
+all are created on the same branch.
+.
+.TP
+.B create=mfs[:second] | most\-free\-space[:second]
+Selects a writable branch which has most free space. In order to keep
+the performance, you can specify the duration (\[oq]second\[cq]) which makes
+aufs hold the index of last selected writable branch until the
+specified seconds expires. The first time you create something in aufs
+after the specified seconds expired, aufs checks the amount of free
+space of all writable branches by internal statfs call
+and the held branch index will be updated.
+The default value is \*[AUFS_MFS_SECOND_DEF] seconds.
+.
+.TP
+.B create=mfsrr:low[:second]
+Selects a writable branch in most-free-space mode first, and then
+round-robin mode. If the selected branch has less free space than the
+specified value \[oq]low\[cq] in bytes, then aufs re-tries in round-robin mode.
+.\" \[oq]G\[cq], \[oq]M\[cq] and \[oq]K\[cq] (case insensitive) can be followed after \[oq]low.\[cq] Or
+Try an arithmetic expansion of shell which is defined by POSIX.
+For example, $((10 * 1024 * 1024)) for 10M.
+You can also specify the duration (\[oq]second\[cq]) which is equivalent to
+the \[oq]mfs\[cq] mode.
+.
+.TP
+.B create=pmfs[:second]
+Selects a writable branch where the parent dir exists, such as tdp
+mode. When the parent dir exists on multiple writable branches, aufs
+selects the one which has most free space, such as mfs mode.
+
+.SS Policies for Copy-Up
+.
+.TP
+.B cpup=tdp | top\-down\-parent
+Equivalent to the same named policy for create.
+This is the default policy.
+.
+.TP
+.B cpup=bup | bottom\-up\-parent
+Selects the writable branch where the parent dir exists and the branch
+is nearest upper one from the copyup-source.
+.
+.TP
+.B cpup=bu | bottom\-up
+Selects the nearest upper writable branch from the copyup-source,
+regardless the existence of the parent dir.
+
+.\" ----------------------------------------------------------------------
+.SH Dentry and Inode Caches
+If you want to clear caches on your system, there are several tricks
+for that. If your system ram is low,
+try \[oq]find /large/dir \-ls > /dev/null\[cq].
+It will read many inodes and dentries and cache them. Then old caches will be
+discarded.
+But when you have large ram or you do not have such large
+directory, it is not effective.
+
+If you want to discard cache within a certain filesystem,
+try \[oq]mount \-o remount /your/mntpnt\[cq]. Some filesystem may return an error of
+EINVAL or something, but VFS discards the unused dentry/inode caches on the
+specified filesystem.
+
+.\" ----------------------------------------------------------------------
+.SH Compatible/Incompatible with Unionfs Version 1.x Series
+If you compile aufs with \-DCONFIG_AUFS_COMPAT, dirs= option and =nfsro
+branch permission flag are available. They are interpreted as
+br: option and =ro flags respectively.
+ \[oq]debug\[cq], \[oq]delete\[cq], \[oq]imap\[cq] options are ignored silently. When you
+compile aufs without \-DCONFIG_AUFS_COMPAT, these three options are
+also ignored, but a warning message is issued.
+
+Ignoring \[oq]delete\[cq] option, and to keep filesystem consistency, aufs tries
+writing something to only one branch in a single systemcall. It means
+aufs may copyup even if the copyup-src branch is specified as writable.
+For example, you have two writable branches and a large regular file
+on the lower writable branch. When you issue rename(2) to the file on aufs,
+aufs may copyup it to the upper writable branch.
+If this behaviour is not what you want, then you should rename(2) it
+on the lower branch directly.
+
+And there is a simple shell
+script \[oq]unionctl\[cq] under sample subdirectory, which is compatible with
+unionctl(8) in
+Unionfs Version 1.x series, except \-\-query action.
+This script executes mount(8) with \[oq]remount\[cq] option and uses
+add/del/mod aufs mount options.
+If you are familiar with Unionfs Version 1.x series and want to use unionctl(8), you can
+try this script instead of using mount \-o remount,... directly.
+Aufs does not support ioctl(2) interface.
+This script is highly depending upon mount(8) in
+util\-linux\-2.12p package, and you need to mount /proc to use this script.
+If your mount(8) version differs, you can try modifying this
+script. It is very easy.
+The unionctl script is just for a sample usage of aufs remount
+interface.
+
+Aufs uses the external inode number bitmap and translation table by
+default.
+
+The default branch permission for the first branch is \[oq]rw\[cq], and the
+rest is \[oq]ro.\[cq]
+
+The whiteout is for hiding files on lower branches. Also it is applied
+to stop readdir going lower branches.
+The latter case is called \[oq]opaque directory.\[cq] Any
+whiteout is an empty file, it means whiteout is just an mark.
+In the case of hiding lower files, the name of whiteout is
+\[oq]\*[AUFS_WH_PFX]<filename>.\[cq]
+And in the case of stopping readdir, the name is
+\[oq]\*[AUFS_WH_PFX]\*[AUFS_WH_PFX].opq\[cq] or
+\[oq]\*[AUFS_WH_PFX]__dir_opaque.\[cq] The name depends upon your compile
+configuration
+CONFIG_AUFS_COMPAT.
+.\" All of newly created or renamed directory will be opaque.
+All whiteouts are hardlinked,
+including \[oq]<writable branch top dir>/\*[AUFS_WH_BASE].\[cq]
+
+The hardlink on an ordinary (disk based) filesystem does not
+consume inode resource newly. But in linux tmpfs, the number of free
+inodes will be decremented by link(2). It is recommended to specify
+nr_inodes option to your tmpfs if you meet ENOSPC. Use this option
+after checking by \[oq]df \-i.\[cq]
+
+When you rmdir or rename-to the dir who has a number of whiteouts,
+aufs rename the dir to the temporary whiteouted-name like
+\[oq]\*[AUFS_WH_PFX]<dir>.<random hex>.\[cq] Then remove it after actual operation.
+cf. mount option \[oq]dirwh.\[cq]
+
+.\" ----------------------------------------------------------------------
+.SH Incompatible with an Ordinary Filesystem
+stat(2) returns the inode info from the first existence inode among
+the branches, except the directory link count.
+Aufs computes the directory link count larger than the exact value usually, in
+order to keep UNIX filesystem semantics, or in order to shut find(1) mouth up.
+The size of a directory may be wrong too, but it has to do no harm.
+The timestamp of a directory will not be updated when a file is
+created or removed under it, and it was done on a lower branch.
+
+The test for permission bits has two cases. One is for a directory,
+and the other is for a non-directory. In the case of a directory, aufs
+checks the permission bits of all existing directories. It means you
+need the correct privilege for the directories including the lower
+branches.
+The test for a non-directory is more simple. It checks only the
+topmost inode.
+
+statfs(2) returns the information of the first branch info except
+namelen when \[oq]nosum\[cq] is specified (the default). The namelen is
+decreased by the whiteout prefix length. And the block size may differ
+from st_blksize which is obtained by stat(2).
+
+Remember, seekdir(3) and telldir(3) are not defined in POSIX. They may
+not work as you expect. Try rewinddir(3) or re-open the dir.
+
+The whiteout prefix (\*[AUFS_WH_PFX]) is reserved on all branches. Users should
+not handle the filename begins with this prefix.
+In order to future whiteout, the maxmum filename length is limited by
+the longest value \- \*[AUFS_WH_PFX_LEN]. It may be a violation of POSIX.
+
+If you dislike the difference between the aufs entries in /etc/mtab
+and /proc/mounts, and if you are using mount(8) in util\-linux package,
+then try ./mount.aufs utility. Copy the script to /sbin/mount.aufs.
+This simple utility tries updating
+/etc/mtab. If you do not care about /etc/mtab, you can ignore this
+utility.
+Remember this utility is highly depending upon mount(8) in
+util\-linux\-2.12p package, and you need to mount /proc.
+
+Since aufs uses its own inode and dentry, your system may cache huge
+number of inodes and dentries. It can be as twice as all of the files
+in your union.
+It means that unmounting or remounting readonly at shutdown time may
+take a long time, since mount(2) in VFS tries freeing all of the cache
+on the target filesystem.
+
+When you open a directory, aufs will open several directories
+internally.
+It means you may reach the limit of the number of file descriptor.
+And when the lower directory cannot be opened, aufs will close all the
+opened upper directories and return an error.
+
+The sub-mount under the branch
+of local filesystem
+is ignored.
+For example, if you have mount another filesystem on
+/branch/another/mntpnt, the files under \[oq]mntpnt\[cq] will be ignored by aufs.
+It is recommended to mount the sub-mount under the mounted aufs.
+For example,
+
+.nf
+# sudo mount /dev/sdaXX /ro_branch
+# d=another/mntpnt
+# sudo mount /dev/sdbXX /ro_branch/$d
+# mkdir -p /rw_branch/$d
+# sudo mount -t aufs -o br:/rw_branch:/ro_branch none /aufs
+# sudo mount -t aufs -o br:/rw_branch/${d}:/ro_branch/${d} none /aufs/another/$d
+.fi
+
+There are several characters which are not allowed to use in a branch
+directory path and xino filename. See detail in Branch Syntax and Mount
+Option.
+
+The file-lock which means fcntl(2) with F_SETLK, F_SETLKW or F_GETLK, flock(2)
+and lockf(3), is applied to virtual aufs file only, not to the file on a
+branch. It means you can break the lock by accessing a branch directly.
+TODO: check \[oq]security\[cq] to hook locks, as inotify does.
+
+The I/O to the named pipe or local socket are not handled by aufs, even
+if it exists in aufs. After the reader and the writer established their
+connection if the pipe/socket are copied-up, they keep using the old one
+instead of the copied-up one.
+
+The fsync(2) and fdatasync(2) systemcalls return 0 which means success, even
+if the given file descriptor is not opened for writing.
+I am afraid this behaviour may violate some standards. Checking the
+behaviour of fsync(2) on ext2, aufs decided to return success.
+
+If you want to use disk-quota, you should set it up to your writable
+branch since aufs does not have its own block device.
+
+When your aufs is the root directory of your system, and your system
+tells you some of the filesystem were not unmounted cleanly, try these
+procedure when you shutdown your system.
+.nf
+# mount -no remount,ro /
+# for i in $writable_branches
+# do mount -no remount,ro $i
+# done
+.fi
+If your xino file is on a hard drive, you also need to specify
+\[oq]noxino\[cq] option or \[oq]xino=/your/tmpfs/xino\[cq] at remounting root
+directory.
+
+To rename(2) directory may return EXDEV even if both of src and tgt
+are on the same aufs. When the rename-src dir exists on multiple
+branches and the lower dir has child(ren), aufs has to copyup all his
+children. It can be recursive copyup. Current aufs does not support
+such huge copyup operation at one time in kernel space, instead
+produces a warning and returns EXDEV.
+Generally, mv(1) detects this error and tries mkdir(2) and
+rename(2) or copy/unlink recursively. So the result is harmless.
+If your application which issues rename(2) for a directory does not
+support EXDEV, it will not work on aufs.
+Also this specification is applied to the case when the src directroy
+exists on the lower readonly branch and it has child(ren).
+
+If a sudden accident such like a power failure happens during aufs is
+performing, and regular fsck for branch filesystems is completed after
+the disaster, you need to extra fsck for aufs writable branches. It is
+necessary to check whether the whiteout remains incorrectly or not,
+eg. the real filename and the whiteout for it under the same parent
+directory. If such whiteout remains, aufs cannot handle the file
+correctly.
+To check the consistency from the aufs\[aq] point of view, you can use a
+simple shell script called /sbin/auchk. Its purpose is a fsck tool for
+aufs, and it checks the illegal whiteout, the remained
+pseudo-links and the remained aufs-temp files. If they are found, the
+utility reports you and asks whether to delete or not.
+It is recommended to execute /sbin/auchk for every writable branch
+filesystem before mouting aufs if the system experienced crash.
+
+
+.\" ----------------------------------------------------------------------
+.SH EXAMPLES
+The mount options are interpreted from left to right at remount-time.
+These examples
+shows how the options are handled. (assuming /sbin/mount.aufs was
+installed)
+
+.nf
+# mount -v -t aufs br:/day0:/base none /u
+none on /u type aufs (rw,xino=/day0/.aufs.xino,br:/day0=rw:/base=ro)
+# mount -v -o remount,\\
+	prepend:/day1,\\
+	xino=/day1/xino,\\
+	mod:/day0=ro,\\
+	del:/day0 \\
+	/u
+none on /u type aufs (rw,xino=/day1/xino,br:/day1=rw:/base=ro)
+.fi
+
+.nf
+# mount -t aufs br:/rw none /u
+# mount -o remount,append:/ro /u
+different uid/gid/permission, /ro
+# mount -o remount,del:/ro /u
+# mount -o remount,nowarn_perm,append:/ro /u
+#
+(there is no warning)
+.fi
+
+.\" If you want to expand your filesystem size, aufs may help you by
+.\" adding an writable branch. Since aufs supports multiple writable
+.\" branches, the old writable branch can be being writable, if you want.
+.\" In this example, any modifications to the files under /ro branch will
+.\" be copied-up to /new, but modifications to the files under /rw branch
+.\" will not.
+.\" And the next example shows the modifications to the files under /rw branch
+.\" will be copied-up to /new/a.
+.\"
+.\" Todo: test multiple writable branches policy. cpup=nearest, cpup=exist_parent.
+.\"
+.\" .nf
+.\" # mount -v -t aufs br:/rw:/ro none /u
+.\" none on /u type aufs (rw,xino=/rw/.aufs.xino,br:/rw=rw:/ro=ro)
+.\" # mkfs /new
+.\" # mount -v -o remount,add:1:/new=rw /u
+.\" none on /u type aufs (rw,xino=/rw/.aufs.xino,br:/rw=rw:/new=rw:/ro=ro)
+.\" .fi
+.\"
+.\" .nf
+.\" # mount -v -t aufs br:/rw:/ro none /u
+.\" none on /u type aufs (rw,xino=/rw/.aufs.xino,br:/rw=rw:/ro=ro)
+.\" # mkfs /new
+.\" # mkdir /new/a new/b
+.\" # mount -v -o remount,add:1:/new/b=rw,prepend:/new/a,mod:/rw=ro /u
+.\" none on /u type aufs (rw,xino=/rw/.aufs.xino,br:/new/a=rw:/rw=ro:/new/b=rw:/ro=ro)
+.\" .fi
+
+When you use aufs as root filesystem, it is recommended to consider to
+exclude some directories. For example, /tmp and /var/log are not need
+to stack in many cases. They do not usually need to copyup or to whiteout.
+Also the swapfile on aufs (a regular file, not a block device) is not
+supported.
+In order to exclude the specific dir from aufs, try bind mounting.
+
+And there is a good sample which is for network booted diskless machines. See
+sample/ in detail.
+
+.\" ----------------------------------------------------------------------
+.SH DIAGNOSTICS
+When you add a branch to your union, aufs may warn you about the
+privilege or security of the branch, which is the permission bits,
+owner and group of the top directory of the branch.
+For example, when your upper writable branch has a world writable top
+directory,
+a malicious user can create any files on the writable branch directly,
+like copyup and modify manually. I am afraid it can be a security
+issue.
+
+When you mount or remount your union without \-o ro common mount option
+and without writable branch, aufs will warn you that the first branch
+should be writable.
+
+.\" It is discouraged to set both of \[oq]udba\[cq] and \[oq]noxino\[cq] mount options. In
+.\" this case the inode number under aufs will always be changed and may
+.\" reach the end of inode number which is a maximum of unsigned long. If
+.\" the inode number reaches the end, aufs will return EIO repeatedly.
+
+When you set udba other than inotify and change something on your
+branch filesystem directly, later aufs may detect some mismatches to
+its cache. If it is a critical mismatch, aufs returns EIO.
+
+When an error occurs in aufs, aufs prints the kernel message with
+\[oq]errno.\[cq] The priority of the message (log level) is ERR or WARNING which
+depends upon the message itself.
+You can convert the \[oq]errno\[cq] into the error message by perror(3),
+strerror(3) or something.
+For example, the \[oq]errno\[cq] in the message \[oq]I/O Error, write failed (\-28)\[cq]
+is 28 which means ENOSPC or \[oq]No space left on device.\[cq]
+
+.\" .SH Current Limitation
+.
+.\" ----------------------------------------------------------------------
+.\" SYNOPSIS
+.\" briefly describes the command or function\[aq]s interface.  For  commands,  this
+.\" shows the syntax of the command and its arguments (including options); bold-
+.\" face is used for as-is text and italics are  used  to  indicate  replaceable
+.\" arguments. Brackets ([]) surround optional arguments, vertical bars (|) sep-
+.\" arate choices, and ellipses (...) can be repeated.  For functions, it  shows
+.\" any required data declarations or #include directives, followed by the func-
+.\" tion declaration.
+.
+.\" DESCRIPTION
+.\" gives an explanation of what the command, function, or format does.  Discuss
+.\" how  it  interacts  with  files  and standard input, and what it produces on
+.\" standard output  or  standard  error.   Omit  internals  and  implementation
+.\" details  unless  they\[aq]re critical for understanding the interface.  Describe
+.\" the usual case; for information on options  use  the  OPTIONS  section.   If
+.\" there  is some kind of input grammar or complex set of subcommands, consider
+.\" describing them in a separate USAGE section (and just place an  overview  in
+.\" the DESCRIPTION section).
+.
+.\" RETURN VALUE
+.\" gives a list of the values the library routine will return to the caller and
+.\" the conditions that cause these values to be returned.
+.
+.\" EXIT STATUS
+.\" lists the possible exit status values or a program and the  conditions  that
+.\" cause these values to be returned.
+.
+.\" USAGE
+.\" describes the grammar of any sublanguage this implements.
+.
+.\" FILES
+.\" lists  the  files the program or function uses, such as configuration files,
+.\" startup files, and files the program directly operates on.   Give  the  full
+.\" pathname  of  these  files,  and  use the installation process to modify the
+.\" directory part to match user preferences.  For many  programs,  the  default
+.\" installation  location is in /usr/local, so your base manual page should use
+.\" /usr/local as the base.
+.
+.\" ENVIRONMENT
+.\" lists all environment variables that affect your program or function and how
+.\" they affect it.
+.
+.\" SECURITY
+.\" discusses security issues and implications.  Warn  about  configurations  or
+.\" environments  that should be avoided, commands that may have security impli-
+.\" cations, and so on, especially if they aren\[aq]t obvious.  Discussing  security
+.\" in  a  separate section isn\[aq]t necessary; if it\[aq]s easier to understand, place
+.\" security information in the other sections (such as the DESCRIPTION or USAGE
+.\" section).  However, please include security information somewhere!
+.
+.\" CONFORMING TO
+.\" describes any standards or conventions this implements.
+.
+.\" NOTES
+.\" provides miscellaneous notes.
+.
+.\" BUGS
+.\" lists  limitations,  known defects or inconveniences, and other questionable
+.\" activities.
+
+.SH COPYRIGHT
+Copyright \(co 2005\-2009 Junjiro R. Okajima
+
+.SH AUTHOR
+Junjiro R. Okajima
+
+.\" SEE ALSO
+.\" lists  related  man  pages in alphabetical order, possibly followed by other
+.\" related pages or documents.  Conventionally this is the last section.
diff --git a/Documentation/filesystems/aufs/design/01intro.txt b/Documentation/filesystems/aufs/design/01intro.txt
new file mode 100644
index 0000000..4955fdc
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/01intro.txt
@@ -0,0 +1,128 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+Introduction
+----------------------------------------
+
+aufs [ei ju: ef es] | [a u f s]
+1. abbrev. for "advanced multi-layered unification filesystem".
+2. abbrev. for "another unionfs".
+3. abbrev. for "auf das" in German which means "on the" in English.
+   Ex. "Butter aufs Brot"(G) means "butter onto bread"(E).
+   But "Filesystem aufs Filesystem" is hard to understand.
+
+AUFS is a filesystem with features:
+- multi layered stackable unification filesystem, the member directory
+  is called as a branch.
+- branch permission and attribute, 'readonly', 'real-readonly',
+  'readwrite', 'whiteout-able', 'link-able whiteout' and their
+  combination.
+- internal "file copy-on-write".
+- logical deletion, whiteout.
+- dynamic branch manipulation, adding, deleting and changing permission.
+- allow bypassing aufs, user's direct branch access.
+- external inode number translation table and bitmap which maintains the
+  persistent aufs inode number.
+- seekable directory, including NFS readdir.
+- file mapping, mmap and sharing pages.
+- pseudo-link, hardlink over branches.
+- loopback mounted filesystem as a branch.
+- several policies to select one among multiple writable branches.
+- revert a single systemcall when an error occurs in aufs.
+- and more...
+
+
+Multi Layered Stackable Unification Filesystem
+----------------------------------------------------------------------
+Most people already knows what it is.
+It is a filesystem which unifies several directories and provides a
+merged single directory. When users access a file, the access will be
+passed/re-directed/converted (sorry, I am not sure which English word is
+correct) to the real file on the member filesystem. The member
+filesystem is called 'lower filesystem' or 'branch' and has a mode
+'readonly' and 'readwrite.' And the deletion for a file on the lower
+readonly branch is handled by creating 'whiteout' on the upper writable
+branch.
+
+On LKML, there have been discussions about UnionMount (Jan Blunck and
+Bharata B Rao) and Unionfs (Erez Zadok). They took different approaches
+to implement the merged-view.
+The former tries putting it into VFS, and the latter implements as a
+separate filesystem.
+(If I misunderstand about these implementations, please let me know and
+I shall correct it. Because it is a long time ago when I read their
+source files last time).
+UnionMount's approach will be able to small, but may be hard to share
+branches between several UnionMount since the whiteout in it is
+implemented in the inode on branch filesystem and always
+shared. According to Bharata's post, readdir does not seems to be
+finished yet.
+Unionfs has a longer history. When I started implementing a stacking filesystem
+(Aug 2005), it already existed. It has virtual super_block, inode,
+dentry and file objects and they have an array pointing lower same kind
+objects. After contributing many patches for Unionfs, I re-started my
+project AUFS (Jun 2006).
+
+In AUFS, the structure of filesystem resembles to Unionfs, but I
+implemented my own ideas, approaches and enhancements and it became
+totally different one.
+
+
+Several characters/aspects of aufs
+----------------------------------------------------------------------
+
+Aufs has several characters or aspects.
+1. a filesystem, callee of VFS helper
+2. sub-VFS, caller of VFS helper for branches
+3. a virtual filesystem which maintains persistent inode number
+4. reader/writer of files on branches such like an application
+
+1. Caller of VFS Helper
+As an ordinary linux filesystem, aufs is a callee of VFS. For instance,
+unlink(2) from an application reaches sys_unlink() kernel function and
+then vfs_unlink() is called. vfs_unlink() is one of VFS helper and it
+calls filesystem specific unlink operation. Actually aufs implements the
+unlink operation but it behaves like a redirector.
+
+2. Caller of VFS Helper for Branches
+aufs_unlink() passes the unlink request to the branch filesystem as if
+it were called from VFS. So the called unlink operation of the branch
+filesystem acts as usual. As a caller of VFS helper, aufs should handle
+every necessary pre/post operation for the branch filesystem.
+- acquire the lock for the parent dir on a branch
+- lookup in a branch
+- revalidate dentry on a branch
+- mnt_want_write() for a branch
+- vfs_unlink() for a branch
+- mnt_drop_write() for a branch
+- release the lock on a branch
+
+3. Persistent Inode Number
+One of the most important issue for a filesystem is to maintain inode
+numbers. This is particularly important to support exporting a
+filesystem via NFS. Aufs is a virtual filesystem which doesn't have a
+backend block device for its own. But some storage is necessary to
+maintain inode number. It may be a large space and may not suit to keep
+in memory. Aufs rents some space from its first writable branch
+filesystem (by default) and creates file(s) on it. These files are
+created by aufs internally and removed soon (currently) keeping opened.
+Note: Because these files are removed, they are totally gone after
+      unmounting aufs. It means the inode numbers are not persistent
+      across unmount or reboot. I have a plan to make them really
+      persistent which will be important for aufs on NFS server.
+
+4. Read/Write Files Internally (copy-on-write)
+Because a branch can be readonly, when you write a file on it, aufs will
+"copy-up" it to the upper writable branch internally. And then write the
+originally requested thing to the file. Generally kernel doesn't
+open/read/write file actively. In aufs, even a single write may cause a
+internal "file copy". This behaviour is very similar to cp(1) command.
+
+Some people may think it is better to pass such work to user space
+helper, instead of doing in kernel space. Actually I am still thinking
+about it. But currently I have implemented it in kernel space.
diff --git a/Documentation/filesystems/aufs/design/02struct.txt b/Documentation/filesystems/aufs/design/02struct.txt
new file mode 100644
index 0000000..6db666b
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/02struct.txt
@@ -0,0 +1,205 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+Basic Aufs Internal Structure
+
+Superblock/Inode/Dentry/File Objects
+----------------------------------------------------------------------
+As like an ordinary filesystem, aufs has its own
+superblock/inode/dentry/file objects. All these objects have a
+dynamically allocated array and store the same kind of pointers to the
+lower filesystem, branch.
+For example, when you build a union with one readwrite branch and one
+readonly, mounted /au, /rw and /ro respectively.
+- /au = /rw + /ro
+- /ro/fileA exists but /rw/fileA
+
+Aufs lookup operation finds /ro/fileA and gets dentry for that. These
+pointers are stored in a aufs dentry. The array in aufs dentry will be,
+- [0] = NULL
+- [1] = /ro/fileA
+
+This style of an array is essentially same to the aufs
+superblock/inode/dentry/file objects.
+
+Because aufs supports manipulating branches, ie. add/delete/change
+dynamically, these objects has its own generation. When branches are
+changed, the generation in aufs superblock is incremented. And a
+generation in other object are compared when it is accessed.
+When a generation in other objects are obsoleted, aufs refreshes the
+internal array.
+
+
+Superblock
+----------------------------------------------------------------------
+Additionally aufs superblock has some data for policies to select one
+among multiple writable branches, XIB files, pseudo-links and kobject.
+See below in detail.
+About the policies which supports copy-down a directory, see policy.txt
+too.
+
+
+Branch and XINO(External Inode Number Translation Table)
+----------------------------------------------------------------------
+Every branch has its own xino (external inode number translation table)
+file. The xino file is created and unlinked by aufs internally. When two
+members of a union exist on the same filesystem, they share the single
+xino file.
+The struct of a xino file is simple, just a sequence of aufs inode
+numbers which is indexed by the lower inode number.
+In the above sample, assume the inode number of /ro/fileA is i111 and
+aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
+4(8) bytes at 111 * 4(8) bytes offset in the xino file.
+
+When the inode numbers are not contiguous, the xino file will be sparse
+which has a hole in it and doesn't consume as much disk space as it
+might appear. If your branch filesystem consumes disk space for such
+holes, then you should specify 'xino=' option at mounting aufs.
+
+Also a writable branch has three kinds of "whiteout bases". All these
+are existed when the branch is joined to aufs and the names are
+whiteout-ed doubly, so that users will never see their names in aufs
+hierarchy.
+1. a regular file which will be linked to all whiteouts.
+2. a directory to store a pseudo-link.
+3. a directory to store an "orphan-ed" file temporary.
+
+1. Whiteout Base
+   When you remove a file on a readonly branch, aufs handles it as a
+   logical deletion and creates a whiteout on the upper writable branch
+   as a hardlink of this file in order not to consume inode on the
+   writable branch.
+2. Pseudo-link Dir
+   See below, Pseudo-link.
+3. Step-Parent Dir
+   When "fileC" exists on the lower readonly branch only and it is
+   opened and removed with its parent dir, and then user writes
+   something into it, then aufs copies-up fileC to this
+   directory. Because there is no other dir to store fileC. After
+   creating a file under this dir, the file is unlinked.
+
+Because aufs supports manipulating branches, ie. add/delete/change
+dynamically, a branch has its own id. When the branch order changes, aufs
+finds the new index by searching the branch id.
+
+
+Pseudo-link
+----------------------------------------------------------------------
+Assume "fileA" exists on the lower readonly branch only and it is
+hardlinked to "fileB" on the branch. When you write something to fileA,
+aufs copies-up it to the upper writable branch. Additionally aufs
+creates a hardlink under the Pseudo-link Directory of the writable
+branch. The inode of a pseudo-link is kept in aufs super_block as a
+simple list. If fileB is read after unlinking fileA, aufs returns
+filedata from the pseudo-link instead of the lower readonly
+branch. Because the pseudo-link is based upon the inode, to keep the
+inode number by xino (see above) is important.
+
+All the hardlinks under the Pseudo-link Directory of the writable branch
+should be restored in a proper location later. Aufs provides a utility
+to do this. The userspace helpers executed at remounting and unmounting
+aufs by default.
+
+
+XIB(external inode number bitmap)
+----------------------------------------------------------------------
+Addition to the xino file per a branch, aufs has an external inode number
+bitmap in a superblock object. It is also a file such like a xino file.
+It is a simple bitmap to mark whether the aufs inode number is in-use or
+not.
+To reduce the file I/O, aufs prepares a single memory page to cache xib.
+
+Aufs implements a feature to truncate/refresh both of xino and xib to
+reduce the number of consumed disk blocks for these files.
+
+
+Virtual or Vertical Dir
+----------------------------------------------------------------------
+In order to support multiple layers (branches), aufs readdir operation
+constructs a virtual dir block on memory. For readdir, aufs calls
+vfs_readdir() internally for each dir on branches, merges their entries
+with eliminating the whiteout-ed ones, and sets it to file (dir)
+object. So the file object has its entry list until it is closed. The
+entry list will be updated when the file position is zero and becomes
+old. This decision is made in aufs automatically.
+
+The dynamically allocated memory block for the name of entries has a
+unit of 512 bytes (by default) and stores the names contiguously (no
+padding). Another block for each entry is handled by kmem_cache too.
+During building dir blocks, aufs creates hash list and judging whether
+the entry is whiteouted by its upper branch or already listed.
+
+Some people may call it can be a security hole or invite DoS attack
+since the opened and once readdir-ed dir (file object) holds its entry
+list and becomes a pressure for system memory. But I'd say it is similar
+to files under /proc or /sys. The virtual files in them also holds a
+memory page (generally) while they are opened. When an idea to reduce
+memory for them is introduced, it will be applied to aufs too.
+
+
+Workqueue
+----------------------------------------------------------------------
+Aufs sometimes requires privilege access to a branch. For instance,
+in copy-up/down operation. When a user process is going to make changes
+to a file which exists in the lower readonly branch only, and the mode
+of one of ancestor directories may not be writable by a user
+process. Here aufs copy-up the file with its ancestors and they may
+require privilege to set its owner/group/mode/etc.
+This is a typical case of a application character of aufs (see
+Introduction).
+
+Aufs uses workqueue synchronously for this case. It creates its own
+workqueue. The workqueue is a kernel thread and has privilege. Aufs
+passes the request to call mkdir or write (for example), and wait for
+its completion. This approach solves a problem of a signal handler
+simply.
+If aufs didn't adopt the workqueue and changed the privilege of the
+process, and if the mkdir/write call arises SIGXFSZ or other signal,
+then the user process might gain a privilege or the generated core file
+was owned by a superuser. But I have a plan to switch to a new
+credential approach which will be introduced in linux-2.6.29.
+
+Also aufs uses the system global workqueue ("events" kernel thread) too
+for asynchronous tasks, such like handling inotify, re-creating a
+whiteout base and etc. This is unrelated to a privilege.
+Most of aufs operation tries acquiring a rw_semaphore for aufs
+superblock at the beginning, at the same time waits for the completion
+of all queued asynchronous tasks.
+
+
+Whiteout
+----------------------------------------------------------------------
+The whiteout in aufs is very similar to Unionfs's. That is represented
+by its filename. UnionMount takes an approach of a file mode, but I am
+afraid several utilities (find(1) or something) will have to support it.
+
+Basically the whiteout represents "logical deletion" which stops aufs to
+lookup further, but also it represents "dir is opaque" which also stop
+lookup.
+
+In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
+In order to make several functions in a single systemcall to be
+revertible, aufs adopts an approach to rename a directory to a temporary
+unique whiteouted name.
+For example, in rename(2) dir where the target dir already existed, aufs
+renames the target dir to a temporary unique whiteouted name before the
+actual rename on a branch and then handles other actions (make it opaque,
+update the attributes, etc). If an error happens in these actions, aufs
+simply renames the whiteouted name back and returns an error. If all are
+succeeded, aufs registers a function to remove the whiteouted unique
+temporary name completely and asynchronously to the system global
+workqueue.
+
+
+Copy-up
+----------------------------------------------------------------------
+It is a well-known feature or concept.
+When user modifies a file on a readonly branch, aufs operate "copy-up"
+internally and makes change to the new file on the upper writable branch.
+When the trigger systemcall does not update the timestamps of the parent
+dir, aufs reverts it after copy-up.
diff --git a/Documentation/filesystems/aufs/design/03lookup.txt b/Documentation/filesystems/aufs/design/03lookup.txt
new file mode 100644
index 0000000..36ddac7
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/03lookup.txt
@@ -0,0 +1,95 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+Lookup in a Branch
+----------------------------------------------------------------------
+Since aufs has a character of sub-VFS (see Introduction), it operates
+lookup for branches as VFS does. It may be a heavy work. Generally
+speaking struct nameidata is a bigger structure and includes many
+information. But almost all lookup operation in aufs is the simplest
+case, ie. lookup only an entry directly connected to its parent. Digging
+down the directory hierarchy is unnecessary.
+
+VFS has a function lookup_one_len() for that use, but it is not usable
+for a branch filesystem which requires struct nameidata. So aufs
+implements a simple lookup wrapper function. When a branch filesystem
+allows NULL as nameidata, it calls lookup_one_len(). Otherwise it builds
+a simplest nameidata and calls lookup_hash().
+Here aufs applies "a principle in NFSD", ie. if the filesystem supports
+NFS-export, then it has to support NULL as a nameidata parameter for
+->create(), ->lookup() and ->d_revalidate(). So the lookup wrapper in
+aufs tests if ->s_export_op in the branch is NULL or not.
+
+When a branch is a remote filesystem, aufs trusts its ->d_revalidate().
+For d_revalidate, aufs implements three levels of revalidate tests. See
+"Revalidate Dentry and UDBA" in detail.
+
+
+Loopback Mount
+----------------------------------------------------------------------
+Basically aufs supports any type of filesystem and block device for a
+branch (actually there are some exceptions). But it is prohibited to add
+a loopback mounted one whose backend file exists in a filesystem which is
+already added to aufs. The reason is to protect aufs from a recursive
+lookup. If it was allowed, the aufs lookup operation might re-enter a
+lookup for the loopback mounted branch in the same context, and will
+cause a deadlock.
+
+
+Revalidate Dentry and UDBA (User's Direct Branch Access)
+----------------------------------------------------------------------
+Generally VFS helpers re-validate a dentry as a part of lookup.
+0. digging down the directory hierarchy.
+1. lock the parent dir by its i_mutex.
+2. lookup the final (child) entry.
+3. revalidate it.
+4. call the actual operation (create, unlink, etc.)
+5. unlock the parent dir
+
+If the filesystem implements its ->d_revalidate() (step 3), then it is
+called. Actually aufs implements it and checks the dentry on a branch is
+still valid.
+But it is not enough. Because aufs has to release the lock for the
+parent dir on a branch at the end of ->lookup() (step 2) and
+->d_revalidate() (step 3) while the i_mutex of the aufs dir is still
+held by VFS.
+If the file on a branch is changed directly, eg. bypassing aufs, after
+aufs released the lock, then the subsequent operation may cause
+something unpleasant result.
+
+This situation is a result of VFS architecture, ->lookup() and
+->d_revalidate() is separated. But I never say it is wrong. It is a good
+design from VFS's point of view. It is just not suitable for sub-VFS
+character in aufs.
+
+Aufs supports such case by three level of revalidation which is
+selectable by user.
+1. Simple Revalidate
+   Addition to the native flow in VFS's, confirm the child-parent
+   relationship on the branch just after locking the parent dir on the
+   branch in the "actual operation" (step 4). When this validation
+   fails, aufs returns EBUSY. ->d_revalidate() (step 3) in aufs still
+   checks the validation of the dentry on branches.
+2. Monitor Changes Internally by Inotify
+   Addition to above, in the "actual operation" (step 4) aufs re-lookup
+   the dentry on the branch, and returns EBUSY if it finds different
+   dentry.
+   Additionally, aufs sets the inotify watch for every dir on branches
+   during it is in cache. When the event is notified, aufs registers a
+   function to kernel 'events' thread by schedule_work(). And the
+   function sets some special status to the cached aufs dentry and inode
+   private data. If they are not cached, then aufs has nothing to
+   do. When the same file is accessed through aufs (step 0-3) later,
+   aufs will detect the status and refresh all necessary data.
+   In this mode, aufs has to ignore the event which is fired by aufs
+   itself.
+3. No Extra Validation
+   This is the simplest test and doesn't add any additional revalidation
+   test, and skip therevalidatin in step 4. It is useful and improves
+   aufs performance when system surely hide the aufs branches from user,
+   by over-mounting something (or another method).
diff --git a/Documentation/filesystems/aufs/design/04branch.txt b/Documentation/filesystems/aufs/design/04branch.txt
new file mode 100644
index 0000000..78432c5
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/04branch.txt
@@ -0,0 +1,67 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+Branch Manipulation
+
+Since aufs supports dynamic branch manipulation, ie. add/remove a branch
+and changing its permission/attribute, there are a lot of works to do.
+
+
+Add a Branch
+----------------------------------------------------------------------
+o Confirm the adding dir exists outside of aufs, including loopback
+  mount.
+- and other various attributes...
+o Initialize the xino file and whiteout bases if necessary.
+  See struct.txt.
+
+o Check the owner/group/mode of the directory
+  When the owner/group/mode of the adding directory differs from the
+  existing branch, aufs issues a warning because it may impose a
+  security risk.
+  For example, when a upper writable branch has a world writable empty
+  top directory, a malicious user can create any files on the writable
+  branch directly, like copy-up and modify manually. If something like
+  /etc/{passwd,shadow} exists on the lower readonly branch but the upper
+  writable branch, and the writable branch is world-writable, then a
+  malicious guy may create /etc/passwd on the writable branch directly
+  and the infected file will be valid in aufs.
+  I am afraid it can be a security issue, but nothing to do except
+  producing a warning.
+
+
+Delete a Branch
+----------------------------------------------------------------------
+o Confirm the deleting branch is not busy
+  To be general, there is one merit to adopt "remount" interface to
+  manipulate branches. It is to discard caches. At deleting a branch,
+  aufs checks the still cached (and connected) dentries and inodes. If
+  there are any, then they are all in-use. An inode without its
+  corresponding dentry can be alive alone (for example, inotify case).
+
+  For the cached one, aufs checks whether the same named entry exists on
+  other branches.
+  If the cached one is a directory, because aufs provides a merged view
+  to users, as long as one dir is left on any branch aufs can show the
+  dir to users. In this case, the branch can be removed from aufs.
+  Otherwise aufs rejects deleting the branch.
+
+  If any file on the deleting branch is opened by aufs, then aufs
+  rejects deleting.
+
+
+Modify the Permission of a Branch
+----------------------------------------------------------------------
+o Re-initialize or remove the xino file and whiteout bases if necessary.
+  See struct.txt.
+
+o rw --> ro: Confirm the modifying branch is not busy
+  Aufs rejects the request if any of these conditions are true.
+  - a file on the branch is mmap-ed.
+  - a regular file on the branch is opened for write and there is no
+    same named entry on the upper branch.
diff --git a/Documentation/filesystems/aufs/design/05wbr_policy.txt b/Documentation/filesystems/aufs/design/05wbr_policy.txt
new file mode 100644
index 0000000..d9720ca
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/05wbr_policy.txt
@@ -0,0 +1,57 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+
+Policies to Select One among Multiple Writable Branches
+----------------------------------------------------------------------
+When the number of writable branch is more than one, aufs has to decide
+the target branch for file creation or copy-up. By default, the highest
+writable branch which has the parent (or ancestor) dir of the target
+file is chosen (top-down-parent policy).
+By user's request, aufs implements some other policies to select the
+writable branch, for file creation two policies, round-robin and
+most-free-space policies. For copy-up three policies, top-down-parent,
+bottom-up-parent and bottom-up policies.
+
+As expected, the round-robin policy selects the branch in circular. When
+you have two writable branches and creates 10 new files, 5 files will be
+created for each branch. mkdir(2) systemcall is an exception. When you
+create 10 new directories, all will be created on the same branch.
+And the most-free-space policy selects the one which has most free
+space among the writable branches. The amount of free space will be
+checked by aufs internally, and users can specify its time interval.
+
+The policies for copy-up is more simple,
+top-down-parent is equivalent to the same named on in create policy,
+bottom-up-parent selects the writable branch where the parent dir
+exists and the nearest upper one from the copyup-source,
+bottom-up selects the nearest upper writable branch from the
+copyup-source, regardless the existence of the parent dir.
+
+There are some rules or exceptions to apply these policies.
+- If there is a readonly branch above the policy-selected branch and
+  the parent dir is marked as opaque (a variation of whiteout), or the
+  target (creating) file is whiteout-ed on the upper readonly branch,
+  then the result of the policy is ignored and the target file will be
+  created on the nearest upper writable branch than the readonly branch.
+- If there is a writable branch above the policy-selected branch and
+  the parent dir is marked as opaque or the target file is whiteouted
+  on the branch, then the result of the policy is ignored and the target
+  file will be created on the highest one among the upper writable
+  branches who has diropq or whiteout. In case of whiteout, aufs removes
+  it as usual.
+- link(2) and rename(2) systemcalls are exceptions in every policy.
+  They try selecting the branch where the source exists as possible
+  since copyup a large file will take long time. If it can't be,
+  ie. the branch where the source exists is readonly, then they will
+  follow the copyup policy.
+- There is an exception for rename(2) when the target exists.
+  If the rename target exists, aufs compares the index of the branches
+  where the source and the target exists and selects the higher
+  one. If the selected branch is readonly, then aufs follows the
+  copyup policy.
diff --git a/Documentation/filesystems/aufs/design/06fmode_exec.txt b/Documentation/filesystems/aufs/design/06fmode_exec.txt
new file mode 100644
index 0000000..b172cd1
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/06fmode_exec.txt
@@ -0,0 +1,24 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+FMODE_EXEC and deny_write()
+----------------------------------------------------------------------
+Generally Unix prevents an executing file from writing its filedata.
+In linux it is implemented by deny_write() and allow_write().
+When a file is executed by exec() family, open_exec() (and sys_uselib())
+they opens the file and calls deny_write(). If the file is aufs's virtual
+one, it has no meaning. The file which deny_write() is really necessary
+is the file on a branch. But the FMODE_EXEC flag is not passed to
+->open() operation. So aufs adopt a dirty trick.
+
+- in order to get FMODE_EXEC, aufs ->lookup() and ->d_revalidate() set
+  nd->intent.open.file->private_data to nd->intent.open.flags temporary.
+- in aufs ->open(), when FMODE_EXEC is set in file->private_data, it
+  calls deny_write() for the file on a branch.
+- when the aufs file is released, allow_write() for the file on a branch
+  is called.
diff --git a/Documentation/filesystems/aufs/design/07mmap.txt b/Documentation/filesystems/aufs/design/07mmap.txt
new file mode 100644
index 0000000..d751c42
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/07mmap.txt
@@ -0,0 +1,44 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+mmap(2) -- File Memory Mapping
+----------------------------------------------------------------------
+In aufs, the file-mapped pages are shared between the file on a branch
+and the virtual one in aufs by overriding vm_operation, particularly
+->fault().
+
+In aufs_mmap(),
+- get and store vm_ops of the real file on a branch.
+- map the file of aufs by generic_file_mmap() and set aufs's vm
+  operations.
+
+In aufs_fault(),
+- get the file of aufs from the passed vma, sleep if needed.
+- get the real file on a branch from the aufs file.
+- a race may happen. for instance a multithreaded library. so some lock
+  is implemented.
+- call ->fault() in the previously stored vm_ops with setting the
+  real file on a branch to vm_file.
+- restore vm_file and wake_up if someone else got sleep.
+
+When a branch is added to or deleted from aufs, the same-named file may
+unveil and its contents will be replaced by the new one when a process
+read(2) through previously opened file.
+(Some users may not want to refresh the filedata. For such users, I
+have a plan to implement a mount option 'refrof' which decides to
+refresh the opened files or not. See plan.txt too.)
+In this case, an already mapped file will not be updated since the
+contents are a part of a process already and it should not be changed by
+aufs branch manipulation. (Even if MAP_SHARED is specified, currently).
+Of course, in case of the deleting branch has a busy file, it cannot be
+deleted from the union.
+
+In Unionfs, it took an approach which the memory pages mapped to
+filedata are copied from the lower (real) file into the Unionfs's
+virtual one and handles it by address_space operations. Recently Unionfs
+changed it to this approach which aufs adopted since Jul 2006.
diff --git a/Documentation/filesystems/aufs/design/08plan.txt b/Documentation/filesystems/aufs/design/08plan.txt
new file mode 100644
index 0000000..d94bc5b
--- /dev/null
+++ b/Documentation/filesystems/aufs/design/08plan.txt
@@ -0,0 +1,169 @@
+
+# Copyright (C) 2005-2009 Junjiro R. Okajima
+# 
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+Plan
+
+Restoring some features which was implemented in aufs1.
+They were dropped in aufs2 in order to make source files simpler and
+easier to be reviewed.
+
+
+Export Aufs via NFS
+----------------------------------------------------------------------
+Here is an approach I adopt in aufs1.
+- like xino/xib, add a new file 'xigen' which stores aufs inode
+  generation.
+- iget_locked(): initialize aufs inode generation for a new inode, and
+  store it in xigen file.
+- destroy_inode(): increment aufs inode generation and store it in xigen
+  file. it is necessary even if it is not unlinked, because any data of
+  inode may be changed by UDBA.
+- encode_fh(): for a root dir, simply return FILEID_ROOT. otherwise
+  build file handle by
+  + branch id (4 bytes)
+  + superblock generation (4 bytes)
+  + inode number (4 or 8 bytes)
+  + parent dir inode number (4 or 8 bytes)
+  + inode generation (4 bytes))
+  + return value of exportfs_encode_fh() for the parent on a branch (4
+    bytes)
+  + file handle for a branch (by exportfs_encode_fh())
+- fh_to_dentry():
+  + find the index of a branch from its id in handle, and check it is
+    still exist in aufs.
+  + 1st level: get the inode number from handle and search it in cache.
+  + 2nd level: if not found, get the parent inode number from handle and
+    search it in cache. and then open the parent dir, find the matching
+    inode number by vfs_readdir() and get its name, and call
+    lookup_one_len() for the target dentry.
+  + 3rd level: if the parent dir is not cached, call
+    exportfs_decode_fh() for a branch and get the parent on a branch,
+    build a pathname of it, convert it a pathname in aufs, call
+    path_lookup(). now aufs gets a parent dir dentry, then handle it as
+    the 2nd level.
+  + to open the dir, aufs needs struct vfsmount. aufs keeps vfsmount
+    for every branch, but not itself. to get this, (currently) aufs
+    searches in current->nsproxy->mnt_ns list. it may not be a good
+    idea, but I didn't get other approach.
+  + test the generation of the gotten inode.
+- every inode operation: they may get EBUSY due to UDBA. in this case,
+  convert it into ESTALE for NFSD.
+- readdir(): call lockdep_on/off() because filldir in NFSD calls
+  lookup_one_len(), vfs_getattr(), encode_fh() and others.
+
+
+Test Only the Highest One for the Directory Permission (dirperm1 option)
+----------------------------------------------------------------------
+Let's try case study.
+- aufs has two branches, upper readwrite and lower readonly.
+  /au = /rw + /ro
+- "dirA" exists under /ro, but /rw. and its mode is 0700.
+- user invoked "chmod a+rx /au/dirA"
+- then "dirA" becomes world readable?
+
+In this case, /ro/dirA is still 0700 since it exists in readonly branch,
+or it may be a natively readonly filesystem. If aufs respects the lower
+branch, it should not respond readdir request from other users. But user
+allowed it by chmod. Should really aufs rejects showing the entries
+under /ro/dirA?
+
+To be honest, I don't have a best solution for this case. So I
+implemented 'dirperm1' and 'nodirperm1' option in aufs1, and leave it to
+users.
+When dirperm1 is specified, aufs checks only the highest one for the
+directory permission, and shows the entries. Otherwise, as usual, checks
+every dir existing on all branches and rejects the request.
+
+As a side effect, dirperm1 option improves the performance of aufs
+because the number of permission check is reduced.
+
+
+Show Whiteout Mode (shwh)
+----------------------------------------------------------------------
+Generally aufs hides the name of whiteouts. But in some cases, to show
+them is very useful for users. For instance, creating a new middle layer
+(branch) by merging existing layers.
+
+(borrowing aufs1 HOW-TO from a user, Michael Towers)
+When you have three branches,
+- Bottom: 'system', squashfs (underlying base system), read-only
+- Middle: 'mods', squashfs, read-only
+- Top: 'overlay', ram (tmpfs), read-write
+
+The top layer is loaded at boot time and saved at shutdown, to preserve
+the changes made to the system during the session.
+When larger changes have been made, or smaller changes have accumulated,
+the size of the saved top layer data grows. At this point, it would be
+nice to be able to merge the two overlay branches ('mods' and 'overlay')
+and rewrite the 'mods' squashfs, clearing the top layer and thus
+restoring save and load speed.
+
+This merging is simplified by the use of another aufs mount, of just the
+two overlay branches using the 'shwh' option.
+# mount -t aufs -o ro,shwh,br:/livesys/overlay=ro+wh:/livesys/mods=rr+wh \
+	aufs /livesys/merge_union
+
+A merged view of these two branches is then available at
+/livesys/merge_union, and the new feature is that the whiteouts are
+visible!
+Note that in 'shwh' mode the aufs mount must be 'ro', which will disable
+writing to all branches. Also the default mode for all branches is 'ro'.
+It is now possible to save the combined contents of the two overlay
+branches to a new squashfs, e.g.:
+# mksquashfs /livesys/merge_union /path/to/newmods.squash
+
+This new squashfs archive can be stored on the boot device and the
+initramfs will use it to replace the old one at the next boot.
+
+
+Being Another Aufs's Readonly Branch (robr)
+----------------------------------------------------------------------
+Aufs1 allows aufs to be another aufs's readonly branch.
+This feature was developed by a user's request. But it may not be used
+currecnly.
+
+
+Copy-up on Open (coo=)
+----------------------------------------------------------------------
+By default the internal copy-up is executed when it is really necessary.
+It is not done when a file is opened for writing, but when write(2) is
+done. Users who have many (over 100) branches want to know and analyse
+when and what file is copied-up. To insert a new upper branch which
+contains such files only may improve the performance of aufs.
+
+Aufs1 implemented "coo=none | leaf | all" option.
+
+
+Refresh the Opened File (refrof)
+----------------------------------------------------------------------
+This option is implemented in aufs1 but incomplete.
+
+When user reads from a file, he expects to get its latest filedata
+generally. If the file is removed and a new same named file is created,
+the content he gets is unchanged, ie. the unlinked filedata.
+
+Let's try case study again.
+- aufs has two branches.
+  /au = /rw + /ro
+- "fileA" exists under /ro, but /rw.
+- user opened "/au/fileA".
+- he or someone else inserts a branch (/new) between /rw and /ro.
+  /au = /rw + /new + /ro
+- the new branch has "fileA".
+- user reads from the opened "fileA"
+- which filedata should aufs return, from /ro or /new?
+
+Some people says it has to be "from /ro" and it is a semantics of Unix.
+The others say it should be "from /new" because the file is not removed
+and it is equivalent to the case of someone else modifies the file.
+
+Here again I don't have a best and final answer. I got an idea to
+implement 'refrof' and 'norefrof' option. When 'refrof' (REFResh the
+Opened File) is specified (by default), aufs returns the filedata from
+/new.
+Otherwise from /new.
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 02/28] aufs public header file
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 01/28] aufs documents J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 03/28] aufs module global J. R. Okajima
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
public header file

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 include/linux/Kbuild      |    1 +
 include/linux/aufs_type.h |   98 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/aufs_type.h

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 106c3ba..d0c7262 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -34,6 +34,7 @@ header-y += atmppp.h
 header-y += atmsap.h
 header-y += atmsvc.h
 header-y += atm_zatm.h
+header-y += aufs_type.h
 header-y += auto_fs4.h
 header-y += ax25.h
 header-y += b1lli.h
diff --git a/include/linux/aufs_type.h b/include/linux/aufs_type.h
new file mode 100644
index 0000000..419c853
--- /dev/null
+++ b/include/linux/aufs_type.h
@@ -0,0 +1,98 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __AUFS_TYPE_H__
+#define __AUFS_TYPE_H__
+
+#include <linux/ioctl.h>
+
+#define AUFS_VERSION	"2"
+
+/* todo? move this to linux-2.6.19/include/magic.h */
+#define AUFS_SUPER_MAGIC	('a' << 24 | 'u' << 16 | 'f' << 8 | 's')
+
+/* ---------------------------------------------------------------------- */
+
+#ifdef CONFIG_AUFS_BRANCH_MAX_127
+/* some environments treat 'char' as 'unsigned char' by default */
+typedef signed char aufs_bindex_t;
+#define AUFS_BRANCH_MAX 127
+#else
+typedef short aufs_bindex_t;
+#ifdef CONFIG_AUFS_BRANCH_MAX_511
+#define AUFS_BRANCH_MAX 511
+#elif defined(CONFIG_AUFS_BRANCH_MAX_1023)
+#define AUFS_BRANCH_MAX 1023
+#elif defined(CONFIG_AUFS_BRANCH_MAX_32767)
+#define AUFS_BRANCH_MAX 32767
+#endif
+#endif
+
+#ifdef __KERNEL__
+#ifndef AUFS_BRANCH_MAX
+#error unknown CONFIG_AUFS_BRANCH_MAX value
+#endif
+#endif /* __KERNEL__ */
+
+/* ---------------------------------------------------------------------- */
+
+#define AUFS_NAME		"aufs"
+#define AUFS_FSTYPE		AUFS_NAME
+
+#define AUFS_ROOT_INO		2
+#define AUFS_FIRST_INO		11
+
+#define AUFS_WH_PFX		".wh."
+#define AUFS_WH_PFX_LEN		((int)sizeof(AUFS_WH_PFX) - 1)
+#define AUFS_XINO_FNAME		"." AUFS_NAME ".xino"
+#define AUFS_XINO_DEFPATH	"/tmp/" AUFS_XINO_FNAME
+#define AUFS_XINO_TRUNC_INIT	64 /* blocks */
+#define AUFS_XINO_TRUNC_STEP	4  /* blocks */
+#define AUFS_DIRWH_DEF		3
+#define AUFS_RDCACHE_DEF	10 /* seconds */
+#define AUFS_WKQ_NAME		AUFS_NAME "d"
+#define AUFS_NWKQ_DEF		4
+#define AUFS_MFS_SECOND_DEF	30 /* seconds */
+#define AUFS_PLINK_WARN		100 /* number of plinks */
+
+#define AUFS_DIROPQ_NAME	AUFS_WH_PFX ".opq" /* whiteouted doubly */
+#define AUFS_WH_DIROPQ		AUFS_WH_PFX AUFS_DIROPQ_NAME
+
+#define AUFS_BASE_NAME		AUFS_WH_PFX AUFS_NAME
+#define AUFS_PLINKDIR_NAME	AUFS_WH_PFX "plnk"
+#define AUFS_ORPHDIR_NAME	AUFS_WH_PFX "orph"
+
+/* doubly whiteouted */
+#define AUFS_WH_BASE		AUFS_WH_PFX AUFS_BASE_NAME
+#define AUFS_WH_PLINKDIR	AUFS_WH_PFX AUFS_PLINKDIR_NAME
+#define AUFS_WH_ORPHDIR		AUFS_WH_PFX AUFS_ORPHDIR_NAME
+
+/* branch permission */
+#define AUFS_BRPERM_RW		"rw"
+#define AUFS_BRPERM_RO		"ro"
+#define AUFS_BRPERM_RR		"rr"
+#define AUFS_BRPERM_WH		"wh"
+#define AUFS_BRPERM_NLWH	"nolwh"
+#define AUFS_BRPERM_ROWH	AUFS_BRPERM_RO "+" AUFS_BRPERM_WH
+#define AUFS_BRPERM_RRWH	AUFS_BRPERM_RR "+" AUFS_BRPERM_WH
+#define AUFS_BRPERM_RWNLWH	AUFS_BRPERM_RW "+" AUFS_BRPERM_NLWH
+
+/* ---------------------------------------------------------------------- */
+
+/* ioctl */
+enum {
+	AuCtl_PLINK_MAINT,
+	AuCtl_PLINK_CLEAN
+};
+
+#define AuCtlType		'A'
+#define AUFS_CTL_PLINK_MAINT	_IO(AuCtlType, AuCtl_PLINK_MAINT)
+#define AUFS_CTL_PLINK_CLEAN	_IO(AuCtlType, AuCtl_PLINK_CLEAN)
+
+#endif /* __AUFS_TYPE_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 03/28] aufs module global
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 01/28] aufs documents J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 02/28] aufs public header file J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 04/28] aufs super_block J. R. Okajima
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
common header file
module initialization and module global vars

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/Kconfig  |   78 ++++++++++++++++++++++++++
 fs/aufs/Makefile |   19 ++++++
 fs/aufs/aufs.h   |   43 ++++++++++++++
 fs/aufs/module.c |  164 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/module.h |   66 ++++++++++++++++++++++
 fs/aufs/rwsem.h  |   52 +++++++++++++++++
 fs/aufs/spl.h    |   47 +++++++++++++++
 7 files changed, 469 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/Kconfig
 create mode 100644 fs/aufs/Makefile
 create mode 100644 fs/aufs/aufs.h
 create mode 100644 fs/aufs/module.c
 create mode 100644 fs/aufs/module.h
 create mode 100644 fs/aufs/rwsem.h
 create mode 100644 fs/aufs/spl.h

diff --git a/fs/aufs/Kconfig b/fs/aufs/Kconfig
new file mode 100644
index 0000000..ebf766b
--- /dev/null
+++ b/fs/aufs/Kconfig
@@ -0,0 +1,78 @@
+config AUFS_FS
+	bool "Aufs (Advanced multi layered unification filesystem) support"
+	depends on EXPERIMENTAL
+	help
+	Aufs is a stackable unification filesystem such as Unionfs,
+	which unifies several directories and provides a merged single
+	directory.
+	In the early days, aufs was entirely re-designed and
+	re-implemented Unionfs Version 1.x series. Introducing many
+	original ideas, approaches and improvements, it becomes totally
+	different from Unionfs while keeping the basic features.
+
+if AUFS_FS
+choice
+	prompt "Maximum number of branches"
+	default AUFS_BRANCH_MAX_127
+	help
+	Specifies the maximum number of branches (or member directories)
+	in a single aufs. The larger value consumes more system
+	resources and has a minor impact to performance.
+config AUFS_BRANCH_MAX_127
+	bool "127"
+	help
+	Specifies the maximum number of branches (or member directories)
+	in a single aufs. The larger value consumes more system
+	resources and has a minor impact to performance.
+config AUFS_BRANCH_MAX_511
+	bool "511"
+	help
+	Specifies the maximum number of branches (or member directories)
+	in a single aufs. The larger value consumes more system
+	resources and has a minor impact to performance.
+config AUFS_BRANCH_MAX_1023
+	bool "1023"
+	help
+	Specifies the maximum number of branches (or member directories)
+	in a single aufs. The larger value consumes more system
+	resources and has a minor impact to performance.
+config AUFS_BRANCH_MAX_32767
+	bool "32767"
+	help
+	Specifies the maximum number of branches (or member directories)
+	in a single aufs. The larger value consumes more system
+	resources and has a minor impact to performance.
+endchoice
+
+config AUFS_HINOTIFY
+	bool "Use inotify to detect actions on a branch"
+	depends on INOTIFY
+	help
+	If you want to modify files on branches directly, eg. bypassing aufs,
+	and want aufs to detect the changes of them fully, then enable this
+	option and use 'udba=inotify' mount option.
+	It will have a negative impact to the performance.
+	See detail in aufs.5.
+
+config AUFS_DEBUG
+	bool "Debug aufs"
+	help
+	Enable this to compile aufs internal debug code.
+	It will have a negative impact to the performance.
+
+config AUFS_MAGIC_SYSRQ
+	bool
+	depends on AUFS_DEBUG && MAGIC_SYSRQ
+	default y
+	help
+	Automatic configuration for internal use.
+	When aufs supports Magic SysRq, enabled automatically.
+
+config AUFS_BDEV_LOOP
+	bool
+	depends on BLK_DEV_LOOP
+	default y
+	help
+	Automatic configuration for internal use.
+	Convert =[ym] into =y.
+endif
diff --git a/fs/aufs/Makefile b/fs/aufs/Makefile
new file mode 100644
index 0000000..7f9e9a3
--- /dev/null
+++ b/fs/aufs/Makefile
@@ -0,0 +1,19 @@
+
+include ${srctree}/${src}/magic.mk
+
+obj-$(CONFIG_AUFS_FS) += aufs.o
+aufs-y := module.o sbinfo.o super.o branch.o xino.o sysaufs.o opts.o \
+	wkq.o vfsub.o dcsub.o \
+	cpup.o whout.o plink.o wbr_policy.o \
+	dinfo.o dentry.o \
+	finfo.o file.o f_op.o \
+	dir.o vdir.o \
+	iinfo.o inode.o i_op.o i_op_add.o i_op_del.o i_op_ren.o \
+	ioctl.o
+
+# all are boolean
+aufs-$(CONFIG_SYSFS) += sysfs.o
+aufs-$(CONFIG_AUFS_BDEV_LOOP) += loop.o
+aufs-$(CONFIG_AUFS_HINOTIFY) += hinotify.o
+aufs-$(CONFIG_AUFS_DEBUG) += debug.o
+aufs-$(CONFIG_AUFS_MAGIC_SYSRQ) += sysrq.o
diff --git a/fs/aufs/aufs.h b/fs/aufs/aufs.h
new file mode 100644
index 0000000..a599aee
--- /dev/null
+++ b/fs/aufs/aufs.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * all header files
+ */
+
+#ifndef __AUFS_H__
+#define __AUFS_H__
+
+#ifdef __KERNEL__
+
+#include <linux/bug.h>
+
+#include "debug.h"
+
+#include "branch.h"
+#include "cpup.h"
+#include "dcsub.h"
+#include "dentry.h"
+#include "dir.h"
+#include "file.h"
+#include "fstype.h"
+#include "inode.h"
+#include "loop.h"
+#include "module.h"
+#include "opts.h"
+#include "rwsem.h"
+#include "spl.h"
+#include "super.h"
+#include "sysaufs.h"
+#include "vfsub.h"
+#include "whout.h"
+#include "wkq.h"
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_H__ */
diff --git a/fs/aufs/module.c b/fs/aufs/module.c
new file mode 100644
index 0000000..c2219d3
--- /dev/null
+++ b/fs/aufs/module.c
@@ -0,0 +1,164 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * module global variables and operations
+ */
+
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include "aufs.h"
+
+void *au_kzrealloc(void *p, unsigned int nused, unsigned int new_sz, gfp_t gfp)
+{
+	if (new_sz <= nused)
+		return p;
+
+	p = krealloc(p, new_sz, gfp);
+	if (p)
+		memset(p + nused, 0, new_sz - nused);
+	return p;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * aufs caches
+ */
+struct kmem_cache *au_cachep[AuCache_Last];
+static int __init au_cache_init(void)
+{
+	au_cachep[AuCache_DINFO] = AuCache(au_dinfo);
+	if (au_cachep[AuCache_DINFO])
+		au_cachep[AuCache_ICNTNR] = AuCache(au_icntnr);
+	if (au_cachep[AuCache_ICNTNR])
+		au_cachep[AuCache_FINFO] = AuCache(au_finfo);
+	if (au_cachep[AuCache_FINFO])
+		au_cachep[AuCache_VDIR] = AuCache(au_vdir);
+	if (au_cachep[AuCache_VDIR])
+		au_cachep[AuCache_DEHSTR] = AuCache(au_vdir_dehstr);
+	if (au_cachep[AuCache_DEHSTR])
+		return 0;
+
+	return -ENOMEM;
+}
+
+static void au_cache_fin(void)
+{
+	int i;
+	for (i = 0; i < AuCache_Last; i++)
+		if (au_cachep[i]) {
+			kmem_cache_destroy(au_cachep[i]);
+			au_cachep[i] = NULL;
+		}
+}
+
+/* ---------------------------------------------------------------------- */
+
+int au_dir_roflags;
+
+/*
+ * functions for module interface.
+ */
+MODULE_LICENSE("GPL");
+/* MODULE_LICENSE("GPL v2"); */
+MODULE_AUTHOR("Junjiro R. Okajima");
+MODULE_DESCRIPTION(AUFS_NAME
+	" -- Advanced multi layered unification filesystem");
+MODULE_VERSION(AUFS_VERSION);
+
+/* it should be 'byte', but param_set_byte() prints it by "%c" */
+short aufs_nwkq = AUFS_NWKQ_DEF;
+MODULE_PARM_DESC(nwkq, "the number of workqueue thread, " AUFS_WKQ_NAME);
+module_param_named(nwkq, aufs_nwkq, short, S_IRUGO);
+
+/* this module parameter has no meaning when SYSFS is disabled */
+int sysaufs_brs = 1;
+MODULE_PARM_DESC(brs, "use <sysfs>/fs/aufs/si_*/brN");
+module_param_named(brs, sysaufs_brs, int, S_IRUGO);
+
+/* ---------------------------------------------------------------------- */
+
+static char au_esc_chars[0x20 + 3]; /* 0x01-0x20, backslash, del, and NULL */
+
+int au_seq_path(struct seq_file *seq, struct path *path)
+{
+	return seq_path(seq, path, au_esc_chars);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int __init aufs_init(void)
+{
+	int err, i;
+	char *p;
+
+	p = au_esc_chars;
+	for (i = 1; i <= ' '; i++)
+		*p++ = i;
+	*p++ = '\\';
+	*p++ = '\x7f';
+	*p = 0;
+
+	au_dir_roflags = au_file_roflags(O_DIRECTORY | O_LARGEFILE);
+
+	sysaufs_brs_init();
+	au_debug_init();
+
+	err = -EINVAL;
+	if (unlikely(aufs_nwkq <= 0))
+		goto out;
+
+	err = sysaufs_init();
+	if (unlikely(err))
+		goto out;
+	err = au_wkq_init();
+	if (unlikely(err))
+		goto out_sysaufs;
+	err = au_hinotify_init();
+	if (unlikely(err))
+		goto out_wkq;
+	err = au_sysrq_init();
+	if (unlikely(err))
+		goto out_hin;
+	err = au_cache_init();
+	if (unlikely(err))
+		goto out_sysrq;
+	err = register_filesystem(&aufs_fs_type);
+	if (unlikely(err))
+		goto out_cache;
+	pr_info(AUFS_NAME " " AUFS_VERSION "\n");
+	goto out; /* success */
+
+ out_cache:
+	au_cache_fin();
+ out_sysrq:
+	au_sysrq_fin();
+ out_hin:
+	au_hinotify_fin();
+ out_wkq:
+	au_wkq_fin();
+ out_sysaufs:
+	sysaufs_fin();
+ out:
+	return err;
+}
+
+static void __exit aufs_exit(void)
+{
+	unregister_filesystem(&aufs_fs_type);
+	au_cache_fin();
+	au_sysrq_fin();
+	au_hinotify_fin();
+	au_wkq_fin();
+	sysaufs_fin();
+}
+
+module_init(aufs_init);
+module_exit(aufs_exit);
diff --git a/fs/aufs/module.h b/fs/aufs/module.h
new file mode 100644
index 0000000..85e397a
--- /dev/null
+++ b/fs/aufs/module.h
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * module initialization and module-global
+ */
+
+#ifndef __AUFS_MODULE_H__
+#define __AUFS_MODULE_H__
+
+#ifdef __KERNEL__
+
+#include <linux/slab.h>
+
+/* module parameters */
+extern short aufs_nwkq;
+extern int sysaufs_brs;
+
+/* ---------------------------------------------------------------------- */
+
+extern int au_dir_roflags;
+
+void *au_kzrealloc(void *p, unsigned int nused, unsigned int new_sz, gfp_t gfp);
+int au_seq_path(struct seq_file *seq, struct path *path);
+
+/* ---------------------------------------------------------------------- */
+
+/* kmem cache */
+enum {
+	AuCache_DINFO,
+	AuCache_ICNTNR,
+	AuCache_FINFO,
+	AuCache_VDIR,
+	AuCache_DEHSTR,
+#ifdef CONFIG_AUFS_HINOTIFY
+	AuCache_HINOTIFY,
+#endif
+	AuCache_Last
+};
+
+#define AuCache(type)	KMEM_CACHE(type, SLAB_RECLAIM_ACCOUNT)
+
+extern struct kmem_cache *au_cachep[];
+
+#define AuCacheFuncs(name, index) \
+static inline void *au_cache_alloc_##name(void) \
+{ return kmem_cache_alloc(au_cachep[AuCache_##index], GFP_NOFS); } \
+static inline void au_cache_free_##name(void *p) \
+{ kmem_cache_free(au_cachep[AuCache_##index], p); }
+
+AuCacheFuncs(dinfo, DINFO);
+AuCacheFuncs(icntnr, ICNTNR);
+AuCacheFuncs(finfo, FINFO);
+AuCacheFuncs(vdir, VDIR);
+AuCacheFuncs(dehstr, DEHSTR);
+
+/*  ---------------------------------------------------------------------- */
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_MODULE_H__ */
diff --git a/fs/aufs/rwsem.h b/fs/aufs/rwsem.h
new file mode 100644
index 0000000..d569243
--- /dev/null
+++ b/fs/aufs/rwsem.h
@@ -0,0 +1,52 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * simple read-write semaphore wrappers
+ */
+
+#ifndef __AUFS_RWSEM_H__
+#define __AUFS_RWSEM_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+#define au_rwsem_destroy(rw)	AuDebugOn(rwsem_is_locked(rw))
+#define AuRwMustNoWaiters(rw)	AuDebugOn(!list_empty(&(rw)->wait_list))
+
+#define AuSimpleLockRwsemFuncs(prefix, param, rwsem) \
+static inline void prefix##_read_lock(param) \
+{ down_read(rwsem); } \
+static inline void prefix##_write_lock(param) \
+{ down_write(rwsem); } \
+static inline int prefix##_read_trylock(param) \
+{ return down_read_trylock(rwsem); } \
+static inline int prefix##_write_trylock(param) \
+{ return down_write_trylock(rwsem); }
+/* why is not _nested version defined */
+/* static inline void prefix##_read_trylock_nested(param, lsc)
+{ down_write_trylock_nested(rwsem, lsc)); }
+static inline void prefix##_write_trylock_nestd(param, lsc)
+{ down_write_trylock_nested(rwsem, lsc); } */
+
+#define AuSimpleUnlockRwsemFuncs(prefix, param, rwsem) \
+static inline void prefix##_read_unlock(param) \
+{ up_read(rwsem); } \
+static inline void prefix##_write_unlock(param) \
+{ up_write(rwsem); } \
+static inline void prefix##_downgrade_lock(param) \
+{ downgrade_write(rwsem); }
+
+#define AuSimpleRwsemFuncs(prefix, param, rwsem) \
+	AuSimpleLockRwsemFuncs(prefix, param, rwsem) \
+	AuSimpleUnlockRwsemFuncs(prefix, param, rwsem)
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_RWSEM_H__ */
diff --git a/fs/aufs/spl.h b/fs/aufs/spl.h
new file mode 100644
index 0000000..89ea7a2
--- /dev/null
+++ b/fs/aufs/spl.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * simple list protected by a spinlock
+ */
+
+#ifndef __AUFS_SPL_H__
+#define __AUFS_SPL_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+struct au_splhead {
+	spinlock_t		spin;
+	struct list_head	head;
+};
+
+static inline void au_spl_init(struct au_splhead *spl)
+{
+	spin_lock_init(&spl->spin);
+	INIT_LIST_HEAD(&spl->head);
+}
+
+static inline void au_spl_add(struct list_head *list, struct au_splhead *spl)
+{
+	spin_lock(&spl->spin);
+	list_add(list, &spl->head);
+	spin_unlock(&spl->spin);
+}
+
+static inline void au_spl_del(struct list_head *list, struct au_splhead *spl)
+{
+	spin_lock(&spl->spin);
+	list_del(list);
+	spin_unlock(&spl->spin);
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_SPL_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 04/28] aufs super_block
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (2 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 03/28] aufs module global J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 05/28] aufs branch directory/filesystem J. R. Okajima
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
super_block operations and private data

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/sbinfo.c |  192 ++++++++++++
 fs/aufs/super.c  |  846 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/super.h  |  266 +++++++++++++++++
 3 files changed, 1304 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/sbinfo.c
 create mode 100644 fs/aufs/super.c
 create mode 100644 fs/aufs/super.h

diff --git a/fs/aufs/sbinfo.c b/fs/aufs/sbinfo.c
new file mode 100644
index 0000000..12d07f5
--- /dev/null
+++ b/fs/aufs/sbinfo.c
@@ -0,0 +1,192 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * superblock private data
+ */
+
+#include "aufs.h"
+
+/*
+ * they are necessary regardless sysfs is disabled.
+ */
+void au_si_free(struct kobject *kobj)
+{
+	struct au_sbinfo *sbinfo;
+	struct super_block *sb;
+
+	sbinfo = container_of(kobj, struct au_sbinfo, si_kobj);
+	AuDebugOn(!list_empty(&sbinfo->si_plink.head));
+
+	sb = sbinfo->si_sb;
+	si_write_lock(sb);
+	au_xino_clr(sb);
+	au_br_free(sbinfo);
+	kfree(sbinfo->si_branch);
+	mutex_destroy(&sbinfo->si_xib_mtx);
+	si_write_unlock(sb);
+	au_rwsem_destroy(&sbinfo->si_rwsem);
+
+	kfree(sbinfo);
+}
+
+int au_si_alloc(struct super_block *sb)
+{
+	int err;
+	struct au_sbinfo *sbinfo;
+
+	err = -ENOMEM;
+	sbinfo = kmalloc(sizeof(*sbinfo), GFP_NOFS);
+	if (unlikely(!sbinfo))
+		goto out;
+
+	/* will be reallocated separately */
+	sbinfo->si_branch = kzalloc(sizeof(*sbinfo->si_branch), GFP_NOFS);
+	if (unlikely(!sbinfo->si_branch))
+		goto out_sbinfo;
+
+	memset(&sbinfo->si_kobj, 0, sizeof(sbinfo->si_kobj));
+	err = sysaufs_si_init(sbinfo);
+	if (unlikely(err))
+		goto out_br;
+
+	au_nwt_init(&sbinfo->si_nowait);
+	init_rwsem(&sbinfo->si_rwsem);
+	down_write(&sbinfo->si_rwsem);
+	sbinfo->si_generation = 0;
+	sbinfo->au_si_status = 0;
+	sbinfo->si_bend = -1;
+	sbinfo->si_last_br_id = 0;
+
+	sbinfo->si_wbr_copyup = AuWbrCopyup_Def;
+	sbinfo->si_wbr_create = AuWbrCreate_Def;
+	sbinfo->si_wbr_copyup_ops = au_wbr_copyup_ops + AuWbrCopyup_Def;
+	sbinfo->si_wbr_create_ops = au_wbr_create_ops + AuWbrCreate_Def;
+
+	sbinfo->si_mntflags = AuOpt_Def;
+
+	sbinfo->si_xread = NULL;
+	sbinfo->si_xwrite = NULL;
+	sbinfo->si_xib = NULL;
+	mutex_init(&sbinfo->si_xib_mtx);
+	sbinfo->si_xib_buf = NULL;
+	sbinfo->si_xino_brid = -1;
+	/* leave si_xib_last_pindex and si_xib_next_bit */
+
+	sbinfo->si_rdcache = AUFS_RDCACHE_DEF * HZ;
+	sbinfo->si_dirwh = AUFS_DIRWH_DEF;
+
+	au_spl_init(&sbinfo->si_plink);
+	init_waitqueue_head(&sbinfo->si_plink_wq);
+
+	/* leave other members for sysaufs and si_mnt. */
+	sbinfo->si_sb = sb;
+	sb->s_fs_info = sbinfo;
+	au_debug_sbinfo_init(sbinfo);
+	return 0; /* success */
+
+ out_br:
+	kfree(sbinfo->si_branch);
+ out_sbinfo:
+	kfree(sbinfo);
+ out:
+	return err;
+}
+
+int au_sbr_realloc(struct au_sbinfo *sbinfo, int nbr)
+{
+	int err, sz;
+	struct au_branch **brp;
+
+	err = -ENOMEM;
+	sz = sizeof(*brp) * (sbinfo->si_bend + 1);
+	if (unlikely(!sz))
+		sz = sizeof(*brp);
+	brp = au_kzrealloc(sbinfo->si_branch, sz, sizeof(*brp) * nbr, GFP_NOFS);
+	if (brp) {
+		sbinfo->si_branch = brp;
+		err = 0;
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+unsigned int au_sigen_inc(struct super_block *sb)
+{
+	unsigned int gen;
+
+	gen = ++au_sbi(sb)->si_generation;
+	au_update_digen(sb->s_root);
+	au_update_iigen(sb->s_root->d_inode);
+	sb->s_root->d_inode->i_version++;
+	return gen;
+}
+
+aufs_bindex_t au_new_br_id(struct super_block *sb)
+{
+	aufs_bindex_t br_id;
+	int i;
+	struct au_sbinfo *sbinfo;
+
+	sbinfo = au_sbi(sb);
+	for (i = 0; i <= AUFS_BRANCH_MAX; i++) {
+		br_id = ++sbinfo->si_last_br_id;
+		if (br_id && au_br_index(sb, br_id) < 0)
+			return br_id;
+	}
+
+	return -1;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* dentry and super_block lock. call at entry point */
+void aufs_read_lock(struct dentry *dentry, int flags)
+{
+	si_read_lock(dentry->d_sb, flags);
+	if (au_ftest_lock(flags, DW))
+		di_write_lock_child(dentry);
+	else
+		di_read_lock_child(dentry, flags);
+}
+
+void aufs_read_unlock(struct dentry *dentry, int flags)
+{
+	if (au_ftest_lock(flags, DW))
+		di_write_unlock(dentry);
+	else
+		di_read_unlock(dentry, flags);
+	si_read_unlock(dentry->d_sb);
+}
+
+void aufs_write_lock(struct dentry *dentry)
+{
+	si_write_lock(dentry->d_sb);
+	di_write_lock_child(dentry);
+}
+
+void aufs_write_unlock(struct dentry *dentry)
+{
+	di_write_unlock(dentry);
+	si_write_unlock(dentry->d_sb);
+}
+
+void aufs_read_and_write_lock2(struct dentry *d1, struct dentry *d2, int flags)
+{
+	si_read_lock(d1->d_sb, flags);
+	di_write_lock2_child(d1, d2, au_ftest_lock(flags, DIR));
+}
+
+void aufs_read_and_write_unlock2(struct dentry *d1, struct dentry *d2)
+{
+	di_write_unlock2(d1, d2);
+	si_read_unlock(d1->d_sb);
+}
diff --git a/fs/aufs/super.c b/fs/aufs/super.c
new file mode 100644
index 0000000..700f8c8
--- /dev/null
+++ b/fs/aufs/super.c
@@ -0,0 +1,846 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * mount and super_block operations
+ */
+
+#include <linux/buffer_head.h>
+#include <linux/seq_file.h>
+#include <linux/statfs.h>
+#include "aufs.h"
+
+/*
+ * super_operations
+ */
+static struct inode *aufs_alloc_inode(struct super_block *sb __maybe_unused)
+{
+	struct au_icntnr *c;
+
+	c = au_cache_alloc_icntnr();
+	if (c) {
+		inode_init_once(&c->vfs_inode);
+		c->vfs_inode.i_version = 1; /* sigen(sb); */
+		c->iinfo.ii_hinode = NULL;
+		return &c->vfs_inode;
+	}
+	return NULL;
+}
+
+static void aufs_destroy_inode(struct inode *inode)
+{
+	au_iinfo_fin(inode);
+	au_cache_free_icntnr(container_of(inode, struct au_icntnr, vfs_inode));
+}
+
+struct inode *au_iget_locked(struct super_block *sb, ino_t ino)
+{
+	struct inode *inode;
+	int err;
+
+	inode = iget_locked(sb, ino);
+	if (unlikely(!inode)) {
+		inode = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	if (!(inode->i_state & I_NEW))
+		goto out;
+
+	err = au_iinfo_init(inode);
+	if (!err)
+		inode->i_version++;
+	else {
+		iget_failed(inode);
+		inode = ERR_PTR(err);
+	}
+
+ out:
+	/* never return NULL */
+	AuDebugOn(!inode);
+	AuTraceErrPtr(inode);
+	return inode;
+}
+
+/* lock free root dinfo */
+static int au_show_brs(struct seq_file *seq, struct super_block *sb)
+{
+	int err;
+	aufs_bindex_t bindex, bend;
+	struct path path;
+	struct au_hdentry *hd;
+	struct au_branch *br;
+
+	err = 0;
+	bend = au_sbend(sb);
+	hd = au_di(sb->s_root)->di_hdentry;
+	for (bindex = 0; !err && bindex <= bend; bindex++) {
+		br = au_sbr(sb, bindex);
+		path.mnt = br->br_mnt;
+		path.dentry = hd[bindex].hd_dentry;
+		err = au_seq_path(seq, &path);
+		if (err > 0)
+			err = seq_printf(seq, "=%s",
+					 au_optstr_br_perm(br->br_perm));
+		if (!err && bindex != bend)
+			err = seq_putc(seq, ':');
+	}
+
+	return err;
+}
+
+static void au_show_wbr_create(struct seq_file *m, int v,
+			       struct au_sbinfo *sbinfo)
+{
+	const char *pat;
+
+	seq_printf(m, ",create=");
+	pat = au_optstr_wbr_create(v);
+	switch (v) {
+	case AuWbrCreate_TDP:
+	case AuWbrCreate_RR:
+	case AuWbrCreate_MFS:
+	case AuWbrCreate_PMFS:
+		seq_printf(m, pat);
+		break;
+	case AuWbrCreate_MFSV:
+		seq_printf(m, /*pat*/"mfs:%lu",
+			   sbinfo->si_wbr_mfs.mfs_expire / HZ);
+		break;
+	case AuWbrCreate_PMFSV:
+		seq_printf(m, /*pat*/"pmfs:%lu",
+			   sbinfo->si_wbr_mfs.mfs_expire / HZ);
+		break;
+	case AuWbrCreate_MFSRR:
+		seq_printf(m, /*pat*/"mfsrr:%llu",
+			   sbinfo->si_wbr_mfs.mfsrr_watermark);
+		break;
+	case AuWbrCreate_MFSRRV:
+		seq_printf(m, /*pat*/"mfsrr:%llu:%lu",
+			   sbinfo->si_wbr_mfs.mfsrr_watermark,
+			   sbinfo->si_wbr_mfs.mfs_expire / HZ);
+		break;
+	}
+}
+
+static int au_show_xino(struct seq_file *seq, struct vfsmount *mnt)
+{
+	int err;
+	const int len = sizeof(AUFS_XINO_FNAME) - 1;
+	aufs_bindex_t bindex, brid;
+	struct super_block *sb;
+	struct qstr *name;
+	struct file *f;
+	struct dentry *d, *h_root;
+
+	err = 0;
+	sb = mnt->mnt_sb;
+	f = au_sbi(sb)->si_xib;
+	if (!f)
+		goto out;
+
+	/* stop printing the default xino path on the first writable branch */
+	h_root = NULL;
+	brid = au_xino_brid(sb);
+	if (brid >= 0) {
+		bindex = au_br_index(sb, brid);
+		h_root = au_di(sb->s_root)->di_hdentry[0 + bindex].hd_dentry;
+	}
+	d = f->f_dentry;
+	name = &d->d_name;
+	/* safe ->d_parent because the file is unlinked */
+	if (d->d_parent == h_root
+	    && name->len == len
+	    && !memcmp(name->name, AUFS_XINO_FNAME, len))
+		goto out;
+
+	seq_puts(seq, ",xino=");
+	err = au_xino_path(seq, f);
+
+ out:
+	return err;
+}
+
+/* seq_file will re-call me in case of too long string */
+static int aufs_show_options(struct seq_file *m, struct vfsmount *mnt)
+{
+	int err, n;
+	unsigned int mnt_flags, v;
+	struct super_block *sb;
+	struct au_sbinfo *sbinfo;
+
+#define AuBool(name, str) do { \
+	v = au_opt_test(mnt_flags, name); \
+	if (v != au_opt_test(AuOpt_Def, name)) \
+		seq_printf(m, ",%s" #str, v ? "" : "no"); \
+} while (0)
+
+#define AuStr(name, str) do { \
+	v = mnt_flags & AuOptMask_##name; \
+	if (v != (AuOpt_Def & AuOptMask_##name)) \
+		seq_printf(m, "," #str "=%s", au_optstr_##str(v)); \
+} while (0)
+
+	/* lock free root dinfo */
+	sb = mnt->mnt_sb;
+	si_noflush_read_lock(sb);
+	sbinfo = au_sbi(sb);
+	seq_printf(m, ",si=%lx", sysaufs_si_id(sbinfo));
+
+	mnt_flags = au_mntflags(sb);
+	if (au_opt_test(mnt_flags, XINO)) {
+		err = au_show_xino(m, mnt);
+		if (unlikely(err))
+			goto out;
+	} else
+		seq_puts(m, ",noxino");
+
+	AuBool(TRUNC_XINO, trunc_xino);
+	AuStr(UDBA, udba);
+	AuBool(PLINK, plink);
+	/* AuBool(DIRPERM1, dirperm1); */
+	/* AuBool(REFROF, refrof); */
+
+	v = sbinfo->si_wbr_create;
+	if (v != AuWbrCreate_Def)
+		au_show_wbr_create(m, v, sbinfo);
+
+	v = sbinfo->si_wbr_copyup;
+	if (v != AuWbrCopyup_Def)
+		seq_printf(m, ",cpup=%s", au_optstr_wbr_copyup(v));
+
+	v = au_opt_test(mnt_flags, ALWAYS_DIROPQ);
+	if (v != au_opt_test(AuOpt_Def, ALWAYS_DIROPQ))
+		seq_printf(m, ",diropq=%c", v ? 'a' : 'w');
+
+	n = sbinfo->si_dirwh;
+	if (n != AUFS_DIRWH_DEF)
+		seq_printf(m, ",dirwh=%d", n);
+
+	n = sbinfo->si_rdcache / HZ;
+	if (n != AUFS_RDCACHE_DEF)
+		seq_printf(m, ",rdcache=%d", n);
+
+	AuBool(SUM, sum);
+	/* AuBool(SUM_W, wsum); */
+	AuBool(WARN_PERM, warn_perm);
+	AuBool(VERBOSE, verbose);
+
+ out:
+	/* be sure to print "br:" last */
+	if (!sysaufs_brs) {
+		seq_puts(m, ",br:");
+		au_show_brs(m, sb);
+	}
+	si_read_unlock(sb);
+	return 0;
+
+#undef Deleted
+#undef AuBool
+#undef AuStr
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* sum mode which returns the summation for statfs(2) */
+
+static u64 au_add_till_max(u64 a, u64 b)
+{
+	u64 old;
+
+	old = a;
+	a += b;
+	if (old < a)
+		return a;
+	return ULLONG_MAX;
+}
+
+static int au_statfs_sum(struct super_block *sb, struct kstatfs *buf)
+{
+	int err;
+	u64 blocks, bfree, bavail, files, ffree;
+	aufs_bindex_t bend, bindex, i;
+	unsigned char shared;
+	struct vfsmount *h_mnt;
+	struct super_block *h_sb;
+
+	blocks = 0;
+	bfree = 0;
+	bavail = 0;
+	files = 0;
+	ffree = 0;
+
+	err = 0;
+	bend = au_sbend(sb);
+	for (bindex = bend; bindex >= 0; bindex--) {
+		h_mnt = au_sbr_mnt(sb, bindex);
+		h_sb = h_mnt->mnt_sb;
+		shared = 0;
+		for (i = bindex + 1; !shared && i <= bend; i++)
+			shared = (au_sbr_sb(sb, i) == h_sb);
+		if (shared)
+			continue;
+
+		/* sb->s_root for NFS is unreliable */
+		err = vfs_statfs(h_mnt->mnt_root, buf);
+		if (unlikely(err))
+			goto out;
+
+		blocks = au_add_till_max(blocks, buf->f_blocks);
+		bfree = au_add_till_max(bfree, buf->f_bfree);
+		bavail = au_add_till_max(bavail, buf->f_bavail);
+		files = au_add_till_max(files, buf->f_files);
+		ffree = au_add_till_max(ffree, buf->f_ffree);
+	}
+
+	buf->f_blocks = blocks;
+	buf->f_bfree = bfree;
+	buf->f_bavail = bavail;
+	buf->f_files = files;
+	buf->f_ffree = ffree;
+
+ out:
+	return err;
+}
+
+static int aufs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	int err;
+	struct super_block *sb;
+
+	/* lock free root dinfo */
+	sb = dentry->d_sb;
+	si_noflush_read_lock(sb);
+	if (!au_opt_test(au_mntflags(sb), SUM))
+		/* sb->s_root for NFS is unreliable */
+		err = vfs_statfs(au_sbr_mnt(sb, 0)->mnt_root, buf);
+	else
+		err = au_statfs_sum(sb, buf);
+	si_read_unlock(sb);
+
+	if (!err) {
+		buf->f_type = AUFS_SUPER_MAGIC;
+		buf->f_namelen -= AUFS_WH_PFX_LEN;
+		memset(&buf->f_fsid, 0, sizeof(buf->f_fsid));
+	}
+	/* buf->f_bsize = buf->f_blocks = buf->f_bfree = buf->f_bavail = -1; */
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* try flushing the lower fs at aufs remount/unmount time */
+
+static void au_fsync_br(struct super_block *sb)
+{
+	aufs_bindex_t bend, bindex;
+	int brperm;
+	struct au_branch *br;
+	struct super_block *h_sb;
+
+	bend = au_sbend(sb);
+	for (bindex = 0; bindex < bend; bindex++) {
+		br = au_sbr(sb, bindex);
+		brperm = br->br_perm;
+		if (brperm == AuBrPerm_RR || brperm == AuBrPerm_RRWH)
+			continue;
+		h_sb = br->br_mnt->mnt_sb;
+		if (bdev_read_only(h_sb->s_bdev))
+			continue;
+
+		lockdep_off();
+		down_write(&h_sb->s_umount);
+		shrink_dcache_sb(h_sb);
+		fsync_super(h_sb);
+		up_write(&h_sb->s_umount);
+		lockdep_on();
+	}
+}
+
+/*
+ * this IS NOT for super_operations.
+ * I guess it will be reverted someday.
+ */
+static void aufs_umount_begin(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+
+	sbinfo = au_sbi(sb);
+	if (!sbinfo)
+		return;
+
+	si_write_lock(sb);
+	au_fsync_br(sb);
+	if (au_opt_test(au_mntflags(sb), PLINK))
+		au_plink_put(sb);
+	if (sbinfo->si_wbr_create_ops->fin)
+		sbinfo->si_wbr_create_ops->fin(sb);
+	si_write_unlock(sb);
+}
+
+/* final actions when unmounting a file system */
+static void aufs_put_super(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+
+	sbinfo = au_sbi(sb);
+	if (!sbinfo)
+		return;
+
+	aufs_umount_begin(sb);
+	kobject_put(&sbinfo->si_kobj);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * refresh dentry and inode at remount time.
+ */
+static int do_refresh(struct dentry *dentry, mode_t type,
+		      unsigned int dir_flags)
+{
+	int err;
+	struct dentry *parent;
+
+	di_write_lock_child(dentry);
+	parent = dget_parent(dentry);
+	di_read_lock_parent(parent, AuLock_IR);
+
+	/* returns the number of positive dentries */
+	err = au_refresh_hdentry(dentry, type);
+	if (err >= 0) {
+		struct inode *inode = dentry->d_inode;
+		err = au_refresh_hinode(inode, dentry);
+		if (!err && type == S_IFDIR)
+			au_reset_hinotify(inode, dir_flags);
+	}
+	if (unlikely(err))
+		AuErr("unrecoverable error %d, %.*s\n", err, AuDLNPair(dentry));
+
+	di_read_unlock(parent, AuLock_IR);
+	dput(parent);
+	di_write_unlock(dentry);
+
+	return err;
+}
+
+static int test_dir(struct dentry *dentry, void *arg __maybe_unused)
+{
+	return S_ISDIR(dentry->d_inode->i_mode);
+}
+
+/* gave up consolidating with refresh_nondir() */
+static int refresh_dir(struct dentry *root, unsigned int sigen)
+{
+	int err, i, j, ndentry, e;
+	struct au_dcsub_pages dpages;
+	struct au_dpage *dpage;
+	struct dentry **dentries;
+	struct inode *inode;
+	const unsigned int flags = au_hi_flags(root->d_inode, /*isdir*/1);
+
+	err = 0;
+	list_for_each_entry(inode, &root->d_sb->s_inodes, i_sb_list)
+		if (S_ISDIR(inode->i_mode) && au_iigen(inode) != sigen) {
+			ii_write_lock_child(inode);
+			e = au_refresh_hinode_self(inode, /*do_attr*/1);
+			ii_write_unlock(inode);
+			if (unlikely(e)) {
+				AuDbg("e %d, i%lu\n", e, inode->i_ino);
+				if (!err)
+					err = e;
+				/* go on even if err */
+			}
+		}
+
+	e = au_dpages_init(&dpages, GFP_NOFS);
+	if (unlikely(e)) {
+		if (!err)
+			err = e;
+		goto out;
+	}
+	e = au_dcsub_pages(&dpages, root, test_dir, NULL);
+	if (unlikely(e)) {
+		if (!err)
+			err = e;
+		goto out_dpages;
+	}
+
+	for (i = 0; !e && i < dpages.ndpage; i++) {
+		dpage = dpages.dpages + i;
+		dentries = dpage->dentries;
+		ndentry = dpage->ndentry;
+		for (j = 0; !e && j < ndentry; j++) {
+			struct dentry *d;
+
+			d = dentries[j];
+			au_dbg_verify_dir_parent(d, sigen);
+			if (au_digen(d) != sigen) {
+				e = do_refresh(d, S_IFDIR, flags);
+				if (unlikely(e && !err))
+					err = e;
+				/* break on err */
+			}
+		}
+	}
+
+ out_dpages:
+	au_dpages_free(&dpages);
+ out:
+	return err;
+}
+
+static int test_nondir(struct dentry *dentry, void *arg __maybe_unused)
+{
+	return !S_ISDIR(dentry->d_inode->i_mode);
+}
+
+static int refresh_nondir(struct dentry *root, unsigned int sigen,
+			  int do_dentry)
+{
+	int err, i, j, ndentry, e;
+	struct au_dcsub_pages dpages;
+	struct au_dpage *dpage;
+	struct dentry **dentries;
+	struct inode *inode;
+
+	err = 0;
+	list_for_each_entry(inode, &root->d_sb->s_inodes, i_sb_list)
+		if (!S_ISDIR(inode->i_mode) && au_iigen(inode) != sigen) {
+			ii_write_lock_child(inode);
+			e = au_refresh_hinode_self(inode, /*do_attr*/1);
+			ii_write_unlock(inode);
+			if (unlikely(e)) {
+				AuDbg("e %d, i%lu\n", e, inode->i_ino);
+				if (!err)
+					err = e;
+				/* go on even if err */
+			}
+		}
+
+	if (!do_dentry)
+		goto out;
+
+	e = au_dpages_init(&dpages, GFP_NOFS);
+	if (unlikely(e)) {
+		if (!err)
+			err = e;
+		goto out;
+	}
+	e = au_dcsub_pages(&dpages, root, test_nondir, NULL);
+	if (unlikely(e)) {
+		if (!err)
+			err = e;
+		goto out_dpages;
+	}
+
+	for (i = 0; i < dpages.ndpage; i++) {
+		dpage = dpages.dpages + i;
+		dentries = dpage->dentries;
+		ndentry = dpage->ndentry;
+		for (j = 0; j < ndentry; j++) {
+			struct dentry *d;
+
+			d = dentries[j];
+			au_dbg_verify_nondir_parent(d, sigen);
+			inode = d->d_inode;
+			if (inode && au_digen(d) != sigen) {
+				e = do_refresh(d, inode->i_mode & S_IFMT,
+					       /*dir_flags*/0);
+				if (unlikely(e && !err))
+					err = e;
+				/* go on even err */
+			}
+		}
+	}
+
+ out_dpages:
+	au_dpages_free(&dpages);
+ out:
+	return err;
+}
+
+static void au_remount_refresh(struct super_block *sb, unsigned int flags)
+{
+	int err;
+	unsigned int sigen;
+	struct au_sbinfo *sbinfo;
+	struct dentry *root;
+	struct inode *inode;
+
+	au_sigen_inc(sb);
+	sigen = au_sigen(sb);
+	sbinfo = au_sbi(sb);
+	au_fclr_si(sbinfo, FAILED_REFRESH_DIRS);
+
+	root = sb->s_root;
+	DiMustNoWaiters(root);
+	inode = root->d_inode;
+	IiMustNoWaiters(inode);
+	au_reset_hinotify(inode, au_hi_flags(inode, /*isdir*/1));
+	di_write_unlock(root);
+
+	err = refresh_dir(root, sigen);
+	if (unlikely(err)) {
+		au_fset_si(sbinfo, FAILED_REFRESH_DIRS);
+		AuWarn("Refreshing directories failed, ignored (%d)\n", err);
+	}
+
+	if (au_ftest_opts(flags, REFRESH_NONDIR)) {
+		err = refresh_nondir(root, sigen, !err);
+		if (unlikely(err))
+			AuWarn("Refreshing non-directories failed, ignored"
+			       "(%d)\n", err);
+	}
+
+	/* aufs_write_lock() calls ..._child() */
+	di_write_lock_child(root);
+	au_cpup_attr_all(root->d_inode, /*force*/1);
+}
+
+/* stop extra interpretation of errno in mount(8), and strange error messages */
+static int cvt_err(int err)
+{
+	AuTraceErr(err);
+
+	switch (err) {
+	case -ENOENT:
+	case -ENOTDIR:
+	case -EEXIST:
+	case -EIO:
+		err = -EINVAL;
+	}
+	return err;
+}
+
+static int aufs_remount_fs(struct super_block *sb, int *flags, char *data)
+{
+	int err;
+	struct au_opts opts;
+	struct dentry *root;
+	struct inode *inode;
+	struct au_sbinfo *sbinfo;
+
+	err = 0;
+	root = sb->s_root;
+	if (!data || !*data) {
+		aufs_write_lock(root);
+		err = au_opts_verify(sb, *flags, /*pending*/0);
+		if (!err)
+			au_fsync_br(sb);
+		aufs_write_unlock(root);
+		goto out;
+	}
+
+	err = -ENOMEM;
+	memset(&opts, 0, sizeof(opts));
+	opts.opt = (void *)__get_free_page(GFP_NOFS);
+	if (unlikely(!opts.opt))
+		goto out;
+	opts.max_opt = PAGE_SIZE / sizeof(*opts.opt);
+	opts.flags = AuOpts_REMOUNT;
+	opts.sb_flags = *flags;
+
+	/* parse it before aufs lock */
+	err = au_opts_parse(sb, data, &opts);
+	if (unlikely(err))
+		goto out_opts;
+
+	sbinfo = au_sbi(sb);
+	inode = root->d_inode;
+	mutex_lock(&inode->i_mutex);
+	aufs_write_lock(root);
+	au_fsync_br(sb);
+
+	/* au_opts_remount() may return an error */
+	err = au_opts_remount(sb, &opts);
+	au_opts_free(&opts);
+
+	if (au_ftest_opts(opts.flags, REFRESH_DIR)
+	    || au_ftest_opts(opts.flags, REFRESH_NONDIR))
+		au_remount_refresh(sb, opts.flags);
+
+	aufs_write_unlock(root);
+	mutex_unlock(&inode->i_mutex);
+
+ out_opts:
+	free_page((unsigned long)opts.opt);
+ out:
+	err = cvt_err(err);
+	AuTraceErr(err);
+	return err;
+}
+
+static struct super_operations aufs_sop = {
+	.alloc_inode	= aufs_alloc_inode,
+	.destroy_inode	= aufs_destroy_inode,
+	.drop_inode	= generic_delete_inode,
+	.show_options	= aufs_show_options,
+	.statfs		= aufs_statfs,
+	.put_super	= aufs_put_super,
+	.remount_fs	= aufs_remount_fs
+};
+
+/* ---------------------------------------------------------------------- */
+
+static int alloc_root(struct super_block *sb)
+{
+	int err;
+	struct inode *inode;
+	struct dentry *root;
+
+	err = -ENOMEM;
+	inode = au_iget_locked(sb, AUFS_ROOT_INO);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out;
+
+	inode->i_op = &aufs_dir_iop;
+	inode->i_fop = &aufs_dir_fop;
+	inode->i_mode = S_IFDIR;
+	inode->i_nlink = 2;
+	unlock_new_inode(inode);
+
+	root = d_alloc_root(inode);
+	if (unlikely(!root))
+		goto out_iput;
+	err = PTR_ERR(root);
+	if (IS_ERR(root))
+		goto out_iput;
+
+	err = au_alloc_dinfo(root);
+	if (!err) {
+		sb->s_root = root;
+		return 0; /* success */
+	}
+	dput(root);
+	goto out; /* do not iput */
+
+ out_iput:
+	iget_failed(inode);
+	iput(inode);
+ out:
+	return err;
+
+}
+
+static int aufs_fill_super(struct super_block *sb, void *raw_data,
+			   int silent __maybe_unused)
+{
+	int err;
+	struct au_opts opts;
+	struct dentry *root;
+	struct inode *inode;
+	char *arg = raw_data;
+
+	if (unlikely(!arg || !*arg)) {
+		err = -EINVAL;
+		AuErr("no arg\n");
+		goto out;
+	}
+
+	err = -ENOMEM;
+	memset(&opts, 0, sizeof(opts));
+	opts.opt = (void *)__get_free_page(GFP_NOFS);
+	if (unlikely(!opts.opt))
+		goto out;
+	opts.max_opt = PAGE_SIZE / sizeof(*opts.opt);
+	opts.sb_flags = sb->s_flags;
+
+	err = au_si_alloc(sb);
+	if (unlikely(err))
+		goto out_opts;
+
+	/* all timestamps always follow the ones on the branch */
+	sb->s_flags |= MS_NOATIME | MS_NODIRATIME;
+	sb->s_op = &aufs_sop;
+	sb->s_magic = AUFS_SUPER_MAGIC;
+	sb->s_maxbytes = 0;
+
+	err = alloc_root(sb);
+	if (unlikely(err)) {
+		si_write_unlock(sb);
+		goto out_info;
+	}
+	root = sb->s_root;
+	inode = root->d_inode;
+
+	/*
+	 * actually we can parse options regardless aufs lock here.
+	 * but at remount time, parsing must be done before aufs lock.
+	 * so we follow the same rule.
+	 */
+	ii_write_lock_parent(inode);
+	aufs_write_unlock(root);
+	err = au_opts_parse(sb, arg, &opts);
+	if (unlikely(err))
+		goto out_root;
+
+	/* lock vfs_inode first, then aufs. */
+	mutex_lock(&inode->i_mutex);
+	inode->i_op = &aufs_dir_iop;
+	inode->i_fop = &aufs_dir_fop;
+	aufs_write_lock(root);
+	err = au_opts_mount(sb, &opts);
+	au_opts_free(&opts);
+	if (unlikely(err))
+		goto out_unlock;
+	aufs_write_unlock(root);
+	mutex_unlock(&inode->i_mutex);
+	goto out_opts; /* success */
+
+ out_unlock:
+	aufs_write_unlock(root);
+	mutex_unlock(&inode->i_mutex);
+ out_root:
+	dput(root);
+	sb->s_root = NULL;
+ out_info:
+	kobject_put(&au_sbi(sb)->si_kobj);
+	sb->s_fs_info = NULL;
+ out_opts:
+	free_page((unsigned long)opts.opt);
+ out:
+	AuTraceErr(err);
+	err = cvt_err(err);
+	AuTraceErr(err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int aufs_get_sb(struct file_system_type *fs_type, int flags,
+		       const char *dev_name __maybe_unused, void *raw_data,
+		       struct vfsmount *mnt)
+{
+	int err;
+	struct super_block *sb;
+
+	/* all timestamps always follow the ones on the branch */
+	/* mnt->mnt_flags |= MNT_NOATIME | MNT_NODIRATIME; */
+	err = get_sb_nodev(fs_type, flags, raw_data, aufs_fill_super, mnt);
+	if (!err) {
+		sb = mnt->mnt_sb;
+		si_write_lock(sb);
+		sysaufs_brs_add(sb, 0);
+		si_write_unlock(sb);
+	}
+	return err;
+}
+
+struct file_system_type aufs_fs_type = {
+	.name		= AUFS_FSTYPE,
+	.fs_flags	=
+		FS_RENAME_DOES_D_MOVE	/* a race between rename and others */
+		| FS_REVAL_DOT,		/* for NFS branch and udba */
+	.get_sb		= aufs_get_sb,
+	.kill_sb	= generic_shutdown_super,
+	/* no need to __module_get() and module_put(). */
+	.owner		= THIS_MODULE,
+};
diff --git a/fs/aufs/super.h b/fs/aufs/super.h
new file mode 100644
index 0000000..83721ef
--- /dev/null
+++ b/fs/aufs/super.h
@@ -0,0 +1,266 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * super_block operations
+ */
+
+#ifndef __AUFS_SUPER_H__
+#define __AUFS_SUPER_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/kobject.h>
+#include <linux/aufs_type.h>
+#include "rwsem.h"
+#include "spl.h"
+#include "wkq.h"
+
+typedef ssize_t (*au_readf_t)(struct file *, char __user *, size_t, loff_t *);
+typedef ssize_t (*au_writef_t)(struct file *, const char __user *, size_t,
+			       loff_t *);
+
+/* policies to select one among multiple writable branches */
+struct au_wbr_copyup_operations {
+	int (*copyup)(struct dentry *dentry);
+};
+
+struct au_wbr_create_operations {
+	int (*create)(struct dentry *dentry, int isdir);
+	int (*init)(struct super_block *sb);
+	int (*fin)(struct super_block *sb);
+};
+
+struct au_wbr_mfs {
+	struct mutex	mfs_lock; /* protect this structure */
+	unsigned long	mfs_jiffy;
+	unsigned long	mfs_expire;
+	aufs_bindex_t	mfs_bindex;
+
+	unsigned long long	mfsrr_bytes;
+	unsigned long long	mfsrr_watermark;
+};
+
+/* sbinfo status flags */
+/*
+ * set true when refresh_dirs() failed at remount time.
+ * then try refreshing dirs at access time again.
+ * if it is false, refreshing dirs at access time is unnecesary
+ */
+#define AuSi_FAILED_REFRESH_DIRS	1
+#define AuSi_MAINTAIN_PLINK		(1 << 1)	/* ioctl */
+#define au_ftest_si(sbinfo, name)	((sbinfo)->au_si_status & AuSi_##name)
+#define au_fset_si(sbinfo, name) \
+	{ (sbinfo)->au_si_status |= AuSi_##name; }
+#define au_fclr_si(sbinfo, name) \
+	{ (sbinfo)->au_si_status &= ~AuSi_##name; }
+
+struct au_branch;
+struct au_sbinfo {
+	/* nowait tasks in the system-wide workqueue */
+	struct au_nowait_tasks	si_nowait;
+
+	struct rw_semaphore	si_rwsem;
+
+	/* branch management */
+	unsigned int		si_generation;
+
+	/* see above flags */
+	unsigned char		au_si_status;
+
+	aufs_bindex_t		si_bend;
+	aufs_bindex_t		si_last_br_id;
+	struct au_branch	**si_branch;
+
+	/* policy to select a writable branch */
+	unsigned char		si_wbr_copyup;
+	unsigned char		si_wbr_create;
+	struct au_wbr_copyup_operations *si_wbr_copyup_ops;
+	struct au_wbr_create_operations *si_wbr_create_ops;
+
+	/* round robin */
+	atomic_t		si_wbr_rr_next;
+
+	/* most free space */
+	struct au_wbr_mfs	si_wbr_mfs;
+
+	/* mount flags */
+	/* include/asm-ia64/siginfo.h defines a macro named si_flags */
+	unsigned int		si_mntflags;
+
+	/* external inode number (bitmap and translation table) */
+	au_readf_t		si_xread;
+	au_writef_t		si_xwrite;
+	struct file		*si_xib;
+	struct mutex		si_xib_mtx; /* protect xib members */
+	unsigned long		*si_xib_buf;
+	unsigned long		si_xib_last_pindex;
+	int			si_xib_next_bit;
+	aufs_bindex_t		si_xino_brid;
+	/* reserved for future use */
+	/* unsigned long long	si_xib_limit; */	/* Max xib file size */
+
+	/* readdir cache time, max, in HZ */
+	unsigned long		si_rdcache;
+
+	/*
+	 * If the number of whiteouts are larger than si_dirwh, leave all of
+	 * them after au_whtmp_ren to reduce the cost of rmdir(2).
+	 * future fsck.aufs or kernel thread will remove them later.
+	 * Otherwise, remove all whiteouts and the dir in rmdir(2).
+	 */
+	unsigned int		si_dirwh;
+
+	/*
+	 * rename(2) a directory with all children.
+	 */
+	/* reserved for future use */
+	/* int			si_rendir; */
+
+	/* pseudo_link list */
+	struct au_splhead	si_plink;
+	wait_queue_head_t	si_plink_wq;
+
+	/*
+	 * sysfs and lifetime management.
+	 * this is not a small structure and it may be a waste of memory in case
+	 * of sysfs is disabled, particulary when many aufs-es are mounted.
+	 * but using sysfs is majority.
+	 */
+	struct kobject		si_kobj;
+
+	/* dirty, necessary for unmounting, sysfs and sysrq */
+	struct super_block	*si_sb;
+};
+
+/* ---------------------------------------------------------------------- */
+
+/* policy to select one among writable branches */
+#define AuWbrCopyup(sbinfo, args...) \
+	((sbinfo)->si_wbr_copyup_ops->copyup(args))
+#define AuWbrCreate(sbinfo, args...) \
+	((sbinfo)->si_wbr_create_ops->create(args))
+
+/* flags for si_read_lock()/aufs_read_lock()/di_read_lock() */
+#define AuLock_DW		1		/* write-lock dentry */
+#define AuLock_IR		(1 << 1)	/* read-lock inode */
+#define AuLock_IW		(1 << 2)	/* write-lock inode */
+#define AuLock_FLUSH		(1 << 3)	/* wait for 'nowait' tasks */
+#define AuLock_DIR		(1 << 4)	/* target is a dir */
+#define au_ftest_lock(flags, name)	((flags) & AuLock_##name)
+#define au_fset_lock(flags, name)	{ (flags) |= AuLock_##name; }
+#define au_fclr_lock(flags, name)	{ (flags) &= ~AuLock_##name; }
+
+/* ---------------------------------------------------------------------- */
+
+/* super.c */
+extern struct file_system_type aufs_fs_type;
+struct inode *au_iget_locked(struct super_block *sb, ino_t ino);
+
+/* sbinfo.c */
+void au_si_free(struct kobject *kobj);
+int au_si_alloc(struct super_block *sb);
+int au_sbr_realloc(struct au_sbinfo *sbinfo, int nbr);
+
+unsigned int au_sigen_inc(struct super_block *sb);
+aufs_bindex_t au_new_br_id(struct super_block *sb);
+
+void aufs_read_lock(struct dentry *dentry, int flags);
+void aufs_read_unlock(struct dentry *dentry, int flags);
+void aufs_write_lock(struct dentry *dentry);
+void aufs_write_unlock(struct dentry *dentry);
+void aufs_read_and_write_lock2(struct dentry *d1, struct dentry *d2, int isdir);
+void aufs_read_and_write_unlock2(struct dentry *d1, struct dentry *d2);
+
+/* wbr_policy.c */
+extern struct au_wbr_copyup_operations au_wbr_copyup_ops[];
+extern struct au_wbr_create_operations au_wbr_create_ops[];
+int au_cpdown_dirs(struct dentry *dentry, aufs_bindex_t bdst);
+
+/* ---------------------------------------------------------------------- */
+
+static inline struct au_sbinfo *au_sbi(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* lock superblock. mainly for entry point functions */
+/*
+ * si_noflush_read_lock, si_noflush_write_lock,
+ * si_read_unlock, si_write_unlock, si_downgrade_lock
+ */
+AuSimpleLockRwsemFuncs(si_noflush, struct super_block *sb,
+		       &au_sbi(sb)->si_rwsem);
+AuSimpleUnlockRwsemFuncs(si, struct super_block *sb, &au_sbi(sb)->si_rwsem);
+
+static inline void si_read_lock(struct super_block *sb, int flags)
+{
+	if (au_ftest_lock(flags, FLUSH))
+		au_nwt_flush(&au_sbi(sb)->si_nowait);
+	si_noflush_read_lock(sb);
+}
+
+static inline void si_write_lock(struct super_block *sb)
+{
+	au_nwt_flush(&au_sbi(sb)->si_nowait);
+	si_noflush_write_lock(sb);
+}
+
+static inline int si_read_trylock(struct super_block *sb, int flags)
+{
+	if (au_ftest_lock(flags, FLUSH))
+		au_nwt_flush(&au_sbi(sb)->si_nowait);
+	return si_noflush_read_trylock(sb);
+}
+
+static inline int si_write_trylock(struct super_block *sb, int flags)
+{
+	if (au_ftest_lock(flags, FLUSH))
+		au_nwt_flush(&au_sbi(sb)->si_nowait);
+	return si_noflush_write_trylock(sb);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static inline aufs_bindex_t au_sbend(struct super_block *sb)
+{
+	return au_sbi(sb)->si_bend;
+}
+
+static inline unsigned int au_mntflags(struct super_block *sb)
+{
+	return au_sbi(sb)->si_mntflags;
+}
+
+static inline unsigned int au_sigen(struct super_block *sb)
+{
+	return au_sbi(sb)->si_generation;
+}
+
+static inline struct au_branch *au_sbr(struct super_block *sb,
+				       aufs_bindex_t bindex)
+{
+	return au_sbi(sb)->si_branch[0 + bindex];
+}
+
+static inline void au_xino_brid_set(struct super_block *sb, aufs_bindex_t brid)
+{
+	au_sbi(sb)->si_xino_brid = brid;
+}
+
+static inline aufs_bindex_t au_xino_brid(struct super_block *sb)
+{
+	return au_sbi(sb)->si_xino_brid;
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_SUPER_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 05/28] aufs branch directory/filesystem
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (3 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 04/28] aufs super_block J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 06/28] aufs xino J. R. Okajima
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
branch management

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/branch.c |  944 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/branch.h |  200 ++++++++++++
 2 files changed, 1144 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/branch.c
 create mode 100644 fs/aufs/branch.h

diff --git a/fs/aufs/branch.c b/fs/aufs/branch.c
new file mode 100644
index 0000000..232f5b3
--- /dev/null
+++ b/fs/aufs/branch.c
@@ -0,0 +1,944 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * branch management
+ */
+
+#include "aufs.h"
+
+/*
+ * free a single branch
+ */
+static void au_br_do_free(struct au_branch *br)
+{
+	int i;
+	struct au_wbr *wbr;
+
+	if (br->br_xino.xi_file)
+		fput(br->br_xino.xi_file);
+	mutex_destroy(&br->br_xino.xi_nondir_mtx);
+
+	AuDebugOn(atomic_read(&br->br_count));
+
+	wbr = br->br_wbr;
+	if (wbr) {
+		for (i = 0; i < AuBrWh_Last; i++)
+			dput(wbr->wbr_wh[i]);
+		AuDebugOn(atomic_read(&wbr->wbr_wh_running));
+		au_rwsem_destroy(&wbr->wbr_wh_rwsem);
+	}
+
+	/* some filesystems acquire extra lock */
+	lockdep_off();
+	mntput(br->br_mnt);
+	lockdep_on();
+
+	kfree(wbr);
+	kfree(br);
+}
+
+/*
+ * frees all branches
+ */
+void au_br_free(struct au_sbinfo *sbinfo)
+{
+	aufs_bindex_t bmax;
+	struct au_branch **br;
+
+	bmax = sbinfo->si_bend + 1;
+	br = sbinfo->si_branch;
+	while (bmax--)
+		au_br_do_free(*br++);
+}
+
+/*
+ * find the index of a branch which is specified by @br_id.
+ */
+int au_br_index(struct super_block *sb, aufs_bindex_t br_id)
+{
+	aufs_bindex_t bindex, bend;
+
+	bend = au_sbend(sb);
+	for (bindex = 0; bindex <= bend; bindex++)
+		if (au_sbr_id(sb, bindex) == br_id)
+			return bindex;
+	return -1;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * add a branch
+ */
+
+static int test_overlap(struct super_block *sb, struct dentry *h_d1,
+			struct dentry *h_d2)
+{
+	if (unlikely(h_d1 == h_d2))
+		return 1;
+	return !!au_test_subdir(h_d1, h_d2)
+		|| !!au_test_subdir(h_d2, h_d1)
+		|| au_test_loopback_overlap(sb, h_d1, h_d2)
+		|| au_test_loopback_overlap(sb, h_d2, h_d1);
+}
+
+/*
+ * returns a newly allocated branch. @new_nbranch is a number of branches
+ * after adding a branch.
+ */
+static struct au_branch *au_br_alloc(struct super_block *sb, int new_nbranch,
+				     int perm)
+{
+	struct au_branch *add_branch;
+	struct dentry *root;
+
+	root = sb->s_root;
+	add_branch = kmalloc(sizeof(*add_branch), GFP_NOFS);
+	if (unlikely(!add_branch))
+		goto out;
+
+	add_branch->br_wbr = NULL;
+	if (au_br_writable(perm)) {
+		/* may be freed separately at changing the branch permission */
+		add_branch->br_wbr = kmalloc(sizeof(*add_branch->br_wbr),
+					     GFP_NOFS);
+		if (unlikely(!add_branch->br_wbr))
+			goto out_br;
+	}
+
+	if (unlikely(au_sbr_realloc(au_sbi(sb), new_nbranch)
+		     || au_di_realloc(au_di(root), new_nbranch)
+		     || au_ii_realloc(au_ii(root->d_inode), new_nbranch)))
+		goto out_wbr;
+	return add_branch; /* success */
+
+ out_wbr:
+	kfree(add_branch->br_wbr);
+ out_br:
+	kfree(add_branch);
+ out:
+	return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * test if the branch permission is legal or not.
+ */
+static int test_br(struct inode *inode, int brperm, char *path)
+{
+	int err;
+
+	err = 0;
+	if (unlikely(au_br_writable(brperm) && IS_RDONLY(inode))) {
+		AuErr("write permission for readonly mount or inode, %s\n",
+		      path);
+		err = -EINVAL;
+	}
+
+	return err;
+}
+
+/*
+ * returns:
+ * 0: success, the caller will add it
+ * plus: success, it is already unified, the caller should ignore it
+ * minus: error
+ */
+static int test_add(struct super_block *sb, struct au_opt_add *add, int remount)
+{
+	int err;
+	aufs_bindex_t bend, bindex;
+	struct dentry *root;
+	struct inode *inode, *h_inode;
+
+	root = sb->s_root;
+	bend = au_sbend(sb);
+	if (unlikely(bend >= 0
+		     && au_find_dbindex(root, add->path.dentry) >= 0)) {
+		err = 1;
+		if (!remount) {
+			err = -EINVAL;
+			AuErr("%s duplicated\n", add->pathname);
+		}
+		goto out;
+	}
+
+	err = -ENOSPC; /* -E2BIG; */
+	if (unlikely(AUFS_BRANCH_MAX <= add->bindex
+		     || AUFS_BRANCH_MAX - 1 <= bend)) {
+		AuErr("number of branches exceeded %s\n", add->pathname);
+		goto out;
+	}
+
+	err = -EDOM;
+	if (unlikely(add->bindex < 0 || bend + 1 < add->bindex)) {
+		AuErr("bad index %d\n", add->bindex);
+		goto out;
+	}
+
+	inode = add->path.dentry->d_inode;
+	err = -ENOENT;
+	if (unlikely(!inode->i_nlink)) {
+		AuErr("no existence %s\n", add->pathname);
+		goto out;
+	}
+
+	err = -EINVAL;
+	if (unlikely(inode->i_sb == sb)) {
+		AuErr("%s must be outside\n", add->pathname);
+		goto out;
+	}
+
+	if (unlikely(au_test_fs_unsuppoted(inode->i_sb))) {
+		AuErr("unsupported filesystem, %s (%s)\n",
+		      add->pathname, au_sbtype(inode->i_sb));
+		goto out;
+	}
+
+	err = test_br(add->path.dentry->d_inode, add->perm, add->pathname);
+	if (unlikely(err))
+		goto out;
+
+	if (bend < 0)
+		return 0; /* success */
+
+	err = -EINVAL;
+	for (bindex = 0; bindex <= bend; bindex++)
+		if (unlikely(test_overlap(sb, add->path.dentry,
+					  au_h_dptr(root, bindex)))) {
+			AuErr("%s is overlapped\n", add->pathname);
+			goto out;
+		}
+
+	err = 0;
+	if (au_opt_test(au_mntflags(sb), WARN_PERM)) {
+		h_inode = au_h_dptr(root, 0)->d_inode;
+		if ((h_inode->i_mode & S_IALLUGO) != (inode->i_mode & S_IALLUGO)
+		    || h_inode->i_uid != inode->i_uid
+		    || h_inode->i_gid != inode->i_gid)
+			AuWarn("uid/gid/perm %s %u/%u/0%o, %u/%u/0%o\n",
+			       add->pathname,
+			       inode->i_uid, inode->i_gid,
+			       (inode->i_mode & S_IALLUGO),
+			       h_inode->i_uid, h_inode->i_gid,
+			       (h_inode->i_mode & S_IALLUGO));
+	}
+
+ out:
+	return err;
+}
+
+/*
+ * initialize or clean the whiteouts for an adding branch
+ */
+static int au_br_init_wh(struct super_block *sb, struct au_branch *br,
+			 int new_perm, struct dentry *h_root)
+{
+	int err, old_perm;
+	aufs_bindex_t bindex;
+	struct mutex *h_mtx;
+	struct au_wbr *wbr;
+	struct au_hinode *hdir;
+
+	wbr = br->br_wbr;
+	old_perm = br->br_perm;
+	br->br_perm = new_perm;
+	hdir = NULL;
+	h_mtx = NULL;
+	bindex = au_br_index(sb, br->br_id);
+	if (0 <= bindex) {
+		hdir = au_hi(sb->s_root->d_inode, bindex);
+		au_hin_imtx_lock_nested(hdir, AuLsc_I_PARENT);
+	} else {
+		h_mtx = &h_root->d_inode->i_mutex;
+		mutex_lock_nested(h_mtx, AuLsc_I_PARENT);
+	}
+	if (!wbr)
+		err = au_wh_init(h_root, br, sb);
+	else {
+		wbr_wh_write_lock(wbr);
+		err = au_wh_init(h_root, br, sb);
+		wbr_wh_write_unlock(wbr);
+	}
+	if (hdir)
+		au_hin_imtx_unlock(hdir);
+	else
+		mutex_unlock(h_mtx);
+	br->br_perm = old_perm;
+
+	if (!err && wbr && !au_br_writable(new_perm)) {
+		kfree(wbr);
+		br->br_wbr = NULL;
+	}
+
+	return err;
+}
+
+static int au_wbr_init(struct au_branch *br, struct super_block *sb,
+		       int perm, struct path *path)
+{
+	int err;
+	struct au_wbr *wbr;
+
+	wbr = br->br_wbr;
+	init_rwsem(&wbr->wbr_wh_rwsem);
+	memset(wbr->wbr_wh, 0, sizeof(wbr->wbr_wh));
+	atomic_set(&wbr->wbr_wh_running, 0);
+	wbr->wbr_bytes = 0;
+
+	err = au_br_init_wh(sb, br, perm, path->dentry);
+
+	return err;
+}
+
+/* intialize a new branch */
+static int au_br_init(struct au_branch *br, struct super_block *sb,
+		      struct au_opt_add *add)
+{
+	int err;
+
+	err = 0;
+	br->br_xino.xi_file = NULL;
+	mutex_init(&br->br_xino.xi_nondir_mtx);
+	br->br_perm = add->perm;
+	br->br_mnt = add->path.mnt; /* set first, mntget() later */
+	atomic_set(&br->br_count, 0);
+	br->br_xino_upper = AUFS_XINO_TRUNC_INIT;
+	atomic_set(&br->br_xino_running, 0);
+	br->br_id = au_new_br_id(sb);
+
+	if (au_br_writable(add->perm)) {
+		err = au_wbr_init(br, sb, add->perm, &add->path);
+		if (unlikely(err))
+			goto out;
+	}
+
+	if (au_opt_test(au_mntflags(sb), XINO)) {
+		err = au_xino_br(sb, br, add->path.dentry->d_inode->i_ino,
+				 au_sbr(sb, 0)->br_xino.xi_file, /*do_test*/1);
+		if (unlikely(err)) {
+			AuDebugOn(br->br_xino.xi_file);
+			goto out;
+		}
+	}
+
+	sysaufs_br_init(br);
+	mntget(add->path.mnt);
+
+ out:
+	return err;
+}
+
+static void au_br_do_add_brp(struct au_sbinfo *sbinfo, aufs_bindex_t bindex,
+			     struct au_branch *br, aufs_bindex_t bend,
+			     aufs_bindex_t amount)
+{
+	struct au_branch **brp;
+
+	brp = sbinfo->si_branch + bindex;
+	memmove(brp + 1, brp, sizeof(*brp) * amount);
+	*brp = br;
+	sbinfo->si_bend++;
+	if (unlikely(bend < 0))
+		sbinfo->si_bend = 0;
+}
+
+static void au_br_do_add_hdp(struct au_dinfo *dinfo, aufs_bindex_t bindex,
+			     aufs_bindex_t bend, aufs_bindex_t amount)
+{
+	struct au_hdentry *hdp;
+
+	hdp = dinfo->di_hdentry + bindex;
+	memmove(hdp + 1, hdp, sizeof(*hdp) * amount);
+	au_h_dentry_init(hdp);
+	dinfo->di_bend++;
+	if (unlikely(bend < 0))
+		dinfo->di_bstart = 0;
+}
+
+static void au_br_do_add_hip(struct au_iinfo *iinfo, aufs_bindex_t bindex,
+			     aufs_bindex_t bend, aufs_bindex_t amount)
+{
+	struct au_hinode *hip;
+
+	hip = iinfo->ii_hinode + bindex;
+	memmove(hip + 1, hip, sizeof(*hip) * amount);
+	hip->hi_inode = NULL;
+	au_hin_init(hip, NULL);
+	iinfo->ii_bend++;
+	if (unlikely(bend < 0))
+		iinfo->ii_bstart = 0;
+}
+
+static void au_br_do_add(struct super_block *sb, struct dentry *h_dentry,
+			 struct au_branch *br, aufs_bindex_t bindex)
+{
+	struct dentry *root;
+	struct inode *root_inode;
+	aufs_bindex_t bend, amount;
+
+	root = sb->s_root;
+	root_inode = root->d_inode;
+	au_plink_block_maintain(sb);
+	bend = au_sbend(sb);
+	amount = bend + 1 - bindex;
+	au_br_do_add_brp(au_sbi(sb), bindex, br, bend, amount);
+	au_br_do_add_hdp(au_di(root), bindex, bend, amount);
+	au_br_do_add_hip(au_ii(root_inode), bindex, bend, amount);
+	au_set_h_dptr(root, bindex, dget(h_dentry));
+	au_set_h_iptr(root_inode, bindex, au_igrab(h_dentry->d_inode),
+		      /*flags*/0);
+}
+
+int au_br_add(struct super_block *sb, struct au_opt_add *add, int remount)
+{
+	int err;
+	unsigned long long maxb;
+	aufs_bindex_t bend, add_bindex;
+	struct dentry *root, *h_dentry;
+	struct inode *root_inode;
+	struct au_branch *add_branch;
+
+	root = sb->s_root;
+	root_inode = root->d_inode;
+	IMustLock(root_inode);
+	err = test_add(sb, add, remount);
+	if (unlikely(err < 0))
+		goto out;
+	if (err) {
+		err = 0;
+		goto out; /* success */
+	}
+
+	bend = au_sbend(sb);
+	add_branch = au_br_alloc(sb, bend + 2, add->perm);
+	err = PTR_ERR(add_branch);
+	if (IS_ERR(add_branch))
+		goto out;
+
+	err = au_br_init(add_branch, sb, add);
+	if (unlikely(err)) {
+		au_br_do_free(add_branch);
+		goto out;
+	}
+
+	add_bindex = add->bindex;
+	h_dentry = add->path.dentry;
+	if (!remount)
+		au_br_do_add(sb, h_dentry, add_branch, add_bindex);
+	else {
+		sysaufs_brs_del(sb, add_bindex);
+		au_br_do_add(sb, h_dentry, add_branch, add_bindex);
+		sysaufs_brs_add(sb, add_bindex);
+	}
+
+	if (!add_bindex)
+		au_cpup_attr_all(root_inode, /*force*/1);
+	else
+		au_add_nlink(root_inode, h_dentry->d_inode);
+	maxb = h_dentry->d_sb->s_maxbytes;
+	if (sb->s_maxbytes < maxb)
+		sb->s_maxbytes = maxb;
+
+	/*
+	 * this test/set prevents aufs from handling unnecesary inotify events
+	 * of xino files, in a case of re-adding a writable branch which was
+	 * once detached from aufs.
+	 */
+	if (au_xino_brid(sb) < 0
+	    && au_br_writable(add_branch->br_perm)
+	    && !au_test_fs_bad_xino(h_dentry->d_sb)
+	    && add_branch->br_xino.xi_file
+	    && add_branch->br_xino.xi_file->f_dentry->d_parent == h_dentry)
+		au_xino_brid_set(sb, add_branch->br_id);
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * delete a branch
+ */
+
+/* to show the line number, do not make it inlined function */
+#define AuVerbose(do_info, fmt, args...) do { \
+	if (do_info) \
+		AuInfo(fmt, ##args); \
+} while (0)
+
+/*
+ * test if the branch is deletable or not.
+ */
+static int test_dentry_busy(struct dentry *root, aufs_bindex_t bindex,
+			    unsigned int sigen)
+{
+	int err, i, j, ndentry;
+	aufs_bindex_t bstart, bend;
+	unsigned char verbose;
+	struct au_dcsub_pages dpages;
+	struct au_dpage *dpage;
+	struct dentry *d;
+	struct inode *inode;
+
+	err = au_dpages_init(&dpages, GFP_NOFS);
+	if (unlikely(err))
+		goto out;
+	err = au_dcsub_pages(&dpages, root, NULL, NULL);
+	if (unlikely(err))
+		goto out_dpages;
+
+	verbose = !!au_opt_test(au_mntflags(root->d_sb), VERBOSE);
+	for (i = 0; !err && i < dpages.ndpage; i++) {
+		dpage = dpages.dpages + i;
+		ndentry = dpage->ndentry;
+		for (j = 0; !err && j < ndentry; j++) {
+			d = dpage->dentries[j];
+			AuDebugOn(!atomic_read(&d->d_count));
+			inode = d->d_inode;
+			if (au_digen(d) == sigen && au_iigen(inode) == sigen)
+				di_read_lock_child(d, AuLock_IR);
+			else {
+				di_write_lock_child(d);
+				err = au_reval_dpath(d, sigen);
+				if (!err)
+					di_downgrade_lock(d, AuLock_IR);
+				else {
+					di_write_unlock(d);
+					break;
+				}
+			}
+
+			bstart = au_dbstart(d);
+			bend = au_dbend(d);
+			if (bstart <= bindex
+			    && bindex <= bend
+			    && au_h_dptr(d, bindex)
+			    && (!S_ISDIR(inode->i_mode) || bstart == bend)) {
+				err = -EBUSY;
+				AuVerbose(verbose, "busy %.*s\n", AuDLNPair(d));
+			}
+			di_read_unlock(d, AuLock_IR);
+		}
+	}
+
+ out_dpages:
+	au_dpages_free(&dpages);
+ out:
+	return err;
+}
+
+static int test_inode_busy(struct super_block *sb, aufs_bindex_t bindex,
+			   unsigned int sigen)
+{
+	int err;
+	struct inode *i;
+	aufs_bindex_t bstart, bend;
+	unsigned char verbose;
+
+	err = 0;
+	verbose = !!au_opt_test(au_mntflags(sb), VERBOSE);
+	list_for_each_entry(i, &sb->s_inodes, i_sb_list) {
+		AuDebugOn(!atomic_read(&i->i_count));
+		if (!list_empty(&i->i_dentry))
+			continue;
+
+		if (au_iigen(i) == sigen)
+			ii_read_lock_child(i);
+		else {
+			ii_write_lock_child(i);
+			err = au_refresh_hinode_self(i, /*do_attr*/1);
+			if (!err)
+				ii_downgrade_lock(i);
+			else {
+				ii_write_unlock(i);
+				break;
+			}
+		}
+
+		bstart = au_ibstart(i);
+		bend = au_ibend(i);
+		if (bstart <= bindex
+		    && bindex <= bend
+		    && au_h_iptr(i, bindex)
+		    && (!S_ISDIR(i->i_mode) || bstart == bend)) {
+			err = -EBUSY;
+			AuVerbose(verbose, "busy i%lu\n", i->i_ino);
+			ii_read_unlock(i);
+			break;
+		}
+		ii_read_unlock(i);
+	}
+
+	return err;
+}
+
+static int test_children_busy(struct dentry *root, aufs_bindex_t bindex)
+{
+	int err;
+	unsigned int sigen;
+
+	sigen = au_sigen(root->d_sb);
+	DiMustNoWaiters(root);
+	IiMustNoWaiters(root->d_inode);
+	di_write_unlock(root);
+	err = test_dentry_busy(root, bindex, sigen);
+	if (!err)
+		err = test_inode_busy(root->d_sb, bindex, sigen);
+	di_write_lock_child(root); /* aufs_write_lock() calls ..._child() */
+
+	return err;
+}
+
+static void au_br_do_del_brp(struct au_sbinfo *sbinfo,
+			     const aufs_bindex_t bindex,
+			     const aufs_bindex_t bend)
+{
+	struct au_branch **brp, **p;
+
+	brp = sbinfo->si_branch + bindex;
+	if (bindex < bend)
+		memmove(brp, brp + 1, sizeof(*brp) * (bend - bindex));
+	sbinfo->si_branch[0 + bend] = NULL;
+	sbinfo->si_bend--;
+
+	p = krealloc(sbinfo->si_branch, sizeof(*p) * bend, GFP_NOFS);
+	if (p)
+		sbinfo->si_branch = p;
+}
+
+static void au_br_do_del_hdp(struct au_dinfo *dinfo, const aufs_bindex_t bindex,
+			     const aufs_bindex_t bend)
+{
+	struct au_hdentry *hdp, *p;
+
+	hdp = dinfo->di_hdentry + bindex;
+	if (bindex < bend)
+		memmove(hdp, hdp + 1, sizeof(*hdp) * (bend - bindex));
+	dinfo->di_hdentry[0 + bend].hd_dentry = NULL;
+	dinfo->di_bend--;
+
+	p = krealloc(dinfo->di_hdentry, sizeof(*p) * bend, GFP_NOFS);
+	if (p)
+		dinfo->di_hdentry = p;
+}
+
+static void au_br_do_del_hip(struct au_iinfo *iinfo, const aufs_bindex_t bindex,
+			     const aufs_bindex_t bend)
+{
+	struct au_hinode *hip, *p;
+
+	hip = iinfo->ii_hinode + bindex;
+	if (bindex < bend)
+		memmove(hip, hip + 1, sizeof(*hip) * (bend - bindex));
+	iinfo->ii_hinode[0 + bend].hi_inode = NULL;
+	au_hin_init(iinfo->ii_hinode + bend, NULL);
+	iinfo->ii_bend--;
+
+	p = krealloc(iinfo->ii_hinode, sizeof(*p) * bend, GFP_NOFS);
+	if (p)
+		iinfo->ii_hinode = p;
+}
+
+static void au_br_do_del(struct super_block *sb, aufs_bindex_t bindex,
+			 struct au_branch *br)
+{
+	aufs_bindex_t bend;
+	struct au_sbinfo *sbinfo;
+	struct dentry *root;
+	struct inode *inode;
+
+	root = sb->s_root;
+	inode = root->d_inode;
+	au_plink_block_maintain(sb);
+	sbinfo = au_sbi(sb);
+	bend = sbinfo->si_bend;
+
+	dput(au_h_dptr(root, bindex));
+	au_hiput(au_hi(inode, bindex));
+	au_br_do_free(br);
+
+	au_br_do_del_brp(sbinfo, bindex, bend);
+	au_br_do_del_hdp(au_di(root), bindex, bend);
+	au_br_do_del_hip(au_ii(inode), bindex, bend);
+}
+
+int au_br_del(struct super_block *sb, struct au_opt_del *del, int remount)
+{
+	int err, rerr, i;
+	unsigned int mnt_flags;
+	aufs_bindex_t bindex, bend, br_id;
+	unsigned char do_wh, verbose;
+	struct au_branch *br;
+	struct au_wbr *wbr;
+
+	err = 0;
+	bindex = au_find_dbindex(sb->s_root, del->h_path.dentry);
+	if (bindex < 0) {
+		if (remount)
+			goto out; /* success */
+		err = -ENOENT;
+		AuErr("%s no such branch\n", del->pathname);
+		goto out;
+	}
+	AuDbg("bindex b%d\n", bindex);
+
+	err = -EBUSY;
+	mnt_flags = au_mntflags(sb);
+	verbose = !!au_opt_test(mnt_flags, VERBOSE);
+	bend = au_sbend(sb);
+	if (unlikely(!bend)) {
+		AuVerbose(verbose, "no more branches left\n");
+		goto out;
+	}
+	br = au_sbr(sb, bindex);
+	i = atomic_read(&br->br_count);
+	if (unlikely(i)) {
+		AuVerbose(verbose, "%d file(s) opened\n", i);
+		goto out;
+	}
+
+	wbr = br->br_wbr;
+	do_wh = wbr && (wbr->wbr_whbase || wbr->wbr_plink || wbr->wbr_orph);
+	if (do_wh) {
+		for (i = 0; i < AuBrWh_Last; i++) {
+			dput(wbr->wbr_wh[i]);
+			wbr->wbr_wh[i] = NULL;
+		}
+	}
+
+	err = test_children_busy(sb->s_root, bindex);
+	if (unlikely(err)) {
+		if (do_wh)
+			goto out_wh;
+		goto out;
+	}
+
+	err = 0;
+	br_id = br->br_id;
+	if (!remount)
+		au_br_do_del(sb, bindex, br);
+	else {
+		sysaufs_brs_del(sb, bindex);
+		au_br_do_del(sb, bindex, br);
+		sysaufs_brs_add(sb, bindex);
+	}
+
+	if (!bindex)
+		au_cpup_attr_all(sb->s_root->d_inode, /*force*/1);
+	else
+		au_sub_nlink(sb->s_root->d_inode, del->h_path.dentry->d_inode);
+	if (au_opt_test(mnt_flags, PLINK))
+		au_plink_half_refresh(sb, br_id);
+
+	if (sb->s_maxbytes == del->h_path.dentry->d_sb->s_maxbytes) {
+		bend--;
+		sb->s_maxbytes = 0;
+		for (bindex = 0; bindex <= bend; bindex++) {
+			unsigned long long maxb;
+
+			maxb = au_sbr_sb(sb, bindex)->s_maxbytes;
+			if (sb->s_maxbytes < maxb)
+				sb->s_maxbytes = maxb;
+		}
+	}
+
+	if (au_xino_brid(sb) == br->br_id)
+		au_xino_brid_set(sb, -1);
+	goto out; /* success */
+
+ out_wh:
+	/* revert */
+	rerr = au_br_init_wh(sb, br, br->br_perm, del->h_path.dentry);
+	if (rerr)
+		AuWarn("failed re-creating base whiteout, %s. (%d)\n",
+		       del->pathname, rerr);
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * change a branch permission
+ */
+
+static int do_need_sigen_inc(int a, int b)
+{
+	return au_br_whable(a) && !au_br_whable(b);
+}
+
+static int need_sigen_inc(int old, int new)
+{
+	return do_need_sigen_inc(old, new)
+		|| do_need_sigen_inc(new, old);
+}
+
+static int au_br_mod_files_ro(struct super_block *sb, aufs_bindex_t bindex)
+{
+	int err;
+	unsigned long n, ul, bytes, files;
+	aufs_bindex_t bstart;
+	struct file *file, *hf, **a;
+	const int step_bytes = 1024, /* memory allocation unit */
+		step_files = step_bytes / sizeof(*a);
+
+	err = -ENOMEM;
+	n = 0;
+	bytes = step_bytes;
+	files = step_files;
+	a = kmalloc(bytes, GFP_NOFS);
+	if (unlikely(!a))
+		goto out;
+
+	/* no need file_list_lock() since sbinfo is locked? defered? */
+	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
+		if (special_file(file->f_dentry->d_inode->i_mode))
+			continue;
+
+		AuDbg("%.*s\n", AuDLNPair(file->f_dentry));
+		fi_read_lock(file);
+		if (unlikely(au_test_mmapped(file))) {
+			err = -EBUSY;
+			FiMustNoWaiters(file);
+			fi_read_unlock(file);
+			goto out_free;
+		}
+
+		bstart = au_fbstart(file);
+		if (!S_ISREG(file->f_dentry->d_inode->i_mode)
+		    || !(file->f_mode & FMODE_WRITE)
+		    || bstart != bindex) {
+			FiMustNoWaiters(file);
+			fi_read_unlock(file);
+			continue;
+		}
+
+		hf = au_h_fptr(file, bstart);
+		FiMustNoWaiters(file);
+		fi_read_unlock(file);
+
+		if (n < files)
+			a[n++] = hf;
+		else {
+			void *p;
+
+			err = -ENOMEM;
+			bytes += step_bytes;
+			files += step_files;
+			p = krealloc(a, bytes, GFP_NOFS);
+			if (p) {
+				a = p;
+				a[n++] = hf;
+			} else
+				goto out_free;
+		}
+	}
+
+	err = 0;
+	for (ul = 0; ul < n; ul++) {
+		/* todo: already flushed? */
+		hf = a[ul];
+		hf->f_flags = au_file_roflags(hf->f_flags);
+		hf->f_mode &= ~FMODE_WRITE;
+		file_release_write(hf);
+		mnt_drop_write(hf->f_vfsmnt);
+	}
+
+ out_free:
+	kfree(a);
+ out:
+	return err;
+}
+
+int au_br_mod(struct super_block *sb, struct au_opt_mod *mod, int remount,
+	      int *do_update)
+{
+	int err, rerr;
+	aufs_bindex_t bindex;
+	struct dentry *root;
+	struct au_branch *br;
+
+	root = sb->s_root;
+	au_plink_block_maintain(sb);
+	bindex = au_find_dbindex(root, mod->h_root);
+	if (bindex < 0) {
+		if (remount)
+			return 0; /* success */
+		err = -ENOENT;
+		AuErr("%s no such branch\n", mod->path);
+		goto out;
+	}
+	AuDbg("bindex b%d\n", bindex);
+
+	err = test_br(mod->h_root->d_inode, mod->perm, mod->path);
+	if (unlikely(err))
+		goto out;
+
+	br = au_sbr(sb, bindex);
+	if (br->br_perm == mod->perm)
+		return 0; /* success */
+
+	if (au_br_writable(br->br_perm)) {
+		/* remove whiteout base */
+		err = au_br_init_wh(sb, br, mod->perm, mod->h_root);
+		if (unlikely(err))
+			goto out;
+
+		if (!au_br_writable(mod->perm)) {
+			/* rw --> ro, file might be mmapped */
+			DiMustNoWaiters(root);
+			IiMustNoWaiters(root->d_inode);
+			di_write_unlock(root);
+			err = au_br_mod_files_ro(sb, bindex);
+			/* aufs_write_lock() calls ..._child() */
+			di_write_lock_child(root);
+
+			if (unlikely(err)) {
+				rerr = -ENOMEM;
+				br->br_wbr = kmalloc(sizeof(*br->br_wbr),
+						     GFP_NOFS);
+				if (br->br_wbr)
+					rerr = au_br_init_wh
+						(sb, br, br->br_perm,
+						 mod->h_root);
+				if (unlikely(rerr)) {
+					AuIOErr("nested error %d (%d)\n",
+						rerr, err);
+					br->br_perm = mod->perm;
+				}
+			}
+		}
+	} else if (au_br_writable(mod->perm)) {
+		/* ro --> rw */
+		err = -ENOMEM;
+		br->br_wbr = kmalloc(sizeof(*br->br_wbr), GFP_NOFS);
+		if (br->br_wbr) {
+			struct path path = {
+				.mnt	= br->br_mnt,
+				.dentry	= mod->h_root
+			};
+
+			err = au_wbr_init(br, sb, mod->perm, &path);
+			if (unlikely(err)) {
+				kfree(br->br_wbr);
+				br->br_wbr = NULL;
+			}
+		}
+	}
+
+	if (!err) {
+		*do_update |= need_sigen_inc(br->br_perm, mod->perm);
+		br->br_perm = mod->perm;
+	}
+
+ out:
+	return err;
+}
diff --git a/fs/aufs/branch.h b/fs/aufs/branch.h
new file mode 100644
index 0000000..5a7fed4
--- /dev/null
+++ b/fs/aufs/branch.h
@@ -0,0 +1,200 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * branch filesystems and xino for them
+ */
+
+#ifndef __AUFS_BRANCH_H__
+#define __AUFS_BRANCH_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/sysfs.h>
+#include <linux/aufs_type.h>
+#include "rwsem.h"
+#include "super.h"
+
+/* ---------------------------------------------------------------------- */
+
+/* a xino file */
+struct au_xino_file {
+	struct file		*xi_file;
+	struct mutex		xi_nondir_mtx;
+
+	/* todo: make xino files an array to support huge inode number */
+};
+
+/* members for writable branch only */
+enum {AuBrWh_BASE, AuBrWh_PLINK, AuBrWh_ORPH, AuBrWh_Last};
+struct au_wbr {
+	struct rw_semaphore	wbr_wh_rwsem;
+	struct dentry		*wbr_wh[AuBrWh_Last];
+	atomic_t 		wbr_wh_running;
+#define wbr_whbase		wbr_wh[AuBrWh_BASE]	/* whiteout base */
+#define wbr_plink		wbr_wh[AuBrWh_PLINK]	/* pseudo-link dir */
+#define wbr_orph		wbr_wh[AuBrWh_ORPH]	/* dir for orphans */
+
+	/* mfs mode */
+	unsigned long long	wbr_bytes;
+};
+
+/* protected by superblock rwsem */
+struct au_branch {
+	struct au_xino_file	br_xino;
+
+	aufs_bindex_t		br_id;
+
+	int			br_perm;
+	struct vfsmount		*br_mnt;
+	atomic_t		br_count;
+
+	struct au_wbr		*br_wbr;
+
+	/* xino truncation */
+	blkcnt_t		br_xino_upper;	/* watermark in blocks */
+	atomic_t		br_xino_running;
+
+#ifdef CONFIG_SYSFS
+	/* an entry under sysfs per mount-point */
+	char			br_name[8];
+	struct attribute	br_attr;
+#endif
+};
+
+/* ---------------------------------------------------------------------- */
+
+/* branch permission and attribute */
+enum {
+	AuBrPerm_RW,		/* writable, linkable wh */
+	AuBrPerm_RO,		/* readonly, no wh */
+	AuBrPerm_RR,		/* natively readonly, no wh */
+
+	AuBrPerm_RWNoLinkWH,	/* un-linkable whiteouts */
+
+	AuBrPerm_ROWH,		/* whiteout-able */
+	AuBrPerm_RRWH,		/* whiteout-able */
+
+	AuBrPerm_Last
+};
+
+static inline int au_br_writable(int brperm)
+{
+	return brperm == AuBrPerm_RW || brperm == AuBrPerm_RWNoLinkWH;
+}
+
+static inline int au_br_whable(int brperm)
+{
+	return brperm == AuBrPerm_RW
+		|| brperm == AuBrPerm_ROWH
+		|| brperm == AuBrPerm_RRWH;
+}
+
+static inline int au_br_rdonly(struct au_branch *br)
+{
+	return ((br->br_mnt->mnt_sb->s_flags & MS_RDONLY)
+		|| !au_br_writable(br->br_perm))
+		? -EROFS : 0;
+}
+
+static inline int au_br_hinotifyable(int brperm __maybe_unused)
+{
+#ifdef CONFIG_AUFS_HINOTIFY
+	return brperm != AuBrPerm_RR && brperm != AuBrPerm_RRWH;
+#else
+	return 0;
+#endif
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* branch.c */
+struct au_sbinfo;
+void au_br_free(struct au_sbinfo *sinfo);
+int au_br_index(struct super_block *sb, aufs_bindex_t br_id);
+struct au_opt_add;
+int au_br_add(struct super_block *sb, struct au_opt_add *add, int remount);
+struct au_opt_del;
+int au_br_del(struct super_block *sb, struct au_opt_del *del, int remount);
+struct au_opt_mod;
+int au_br_mod(struct super_block *sb, struct au_opt_mod *mod, int remount,
+	      int *do_update);
+
+/* xino.c */
+int au_xib_trunc(struct super_block *sb);
+ssize_t xino_fread(au_readf_t func, struct file *file, void *buf, size_t size,
+		   loff_t *pos);
+ssize_t xino_fwrite(au_writef_t func, struct file *file, void *buf, size_t size,
+		    loff_t *pos);
+struct file *au_xino_create(struct super_block *sb, char *fname, int silent);
+ino_t au_xino_new_ino(struct super_block *sb);
+int au_xino_write0(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		   ino_t ino);
+int au_xino_write(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		  ino_t ino);
+int au_xino_read(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		 ino_t *ino);
+int au_xino_br(struct super_block *sb, struct au_branch *br, ino_t hino,
+	       struct file *base_file, int do_test);
+int au_xino_trunc(struct super_block *sb, aufs_bindex_t bindex);
+
+struct au_opt_xino;
+int au_xino_set(struct super_block *sb, struct au_opt_xino *xino, int remount);
+void au_xino_clr(struct super_block *sb);
+struct file *au_xino_def(struct super_block *sb);
+int au_xino_path(struct seq_file *seq, struct file *file);
+
+/* ---------------------------------------------------------------------- */
+
+/* Superblock to branch */
+static inline
+aufs_bindex_t au_sbr_id(struct super_block *sb, aufs_bindex_t bindex)
+{
+	return au_sbr(sb, bindex)->br_id;
+}
+
+static inline
+struct vfsmount *au_sbr_mnt(struct super_block *sb, aufs_bindex_t bindex)
+{
+	return au_sbr(sb, bindex)->br_mnt;
+}
+
+static inline
+struct super_block *au_sbr_sb(struct super_block *sb, aufs_bindex_t bindex)
+{
+	return au_sbr_mnt(sb, bindex)->mnt_sb;
+}
+
+static inline void au_sbr_put(struct super_block *sb, aufs_bindex_t bindex)
+{
+	atomic_dec(&au_sbr(sb, bindex)->br_count);
+}
+
+static inline int au_sbr_perm(struct super_block *sb, aufs_bindex_t bindex)
+{
+	return au_sbr(sb, bindex)->br_perm;
+}
+
+static inline int au_sbr_whable(struct super_block *sb, aufs_bindex_t bindex)
+{
+	return au_br_whable(au_sbr_perm(sb, bindex));
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * wbr_wh_read_lock, wbr_wh_write_lock
+ * wbr_wh_read_unlock, wbr_wh_write_unlock, wbr_wh_downgrade_lock
+ */
+AuSimpleRwsemFuncs(wbr_wh, struct au_wbr *wbr, &wbr->wbr_wh_rwsem);
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_BRANCH_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 06/28] aufs xino
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (4 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 05/28] aufs branch directory/filesystem J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 07/28] aufs object lifetime management via sysfs J. R. Okajima
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
external inode number translation table and bitmap

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/xino.c | 1181 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1181 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/xino.c

diff --git a/fs/aufs/xino.c b/fs/aufs/xino.c
new file mode 100644
index 0000000..bc19b86
--- /dev/null
+++ b/fs/aufs/xino.c
@@ -0,0 +1,1181 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * external inode number translation table and bitmap
+ */
+
+#include <linux/seq_file.h>
+#include <linux/uaccess.h>
+#include "aufs.h"
+
+ssize_t xino_fread(au_readf_t func, struct file *file, void *buf, size_t size,
+		   loff_t *pos)
+{
+	ssize_t err;
+	mm_segment_t oldfs;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	do {
+		/* todo: signal_pending? */
+		err = func(file, (char __user *)buf, size, pos);
+	} while (err == -EAGAIN || err == -EINTR);
+	set_fs(oldfs);
+
+#if 0 /* reserved for future use */
+	if (err > 0)
+		fsnotify_access(file->f_dentry);
+#endif
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static ssize_t do_xino_fwrite(au_writef_t func, struct file *file, void *buf,
+			      size_t size, loff_t *pos)
+{
+	ssize_t err;
+	mm_segment_t oldfs;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	lockdep_off();
+	do {
+		/* todo: signal_pending? */
+		err = func(file, (const char __user *)buf, size, pos);
+	} while (err == -EAGAIN || err == -EINTR);
+	lockdep_on();
+	set_fs(oldfs);
+
+#if 0 /* reserved for future use */
+	if (err > 0)
+		fsnotify_modify(file->f_dentry);
+#endif
+
+	return err;
+}
+
+struct do_xino_fwrite_args {
+	ssize_t *errp;
+	au_writef_t func;
+	struct file *file;
+	void *buf;
+	size_t size;
+	loff_t *pos;
+};
+
+static void call_do_xino_fwrite(void *args)
+{
+	struct do_xino_fwrite_args *a = args;
+	*a->errp = do_xino_fwrite(a->func, a->file, a->buf, a->size, a->pos);
+}
+
+ssize_t xino_fwrite(au_writef_t func, struct file *file, void *buf, size_t size,
+		    loff_t *pos)
+{
+	ssize_t err;
+
+	/* todo: signal block and no wkq? */
+	/* todo: new credential scheme */
+	/*
+	 * it breaks RLIMIT_FSIZE and normal user's limit,
+	 * users should care about quota and real 'filesystem full.'
+	 */
+	if (!au_test_wkq(current)) {
+		int wkq_err;
+		struct do_xino_fwrite_args args = {
+			.errp	= &err,
+			.func	= func,
+			.file	= file,
+			.buf	= buf,
+			.size	= size,
+			.pos	= pos
+		};
+
+		wkq_err = au_wkq_wait(call_do_xino_fwrite, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	} else
+		err = do_xino_fwrite(func, file, buf, size, pos);
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * create a new xinofile at the same place/path as @base_file.
+ */
+static struct file *au_xino_create2(struct file *base_file,
+				    struct file *copy_src)
+{
+	struct file *file;
+	struct dentry *base, *dentry, *parent;
+	struct inode *dir;
+	struct qstr *name;
+	int err;
+
+	base = base_file->f_dentry;
+	parent = base->d_parent; /* dir inode is locked */
+	dir = parent->d_inode;
+	IMustLock(dir);
+
+	file = ERR_PTR(-EINVAL);
+	name = &base->d_name;
+	dentry = vfsub_lookup_one_len(name->name, parent, name->len);
+	if (IS_ERR(dentry)) {
+		file = (void *)dentry;
+		AuErr("%.*s lookup err %ld\n", AuLNPair(name), PTR_ERR(dentry));
+		goto out;
+	}
+
+	/* no need to mnt_want_write() since we call dentry_open() later */
+	err = vfs_create(dir, dentry, S_IRUGO | S_IWUGO, NULL);
+	if (unlikely(err)) {
+		file = ERR_PTR(err);
+		AuErr("%.*s create err %d\n", AuLNPair(name), err);
+		goto out_dput;
+	}
+
+	file = dentry_open(dget(dentry), mntget(base_file->f_vfsmnt),
+			   O_RDWR | O_CREAT | O_EXCL | O_LARGEFILE,
+			   current_cred());
+	if (IS_ERR(file)) {
+		AuErr("%.*s open err %ld\n", AuLNPair(name), PTR_ERR(file));
+		goto out_dput;
+	}
+
+	err = vfsub_unlink(dir, &file->f_path, /*force*/0);
+	if (unlikely(err)) {
+		AuErr("%.*s unlink err %d\n", AuLNPair(name), err);
+		goto out_fput;
+	}
+
+	if (copy_src) {
+		/* no one can touch copy_src xino */
+		err = au_copy_file(file, copy_src,
+				   i_size_read(copy_src->f_dentry->d_inode));
+		if (unlikely(err)) {
+			AuErr("%.*s copy err %d\n", AuLNPair(name), err);
+			goto out_fput;
+		}
+	}
+	goto out_dput; /* success */
+
+ out_fput:
+	fput(file);
+	file = ERR_PTR(err);
+ out_dput:
+	dput(dentry);
+ out:
+	return file;
+}
+
+struct au_xino_lock_dir {
+	struct au_hinode *hdir;
+	struct dentry *parent;
+	struct mutex *mtx;
+};
+
+static void au_xino_lock_dir(struct super_block *sb, struct file *xino,
+			     struct au_xino_lock_dir *ldir)
+{
+	aufs_bindex_t brid, bindex;
+
+	ldir->hdir = NULL;
+	bindex = -1;
+	brid = au_xino_brid(sb);
+	if (brid >= 0)
+		bindex = au_br_index(sb, brid);
+	if (bindex >= 0) {
+		ldir->hdir = au_hi(sb->s_root->d_inode, bindex);
+		au_hin_imtx_lock_nested(ldir->hdir, AuLsc_I_PARENT);
+	} else {
+		ldir->parent = dget_parent(xino->f_dentry);
+		ldir->mtx = &ldir->parent->d_inode->i_mutex;
+		mutex_lock_nested(ldir->mtx, AuLsc_I_PARENT);
+	}
+}
+
+static void au_xino_unlock_dir(struct au_xino_lock_dir *ldir)
+{
+	if (ldir->hdir)
+		au_hin_imtx_unlock(ldir->hdir);
+	else {
+		mutex_unlock(ldir->mtx);
+		dput(ldir->parent);
+	}
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* trucate xino files asynchronously */
+
+int au_xino_trunc(struct super_block *sb, aufs_bindex_t bindex)
+{
+	int err;
+	aufs_bindex_t bi, bend;
+	struct au_branch *br;
+	struct file *new_xino, *file;
+	struct super_block *h_sb;
+	struct au_xino_lock_dir ldir;
+
+	err = -EINVAL;
+	bend = au_sbend(sb);
+	if (unlikely(bindex < 0 || bend < bindex))
+		goto out;
+	br = au_sbr(sb, bindex);
+	file = br->br_xino.xi_file;
+	if (!file)
+		goto out;
+
+	au_xino_lock_dir(sb, file, &ldir);
+	/* mnt_want_write() is unnecessary here */
+	new_xino = au_xino_create2(file, file);
+	au_xino_unlock_dir(&ldir);
+	err = PTR_ERR(new_xino);
+	if (IS_ERR(new_xino))
+		goto out;
+	err = 0;
+	fput(file);
+	br->br_xino.xi_file = new_xino;
+
+	h_sb = br->br_mnt->mnt_sb;
+	for (bi = 0; bi <= bend; bi++) {
+		if (unlikely(bi == bindex))
+			continue;
+		br = au_sbr(sb, bi);
+		if (br->br_mnt->mnt_sb != h_sb)
+			continue;
+
+		fput(br->br_xino.xi_file);
+		br->br_xino.xi_file = new_xino;
+		get_file(new_xino);
+	}
+
+ out:
+	return err;
+}
+
+struct xino_do_trunc_args {
+	struct super_block *sb;
+	struct au_branch *br;
+};
+
+static void xino_do_trunc(void *_args)
+{
+	struct xino_do_trunc_args *args = _args;
+	struct super_block *sb;
+	struct au_branch *br;
+	struct inode *dir;
+	int err;
+	aufs_bindex_t bindex;
+
+	err = 0;
+	sb = args->sb;
+	dir = sb->s_root->d_inode;
+	br = args->br;
+
+	si_noflush_write_lock(sb);
+	ii_read_lock_parent(dir);
+	bindex = au_br_index(sb, br->br_id);
+	err = au_xino_trunc(sb, bindex);
+	if (unlikely(err))
+		goto out;
+
+	if (br->br_xino.xi_file->f_dentry->d_inode->i_blocks
+	    >= br->br_xino_upper)
+		br->br_xino_upper += AUFS_XINO_TRUNC_STEP;
+
+ out:
+	ii_read_unlock(dir);
+	if (unlikely(err))
+		AuWarn("err b%d, (%d)\n", bindex, err);
+	atomic_dec(&br->br_xino_running);
+	atomic_dec(&br->br_count);
+	au_nwt_done(&au_sbi(sb)->si_nowait);
+	si_write_unlock(sb);
+	kfree(args);
+}
+
+static void xino_try_trunc(struct super_block *sb, struct au_branch *br)
+{
+	struct xino_do_trunc_args *args;
+	int wkq_err;
+
+	if (br->br_xino.xi_file->f_dentry->d_inode->i_blocks
+	    < br->br_xino_upper)
+		return;
+
+	if (atomic_inc_return(&br->br_xino_running) > 1)
+		goto out;
+
+	/* lock and kfree() will be called in trunc_xino() */
+	args = kmalloc(sizeof(*args), GFP_NOFS);
+	if (unlikely(!args)) {
+		AuErr1("no memory\n");
+		goto out_args;
+	}
+
+	atomic_inc(&br->br_count);
+	args->sb = sb;
+	args->br = br;
+	wkq_err = au_wkq_nowait(xino_do_trunc, args, sb);
+	if (!wkq_err)
+		return; /* success */
+
+	AuErr("wkq %d\n", wkq_err);
+	atomic_dec(&br->br_count);
+
+ out_args:
+	kfree(args);
+ out:
+	atomic_dec(&br->br_xino_running);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static const loff_t au_loff_max = LLONG_MAX;
+
+static int au_xino_do_write(au_writef_t write, struct file *file,
+			    ino_t h_ino, ino_t ino)
+{
+	loff_t pos;
+	ssize_t sz;
+
+	pos = h_ino;
+	if (unlikely(au_loff_max / sizeof(ino) - 1 < pos)) {
+		AuIOErr1("too large hi%lu\n", (unsigned long)h_ino);
+		return -EFBIG;
+	}
+	pos *= sizeof(ino);
+	sz = xino_fwrite(write, file, &ino, sizeof(ino), &pos);
+	if (sz == sizeof(ino))
+		return 0; /* success */
+
+	AuIOErr("write failed (%zd)\n", sz);
+	return -EIO;
+}
+
+/*
+ * write @ino to the xinofile for the specified branch{@sb, @bindex}
+ * at the position of @h_ino.
+ * even if @ino is zero, it is written to the xinofile and means no entry.
+ * if the size of the xino file on a specific filesystem exceeds the watermark,
+ * try truncating it.
+ */
+int au_xino_write(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		  ino_t ino)
+{
+	int err;
+	unsigned int mnt_flags;
+	struct au_branch *br;
+
+	BUILD_BUG_ON(sizeof(long long) != sizeof(au_loff_max)
+		     || ((loff_t)-1) > 0);
+
+	mnt_flags = au_mntflags(sb);
+	if (!au_opt_test(mnt_flags, XINO))
+		return 0;
+
+	br = au_sbr(sb, bindex);
+	err = au_xino_do_write(au_sbi(sb)->si_xwrite, br->br_xino.xi_file,
+			       h_ino, ino);
+	if (!err) {
+		if (au_opt_test(mnt_flags, TRUNC_XINO)
+		    && au_test_fs_trunc_xino(br->br_mnt->mnt_sb))
+			xino_try_trunc(sb, br);
+		return 0; /* success */
+	}
+
+	AuIOErr("write failed (%d)\n", err);
+	return -EIO;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* aufs inode number bitmap */
+
+static const int page_bits = (int)PAGE_SIZE * BITS_PER_BYTE;
+static ino_t xib_calc_ino(unsigned long pindex, int bit)
+{
+	ino_t ino;
+
+	AuDebugOn(bit < 0 || page_bits <= bit);
+	ino = AUFS_FIRST_INO + pindex * page_bits + bit;
+	return ino;
+}
+
+static void xib_calc_bit(ino_t ino, unsigned long *pindex, int *bit)
+{
+	AuDebugOn(ino < AUFS_FIRST_INO);
+	ino -= AUFS_FIRST_INO;
+	*pindex = ino / page_bits;
+	*bit = ino % page_bits;
+}
+
+static int xib_pindex(struct super_block *sb, unsigned long pindex)
+{
+	int err;
+	loff_t pos;
+	ssize_t sz;
+	struct au_sbinfo *sbinfo;
+	struct file *xib;
+	unsigned long *p;
+
+	sbinfo = au_sbi(sb);
+	MtxMustLock(&sbinfo->si_xib_mtx);
+	AuDebugOn(pindex > ULONG_MAX / PAGE_SIZE
+		  || !au_opt_test(sbinfo->si_mntflags, XINO));
+
+	if (pindex == sbinfo->si_xib_last_pindex)
+		return 0;
+
+	xib = sbinfo->si_xib;
+	p = sbinfo->si_xib_buf;
+	pos = sbinfo->si_xib_last_pindex;
+	pos *= PAGE_SIZE;
+	sz = xino_fwrite(sbinfo->si_xwrite, xib, p, PAGE_SIZE, &pos);
+	if (unlikely(sz != PAGE_SIZE))
+		goto out;
+
+	pos = pindex;
+	pos *= PAGE_SIZE;
+	if (i_size_read(xib->f_dentry->d_inode) >= pos + PAGE_SIZE)
+		sz = xino_fread(sbinfo->si_xread, xib, p, PAGE_SIZE, &pos);
+	else {
+		memset(p, 0, PAGE_SIZE);
+		sz = xino_fwrite(sbinfo->si_xwrite, xib, p, PAGE_SIZE, &pos);
+	}
+	if (sz == PAGE_SIZE) {
+		sbinfo->si_xib_last_pindex = pindex;
+		return 0; /* success */
+	}
+
+ out:
+	AuIOErr1("write failed (%zd)\n", sz);
+	err = sz;
+	if (sz >= 0)
+		err = -EIO;
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int au_xino_write0(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		   ino_t ino)
+{
+	int err, bit;
+	unsigned long pindex;
+	struct au_sbinfo *sbinfo;
+
+	if (!au_opt_test(au_mntflags(sb), XINO))
+		return 0;
+
+	err = 0;
+	if (ino) {
+		sbinfo = au_sbi(sb);
+		xib_calc_bit(ino, &pindex, &bit);
+		AuDebugOn(page_bits <= bit);
+		mutex_lock(&sbinfo->si_xib_mtx);
+		err = xib_pindex(sb, pindex);
+		if (!err) {
+			clear_bit(bit, sbinfo->si_xib_buf);
+			sbinfo->si_xib_next_bit = bit;
+		}
+		mutex_unlock(&sbinfo->si_xib_mtx);
+	}
+
+	if (!err)
+		err = au_xino_write(sb, bindex, h_ino, 0);
+	return err;
+}
+
+/* get an unused inode number from bitmap */
+ino_t au_xino_new_ino(struct super_block *sb)
+{
+	ino_t ino;
+	unsigned long *p, pindex, ul, pend;
+	struct au_sbinfo *sbinfo;
+	struct file *file;
+	int free_bit, err;
+
+	if (!au_opt_test(au_mntflags(sb), XINO))
+		return iunique(sb, AUFS_FIRST_INO);
+
+	sbinfo = au_sbi(sb);
+	mutex_lock(&sbinfo->si_xib_mtx);
+	p = sbinfo->si_xib_buf;
+	free_bit = sbinfo->si_xib_next_bit;
+	if (free_bit < page_bits && !test_bit(free_bit, p))
+		goto out; /* success */
+	free_bit = find_first_zero_bit(p, page_bits);
+	if (free_bit < page_bits)
+		goto out; /* success */
+
+	pindex = sbinfo->si_xib_last_pindex;
+	for (ul = pindex - 1; ul < ULONG_MAX; ul--) {
+		err = xib_pindex(sb, ul);
+		if (unlikely(err))
+			goto out_err;
+		free_bit = find_first_zero_bit(p, page_bits);
+		if (free_bit < page_bits)
+			goto out; /* success */
+	}
+
+	file = sbinfo->si_xib;
+	pend = i_size_read(file->f_dentry->d_inode) / PAGE_SIZE;
+	for (ul = pindex + 1; ul <= pend; ul++) {
+		err = xib_pindex(sb, ul);
+		if (unlikely(err))
+			goto out_err;
+		free_bit = find_first_zero_bit(p, page_bits);
+		if (free_bit < page_bits)
+			goto out; /* success */
+	}
+	BUG();
+
+ out:
+	set_bit(free_bit, p);
+	sbinfo->si_xib_next_bit++;
+	pindex = sbinfo->si_xib_last_pindex;
+	mutex_unlock(&sbinfo->si_xib_mtx);
+	ino = xib_calc_ino(pindex, free_bit);
+	AuDbg("i%lu\n", (unsigned long)ino);
+	return ino;
+ out_err:
+	mutex_unlock(&sbinfo->si_xib_mtx);
+	AuDbg("i0\n");
+	return 0;
+}
+
+/*
+ * read @ino from xinofile for the specified branch{@sb, @bindex}
+ * at the position of @h_ino.
+ * if @ino does not exist and @do_new is true, get new one.
+ */
+int au_xino_read(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		 ino_t *ino)
+{
+	int err;
+	ssize_t sz;
+	loff_t pos;
+	struct file *file;
+	struct au_sbinfo *sbinfo;
+
+	*ino = 0;
+	if (!au_opt_test(au_mntflags(sb), XINO))
+		return 0; /* no xino */
+
+	err = 0;
+	sbinfo = au_sbi(sb);
+	pos = h_ino;
+	if (unlikely(au_loff_max / sizeof(*ino) - 1 < pos)) {
+		AuIOErr1("too large hi%lu\n", (unsigned long)h_ino);
+		return -EFBIG;
+	}
+	pos *= sizeof(*ino);
+
+	file = au_sbr(sb, bindex)->br_xino.xi_file;
+	if (i_size_read(file->f_dentry->d_inode) < pos + sizeof(*ino))
+		return 0; /* no ino */
+
+	sz = xino_fread(sbinfo->si_xread, file, ino, sizeof(*ino), &pos);
+	if (sz == sizeof(*ino))
+		return 0; /* success */
+
+	err = sz;
+	if (unlikely(sz >= 0)) {
+		err = -EIO;
+		AuIOErr("xino read error (%zd)\n", sz);
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* create and set a new xino file */
+
+struct file *au_xino_create(struct super_block *sb, char *fname, int silent)
+{
+	struct file *file;
+	struct dentry *h_parent, *d;
+	struct inode *h_dir;
+	int err;
+
+	/*
+	 * at mount-time, and the xino file is the default path,
+	 * hinotify is disabled so we have no inotify events to ignore.
+	 * when a user specified the xino, we cannot get au_hdir to be ignored.
+	 */
+	file = vfsub_filp_open(fname, O_RDWR | O_CREAT | O_EXCL | O_LARGEFILE,
+			       S_IRUGO | S_IWUGO);
+	if (IS_ERR(file)) {
+		if (!silent)
+			AuErr("open %s(%ld)\n", fname, PTR_ERR(file));
+		return file;
+	}
+
+	/* keep file count */
+	h_parent = dget_parent(file->f_dentry);
+	h_dir = h_parent->d_inode;
+	mutex_lock_nested(&h_dir->i_mutex, AuLsc_I_PARENT);
+	/* mnt_want_write() is unnecessary here */
+	err = vfsub_unlink(h_dir, &file->f_path, /*force*/0);
+	mutex_unlock(&h_dir->i_mutex);
+	dput(h_parent);
+	if (unlikely(err)) {
+		if (!silent)
+			AuErr("unlink %s(%d)\n", fname, err);
+		goto out;
+	}
+
+	err = -EINVAL;
+	d = file->f_dentry;
+	if (unlikely(sb == d->d_sb)) {
+		if (!silent)
+			AuErr("%s must be outside\n", fname);
+		goto out;
+	}
+	if (unlikely(au_test_fs_bad_xino(d->d_sb))) {
+		if (!silent)
+			AuErr("xino doesn't support %s(%s)\n",
+			      fname, au_sbtype(d->d_sb));
+		goto out;
+	}
+	return file; /* success */
+
+ out:
+	fput(file);
+	file = ERR_PTR(err);
+	return file;
+}
+
+/*
+ * find another branch who is on the same filesystem of the specified
+ * branch{@btgt}. search until @bend.
+ */
+static int is_sb_shared(struct super_block *sb, aufs_bindex_t btgt,
+			aufs_bindex_t bend)
+{
+	aufs_bindex_t bindex;
+	struct super_block *tgt_sb = au_sbr_sb(sb, btgt);
+
+	for (bindex = 0; bindex < btgt; bindex++)
+		if (unlikely(tgt_sb == au_sbr_sb(sb, bindex)))
+			return bindex;
+	for (bindex++; bindex <= bend; bindex++)
+		if (unlikely(tgt_sb == au_sbr_sb(sb, bindex)))
+			return bindex;
+	return -1;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * initialize the xinofile for the specified branch @br
+ * at the place/path where @base_file indicates.
+ * test whether another branch is on the same filesystem or not,
+ * if @do_test is true.
+ */
+int au_xino_br(struct super_block *sb, struct au_branch *br, ino_t h_ino,
+	       struct file *base_file, int do_test)
+{
+	int err;
+	ino_t ino;
+	aufs_bindex_t bend, bindex;
+	struct au_branch *shared_br, *b;
+	struct file *file;
+	struct super_block *tgt_sb;
+
+	shared_br = NULL;
+	bend = au_sbend(sb);
+	if (do_test) {
+		tgt_sb = br->br_mnt->mnt_sb;
+		for (bindex = 0; bindex <= bend; bindex++) {
+			b = au_sbr(sb, bindex);
+			if (tgt_sb == b->br_mnt->mnt_sb) {
+				shared_br = b;
+				break;
+			}
+		}
+	}
+
+	if (!shared_br || !shared_br->br_xino.xi_file) {
+		struct au_xino_lock_dir ldir;
+
+		au_xino_lock_dir(sb, base_file, &ldir);
+		/* mnt_want_write() is unnecessary here */
+		file = au_xino_create2(base_file, NULL);
+		au_xino_unlock_dir(&ldir);
+		err = PTR_ERR(file);
+		if (IS_ERR(file))
+			goto out;
+		br->br_xino.xi_file = file;
+	} else {
+		br->br_xino.xi_file = shared_br->br_xino.xi_file;
+		get_file(br->br_xino.xi_file);
+	}
+
+	ino = AUFS_ROOT_INO;
+	err = au_xino_do_write(au_sbi(sb)->si_xwrite, br->br_xino.xi_file,
+			       h_ino, ino);
+	if (!err)
+		return 0; /* success */
+
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* trucate a xino bitmap file */
+
+/* todo: slow */
+static int do_xib_restore(struct super_block *sb, struct file *file, void *page)
+{
+	int err, bit;
+	ssize_t sz;
+	unsigned long pindex;
+	loff_t pos, pend;
+	struct au_sbinfo *sbinfo;
+	au_readf_t func;
+	ino_t *ino;
+	unsigned long *p;
+
+	err = 0;
+	sbinfo = au_sbi(sb);
+	p = sbinfo->si_xib_buf;
+	func = sbinfo->si_xread;
+	pend = i_size_read(file->f_dentry->d_inode);
+	pos = 0;
+	while (pos < pend) {
+		sz = xino_fread(func, file, page, PAGE_SIZE, &pos);
+		err = sz;
+		if (unlikely(sz <= 0))
+			goto out;
+
+		err = 0;
+		for (ino = page; sz > 0; ino++, sz -= sizeof(ino)) {
+			if (unlikely(*ino < AUFS_FIRST_INO))
+				continue;
+
+			xib_calc_bit(*ino, &pindex, &bit);
+			AuDebugOn(page_bits <= bit);
+			err = xib_pindex(sb, pindex);
+			if (!err)
+				set_bit(bit, p);
+			else
+				goto out;
+		}
+	}
+
+ out:
+	return err;
+}
+
+static int xib_restore(struct super_block *sb)
+{
+	int err;
+	aufs_bindex_t bindex, bend;
+	void *page;
+
+	err = -ENOMEM;
+	page = (void *)__get_free_page(GFP_NOFS);
+	if (unlikely(!page))
+		goto out;
+
+	err = 0;
+	bend = au_sbend(sb);
+	for (bindex = 0; !err && bindex <= bend; bindex++)
+		if (!bindex || is_sb_shared(sb, bindex, bindex - 1) < 0)
+			err = do_xib_restore
+				(sb, au_sbr(sb, bindex)->br_xino.xi_file, page);
+		else
+			AuDbg("b%d\n", bindex);
+	free_page((unsigned long)page);
+
+ out:
+	return err;
+}
+
+int au_xib_trunc(struct super_block *sb)
+{
+	int err;
+	ssize_t sz;
+	loff_t pos;
+	struct au_xino_lock_dir ldir;
+	struct au_sbinfo *sbinfo;
+	unsigned long *p;
+	struct file *file;
+
+	err = 0;
+	sbinfo = au_sbi(sb);
+	if (!au_opt_test(sbinfo->si_mntflags, XINO))
+		goto out;
+
+	file = sbinfo->si_xib;
+	if (i_size_read(file->f_dentry->d_inode) <= PAGE_SIZE)
+		goto out;
+
+	au_xino_lock_dir(sb, file, &ldir);
+	/* mnt_want_write() is unnecessary here */
+	file = au_xino_create2(sbinfo->si_xib, NULL);
+	au_xino_unlock_dir(&ldir);
+	err = PTR_ERR(file);
+	if (IS_ERR(file))
+		goto out;
+	fput(sbinfo->si_xib);
+	sbinfo->si_xib = file;
+
+	p = sbinfo->si_xib_buf;
+	memset(p, 0, PAGE_SIZE);
+	pos = 0;
+	sz = xino_fwrite(sbinfo->si_xwrite, sbinfo->si_xib, p, PAGE_SIZE, &pos);
+	if (unlikely(sz != PAGE_SIZE)) {
+		err = sz;
+		AuIOErr("err %d\n", err);
+		if (sz >= 0)
+			err = -EIO;
+		goto out;
+	}
+
+	mutex_lock(&sbinfo->si_xib_mtx);
+	/* mnt_want_write() is unnecessary here */
+	err = xib_restore(sb);
+	mutex_unlock(&sbinfo->si_xib_mtx);
+
+out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * xino mount option handlers
+ */
+static au_readf_t find_readf(struct file *h_file)
+{
+	const struct file_operations *fop = h_file->f_op;
+
+	if (fop) {
+		if (fop->read)
+			return fop->read;
+		if (fop->aio_read)
+			return do_sync_read;
+	}
+	return ERR_PTR(-ENOSYS);
+}
+
+static au_writef_t find_writef(struct file *h_file)
+{
+	const struct file_operations *fop = h_file->f_op;
+
+	if (fop) {
+		if (fop->write)
+			return fop->write;
+		if (fop->aio_write)
+			return do_sync_write;
+	}
+	return ERR_PTR(-ENOSYS);
+}
+
+/* xino bitmap */
+static void xino_clear_xib(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+
+	sbinfo = au_sbi(sb);
+	sbinfo->si_xread = NULL;
+	sbinfo->si_xwrite = NULL;
+	if (sbinfo->si_xib)
+		fput(sbinfo->si_xib);
+	sbinfo->si_xib = NULL;
+	free_page((unsigned long)sbinfo->si_xib_buf);
+	sbinfo->si_xib_buf = NULL;
+}
+
+static int au_xino_set_xib(struct super_block *sb, struct file *base)
+{
+	int err;
+	loff_t pos;
+	struct au_sbinfo *sbinfo;
+	struct file *file;
+
+	sbinfo = au_sbi(sb);
+	file = au_xino_create2(base, sbinfo->si_xib);
+	err = PTR_ERR(file);
+	if (IS_ERR(file))
+		goto out;
+	if (sbinfo->si_xib)
+		fput(sbinfo->si_xib);
+	sbinfo->si_xib = file;
+	sbinfo->si_xread = find_readf(file);
+	sbinfo->si_xwrite = find_writef(file);
+
+	err = -ENOMEM;
+	if (!sbinfo->si_xib_buf)
+		sbinfo->si_xib_buf = (void *)get_zeroed_page(GFP_NOFS);
+	if (unlikely(!sbinfo->si_xib_buf))
+		goto out_unset;
+
+	sbinfo->si_xib_last_pindex = 0;
+	sbinfo->si_xib_next_bit = 0;
+	if (i_size_read(file->f_dentry->d_inode) < PAGE_SIZE) {
+		pos = 0;
+		err = xino_fwrite(sbinfo->si_xwrite, file, sbinfo->si_xib_buf,
+				  PAGE_SIZE, &pos);
+		if (unlikely(err != PAGE_SIZE))
+			goto out_free;
+	}
+	err = 0;
+	goto out; /* success */
+
+ out_free:
+	free_page((unsigned long)sbinfo->si_xib_buf);
+	sbinfo->si_xib_buf = NULL;
+	if (err >= 0)
+		err = -EIO;
+ out_unset:
+	fput(sbinfo->si_xib);
+	sbinfo->si_xib = NULL;
+	sbinfo->si_xread = NULL;
+	sbinfo->si_xwrite = NULL;
+ out:
+	return err;
+}
+
+/* xino for each branch */
+static void xino_clear_br(struct super_block *sb)
+{
+	aufs_bindex_t bindex, bend;
+	struct au_branch *br;
+
+	bend = au_sbend(sb);
+	for (bindex = 0; bindex <= bend; bindex++) {
+		br = au_sbr(sb, bindex);
+		if (!br || !br->br_xino.xi_file)
+			continue;
+
+		fput(br->br_xino.xi_file);
+		br->br_xino.xi_file = NULL;
+	}
+}
+
+static int au_xino_set_br(struct super_block *sb, struct file *base)
+{
+	int err;
+	ino_t ino;
+	aufs_bindex_t bindex, bend, bshared;
+	struct {
+		struct file *old, *new;
+	} *fpair, *p;
+	struct au_branch *br;
+	struct inode *inode;
+	au_writef_t writef;
+
+	err = -ENOMEM;
+	bend = au_sbend(sb);
+	fpair = kcalloc(bend + 1, sizeof(*fpair), GFP_NOFS);
+	if (unlikely(!fpair))
+		goto out;
+
+	inode = sb->s_root->d_inode;
+	ino = AUFS_ROOT_INO;
+	writef = au_sbi(sb)->si_xwrite;
+	for (bindex = 0, p = fpair; bindex <= bend; bindex++, p++) {
+		br = au_sbr(sb, bindex);
+		bshared = is_sb_shared(sb, bindex, bindex - 1);
+		if (bshared >= 0) {
+			/* shared xino */
+			*p = fpair[bshared];
+			get_file(p->new);
+		}
+
+		if (!p->new) {
+			/* new xino */
+			p->old = br->br_xino.xi_file;
+			p->new = au_xino_create2(base, br->br_xino.xi_file);
+			err = PTR_ERR(p->new);
+			if (IS_ERR(p->new)) {
+				p->new = NULL;
+				goto out_pair;
+			}
+		}
+
+		err = au_xino_do_write(writef, p->new,
+				       au_h_iptr(inode, bindex)->i_ino, ino);
+		if (unlikely(err))
+			goto out_pair;
+	}
+
+	for (bindex = 0, p = fpair; bindex <= bend; bindex++, p++) {
+		br = au_sbr(sb, bindex);
+		if (br->br_xino.xi_file)
+			fput(br->br_xino.xi_file);
+		get_file(p->new);
+		br->br_xino.xi_file = p->new;
+	}
+
+ out_pair:
+	for (bindex = 0, p = fpair; bindex <= bend; bindex++, p++)
+		if (p->new)
+			fput(p->new);
+		else
+			break;
+	kfree(fpair);
+ out:
+	return err;
+}
+
+void au_xino_clr(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+
+	xino_clear_xib(sb);
+	xino_clear_br(sb);
+	sbinfo = au_sbi(sb);
+	/* lvalue, do not call au_mntflags() */
+	au_opt_clr(sbinfo->si_mntflags, XINO);
+}
+
+int au_xino_set(struct super_block *sb, struct au_opt_xino *xino, int remount)
+{
+	int err, skip;
+	struct dentry *parent, *cur_parent;
+	struct qstr *dname, *cur_name;
+	struct file *cur_xino;
+	struct inode *dir;
+	struct au_sbinfo *sbinfo;
+
+	err = 0;
+	sbinfo = au_sbi(sb);
+	parent = dget_parent(xino->file->f_dentry);
+	if (remount) {
+		skip = 0;
+		dname = &xino->file->f_dentry->d_name;
+		cur_xino = sbinfo->si_xib;
+		if (cur_xino) {
+			cur_parent = dget_parent(cur_xino->f_dentry);
+			cur_name = &cur_xino->f_dentry->d_name;
+			skip = (cur_parent == parent
+				&& dname->len == cur_name->len
+				&& !memcmp(dname->name, cur_name->name,
+					   dname->len));
+			dput(cur_parent);
+		}
+		if (skip)
+			goto out;
+	}
+
+	au_opt_set(sbinfo->si_mntflags, XINO);
+	dir = parent->d_inode;
+	mutex_lock_nested(&dir->i_mutex, AuLsc_I_PARENT);
+	/* mnt_want_write() is unnecessary here */
+	err = au_xino_set_xib(sb, xino->file);
+	if (!err)
+		err = au_xino_set_br(sb, xino->file);
+	mutex_unlock(&dir->i_mutex);
+	if (!err)
+		goto out; /* success */
+
+	/* reset all */
+	AuIOErr("failed creating xino(%d).\n", err);
+
+ out:
+	dput(parent);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * create a xinofile at the default place/path.
+ */
+struct file *au_xino_def(struct super_block *sb)
+{
+	struct file *file;
+	char *page, *p;
+	struct au_branch *br;
+	struct super_block *h_sb;
+	struct path path;
+	aufs_bindex_t bend, bindex, bwr;
+
+	br = NULL;
+	bend = au_sbend(sb);
+	bwr = -1;
+	for (bindex = 0; bindex <= bend; bindex++) {
+		br = au_sbr(sb, bindex);
+		if (au_br_writable(br->br_perm)
+		    && !au_test_fs_bad_xino(br->br_mnt->mnt_sb)) {
+			bwr = bindex;
+			break;
+		}
+	}
+
+	if (bwr >= 0) {
+		file = ERR_PTR(-ENOMEM);
+		page = __getname();
+		if (unlikely(!page))
+			goto out;
+		path.mnt = br->br_mnt;
+		path.dentry = au_h_dptr(sb->s_root, bwr);
+		p = d_path(&path, page, PATH_MAX - sizeof(AUFS_XINO_FNAME));
+		file = (void *)p;
+		if (!IS_ERR(p)) {
+			strcat(p, "/" AUFS_XINO_FNAME);
+			AuDbg("%s\n", p);
+			file = au_xino_create(sb, p, /*silent*/0);
+			if (!IS_ERR(file))
+				au_xino_brid_set(sb, br->br_id);
+		}
+		__putname(page);
+	} else {
+		file = au_xino_create(sb, AUFS_XINO_DEFPATH, /*silent*/0);
+		if (IS_ERR(file))
+			goto out;
+		h_sb = file->f_dentry->d_sb;
+		if (unlikely(au_test_fs_bad_xino(h_sb))) {
+			AuErr("xino doesn't support %s(%s)\n",
+			      AUFS_XINO_DEFPATH, au_sbtype(h_sb));
+			fput(file);
+			file = ERR_PTR(-EINVAL);
+		}
+		if (!IS_ERR(file))
+			au_xino_brid_set(sb, -1);
+	}
+
+ out:
+	return file;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int au_xino_path(struct seq_file *seq, struct file *file)
+{
+	int err;
+
+	err = au_seq_path(seq, &file->f_path);
+	if (unlikely(err < 0))
+		goto out;
+
+	err = 0;
+#define Deleted "\\040(deleted)"
+	seq->count -= sizeof(Deleted) - 1;
+	AuDebugOn(memcmp(seq->buf + seq->count, Deleted,
+			 sizeof(Deleted) - 1));
+#undef Deleted
+
+ out:
+	return err;
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 07/28] aufs object lifetime management via sysfs
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (5 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 06/28] aufs xino J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 08/28] aufs mount options/flags J. R. Okajima
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
some aufs objects have a corresponding entry under sysfs, so the
lifetime will be managed by struct kref even if CONFIG_SYSFS is
disabled.
This file is compiled unconditionally.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/sysaufs.c |   79 +++++++++++++++++++++++++++++++++++++++++
 fs/aufs/sysaufs.h |  102 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 181 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/sysaufs.c
 create mode 100644 fs/aufs/sysaufs.h

diff --git a/fs/aufs/sysaufs.c b/fs/aufs/sysaufs.c
new file mode 100644
index 0000000..47ffbfd
--- /dev/null
+++ b/fs/aufs/sysaufs.c
@@ -0,0 +1,79 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sysfs interface and lifetime management
+ * they are necessary regardless sysfs is disabled.
+ */
+
+#include <linux/fs.h>
+#include <linux/random.h>
+#include <linux/sysfs.h>
+#include "aufs.h"
+
+unsigned long sysaufs_si_mask;
+struct kset *sysaufs_ket;
+
+#define AuSiAttr(_name) { \
+	.attr   = { .name = __stringify(_name), .mode = 0444 },	\
+	.show   = sysaufs_si_##_name,				\
+}
+
+static struct sysaufs_si_attr sysaufs_si_attr_xino = AuSiAttr(xino);
+struct attribute *sysaufs_si_attrs[] = {
+	&sysaufs_si_attr_xino.attr,
+	NULL,
+};
+
+static struct sysfs_ops au_sbi_ops = {
+	.show   = sysaufs_si_show
+};
+
+static struct kobj_type au_sbi_ktype = {
+	.release	= au_si_free,
+	.sysfs_ops	= &au_sbi_ops,
+	.default_attrs	= sysaufs_si_attrs
+};
+
+/* ---------------------------------------------------------------------- */
+
+int sysaufs_si_init(struct au_sbinfo *sbinfo)
+{
+	int err;
+
+	sbinfo->si_kobj.kset = sysaufs_ket;
+	err = kobject_init_and_add(&sbinfo->si_kobj, &au_sbi_ktype,
+				   /*&sysaufs_ket->kobj*/NULL,
+				   "si_%lx", sysaufs_si_id(sbinfo));
+	return err;
+}
+
+void sysaufs_fin(void)
+{
+	sysfs_remove_group(&sysaufs_ket->kobj, sysaufs_attr_group);
+	kset_unregister(sysaufs_ket);
+}
+
+int __init sysaufs_init(void)
+{
+	int err;
+
+	get_random_bytes(&sysaufs_si_mask, sizeof(sysaufs_si_mask));
+
+	sysaufs_ket = kset_create_and_add(AUFS_NAME, NULL, fs_kobj);
+	err = PTR_ERR(sysaufs_ket);
+	if (IS_ERR(sysaufs_ket))
+		goto out;
+	err = sysfs_create_group(&sysaufs_ket->kobj, sysaufs_attr_group);
+	if (unlikely(err))
+		kset_unregister(sysaufs_ket);
+
+ out:
+	return err;
+}
diff --git a/fs/aufs/sysaufs.h b/fs/aufs/sysaufs.h
new file mode 100644
index 0000000..04ffab2
--- /dev/null
+++ b/fs/aufs/sysaufs.h
@@ -0,0 +1,102 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sysfs interface and mount lifetime management
+ */
+
+#ifndef __SYSAUFS_H__
+#define __SYSAUFS_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/aufs_type.h>
+
+struct sysaufs_si_attr {
+	struct attribute attr;
+	int (*show)(struct seq_file *seq, struct super_block *sb);
+};
+
+/* ---------------------------------------------------------------------- */
+
+/* sysaufs.c */
+extern unsigned long sysaufs_si_mask;
+extern struct kset *sysaufs_ket;
+extern struct attribute *sysaufs_si_attrs[];
+int sysaufs_si_init(struct au_sbinfo *sbinfo);
+int __init sysaufs_init(void);
+void sysaufs_fin(void);
+
+/* ---------------------------------------------------------------------- */
+
+/* some people doesn't like to show a pointer in kernel */
+static inline unsigned long sysaufs_si_id(struct au_sbinfo *sbinfo)
+{
+	return sysaufs_si_mask ^ (unsigned long)sbinfo;
+}
+
+struct au_branch;
+#ifdef CONFIG_SYSFS
+/* sysfs.c */
+extern struct attribute_group *sysaufs_attr_group;
+extern struct kobj_type *sysaufs_ktype;
+
+int sysaufs_si_xino(struct seq_file *seq, struct super_block *sb);
+ssize_t sysaufs_si_show(struct kobject *kobj, struct attribute *attr,
+			 char *buf);
+
+void sysaufs_br_init(struct au_branch *br);
+void sysaufs_brs_add(struct super_block *sb, aufs_bindex_t bindex);
+void sysaufs_brs_del(struct super_block *sb, aufs_bindex_t bindex);
+
+#define sysaufs_brs_init()	do {} while (0)
+
+#else
+#define sysaufs_attr_group	NULL
+#define sysaufs_ktype		NULL
+
+static inline
+int sysaufs_si_xino(struct seq_file *seq, struct super_block *sb)
+{
+	return 0;
+}
+
+static inline
+ssize_t sysaufs_si_show(struct kobject *kobj, struct attribute *attr,
+			 char *buf)
+{
+	return 0;
+}
+
+static inline void sysaufs_br_init(struct au_branch *br)
+{
+	/* empty */
+}
+
+static inline void sysaufs_brs_add(struct super_block *sb, aufs_bindex_t bindex)
+{
+	/* nothing */
+}
+
+static inline void sysaufs_brs_del(struct super_block *sb, aufs_bindex_t bindex)
+{
+	/* nothing */
+}
+
+static inline void sysaufs_brs_init(void)
+{
+	sysaufs_brs = 0;
+}
+
+#endif /* CONFIG_SYSFS */
+
+#endif /* __KERNEL__ */
+#endif /* __SYSAUFS_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 08/28] aufs mount options/flags
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (6 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 07/28] aufs object lifetime management via sysfs J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 09/28] aufs workqueue J. R. Okajima
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
handling mount options/flags

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/opts.c | 1438 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/opts.h |  180 +++++++
 2 files changed, 1618 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/opts.c
 create mode 100644 fs/aufs/opts.h

diff --git a/fs/aufs/opts.c b/fs/aufs/opts.c
new file mode 100644
index 0000000..847b8e6
--- /dev/null
+++ b/fs/aufs/opts.c
@@ -0,0 +1,1438 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * mount options/flags
+ */
+
+#include <linux/types.h> /* a distribution requires */
+#include <linux/parser.h>
+#include "aufs.h"
+
+/* ---------------------------------------------------------------------- */
+
+enum {
+	Opt_br,
+	Opt_add, Opt_del, Opt_mod, Opt_reorder, Opt_append, Opt_prepend,
+	Opt_idel, Opt_imod, Opt_ireorder,
+	Opt_dirwh, Opt_rdcache, Opt_deblk, Opt_nhash, Opt_rendir,
+	Opt_xino, Opt_zxino, Opt_noxino,
+	Opt_trunc_xino, Opt_trunc_xino_v, Opt_notrunc_xino,
+	Opt_trunc_xino_path, Opt_itrunc_xino,
+	Opt_trunc_xib, Opt_notrunc_xib,
+	Opt_plink, Opt_noplink, Opt_list_plink,
+	Opt_udba,
+	/* Opt_lock, Opt_unlock, */
+	Opt_cmd, Opt_cmd_args,
+	Opt_diropq_a, Opt_diropq_w,
+	Opt_warn_perm, Opt_nowarn_perm,
+	Opt_wbr_copyup, Opt_wbr_create,
+	Opt_refrof, Opt_norefrof,
+	Opt_verbose, Opt_noverbose,
+	Opt_sum, Opt_nosum, Opt_wsum,
+	Opt_tail, Opt_ignore, Opt_ignore_silent, Opt_err
+};
+
+static match_table_t options = {
+	{Opt_br, "br=%s"},
+	{Opt_br, "br:%s"},
+
+	{Opt_add, "add=%d:%s"},
+	{Opt_add, "add:%d:%s"},
+	{Opt_add, "ins=%d:%s"},
+	{Opt_add, "ins:%d:%s"},
+	{Opt_append, "append=%s"},
+	{Opt_append, "append:%s"},
+	{Opt_prepend, "prepend=%s"},
+	{Opt_prepend, "prepend:%s"},
+
+	{Opt_del, "del=%s"},
+	{Opt_del, "del:%s"},
+	/* {Opt_idel, "idel:%d"}, */
+	{Opt_mod, "mod=%s"},
+	{Opt_mod, "mod:%s"},
+	/* {Opt_imod, "imod:%d:%s"}, */
+
+	{Opt_dirwh, "dirwh=%d"},
+
+	{Opt_xino, "xino=%s"},
+	{Opt_noxino, "noxino"},
+	{Opt_trunc_xino, "trunc_xino"},
+	{Opt_trunc_xino_v, "trunc_xino_v=%d:%d"},
+	{Opt_notrunc_xino, "notrunc_xino"},
+	{Opt_trunc_xino_path, "trunc_xino=%s"},
+	{Opt_itrunc_xino, "itrunc_xino=%d"},
+	/* {Opt_zxino, "zxino=%s"}, */
+	{Opt_trunc_xib, "trunc_xib"},
+	{Opt_notrunc_xib, "notrunc_xib"},
+
+	{Opt_plink, "plink"},
+	{Opt_noplink, "noplink"},
+#ifdef CONFIG_AUFS_DEBUG
+	{Opt_list_plink, "list_plink"},
+#endif
+
+	{Opt_udba, "udba=%s"},
+
+	{Opt_diropq_a, "diropq=always"},
+	{Opt_diropq_a, "diropq=a"},
+	{Opt_diropq_w, "diropq=whiteouted"},
+	{Opt_diropq_w, "diropq=w"},
+
+	{Opt_warn_perm, "warn_perm"},
+	{Opt_nowarn_perm, "nowarn_perm"},
+
+	/* keep them temporary */
+	{Opt_ignore_silent, "coo=%s"},
+	{Opt_ignore_silent, "nodlgt"},
+	{Opt_ignore_silent, "nodirperm1"},
+	{Opt_ignore_silent, "noshwh"},
+	{Opt_ignore_silent, "noshwh"},
+	{Opt_ignore_silent, "clean_plink"},
+
+	{Opt_rendir, "rendir=%d"},
+
+	{Opt_refrof, "refrof"},
+	{Opt_norefrof, "norefrof"},
+
+	{Opt_verbose, "verbose"},
+	{Opt_verbose, "v"},
+	{Opt_noverbose, "noverbose"},
+	{Opt_noverbose, "quiet"},
+	{Opt_noverbose, "q"},
+	{Opt_noverbose, "silent"},
+
+	{Opt_sum, "sum"},
+	{Opt_nosum, "nosum"},
+	{Opt_wsum, "wsum"},
+
+	{Opt_rdcache, "rdcache=%d"},
+	{Opt_rdcache, "rdcache:%d"},
+
+	{Opt_wbr_create, "create=%s"},
+	{Opt_wbr_create, "create_policy=%s"},
+	{Opt_wbr_copyup, "cpup=%s"},
+	{Opt_wbr_copyup, "copyup=%s"},
+	{Opt_wbr_copyup, "copyup_policy=%s"},
+
+	/* internal use for the scripts */
+	{Opt_ignore_silent, "si=%s"},
+
+	{Opt_br, "dirs=%s"},
+	{Opt_ignore, "debug=%d"},
+	{Opt_ignore, "delete=whiteout"},
+	{Opt_ignore, "delete=all"},
+	{Opt_ignore, "imap=%s"},
+
+	{Opt_err, NULL}
+};
+
+/* ---------------------------------------------------------------------- */
+
+static const char *au_parser_pattern(int val, struct match_token *token)
+{
+	while (token->pattern) {
+		if (token->token == val)
+			return token->pattern;
+		token++;
+	}
+	BUG();
+	return "??";
+}
+
+/* ---------------------------------------------------------------------- */
+
+static match_table_t brperms = {
+	{AuBrPerm_RO, AUFS_BRPERM_RO},
+	{AuBrPerm_RR, AUFS_BRPERM_RR},
+	{AuBrPerm_RW, AUFS_BRPERM_RW},
+
+	{AuBrPerm_ROWH, AUFS_BRPERM_ROWH},
+	{AuBrPerm_RRWH, AUFS_BRPERM_RRWH},
+	{AuBrPerm_RWNoLinkWH, AUFS_BRPERM_RWNLWH},
+
+	{AuBrPerm_ROWH, "nfsro"},
+	{AuBrPerm_RO, NULL}
+};
+
+static int br_perm_val(char *perm)
+{
+	int val;
+	substring_t args[MAX_OPT_ARGS];
+
+	val = match_token(perm, brperms, args);
+	return val;
+}
+
+const char *au_optstr_br_perm(int brperm)
+{
+	return au_parser_pattern(brperm, (void *)brperms);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static match_table_t udbalevel = {
+	{AuOpt_UDBA_REVAL, "reval"},
+	{AuOpt_UDBA_NONE, "none"},
+#ifdef CONFIG_AUFS_HINOTIFY
+	{AuOpt_UDBA_HINOTIFY, "inotify"},
+#endif
+	{-1, NULL}
+};
+
+static int udba_val(char *str)
+{
+	substring_t args[MAX_OPT_ARGS];
+
+	return match_token(str, udbalevel, args);
+}
+
+const char *au_optstr_udba(int udba)
+{
+	return au_parser_pattern(udba, (void *)udbalevel);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static match_table_t au_wbr_create_policy = {
+	{AuWbrCreate_TDP, "tdp"},
+	{AuWbrCreate_TDP, "top-down-parent"},
+	{AuWbrCreate_RR, "rr"},
+	{AuWbrCreate_RR, "round-robin"},
+	{AuWbrCreate_MFS, "mfs"},
+	{AuWbrCreate_MFS, "most-free-space"},
+	{AuWbrCreate_MFSV, "mfs:%d"},
+	{AuWbrCreate_MFSV, "most-free-space:%d"},
+
+	{AuWbrCreate_MFSRR, "mfsrr:%d"},
+	{AuWbrCreate_MFSRRV, "mfsrr:%d:%d"},
+	{AuWbrCreate_PMFS, "pmfs"},
+	{AuWbrCreate_PMFSV, "pmfs:%d"},
+
+	{-1, NULL}
+};
+
+/* cf. linux/lib/parser.c */
+static int au_match_ull(substring_t *s, unsigned long long *result)
+{
+	int err;
+	unsigned int len;
+	char a[32];
+
+	err = -ERANGE;
+	len = s->to - s->from;
+	if (len + 1 <= sizeof(a)) {
+		memcpy(a, s->from, len);
+		a[len] = '\0';
+		err = strict_strtoull(a, 0, result);
+	}
+	return err;
+}
+
+static int au_wbr_mfs_wmark(substring_t *arg, char *str,
+			    struct au_opt_wbr_create *create)
+{
+	int err;
+	unsigned long long ull;
+
+	err = 0;
+	if (!au_match_ull(arg, &ull))
+		create->mfsrr_watermark = ull;
+	else {
+		AuErr("bad integer in %s\n", str);
+		err = -EINVAL;
+	}
+
+	return err;
+}
+
+static int au_wbr_mfs_sec(substring_t *arg, char *str,
+			  struct au_opt_wbr_create *create)
+{
+	int n, err;
+
+	err = 0;
+	if (!match_int(arg, &n) && 0 <= n)
+		create->mfs_second = n;
+	else {
+		AuErr("bad integer in %s\n", str);
+		err = -EINVAL;
+	}
+
+	return err;
+}
+
+static int au_wbr_create_val(char *str, struct au_opt_wbr_create *create)
+{
+	int err, e;
+	substring_t args[MAX_OPT_ARGS];
+
+	err = match_token(str, au_wbr_create_policy, args);
+	create->wbr_create = err;
+	switch (err) {
+	case AuWbrCreate_MFSRRV:
+		e = au_wbr_mfs_wmark(&args[0], str, create);
+		if (!e)
+			e = au_wbr_mfs_sec(&args[1], str, create);
+		if (unlikely(e))
+			err = e;
+		break;
+	case AuWbrCreate_MFSRR:
+		e = au_wbr_mfs_wmark(&args[0], str, create);
+		if (unlikely(e)) {
+			err = e;
+			break;
+		}
+		/*FALLTHROUGH*/
+	case AuWbrCreate_MFS:
+	case AuWbrCreate_PMFS:
+		create->mfs_second = AUFS_MFS_SECOND_DEF;
+		break;
+	case AuWbrCreate_MFSV:
+	case AuWbrCreate_PMFSV:
+		e = au_wbr_mfs_sec(&args[0], str, create);
+		if (unlikely(e))
+			err = e;
+		break;
+	}
+
+	return err;
+}
+
+const char *au_optstr_wbr_create(int wbr_create)
+{
+	return au_parser_pattern(wbr_create, (void *)au_wbr_create_policy);
+}
+
+static match_table_t au_wbr_copyup_policy = {
+	{AuWbrCopyup_TDP, "tdp"},
+	{AuWbrCopyup_TDP, "top-down-parent"},
+	{AuWbrCopyup_BUP, "bup"},
+	{AuWbrCopyup_BUP, "bottom-up-parent"},
+	{AuWbrCopyup_BU, "bu"},
+	{AuWbrCopyup_BU, "bottom-up"},
+	{-1, NULL}
+};
+
+static int au_wbr_copyup_val(char *str)
+{
+	substring_t args[MAX_OPT_ARGS];
+
+	return match_token(str, au_wbr_copyup_policy, args);
+}
+
+const char *au_optstr_wbr_copyup(int wbr_copyup)
+{
+	return au_parser_pattern(wbr_copyup, (void *)au_wbr_copyup_policy);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static const int lkup_dirflags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+
+static void dump_opts(struct au_opts *opts)
+{
+#ifdef CONFIG_AUFS_DEBUG
+	/* reduce stack space */
+	union {
+		struct au_opt_add *add;
+		struct au_opt_del *del;
+		struct au_opt_mod *mod;
+		struct au_opt_xino *xino;
+		struct au_opt_xino_itrunc *xino_itrunc;
+		struct au_opt_wbr_create *create;
+	} u;
+	struct au_opt *opt;
+
+	opt = opts->opt;
+	while (opt->type != Opt_tail) {
+		switch (opt->type) {
+		case Opt_add:
+			u.add = &opt->add;
+			AuDbg("add {b%d, %s, 0x%x, %p}\n",
+				  u.add->bindex, u.add->pathname, u.add->perm,
+				  u.add->path.dentry);
+			break;
+		case Opt_del:
+		case Opt_idel:
+			u.del = &opt->del;
+			AuDbg("del {%s, %p}\n",
+			      u.del->pathname, u.del->h_path.dentry);
+			break;
+		case Opt_mod:
+		case Opt_imod:
+			u.mod = &opt->mod;
+			AuDbg("mod {%s, 0x%x, %p}\n",
+				  u.mod->path, u.mod->perm, u.mod->h_root);
+			break;
+		case Opt_append:
+			u.add = &opt->add;
+			AuDbg("append {b%d, %s, 0x%x, %p}\n",
+				  u.add->bindex, u.add->pathname, u.add->perm,
+				  u.add->path.dentry);
+			break;
+		case Opt_prepend:
+			u.add = &opt->add;
+			AuDbg("prepend {b%d, %s, 0x%x, %p}\n",
+				  u.add->bindex, u.add->pathname, u.add->perm,
+				  u.add->path.dentry);
+			break;
+		case Opt_dirwh:
+			AuDbg("dirwh %d\n", opt->dirwh);
+			break;
+		case Opt_rdcache:
+			AuDbg("rdcache %d\n", opt->rdcache);
+			break;
+		case Opt_xino:
+			u.xino = &opt->xino;
+			AuDbg("xino {%s %.*s}\n",
+				  u.xino->path,
+				  AuDLNPair(u.xino->file->f_dentry));
+			break;
+		case Opt_trunc_xino:
+			AuLabel(trunc_xino);
+			break;
+		case Opt_notrunc_xino:
+			AuLabel(notrunc_xino);
+			break;
+		case Opt_trunc_xino_path:
+		case Opt_itrunc_xino:
+			u.xino_itrunc = &opt->xino_itrunc;
+			AuDbg("trunc_xino %d\n", u.xino_itrunc->bindex);
+			break;
+
+		case Opt_noxino:
+			AuLabel(noxino);
+			break;
+		case Opt_trunc_xib:
+			AuLabel(trunc_xib);
+			break;
+		case Opt_notrunc_xib:
+			AuLabel(notrunc_xib);
+			break;
+		case Opt_plink:
+			AuLabel(plink);
+			break;
+		case Opt_noplink:
+			AuLabel(noplink);
+			break;
+		case Opt_list_plink:
+			AuLabel(list_plink);
+			break;
+		case Opt_udba:
+			AuDbg("udba %d, %s\n",
+				  opt->udba, au_optstr_udba(opt->udba));
+			break;
+		case Opt_diropq_a:
+			AuLabel(diropq_a);
+			break;
+		case Opt_diropq_w:
+			AuLabel(diropq_w);
+			break;
+		case Opt_warn_perm:
+			AuLabel(warn_perm);
+			break;
+		case Opt_nowarn_perm:
+			AuLabel(nowarn_perm);
+			break;
+		case Opt_refrof:
+			AuLabel(refrof);
+			break;
+		case Opt_norefrof:
+			AuLabel(norefrof);
+			break;
+		case Opt_verbose:
+			AuLabel(verbose);
+			break;
+		case Opt_noverbose:
+			AuLabel(noverbose);
+			break;
+		case Opt_sum:
+			AuLabel(sum);
+			break;
+		case Opt_nosum:
+			AuLabel(nosum);
+			break;
+		case Opt_wsum:
+			AuLabel(wsum);
+			break;
+		case Opt_wbr_create:
+			u.create = &opt->wbr_create;
+			AuDbg("create %d, %s\n", u.create->wbr_create,
+				  au_optstr_wbr_create(u.create->wbr_create));
+			switch (u.create->wbr_create) {
+			case AuWbrCreate_MFSV:
+			case AuWbrCreate_PMFSV:
+				AuDbg("%d sec\n", u.create->mfs_second);
+				break;
+			case AuWbrCreate_MFSRR:
+				AuDbg("%llu watermark\n",
+					  u.create->mfsrr_watermark);
+				break;
+			case AuWbrCreate_MFSRRV:
+				AuDbg("%llu watermark, %d sec\n",
+					  u.create->mfsrr_watermark,
+					  u.create->mfs_second);
+				break;
+			}
+			break;
+		case Opt_wbr_copyup:
+			AuDbg("copyup %d, %s\n", opt->wbr_copyup,
+				  au_optstr_wbr_copyup(opt->wbr_copyup));
+			break;
+		default:
+			BUG();
+		}
+		opt++;
+	}
+#endif
+}
+
+void au_opts_free(struct au_opts *opts)
+{
+	struct au_opt *opt;
+
+	opt = opts->opt;
+	while (opt->type != Opt_tail) {
+		switch (opt->type) {
+		case Opt_add:
+		case Opt_append:
+		case Opt_prepend:
+			path_put(&opt->add.path);
+			break;
+		case Opt_del:
+		case Opt_idel:
+			path_put(&opt->del.h_path);
+			break;
+		case Opt_mod:
+		case Opt_imod:
+			dput(opt->mod.h_root);
+			break;
+		case Opt_xino:
+			fput(opt->xino.file);
+			break;
+		}
+		opt++;
+	}
+}
+
+static int opt_add(struct au_opt *opt, char *opt_str, unsigned long sb_flags,
+		   aufs_bindex_t bindex)
+{
+	int err;
+	struct au_opt_add *add = &opt->add;
+	char *p;
+
+	add->bindex = bindex;
+	add->perm = AuBrPerm_Last;
+	add->pathname = opt_str;
+	p = strchr(opt_str, '=');
+	if (p) {
+		*p++ = 0;
+		if (*p)
+			add->perm = br_perm_val(p);
+	}
+
+	err = vfsub_kern_path(add->pathname, lkup_dirflags, &add->path);
+	if (!err) {
+		if (!p) {
+			add->perm = AuBrPerm_RO;
+			if (au_test_fs_rr(add->path.dentry->d_sb))
+				add->perm = AuBrPerm_RR;
+			else if (!bindex && !(sb_flags & MS_RDONLY))
+				add->perm = AuBrPerm_RW;
+		}
+		opt->type = Opt_add;
+		goto out;
+	}
+	AuErr("lookup failed %s (%d)\n", add->pathname, err);
+	err = -EINVAL;
+
+ out:
+	return err;
+}
+
+static int au_opts_parse_del(struct au_opt_del *del, substring_t args[])
+{
+	int err;
+
+	del->pathname = args[0].from;
+	AuDbg("del path %s\n", del->pathname);
+
+	err = vfsub_kern_path(del->pathname, lkup_dirflags, &del->h_path);
+	if (unlikely(err))
+		AuErr("lookup failed %s (%d)\n", del->pathname, err);
+
+	return err;
+}
+
+#if 0 /* reserved for future use */
+static int au_opts_parse_idel(struct super_block *sb, aufs_bindex_t bindex,
+			      struct au_opt_del *del, substring_t args[])
+{
+	int err;
+	struct dentry *root;
+
+	err = -EINVAL;
+	root = sb->s_root;
+	aufs_read_lock(root, AuLock_FLUSH);
+	if (bindex < 0 || au_sbend(sb) < bindex) {
+		AuErr("out of bounds, %d\n", bindex);
+		goto out;
+	}
+
+	err = 0;
+	del->h_path.dentry = dget(au_h_dptr(root, bindex));
+	del->h_path.mnt = mntget(au_sbr_mnt(sb, bindex));
+
+ out:
+	aufs_read_unlock(root, !AuLock_IR);
+	return err;
+}
+#endif
+
+static int au_opts_parse_mod(struct au_opt_mod *mod, substring_t args[])
+{
+	int err;
+	struct path path;
+	char *p;
+
+	err = -EINVAL;
+	mod->path = args[0].from;
+	p = strchr(mod->path, '=');
+	if (unlikely(!p)) {
+		AuErr("no permssion %s\n", args[0].from);
+		goto out;
+	}
+
+	*p++ = 0;
+	err = vfsub_kern_path(mod->path, lkup_dirflags, &path);
+	if (unlikely(err)) {
+		AuErr("lookup failed %s (%d)\n", mod->path, err);
+		goto out;
+	}
+
+	mod->perm = br_perm_val(p);
+	AuDbg("mod path %s, perm 0x%x, %s\n", mod->path, mod->perm, p);
+	mod->h_root = dget(path.dentry);
+	path_put(&path);
+
+ out:
+	return err;
+}
+
+#if 0 /* reserved for future use */
+static int au_opts_parse_imod(struct super_block *sb, aufs_bindex_t bindex,
+			      struct au_opt_mod *mod, substring_t args[])
+{
+	int err;
+	struct dentry *root;
+
+	err = -EINVAL;
+	root = sb->s_root;
+	aufs_read_lock(root, AuLock_FLUSH);
+	if (bindex < 0 || au_sbend(sb) < bindex) {
+		AuErr("out of bounds, %d\n", bindex);
+		goto out;
+	}
+
+	err = 0;
+	mod->perm = br_perm_val(args[1].from);
+	AuDbg("mod path %s, perm 0x%x, %s\n",
+	      mod->path, mod->perm, args[1].from);
+	mod->h_root = dget(au_h_dptr(root, bindex));
+
+ out:
+	aufs_read_unlock(root, !AuLock_IR);
+	return err;
+}
+#endif
+
+static int au_opts_parse_xino(struct super_block *sb, struct au_opt_xino *xino,
+			      substring_t args[])
+{
+	int err;
+	struct file *file;
+
+	file = au_xino_create(sb, args[0].from, /*silent*/0);
+	err = PTR_ERR(file);
+	if (IS_ERR(file))
+		goto out;
+
+	err = -EINVAL;
+	if (unlikely(file->f_dentry->d_sb == sb)) {
+		fput(file);
+		AuErr("%s must be outside\n", args[0].from);
+		goto out;
+	}
+
+	err = 0;
+	xino->file = file;
+	xino->path = args[0].from;
+
+ out:
+	return err;
+}
+
+static
+int au_opts_parse_xino_itrunc_path(struct super_block *sb,
+				   struct au_opt_xino_itrunc *xino_itrunc,
+				   substring_t args[])
+{
+	int err;
+	aufs_bindex_t bend, bindex;
+	struct path path;
+	struct dentry *root;
+
+	err = vfsub_kern_path(args[0].from, lkup_dirflags, &path);
+	if (unlikely(err)) {
+		AuErr("lookup failed %s (%d)\n", args[0].from, err);
+		goto out;
+	}
+
+	xino_itrunc->bindex = -1;
+	root = sb->s_root;
+	aufs_read_lock(root, AuLock_FLUSH);
+	bend = au_sbend(sb);
+	for (bindex = 0; bindex <= bend; bindex++) {
+		if (au_h_dptr(root, bindex) == path.dentry) {
+			xino_itrunc->bindex = bindex;
+			break;
+		}
+	}
+	aufs_read_unlock(root, !AuLock_IR);
+	path_put(&path);
+
+	if (unlikely(xino_itrunc->bindex < 0)) {
+		AuErr("no such branch %s\n", args[0].from);
+		err = -EINVAL;
+	}
+
+ out:
+	return err;
+}
+
+/* called without aufs lock */
+int au_opts_parse(struct super_block *sb, char *str, struct au_opts *opts)
+{
+	int err, n, token;
+	aufs_bindex_t bindex;
+	unsigned char skipped;
+	struct dentry *root;
+	struct au_opt *opt, *opt_tail;
+	char *opt_str;
+	/* reduce the stack space */
+	union {
+		struct au_opt_xino_itrunc *xino_itrunc;
+		struct au_opt_wbr_create *create;
+	} u;
+	struct {
+		substring_t args[MAX_OPT_ARGS];
+	} *a;
+
+	err = -ENOMEM;
+	a = kmalloc(sizeof(*a), GFP_NOFS);
+	if (unlikely(!a))
+		goto out;
+
+	root = sb->s_root;
+	err = 0;
+	bindex = 0;
+	opt = opts->opt;
+	opt_tail = opt + opts->max_opt - 1;
+	opt->type = Opt_tail;
+	while (!err && (opt_str = strsep(&str, ",")) && *opt_str) {
+		err = -EINVAL;
+		skipped = 0;
+		token = match_token(opt_str, options, a->args);
+		switch (token) {
+		case Opt_br:
+			err = 0;
+			while (!err && (opt_str = strsep(&a->args[0].from, ":"))
+			       && *opt_str) {
+				err = opt_add(opt, opt_str, opts->sb_flags,
+					      bindex++);
+				if (unlikely(!err && ++opt > opt_tail)) {
+					err = -E2BIG;
+					break;
+				}
+				opt->type = Opt_tail;
+				skipped = 1;
+			}
+			break;
+		case Opt_add:
+			if (unlikely(match_int(&a->args[0], &n))) {
+				AuErr("bad integer in %s\n", opt_str);
+				break;
+			}
+			bindex = n;
+			err = opt_add(opt, a->args[1].from, opts->sb_flags,
+				      bindex);
+			if (!err)
+				opt->type = token;
+			break;
+		case Opt_append:
+			err = opt_add(opt, a->args[0].from, opts->sb_flags,
+				      /*dummy bindex*/1);
+			if (!err)
+				opt->type = token;
+			break;
+		case Opt_prepend:
+			err = opt_add(opt, a->args[0].from, opts->sb_flags,
+				      /*bindex*/0);
+			if (!err)
+				opt->type = token;
+			break;
+		case Opt_del:
+			err = au_opts_parse_del(&opt->del, a->args);
+			if (!err)
+				opt->type = token;
+			break;
+#if 0 /* reserved for future use */
+		case Opt_idel:
+			del->pathname = "(indexed)";
+			if (unlikely(match_int(&args[0], &n))) {
+				AuErr("bad integer in %s\n", opt_str);
+				break;
+			}
+			err = au_opts_parse_idel(sb, n, &opt->del, a->args);
+			if (!err)
+				opt->type = token;
+			break;
+#endif
+		case Opt_mod:
+			err = au_opts_parse_mod(&opt->mod, a->args);
+			if (!err)
+				opt->type = token;
+			break;
+#ifdef IMOD /* reserved for future use */
+		case Opt_imod:
+			u.mod->path = "(indexed)";
+			if (unlikely(match_int(&a->args[0], &n))) {
+				AuErr("bad integer in %s\n", opt_str);
+				break;
+			}
+			err = au_opts_parse_imod(sb, n, &opt->mod, a->args);
+			if (!err)
+				opt->type = token;
+			break;
+#endif
+		case Opt_xino:
+			err = au_opts_parse_xino(sb, &opt->xino, a->args);
+			if (!err)
+				opt->type = token;
+			break;
+
+		case Opt_trunc_xino_path:
+			err = au_opts_parse_xino_itrunc_path
+				(sb, &opt->xino_itrunc, a->args);
+			if (!err)
+				opt->type = token;
+			break;
+
+		case Opt_itrunc_xino:
+			u.xino_itrunc = &opt->xino_itrunc;
+			if (unlikely(match_int(&a->args[0], &n))) {
+				AuErr("bad integer in %s\n", opt_str);
+				break;
+			}
+			u.xino_itrunc->bindex = n;
+			aufs_read_lock(root, AuLock_FLUSH);
+			if (n < 0 || au_sbend(sb) < n) {
+				AuErr("out of bounds, %d\n", n);
+				aufs_read_unlock(root, !AuLock_IR);
+				break;
+			}
+			aufs_read_unlock(root, !AuLock_IR);
+			err = 0;
+			opt->type = token;
+			break;
+
+		case Opt_dirwh:
+			if (unlikely(match_int(&a->args[0], &opt->dirwh)))
+				break;
+			err = 0;
+			opt->type = token;
+			break;
+
+		case Opt_rdcache:
+			if (unlikely(match_int(&a->args[0], &opt->rdcache)))
+				break;
+			err = 0;
+			opt->type = token;
+			break;
+
+		case Opt_trunc_xino:
+		case Opt_notrunc_xino:
+		case Opt_noxino:
+		case Opt_trunc_xib:
+		case Opt_notrunc_xib:
+		case Opt_plink:
+		case Opt_noplink:
+		case Opt_list_plink:
+		case Opt_diropq_a:
+		case Opt_diropq_w:
+		case Opt_warn_perm:
+		case Opt_nowarn_perm:
+		case Opt_refrof:
+		case Opt_norefrof:
+		case Opt_verbose:
+		case Opt_noverbose:
+		case Opt_sum:
+		case Opt_nosum:
+		case Opt_wsum:
+			err = 0;
+			opt->type = token;
+			break;
+
+		case Opt_udba:
+			opt->udba = udba_val(a->args[0].from);
+			if (opt->udba >= 0) {
+				err = 0;
+				opt->type = token;
+			} else
+				AuErr("wrong value, %s\n", opt_str);
+			break;
+
+		case Opt_wbr_create:
+			u.create = &opt->wbr_create;
+			u.create->wbr_create
+				= au_wbr_create_val(a->args[0].from, u.create);
+			if (u.create->wbr_create >= 0) {
+				err = 0;
+				opt->type = token;
+			} else
+				AuErr("wrong value, %s\n", opt_str);
+			break;
+		case Opt_wbr_copyup:
+			opt->wbr_copyup = au_wbr_copyup_val(a->args[0].from);
+			if (opt->wbr_copyup >= 0) {
+				err = 0;
+				opt->type = token;
+			} else
+				AuErr("wrong value, %s\n", opt_str);
+			break;
+
+		case Opt_ignore:
+			AuWarn("ignored %s\n", opt_str);
+			/*FALLTHROUGH*/
+		case Opt_ignore_silent:
+			skipped = 1;
+			err = 0;
+			break;
+		case Opt_err:
+			AuErr("unknown option %s\n", opt_str);
+			break;
+		}
+
+		if (!err && !skipped) {
+			if (unlikely(++opt > opt_tail)) {
+				err = -E2BIG;
+				opt--;
+				opt->type = Opt_tail;
+				break;
+			}
+			opt->type = Opt_tail;
+		}
+	}
+
+	kfree(a);
+	dump_opts(opts);
+	if (unlikely(err))
+		au_opts_free(opts);
+
+ out:
+	return err;
+}
+
+static int au_opt_wbr_create(struct super_block *sb,
+			     struct au_opt_wbr_create *create)
+{
+	int err;
+	struct au_sbinfo *sbinfo;
+
+	err = 1; /* handled */
+	sbinfo = au_sbi(sb);
+	if (sbinfo->si_wbr_create_ops->fin) {
+		err = sbinfo->si_wbr_create_ops->fin(sb);
+		if (!err)
+			err = 1;
+	}
+
+	sbinfo->si_wbr_create = create->wbr_create;
+	sbinfo->si_wbr_create_ops = au_wbr_create_ops + create->wbr_create;
+	switch (create->wbr_create) {
+	case AuWbrCreate_MFSRRV:
+	case AuWbrCreate_MFSRR:
+		sbinfo->si_wbr_mfs.mfsrr_watermark = create->mfsrr_watermark;
+		/*FALLTHROUGH*/
+	case AuWbrCreate_MFS:
+	case AuWbrCreate_MFSV:
+	case AuWbrCreate_PMFS:
+	case AuWbrCreate_PMFSV:
+		sbinfo->si_wbr_mfs.mfs_expire = create->mfs_second * HZ;
+		break;
+	}
+
+	if (sbinfo->si_wbr_create_ops->init)
+		sbinfo->si_wbr_create_ops->init(sb); /* ignore */
+
+	return err;
+}
+
+/*
+ * returns,
+ * plus: processed without an error
+ * zero: unprocessed
+ */
+static int au_opt_simple(struct super_block *sb, struct au_opt *opt,
+			 struct au_opts *opts)
+{
+	int err;
+	struct au_sbinfo *sbinfo;
+
+	err = 1; /* handled */
+	sbinfo = au_sbi(sb);
+	switch (opt->type) {
+	case Opt_udba:
+		sbinfo->si_mntflags &= ~AuOptMask_UDBA;
+		sbinfo->si_mntflags |= opt->udba;
+		opts->given_udba |= opt->udba;
+		break;
+
+	case Opt_plink:
+		au_opt_set(sbinfo->si_mntflags, PLINK);
+		break;
+	case Opt_noplink:
+		if (au_opt_test(sbinfo->si_mntflags, PLINK))
+			au_plink_put(sb);
+		au_opt_clr(sbinfo->si_mntflags, PLINK);
+		break;
+	case Opt_list_plink:
+		if (au_opt_test(sbinfo->si_mntflags, PLINK))
+			au_plink_list(sb);
+		break;
+
+	case Opt_diropq_a:
+		au_opt_set(sbinfo->si_mntflags, ALWAYS_DIROPQ);
+		break;
+	case Opt_diropq_w:
+		au_opt_clr(sbinfo->si_mntflags, ALWAYS_DIROPQ);
+		break;
+
+	case Opt_warn_perm:
+		au_opt_set(sbinfo->si_mntflags, WARN_PERM);
+		break;
+	case Opt_nowarn_perm:
+		au_opt_clr(sbinfo->si_mntflags, WARN_PERM);
+		break;
+
+	case Opt_refrof:
+		au_opt_set(sbinfo->si_mntflags, REFROF);
+		break;
+	case Opt_norefrof:
+		au_opt_clr(sbinfo->si_mntflags, REFROF);
+		break;
+
+	case Opt_verbose:
+		au_opt_set(sbinfo->si_mntflags, VERBOSE);
+		break;
+	case Opt_noverbose:
+		au_opt_clr(sbinfo->si_mntflags, VERBOSE);
+		break;
+
+	case Opt_sum:
+		au_opt_set(sbinfo->si_mntflags, SUM);
+		break;
+	case Opt_wsum:
+		au_opt_clr(sbinfo->si_mntflags, SUM);
+		au_opt_set(sbinfo->si_mntflags, SUM_W);
+	case Opt_nosum:
+		au_opt_clr(sbinfo->si_mntflags, SUM);
+		au_opt_clr(sbinfo->si_mntflags, SUM_W);
+		break;
+
+	case Opt_wbr_create:
+		err = au_opt_wbr_create(sb, &opt->wbr_create);
+		break;
+	case Opt_wbr_copyup:
+		sbinfo->si_wbr_copyup = opt->wbr_copyup;
+		sbinfo->si_wbr_copyup_ops = au_wbr_copyup_ops + opt->wbr_copyup;
+		break;
+
+	case Opt_dirwh:
+		sbinfo->si_dirwh = opt->dirwh;
+		break;
+
+	case Opt_rdcache:
+		sbinfo->si_rdcache = opt->rdcache * HZ;
+		break;
+
+	case Opt_trunc_xino:
+		au_opt_set(sbinfo->si_mntflags, TRUNC_XINO);
+		break;
+	case Opt_notrunc_xino:
+		au_opt_clr(sbinfo->si_mntflags, TRUNC_XINO);
+		break;
+
+	case Opt_trunc_xino_path:
+	case Opt_itrunc_xino:
+		err = au_xino_trunc(sb, opt->xino_itrunc.bindex);
+		if (!err)
+			err = 1;
+		break;
+
+	case Opt_trunc_xib:
+		au_fset_opts(opts->flags, TRUNC_XIB);
+		break;
+	case Opt_notrunc_xib:
+		au_fclr_opts(opts->flags, TRUNC_XIB);
+		break;
+
+	default:
+		err = 0;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * returns tri-state.
+ * plus: processed without an error
+ * zero: unprocessed
+ * minus: error
+ */
+static int au_opt_br(struct super_block *sb, struct au_opt *opt,
+		     struct au_opts *opts)
+{
+	int err, do_refresh;
+
+	err = 0;
+	switch (opt->type) {
+	case Opt_append:
+		opt->add.bindex = au_sbend(sb) + 1;
+		if (opt->add.bindex < 0)
+			opt->add.bindex = 0;
+		goto add;
+	case Opt_prepend:
+		opt->add.bindex = 0;
+	add:
+	case Opt_add:
+		err = au_br_add(sb, &opt->add,
+				au_ftest_opts(opts->flags, REMOUNT));
+		if (!err) {
+			err = 1;
+			au_fset_opts(opts->flags, REFRESH_DIR);
+			if (au_br_whable(opt->add.perm))
+				au_fset_opts(opts->flags, REFRESH_NONDIR);
+		}
+		break;
+
+	case Opt_del:
+	case Opt_idel:
+		err = au_br_del(sb, &opt->del,
+				au_ftest_opts(opts->flags, REMOUNT));
+		if (!err) {
+			err = 1;
+			au_fset_opts(opts->flags, TRUNC_XIB);
+			au_fset_opts(opts->flags, REFRESH_DIR);
+			au_fset_opts(opts->flags, REFRESH_NONDIR);
+		}
+		break;
+
+	case Opt_mod:
+	case Opt_imod:
+		err = au_br_mod(sb, &opt->mod,
+				au_ftest_opts(opts->flags, REMOUNT),
+				&do_refresh);
+		if (!err) {
+			err = 1;
+			if (do_refresh) {
+				au_fset_opts(opts->flags, REFRESH_DIR);
+				au_fset_opts(opts->flags, REFRESH_NONDIR);
+			}
+		}
+		break;
+	}
+
+	return err;
+}
+
+static int au_opt_xino(struct super_block *sb, struct au_opt *opt,
+		       struct au_opt_xino **opt_xino,
+		       struct au_opts *opts)
+{
+	int err;
+	aufs_bindex_t bend, bindex;
+	struct dentry *root, *parent, *h_root;
+
+	err = 0;
+	switch (opt->type) {
+	case Opt_xino:
+		err = au_xino_set(sb, &opt->xino,
+				  !!au_ftest_opts(opts->flags, REMOUNT));
+		if (unlikely(err))
+			break;
+
+		*opt_xino = &opt->xino;
+		au_xino_brid_set(sb, -1);
+
+		/* safe d_parent access */
+		parent = opt->xino.file->f_dentry->d_parent;
+		root = sb->s_root;
+		bend = au_sbend(sb);
+		for (bindex = 0; bindex <= bend; bindex++) {
+			h_root = au_h_dptr(root, bindex);
+			if (h_root == parent) {
+				au_xino_brid_set(sb, au_sbr_id(sb, bindex));
+				break;
+			}
+		}
+		break;
+
+	case Opt_noxino:
+		au_xino_clr(sb);
+		au_xino_brid_set(sb, -1);
+		*opt_xino = (void *)-1;
+		break;
+	}
+
+	return err;
+}
+
+int au_opts_verify(struct super_block *sb, unsigned long sb_flags,
+		   unsigned int pending)
+{
+	int err;
+	aufs_bindex_t bindex, bend;
+	unsigned char do_plink, skip, do_free;
+	struct au_branch *br;
+	struct au_wbr *wbr;
+	struct dentry *root;
+	struct inode *dir, *h_dir;
+	struct au_sbinfo *sbinfo;
+	struct au_hinode *hdir;
+
+	sbinfo = au_sbi(sb);
+	AuDebugOn(!(sbinfo->si_mntflags & AuOptMask_UDBA));
+
+	if (unlikely(!(sb_flags & MS_RDONLY)
+		     && !au_br_writable(au_sbr_perm(sb, 0))))
+		AuWarn("first branch should be rw\n");
+
+	if (au_opt_test((sbinfo->si_mntflags | pending), UDBA_HINOTIFY)
+	    && !au_opt_test(sbinfo->si_mntflags, XINO))
+		AuWarn("udba=inotify requires xino\n");
+
+	err = 0;
+	root = sb->s_root;
+	dir = sb->s_root->d_inode;
+	do_plink = !!au_opt_test(sbinfo->si_mntflags, PLINK);
+	bend = au_sbend(sb);
+	for (bindex = 0; !err && bindex <= bend; bindex++) {
+		skip = 0;
+		h_dir = au_h_iptr(dir, bindex);
+		br = au_sbr(sb, bindex);
+		do_free = 0;
+
+		wbr = br->br_wbr;
+		if (wbr)
+			wbr_wh_read_lock(wbr);
+
+		switch (br->br_perm) {
+		case AuBrPerm_RO:
+		case AuBrPerm_ROWH:
+		case AuBrPerm_RR:
+		case AuBrPerm_RRWH:
+			do_free = !!wbr;
+			skip = (!wbr
+				|| (!wbr->wbr_whbase
+				    && !wbr->wbr_plink
+				    && !wbr->wbr_orph));
+			break;
+
+		case AuBrPerm_RWNoLinkWH:
+			/* skip = (!br->br_whbase && !br->br_orph); */
+			skip = (!wbr || !wbr->wbr_whbase);
+			if (skip && wbr) {
+				if (do_plink)
+					skip = !!wbr->wbr_plink;
+				else
+					skip = !wbr->wbr_plink;
+			}
+			break;
+
+		case AuBrPerm_RW:
+			/* skip = (br->br_whbase && br->br_ohph); */
+			skip = (wbr && wbr->wbr_whbase);
+			if (skip) {
+				if (do_plink)
+					skip = !!wbr->wbr_plink;
+				else
+					skip = !wbr->wbr_plink;
+			}
+			break;
+
+		default:
+			BUG();
+		}
+		if (wbr)
+			wbr_wh_read_unlock(wbr);
+
+		if (skip)
+			continue;
+
+		hdir = au_hi(dir, bindex);
+		au_hin_imtx_lock_nested(hdir, AuLsc_I_PARENT);
+		if (wbr)
+			wbr_wh_write_lock(wbr);
+		err = au_wh_init(au_h_dptr(root, bindex), br, sb);
+		if (wbr)
+			wbr_wh_write_unlock(wbr);
+		au_hin_imtx_unlock(hdir);
+
+		if (!err && do_free) {
+			kfree(wbr);
+			br->br_wbr = NULL;
+		}
+	}
+
+	return err;
+}
+
+int au_opts_mount(struct super_block *sb, struct au_opts *opts)
+{
+	int err;
+	unsigned int tmp;
+	aufs_bindex_t bend;
+	struct au_opt *opt;
+	struct au_opt_xino *opt_xino, xino;
+	struct au_sbinfo *sbinfo;
+
+	err = 0;
+	opt_xino = NULL;
+	opt = opts->opt;
+	while (err >= 0 && opt->type != Opt_tail)
+		err = au_opt_simple(sb, opt++, opts);
+	if (err > 0)
+		err = 0;
+	else if (unlikely(err < 0))
+		goto out;
+
+	/* disable xino and udba temporary */
+	sbinfo = au_sbi(sb);
+	tmp = sbinfo->si_mntflags;
+	au_opt_clr(sbinfo->si_mntflags, XINO);
+	au_opt_set_udba(sbinfo->si_mntflags, UDBA_REVAL);
+
+	opt = opts->opt;
+	while (err >= 0 && opt->type != Opt_tail)
+		err = au_opt_br(sb, opt++, opts);
+	if (err > 0)
+		err = 0;
+	else if (unlikely(err < 0))
+		goto out;
+
+	bend = au_sbend(sb);
+	if (unlikely(bend < 0)) {
+		err = -EINVAL;
+		AuErr("no branches\n");
+		goto out;
+	}
+
+	if (au_opt_test(tmp, XINO))
+		au_opt_set(sbinfo->si_mntflags, XINO);
+	opt = opts->opt;
+	while (!err && opt->type != Opt_tail)
+		err = au_opt_xino(sb, opt++, &opt_xino, opts);
+	if (unlikely(err))
+		goto out;
+
+	err = au_opts_verify(sb, sb->s_flags, tmp);
+	if (unlikely(err))
+		goto out;
+
+	/* restore xino */
+	if (au_opt_test(tmp, XINO) && !opt_xino) {
+		xino.file = au_xino_def(sb);
+		err = PTR_ERR(xino.file);
+		if (IS_ERR(xino.file))
+			goto out;
+
+		err = au_xino_set(sb, &xino, /*remount*/0);
+		fput(xino.file);
+		if (unlikely(err))
+			goto out;
+	}
+
+	/* restore udba */
+	sbinfo->si_mntflags &= ~AuOptMask_UDBA;
+	sbinfo->si_mntflags |= (tmp & AuOptMask_UDBA);
+	if (au_opt_test(tmp, UDBA_HINOTIFY)) {
+		struct inode *dir = sb->s_root->d_inode;
+		au_reset_hinotify(dir,
+				  au_hi_flags(dir, /*isdir*/1) & ~AuHi_XINO);
+	}
+
+ out:
+	return err;
+}
+
+int au_opts_remount(struct super_block *sb, struct au_opts *opts)
+{
+	int err, rerr;
+	struct inode *dir;
+	struct au_opt_xino *opt_xino;
+	struct au_opt *opt;
+	struct au_sbinfo *sbinfo;
+
+	dir = sb->s_root->d_inode;
+	sbinfo = au_sbi(sb);
+	err = 0;
+	opt_xino = NULL;
+	opt = opts->opt;
+	while (err >= 0 && opt->type != Opt_tail) {
+		err = au_opt_simple(sb, opt, opts);
+		if (!err)
+			err = au_opt_br(sb, opt, opts);
+		if (!err)
+			err = au_opt_xino(sb, opt, &opt_xino, opts);
+		opt++;
+	}
+	if (err > 0)
+		err = 0;
+	AuTraceErr(err);
+	/* go on even err */
+
+	rerr = au_opts_verify(sb, opts->sb_flags, /*pending*/0);
+	if (unlikely(rerr && !err))
+		err = rerr;
+
+	if (au_ftest_opts(opts->flags, TRUNC_XIB)) {
+		rerr = au_xib_trunc(sb);
+		if (unlikely(rerr && !err))
+			err = rerr;
+	}
+
+	/* will be handled by the caller */
+	if (!au_ftest_opts(opts->flags, REFRESH_DIR)
+	    && (opts->given_udba || au_opt_test(sbinfo->si_mntflags, XINO)))
+		au_fset_opts(opts->flags, REFRESH_DIR);
+
+	AuDbg("status 0x%x\n", opts->flags);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+unsigned int au_opt_udba(struct super_block *sb)
+{
+	return au_mntflags(sb) & AuOptMask_UDBA;
+}
diff --git a/fs/aufs/opts.h b/fs/aufs/opts.h
new file mode 100644
index 0000000..89c3e3d
--- /dev/null
+++ b/fs/aufs/opts.h
@@ -0,0 +1,180 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * mount options/flags
+ */
+
+#ifndef __AUFS_OPTS_H__
+#define __AUFS_OPTS_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/aufs_type.h>
+
+/* ---------------------------------------------------------------------- */
+
+/* mount flags */
+#define AuOpt_XINO		1		/* external inode number bitmap
+						   and translation table */
+#define AuOpt_TRUNC_XINO	(1 << 1)	/* truncate xino files */
+#define AuOpt_UDBA_NONE		(1 << 2)	/* users direct branch access */
+#define AuOpt_UDBA_REVAL	(1 << 3)
+#define AuOpt_UDBA_HINOTIFY	(1 << 4)
+#define AuOpt_PLINK		(1 << 5)	/* pseudo-link */
+#define AuOpt_DIRPERM1		(1 << 6)	/* unimplemented */
+#define AuOpt_REFROF		(1 << 7)	/* unimplemented */
+#define AuOpt_ALWAYS_DIROPQ	(1 << 8)	/* policy to creating diropq */
+#define AuOpt_SUM		(1 << 9)	/* summation for statfs(2) */
+#define AuOpt_SUM_W		(1 << 10)	/* unimplemented */
+#define AuOpt_WARN_PERM		(1 << 11)	/* warn when add-branch */
+#define AuOpt_VERBOSE		(1 << 12)	/* busy inode when del-branch */
+
+#ifndef CONFIG_AUFS_HINOTIFY
+#undef AuOpt_UDBA_HINOTIFY
+#define AuOpt_UDBA_HINOTIFY	0
+#endif
+
+#define AuOpt_Def	(AuOpt_XINO \
+			 | AuOpt_UDBA_REVAL \
+			 | AuOpt_PLINK \
+			 /* | AuOpt_DIRPERM1 */ \
+			 | AuOpt_WARN_PERM)
+#define AuOptMask_UDBA	(AuOpt_UDBA_NONE \
+			 | AuOpt_UDBA_REVAL \
+			 | AuOpt_UDBA_HINOTIFY)
+
+#define au_opt_test(flags, name)	(flags & AuOpt_##name)
+#define au_opt_set(flags, name) do { \
+	BUILD_BUG_ON(AuOpt_##name & AuOptMask_UDBA); \
+	((flags) |= AuOpt_##name); \
+} while (0)
+#define au_opt_set_udba(flags, name) do { \
+	(flags) &= ~AuOptMask_UDBA; \
+	((flags) |= AuOpt_##name); \
+} while (0)
+#define au_opt_clr(flags, name)		{ ((flags) &= ~AuOpt_##name); }
+
+/* ---------------------------------------------------------------------- */
+
+/* policies to select one among multiple writable branches */
+enum {
+	AuWbrCreate_TDP,	/* top down parent */
+	AuWbrCreate_RR,		/* round robin */
+	AuWbrCreate_MFS,	/* most free space */
+	AuWbrCreate_MFSV,	/* mfs with seconds */
+	AuWbrCreate_MFSRR,	/* mfs then rr */
+	AuWbrCreate_MFSRRV,	/* mfs then rr with seconds */
+	AuWbrCreate_PMFS,	/* parent and mfs */
+	AuWbrCreate_PMFSV,	/* parent and mfs with seconds */
+
+	AuWbrCreate_Def = AuWbrCreate_TDP
+};
+
+enum {
+	AuWbrCopyup_TDP,	/* top down parent */
+	AuWbrCopyup_BUP,	/* bottom up parent */
+	AuWbrCopyup_BU,		/* bottom up */
+
+	AuWbrCopyup_Def = AuWbrCopyup_TDP
+};
+
+/* ---------------------------------------------------------------------- */
+
+struct au_opt_add {
+	aufs_bindex_t	bindex;
+	char		*pathname;
+	int		perm;
+	struct path	path;
+};
+
+struct au_opt_del {
+	char		*pathname;
+	struct path	h_path;
+};
+
+struct au_opt_mod {
+	char		*path;
+	int		perm;
+	struct dentry	*h_root;
+};
+
+struct au_opt_xino {
+	char		*path;
+	struct file	*file;
+};
+
+struct au_opt_xino_itrunc {
+	aufs_bindex_t	bindex;
+};
+
+struct au_opt_wbr_create {
+	int			wbr_create;
+	int			mfs_second;
+	unsigned long long	mfsrr_watermark;
+};
+
+struct au_opt {
+	int type;
+	union {
+		struct au_opt_xino	xino;
+		struct au_opt_xino_itrunc xino_itrunc;
+		struct au_opt_add	add;
+		struct au_opt_del	del;
+		struct au_opt_mod	mod;
+		int			dirwh;
+		int			rdcache;
+		int			deblk;
+		int			nhash;
+		int			udba;
+		struct au_opt_wbr_create wbr_create;
+		int			wbr_copyup;
+	};
+};
+
+/* opts flags */
+#define AuOpts_REMOUNT		1
+#define AuOpts_REFRESH_DIR	(1 << 1)
+#define AuOpts_REFRESH_NONDIR	(1 << 2)
+#define AuOpts_TRUNC_XIB	(1 << 3)
+#define au_ftest_opts(flags, name)	((flags) & AuOpts_##name)
+#define au_fset_opts(flags, name)	{ (flags) |= AuOpts_##name; }
+#define au_fclr_opts(flags, name)	{ (flags) &= ~AuOpts_##name; }
+
+struct au_opts {
+	struct au_opt	*opt;
+	int		max_opt;
+
+	unsigned int	given_udba;
+	unsigned int	flags;
+	unsigned long	sb_flags;
+};
+
+/* ---------------------------------------------------------------------- */
+
+const char *au_optstr_br_perm(int brperm);
+const char *au_optstr_udba(int udba);
+const char *au_optstr_wbr_copyup(int wbr_copyup);
+const char *au_optstr_wbr_create(int wbr_create);
+
+void au_opts_free(struct au_opts *opts);
+int au_opts_parse(struct super_block *sb, char *str, struct au_opts *opts);
+int au_opts_verify(struct super_block *sb, unsigned long sb_flags,
+		   unsigned int pending);
+int au_opts_mount(struct super_block *sb, struct au_opts *opts);
+int au_opts_remount(struct super_block *sb, struct au_opts *opts);
+
+unsigned int au_opt_udba(struct super_block *sb);
+
+/* ---------------------------------------------------------------------- */
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_OPTS_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 09/28] aufs workqueue
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (7 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 08/28] aufs mount options/flags J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 10/28] aufs sub-VFS J. R. Okajima
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
workqueue for asynchronous and super-io operations.
they are used for:
- handling 'opaque' directory and whiteout
- lookup and copy-up/down with credential
- internal xino file i/o

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/wkq.c |  249 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/wkq.h |   72 +++++++++++++++++
 2 files changed, 321 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/wkq.c
 create mode 100644 fs/aufs/wkq.h

diff --git a/fs/aufs/wkq.c b/fs/aufs/wkq.c
new file mode 100644
index 0000000..41e30be
--- /dev/null
+++ b/fs/aufs/wkq.c
@@ -0,0 +1,249 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * workqueue for asynchronous/super-io operations
+ * todo: try new dredential scheme
+ */
+
+#include "aufs.h"
+
+/* internal workqueue named AUFS_WKQ_NAME */
+static struct au_wkq {
+	struct workqueue_struct	*q;
+
+	/* balancing */
+	atomic_t		busy;
+} *au_wkq;
+
+struct au_wkinfo {
+	struct work_struct wk;
+	struct super_block *sb;
+
+	unsigned int flags; /* see wkq.h */
+
+	au_wkq_func_t func;
+	void *args;
+
+	atomic_t *busyp;
+	struct completion *comp;
+};
+
+/* ---------------------------------------------------------------------- */
+
+static int enqueue(struct au_wkq *wkq, struct au_wkinfo *wkinfo)
+{
+	wkinfo->busyp = &wkq->busy;
+	if (au_ftest_wkq(wkinfo->flags, WAIT))
+		return !queue_work(wkq->q, &wkinfo->wk);
+	else
+		return !schedule_work(&wkinfo->wk);
+}
+
+static void do_wkq(struct au_wkinfo *wkinfo)
+{
+	unsigned int idle, n;
+	int i, idle_idx;
+
+	while (1) {
+		if (au_ftest_wkq(wkinfo->flags, WAIT)) {
+			idle_idx = 0;
+			idle = UINT_MAX;
+			for (i = 0; i < aufs_nwkq; i++) {
+				n = atomic_inc_return(&au_wkq[i].busy);
+				if (n == 1 && !enqueue(au_wkq + i, wkinfo))
+					return; /* success */
+
+				if (n < idle) {
+					idle_idx = i;
+					idle = n;
+				}
+				atomic_dec(&au_wkq[i].busy);
+			}
+		} else
+			idle_idx = aufs_nwkq;
+
+		atomic_inc(&au_wkq[idle_idx].busy);
+		if (!enqueue(au_wkq + idle_idx, wkinfo))
+			return; /* success */
+
+		/* impossible? */
+		AuWarn1("failed to queue_work()\n");
+		yield();
+	}
+}
+
+static void wkq_func(struct work_struct *wk)
+{
+	struct au_wkinfo *wkinfo = container_of(wk, struct au_wkinfo, wk);
+
+	wkinfo->func(wkinfo->args);
+	atomic_dec(wkinfo->busyp);
+	if (au_ftest_wkq(wkinfo->flags, WAIT))
+		complete(wkinfo->comp);
+	else {
+		kobject_put(&au_sbi(wkinfo->sb)->si_kobj);
+		module_put(THIS_MODULE);
+		kfree(wkinfo);
+	}
+}
+
+/*
+ * Since struct completion is large, try allocating it dynamically.
+ */
+#if defined(CONFIG_4KSTACKS) || defined(AuTest4KSTACKS)
+#define AuWkqCompDeclare(name)	struct completion *comp = NULL
+
+static int au_wkq_comp_alloc(struct au_wkinfo *wkinfo, struct completion **comp)
+{
+	*comp = kmalloc(sizeof(**comp), GFP_NOFS);
+	if (*comp) {
+		init_completion(*comp);
+		wkinfo->comp = *comp;
+		return 0;
+	}
+	return -ENOMEM;
+}
+
+static void au_wkq_comp_free(struct completion *comp)
+{
+	kfree(comp);
+}
+
+#else
+
+/* no braces */
+#define AuWkqCompDeclare(name) \
+	DECLARE_COMPLETION_ONSTACK(_ ## name); \
+	struct completion *comp = &_ ## name
+
+static int au_wkq_comp_alloc(struct au_wkinfo *wkinfo, struct completion **comp)
+{
+	wkinfo->comp = *comp;
+	return 0;
+}
+
+static void au_wkq_comp_free(struct completion *comp __maybe_unused)
+{
+	/* empty */
+}
+#endif /* 4KSTACKS */
+
+static void au_wkq_run(struct au_wkinfo *wkinfo)
+{
+	au_dbg_verify_kthread();
+	INIT_WORK(&wkinfo->wk, wkq_func);
+	do_wkq(wkinfo);
+}
+
+int au_wkq_wait(au_wkq_func_t func, void *args)
+{
+	int err;
+	AuWkqCompDeclare(comp);
+	struct au_wkinfo wkinfo = {
+		.flags	= AuWkq_WAIT,
+		.func	= func,
+		.args	= args
+	};
+
+	err = au_wkq_comp_alloc(&wkinfo, &comp);
+	if (!err) {
+		au_wkq_run(&wkinfo);
+		/* no timeout, no interrupt */
+		wait_for_completion(wkinfo.comp);
+		au_wkq_comp_free(comp);
+	}
+
+	return err;
+
+}
+
+int au_wkq_nowait(au_wkq_func_t func, void *args, struct super_block *sb)
+{
+	int err;
+	struct au_wkinfo *wkinfo;
+
+	atomic_inc(&au_sbi(sb)->si_nowait.nw_len);
+
+	/*
+	 * wkq_func() must free this wkinfo.
+	 * it highly depends upon the implementation of workqueue.
+	 */
+	err = 0;
+	wkinfo = kmalloc(sizeof(*wkinfo), GFP_NOFS);
+	if (wkinfo) {
+		wkinfo->sb = sb;
+		wkinfo->flags = !AuWkq_WAIT;
+		wkinfo->func = func;
+		wkinfo->args = args;
+		wkinfo->comp = NULL;
+		kobject_get(&au_sbi(sb)->si_kobj);
+		__module_get(THIS_MODULE);
+
+		au_wkq_run(wkinfo);
+	} else {
+		err = -ENOMEM;
+		atomic_dec(&au_sbi(sb)->si_nowait.nw_len);
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_nwt_init(struct au_nowait_tasks *nwt)
+{
+	atomic_set(&nwt->nw_len, 0);
+	/* smp_mb();*/ /* atomic_set */
+	init_waitqueue_head(&nwt->nw_wq);
+}
+
+void au_wkq_fin(void)
+{
+	int i;
+
+	for (i = 0; i < aufs_nwkq; i++)
+		if (au_wkq[i].q && !IS_ERR(au_wkq[i].q))
+			destroy_workqueue(au_wkq[i].q);
+	kfree(au_wkq);
+}
+
+int __init au_wkq_init(void)
+{
+	int err, i;
+	struct au_wkq *nowaitq;
+
+	/* '+1' is for accounting of nowait queue */
+	err = -ENOMEM;
+	au_wkq = kcalloc(aufs_nwkq + 1, sizeof(*au_wkq), GFP_NOFS);
+	if (unlikely(!au_wkq))
+		goto out;
+
+	err = 0;
+	for (i = 0; i < aufs_nwkq; i++) {
+		au_wkq[i].q = create_singlethread_workqueue(AUFS_WKQ_NAME);
+		if (au_wkq[i].q && !IS_ERR(au_wkq[i].q)) {
+			atomic_set(&au_wkq[i].busy, 0);
+			continue;
+		}
+
+		err = PTR_ERR(au_wkq[i].q);
+		au_wkq_fin();
+		goto out;
+	}
+
+	/* nowait accounting */
+	nowaitq = au_wkq + aufs_nwkq;
+	atomic_set(&nowaitq->busy, 0);
+	nowaitq->q = NULL;
+	/* smp_mb(); */ /* atomic_set */
+
+ out:
+	return err;
+}
diff --git a/fs/aufs/wkq.h b/fs/aufs/wkq.h
new file mode 100644
index 0000000..17d24d8
--- /dev/null
+++ b/fs/aufs/wkq.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * workqueue for asynchronous/super-io operations
+ * todo: try new credentials management scheme
+ */
+
+#ifndef __AUFS_WKQ_H__
+#define __AUFS_WKQ_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+#include <linux/aufs_type.h>
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * in the next operation, wait for the 'nowait' tasks in system-wide workqueue
+ */
+struct au_nowait_tasks {
+	atomic_t		nw_len;
+	wait_queue_head_t	nw_wq;
+};
+
+/* ---------------------------------------------------------------------- */
+
+typedef void (*au_wkq_func_t)(void *args);
+
+/* wkq flags */
+#define AuWkq_WAIT	1
+#define au_ftest_wkq(flags, name)	((flags) & AuWkq_##name)
+#define au_fset_wkq(flags, name)	{ (flags) |= AuWkq_##name; }
+#define au_fclr_wkq(flags, name)	{ (flags) &= ~AuWkq_##name; }
+
+/* wkq.c */
+int au_wkq_wait(au_wkq_func_t func, void *args);
+int au_wkq_nowait(au_wkq_func_t func, void *args, struct super_block *sb);
+void au_nwt_init(struct au_nowait_tasks *nwt);
+int __init au_wkq_init(void);
+void au_wkq_fin(void);
+
+/* ---------------------------------------------------------------------- */
+
+static inline int au_test_wkq(struct task_struct *tsk)
+{
+	return !tsk->mm && !strcmp(tsk->comm, AUFS_WKQ_NAME);
+}
+
+static inline void au_nwt_done(struct au_nowait_tasks *nwt)
+{
+	if (!atomic_dec_return(&nwt->nw_len))
+		wake_up_all(&nwt->nw_wq);
+}
+
+static inline int au_nwt_flush(struct au_nowait_tasks *nwt)
+{
+	wait_event(nwt->nw_wq, !atomic_read(&nwt->nw_len));
+	return 0;
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_WKQ_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 10/28] aufs sub-VFS
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (8 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 09/28] aufs workqueue J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 11/28] aufs sub-dcache J. R. Okajima
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
sub-routines or simple wrappers for VFS

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/vfsub.c |  716 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/vfsub.h |  137 +++++++++++
 2 files changed, 853 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/vfsub.c
 create mode 100644 fs/aufs/vfsub.h

diff --git a/fs/aufs/vfsub.c b/fs/aufs/vfsub.c
new file mode 100644
index 0000000..34359cd
--- /dev/null
+++ b/fs/aufs/vfsub.c
@@ -0,0 +1,716 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sub-routines for VFS
+ */
+
+#include <linux/uaccess.h>
+#include "aufs.h"
+
+int vfsub_update_h_iattr(struct path *h_path, int *did)
+{
+	int err;
+	struct kstat st;
+	struct super_block *h_sb;
+
+	/* for remote fs, leave work for its getattr or d_revalidate */
+	/* for bad i_attr fs, handle them in aufs_getattr() */
+	/* still some fs may acquire i_mutex. we need to skip them */
+	err = 0;
+	if (!did)
+		did = &err;
+	h_sb = h_path->dentry->d_sb;
+	*did = (!au_test_fs_remote(h_sb) && au_test_fs_refresh_iattr(h_sb));
+	if (*did)
+		err = vfs_getattr(h_path->mnt, h_path->dentry, &st);
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct file *vfsub_filp_open(const char *path, int oflags, int mode)
+{
+	struct file *file;
+
+	lockdep_off();
+	file = filp_open(path, oflags, mode);
+	lockdep_on();
+	if (IS_ERR(file))
+		goto out;
+	vfsub_update_h_iattr(&file->f_path, /*did*/NULL); /*ignore*/
+
+ out:
+	return file;
+}
+
+int vfsub_kern_path(const char *name, unsigned int flags, struct path *path)
+{
+	int err;
+
+	/* lockdep_off(); */
+	err = kern_path(name, flags, path);
+	/* lockdep_on(); */
+	if (!err && path->dentry->d_inode)
+		vfsub_update_h_iattr(path, /*did*/NULL); /*ignore*/
+	return err;
+}
+
+struct dentry *vfsub_lookup_one_len(const char *name, struct dentry *parent,
+				    int len)
+{
+	struct path path = {
+		.mnt = NULL
+	};
+
+	IMustLock(parent->d_inode);
+
+	path.dentry = lookup_one_len(name, parent, len);
+	if (IS_ERR(path.dentry))
+		goto out;
+	if (path.dentry->d_inode)
+		vfsub_update_h_iattr(&path, /*did*/NULL); /*ignore*/
+
+ out:
+	return path.dentry;
+}
+
+struct dentry *vfsub_lookup_hash(struct nameidata *nd)
+{
+	struct path path = {
+		.mnt = nd->path.mnt
+	};
+
+	IMustLock(nd->path.dentry->d_inode);
+
+	path.dentry = lookup_hash(nd);
+	if (!IS_ERR(path.dentry) && path.dentry->d_inode)
+		vfsub_update_h_iattr(&path, /*did*/NULL); /*ignore*/
+
+	return path.dentry;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct dentry *vfsub_lock_rename(struct dentry *d1, struct au_hinode *hdir1,
+				 struct dentry *d2, struct au_hinode *hdir2)
+{
+	struct dentry *d;
+
+	lockdep_off();
+	d = lock_rename(d1, d2);
+	lockdep_on();
+	au_hin_suspend(hdir1);
+	if (hdir1 != hdir2)
+		au_hin_suspend(hdir2);
+
+	return d;
+}
+
+void vfsub_unlock_rename(struct dentry *d1, struct au_hinode *hdir1,
+			 struct dentry *d2, struct au_hinode *hdir2)
+{
+	au_hin_resume(hdir1);
+	if (hdir1 != hdir2)
+		au_hin_resume(hdir2);
+	lockdep_off();
+	unlock_rename(d1, d2);
+	lockdep_on();
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfsub_create(struct inode *dir, struct path *path, int mode)
+{
+	int err;
+	struct dentry *d;
+
+	IMustLock(dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	err = security_path_mknod(path, path->dentry, mode, 0);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	if (au_test_fs_null_nd(dir->i_sb))
+		err = vfs_create(dir, path->dentry, mode, NULL);
+	else {
+		struct nameidata h_nd;
+
+		memset(&h_nd, 0, sizeof(h_nd));
+		h_nd.flags = LOOKUP_CREATE;
+		/* cf. fs/namei.c:open_to_namei_flags() */
+		h_nd.intent.open.flags = O_CREAT | O_RDONLY;
+		h_nd.intent.open.flags++;
+		h_nd.intent.open.create_mode = mode;
+		h_nd.path.dentry = path->dentry->d_parent;
+		h_nd.path.mnt = path->mnt;
+		path_get(&h_nd.path);
+		err = vfs_create(dir, path->dentry, mode, &h_nd);
+		path_put(&h_nd.path);
+	}
+
+	if (!err) {
+		struct path tmp = *path;
+		int did;
+
+		vfsub_update_h_iattr(&tmp, &did);
+		if (did) {
+			tmp.dentry = path->dentry->d_parent;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+		}
+		/*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+int vfsub_symlink(struct inode *dir, struct path *path, const char *symname)
+{
+	int err;
+	struct dentry *d;
+
+	IMustLock(dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	err = security_path_symlink(path, path->dentry, symname);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	err = vfs_symlink(dir, path->dentry, symname);
+	if (!err) {
+		struct path tmp = *path;
+		int did;
+
+		vfsub_update_h_iattr(&tmp, &did);
+		if (did) {
+			tmp.dentry = path->dentry->d_parent;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+		}
+		/*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+int vfsub_mknod(struct inode *dir, struct path *path, int mode, dev_t dev)
+{
+	int err;
+	struct dentry *d;
+
+	IMustLock(dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	err = security_path_mknod(path, path->dentry, mode, dev);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	err = vfs_mknod(dir, path->dentry, mode, dev);
+	if (!err) {
+		struct path tmp = *path;
+		int did;
+
+		vfsub_update_h_iattr(&tmp, &did);
+		if (did) {
+			tmp.dentry = path->dentry->d_parent;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+		}
+		/*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+int vfsub_link(struct dentry *src_dentry, struct inode *dir, struct path *path)
+{
+	int err;
+	struct dentry *d;
+
+	IMustLock(dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	err = security_path_link(src_dentry, path, path->dentry);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	lockdep_off();
+	err = vfs_link(src_dentry, dir, path->dentry);
+	lockdep_on();
+	if (!err) {
+		struct path tmp = *path;
+		int did;
+
+		/* fuse has different memory inode for the same inumber */
+		vfsub_update_h_iattr(&tmp, &did);
+		if (did) {
+			tmp.dentry = path->dentry->d_parent;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+			tmp.dentry = src_dentry;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+		}
+		/*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+int vfsub_rename(struct inode *src_dir, struct dentry *src_dentry,
+		 struct inode *dir, struct path *path)
+{
+	int err;
+	struct path tmp = {
+		.mnt	= path->mnt
+	};
+	struct dentry *d;
+
+	IMustLock(dir);
+	IMustLock(src_dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	tmp.dentry = src_dentry->d_parent;
+	err = security_path_rename(&tmp, src_dentry, path, path->dentry);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	lockdep_off();
+	err = vfs_rename(src_dir, src_dentry, dir, path->dentry);
+	lockdep_on();
+	if (!err) {
+		int did;
+
+		tmp.dentry = d->d_parent;
+		vfsub_update_h_iattr(&tmp, &did);
+		if (did) {
+			tmp.dentry = src_dentry;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+			tmp.dentry = src_dentry->d_parent;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+		}
+		/*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+int vfsub_mkdir(struct inode *dir, struct path *path, int mode)
+{
+	int err;
+	struct dentry *d;
+
+	IMustLock(dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	err = security_path_mkdir(path, path->dentry, mode);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	err = vfs_mkdir(dir, path->dentry, mode);
+	if (!err) {
+		struct path tmp = *path;
+		int did;
+
+		vfsub_update_h_iattr(&tmp, &did);
+		if (did) {
+			tmp.dentry = path->dentry->d_parent;
+			vfsub_update_h_iattr(&tmp, /*did*/NULL);
+		}
+		/*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+int vfsub_rmdir(struct inode *dir, struct path *path)
+{
+	int err;
+	struct dentry *d;
+
+	IMustLock(dir);
+
+	d = path->dentry;
+	path->dentry = d->d_parent;
+	err = security_path_rmdir(path, path->dentry);
+	path->dentry = d;
+	if (unlikely(err))
+		goto out;
+
+	lockdep_off();
+	err = vfs_rmdir(dir, path->dentry);
+	lockdep_on();
+	if (!err) {
+		struct path tmp = {
+			.dentry	= path->dentry->d_parent,
+			.mnt	= path->mnt
+		};
+
+		vfsub_update_h_iattr(&tmp, /*did*/NULL); /*ignore*/
+	}
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+ssize_t vfsub_read_u(struct file *file, char __user *ubuf, size_t count,
+		     loff_t *ppos)
+{
+	ssize_t err;
+
+	err = vfs_read(file, ubuf, count, ppos);
+	if (err >= 0)
+		vfsub_update_h_iattr(&file->f_path, /*did*/NULL); /*ignore*/
+	return err;
+}
+
+/* todo: kernel_read()? */
+ssize_t vfsub_read_k(struct file *file, void *kbuf, size_t count,
+		     loff_t *ppos)
+{
+	ssize_t err;
+	mm_segment_t oldfs;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	err = vfsub_read_u(file, (char __user *)kbuf, count, ppos);
+	set_fs(oldfs);
+	return err;
+}
+
+ssize_t vfsub_write_u(struct file *file, const char __user *ubuf, size_t count,
+		      loff_t *ppos)
+{
+	ssize_t err;
+
+	lockdep_off();
+	err = vfs_write(file, ubuf, count, ppos);
+	lockdep_on();
+	if (err >= 0)
+		vfsub_update_h_iattr(&file->f_path, /*did*/NULL); /*ignore*/
+	return err;
+}
+
+ssize_t vfsub_write_k(struct file *file, void *kbuf, size_t count, loff_t *ppos)
+{
+	ssize_t err;
+	mm_segment_t oldfs;
+
+	oldfs = get_fs();
+	set_fs(KERNEL_DS);
+	err = vfsub_write_u(file, (const char __user *)kbuf, count, ppos);
+	set_fs(oldfs);
+	return err;
+}
+
+int vfsub_readdir(struct file *file, filldir_t filldir, void *arg)
+{
+	int err;
+
+	lockdep_off();
+	err = vfs_readdir(file, filldir, arg);
+	lockdep_on();
+	if (err >= 0)
+		vfsub_update_h_iattr(&file->f_path, /*did*/NULL); /*ignore*/
+	return err;
+}
+
+long vfsub_splice_to(struct file *in, loff_t *ppos,
+		     struct pipe_inode_info *pipe, size_t len,
+		     unsigned int flags)
+{
+	long err;
+
+	lockdep_off();
+	err = do_splice_to(in, ppos, pipe, len, flags);
+	lockdep_on();
+	if (err >= 0)
+		vfsub_update_h_iattr(&in->f_path, /*did*/NULL); /*ignore*/
+	return err;
+}
+
+long vfsub_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		       loff_t *ppos, size_t len, unsigned int flags)
+{
+	long err;
+
+	lockdep_off();
+	err = do_splice_from(pipe, out, ppos, len, flags);
+	lockdep_on();
+	if (err >= 0)
+		vfsub_update_h_iattr(&out->f_path, /*did*/NULL); /*ignore*/
+	return err;
+}
+
+/* cf. open.c:do_sys_truncate() and do_sys_ftruncate() */
+int vfsub_trunc(struct path *h_path, loff_t length, unsigned int attr,
+		struct file *h_file)
+{
+	int err;
+	struct inode *h_inode;
+	union {
+		unsigned int u;
+		fmode_t m;
+	} u;
+
+	h_inode = h_path->dentry->d_inode;
+	if (!h_file) {
+		err = mnt_want_write(h_path->mnt);
+		if (err)
+			goto out;
+		err = inode_permission(h_inode, MAY_WRITE);
+		if (err)
+			goto out_mnt;
+		err = get_write_access(h_inode);
+		if (err)
+			goto out_mnt;
+		u.m = FMODE_WRITE;
+		err = break_lease(h_inode, u.u);
+		if (err)
+			goto out_inode;
+	}
+
+	err = locks_verify_truncate(h_inode, h_file, length);
+	if (!err)
+		err = security_path_truncate(h_path, length, attr);
+	if (!err) {
+		lockdep_off();
+		err = do_truncate(h_path->dentry, length, attr, h_file);
+		lockdep_on();
+	}
+
+ out_inode:
+	if (!h_file)
+		put_write_access(h_inode);
+ out_mnt:
+	if (!h_file)
+		mnt_drop_write(h_path->mnt);
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct au_vfsub_mkdir_args {
+	int *errp;
+	struct inode *dir;
+	struct path *path;
+	int mode;
+};
+
+static void au_call_vfsub_mkdir(void *args)
+{
+	struct au_vfsub_mkdir_args *a = args;
+	*a->errp = vfsub_mkdir(a->dir, a->path, a->mode);
+}
+
+int vfsub_sio_mkdir(struct inode *dir, struct path *path, int mode)
+{
+	int err, do_sio, wkq_err;
+
+	do_sio = au_test_h_perm_sio(dir, MAY_EXEC | MAY_WRITE);
+	if (!do_sio)
+		err = vfsub_mkdir(dir, path, mode);
+	else {
+		struct au_vfsub_mkdir_args args = {
+			.errp	= &err,
+			.dir	= dir,
+			.path	= path,
+			.mode	= mode
+		};
+		wkq_err = au_wkq_wait(au_call_vfsub_mkdir, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+
+	return err;
+}
+
+struct au_vfsub_rmdir_args {
+	int *errp;
+	struct inode *dir;
+	struct path *path;
+};
+
+static void au_call_vfsub_rmdir(void *args)
+{
+	struct au_vfsub_rmdir_args *a = args;
+	*a->errp = vfsub_rmdir(a->dir, a->path);
+}
+
+int vfsub_sio_rmdir(struct inode *dir, struct path *path)
+{
+	int err, do_sio, wkq_err;
+
+	do_sio = au_test_h_perm_sio(dir, MAY_EXEC | MAY_WRITE);
+	if (!do_sio)
+		err = vfsub_rmdir(dir, path);
+	else {
+		struct au_vfsub_rmdir_args args = {
+			.errp	= &err,
+			.dir	= dir,
+			.path	= path
+		};
+		wkq_err = au_wkq_wait(au_call_vfsub_rmdir, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct notify_change_args {
+	int *errp;
+	struct path *path;
+	struct iattr *ia;
+};
+
+static void call_notify_change(void *args)
+{
+	struct notify_change_args *a = args;
+	struct inode *h_inode;
+
+	h_inode = a->path->dentry->d_inode;
+	IMustLock(h_inode);
+
+	*a->errp = -EPERM;
+	if (!IS_IMMUTABLE(h_inode) && !IS_APPEND(h_inode)) {
+		lockdep_off();
+		*a->errp = notify_change(a->path->dentry, a->ia);
+		lockdep_on();
+		if (!*a->errp)
+			vfsub_update_h_iattr(a->path, /*did*/NULL); /*ignore*/
+	}
+	AuTraceErr(*a->errp);
+}
+
+int vfsub_notify_change(struct path *path, struct iattr *ia)
+{
+	int err;
+	struct notify_change_args args = {
+		.errp	= &err,
+		.path	= path,
+		.ia	= ia
+	};
+
+	call_notify_change(&args);
+
+	return err;
+}
+
+int vfsub_sio_notify_change(struct path *path, struct iattr *ia)
+{
+	int err, wkq_err;
+	struct notify_change_args args = {
+		.errp	= &err,
+		.path	= path,
+		.ia	= ia
+	};
+
+	wkq_err = au_wkq_wait(call_notify_change, &args);
+	if (unlikely(wkq_err))
+		err = wkq_err;
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct unlink_args {
+	int *errp;
+	struct inode *dir;
+	struct path *path;
+};
+
+static void call_unlink(void *args)
+{
+	struct unlink_args *a = args;
+	struct dentry *d = a->path->dentry;
+	struct inode *h_inode;
+	const int stop_sillyrename = (au_test_nfs(d->d_sb)
+				      && atomic_read(&d->d_count) == 1);
+
+	IMustLock(a->dir);
+
+	a->path->dentry = d->d_parent;
+	*a->errp = security_path_unlink(a->path, d);
+	a->path->dentry = d;
+	if (unlikely(*a->errp))
+		return;
+
+	if (!stop_sillyrename)
+		dget(d);
+	h_inode = d->d_inode;
+	if (h_inode)
+		atomic_inc(&h_inode->i_count);
+
+	lockdep_off();
+	*a->errp = vfs_unlink(a->dir, d);
+	lockdep_on();
+	if (!*a->errp) {
+		struct path tmp = {
+			.dentry = d->d_parent,
+			.mnt	= a->path->mnt
+		};
+		vfsub_update_h_iattr(&tmp, /*did*/NULL); /*ignore*/
+	}
+
+	if (!stop_sillyrename)
+		dput(d);
+	if (h_inode)
+		iput(h_inode);
+
+	AuTraceErr(*a->errp);
+}
+
+/*
+ * @dir: must be locked.
+ * @dentry: target dentry.
+ */
+int vfsub_unlink(struct inode *dir, struct path *path, int force)
+{
+	int err;
+	struct unlink_args args = {
+		.errp	= &err,
+		.dir	= dir,
+		.path	= path
+	};
+
+	if (!force)
+		call_unlink(&args);
+	else {
+		int wkq_err;
+
+		wkq_err = au_wkq_wait(call_unlink, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+
+	return err;
+}
diff --git a/fs/aufs/vfsub.h b/fs/aufs/vfsub.h
new file mode 100644
index 0000000..d0bf65e
--- /dev/null
+++ b/fs/aufs/vfsub.h
@@ -0,0 +1,137 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sub-routines for VFS
+ */
+
+#ifndef __AUFS_VFSUB_H__
+#define __AUFS_VFSUB_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/fs_stack.h>
+#include <linux/namei.h>
+#include <linux/security.h>
+#include <linux/splice.h>
+
+/* ---------------------------------------------------------------------- */
+
+/* lock subclass for lower inode */
+/* default MAX_LOCKDEP_SUBCLASSES(8) is not enough */
+/* reduce? gave up. */
+enum {
+	AuLsc_I_Begin = I_MUTEX_QUOTA, /* 4 */
+	AuLsc_I_PARENT,		/* lower inode, parent first */
+	AuLsc_I_PARENT2,	/* copyup dirs */
+	AuLsc_I_CHILD,
+	AuLsc_I_CHILD2,
+	AuLsc_I_End
+};
+
+/* to debug easier, do not make them inlined functions */
+#define MtxMustLock(mtx)	AuDebugOn(!mutex_is_locked(mtx))
+#define IMustLock(i)		MtxMustLock(&(i)->i_mutex)
+
+/* ---------------------------------------------------------------------- */
+
+static inline void vfsub_copy_inode_size(struct inode *inode,
+					 struct inode *h_inode)
+{
+	spin_lock(&inode->i_lock);
+	fsstack_copy_inode_size(inode, h_inode);
+	spin_unlock(&inode->i_lock);
+}
+
+int vfsub_update_h_iattr(struct path *h_path, int *did);
+struct file *vfsub_filp_open(const char *path, int oflags, int mode);
+int vfsub_kern_path(const char *name, unsigned int flags, struct path *path);
+struct dentry *vfsub_lookup_one_len(const char *name, struct dentry *parent,
+				    int len);
+struct dentry *vfsub_lookup_hash(struct nameidata *nd);
+
+/* ---------------------------------------------------------------------- */
+
+struct au_hinode;
+struct dentry *vfsub_lock_rename(struct dentry *d1, struct au_hinode *hdir1,
+				 struct dentry *d2, struct au_hinode *hdir2);
+void vfsub_unlock_rename(struct dentry *d1, struct au_hinode *hdir1,
+			 struct dentry *d2, struct au_hinode *hdir2);
+
+int vfsub_create(struct inode *dir, struct path *path, int mode);
+int vfsub_symlink(struct inode *dir, struct path *path,
+		  const char *symname);
+int vfsub_mknod(struct inode *dir, struct path *path, int mode, dev_t dev);
+int vfsub_link(struct dentry *src_dentry, struct inode *dir,
+	       struct path *path);
+int vfsub_rename(struct inode *src_hdir, struct dentry *src_dentry,
+		 struct inode *hdir, struct path *path);
+int vfsub_mkdir(struct inode *dir, struct path *path, int mode);
+int vfsub_rmdir(struct inode *dir, struct path *path);
+
+/* ---------------------------------------------------------------------- */
+
+ssize_t vfsub_read_u(struct file *file, char __user *ubuf, size_t count,
+		     loff_t *ppos);
+ssize_t vfsub_read_k(struct file *file, void *kbuf, size_t count,
+			loff_t *ppos);
+ssize_t vfsub_write_u(struct file *file, const char __user *ubuf, size_t count,
+		      loff_t *ppos);
+ssize_t vfsub_write_k(struct file *file, void *kbuf, size_t count,
+		      loff_t *ppos);
+int vfsub_readdir(struct file *file, filldir_t filldir, void *arg);
+
+static inline void vfsub_file_accessed(struct file *h_file)
+{
+	file_accessed(h_file);
+	vfsub_update_h_iattr(&h_file->f_path, /*did*/NULL); /*ignore*/
+}
+
+static inline void vfsub_touch_atime(struct vfsmount *h_mnt,
+				     struct dentry *h_dentry)
+{
+	struct path h_path = {
+		.dentry	= h_dentry,
+		.mnt	= h_mnt
+	};
+	touch_atime(h_mnt, h_dentry);
+	vfsub_update_h_iattr(&h_path, /*did*/NULL); /*ignore*/
+}
+
+long vfsub_splice_to(struct file *in, loff_t *ppos,
+		     struct pipe_inode_info *pipe, size_t len,
+		     unsigned int flags);
+long vfsub_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		       loff_t *ppos, size_t len, unsigned int flags);
+int vfsub_trunc(struct path *h_path, loff_t length, unsigned int attr,
+		struct file *h_file);
+
+/* ---------------------------------------------------------------------- */
+
+static inline loff_t vfsub_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t err;
+
+	lockdep_off();
+	err = vfs_llseek(file, offset, origin);
+	lockdep_on();
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int vfsub_sio_mkdir(struct inode *dir, struct path *path, int mode);
+int vfsub_sio_rmdir(struct inode *dir, struct path *path);
+int vfsub_sio_notify_change(struct path *path, struct iattr *ia);
+int vfsub_notify_change(struct path *path, struct iattr *ia);
+int vfsub_unlink(struct inode *dir, struct path *path, int force);
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_VFSUB_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 11/28] aufs sub-dcache
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (9 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 10/28] aufs sub-VFS J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 12/28] aufs copy-up J. R. Okajima
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
sub-routines or simple wrappers for dentry cache

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/dcsub.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/dcsub.h |   43 +++++++++++
 2 files changed, 257 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/dcsub.c
 create mode 100644 fs/aufs/dcsub.h

diff --git a/fs/aufs/dcsub.c b/fs/aufs/dcsub.c
new file mode 100644
index 0000000..e4597e4
--- /dev/null
+++ b/fs/aufs/dcsub.c
@@ -0,0 +1,214 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sub-routines for dentry cache
+ */
+
+#include "aufs.h"
+
+static void au_dpage_free(struct au_dpage *dpage)
+{
+	int i;
+	struct dentry **p;
+
+	p = dpage->dentries;
+	for (i = 0; i < dpage->ndentry; i++)
+		dput(*p++);
+	free_page((unsigned long)dpage->dentries);
+}
+
+int au_dpages_init(struct au_dcsub_pages *dpages, gfp_t gfp)
+{
+	int err;
+	void *p;
+
+	err = -ENOMEM;
+	dpages->dpages = kmalloc(sizeof(*dpages->dpages), gfp);
+	if (unlikely(!dpages->dpages))
+		goto out;
+
+	p = (void *)__get_free_page(gfp);
+	if (unlikely(!p))
+		goto out_dpages;
+
+	dpages->dpages[0].ndentry = 0;
+	dpages->dpages[0].dentries = p;
+	dpages->ndpage = 1;
+	return 0; /* success */
+
+ out_dpages:
+	kfree(dpages->dpages);
+ out:
+	return err;
+}
+
+void au_dpages_free(struct au_dcsub_pages *dpages)
+{
+	int i;
+	struct au_dpage *p;
+
+	p = dpages->dpages;
+	for (i = 0; i < dpages->ndpage; i++)
+		au_dpage_free(p++);
+	kfree(dpages->dpages);
+}
+
+static int au_dpages_append(struct au_dcsub_pages *dpages,
+			    struct dentry *dentry, gfp_t gfp)
+{
+	int err, sz;
+	struct au_dpage *dpage;
+	void *p;
+
+	dpage = dpages->dpages + dpages->ndpage - 1;
+	sz = PAGE_SIZE / sizeof(dentry);
+	if (unlikely(dpage->ndentry >= sz)) {
+		AuLabel(new dpage);
+		err = -ENOMEM;
+		sz = dpages->ndpage * sizeof(*dpages->dpages);
+		p = au_kzrealloc(dpages->dpages, sz,
+				 sz + sizeof(*dpages->dpages), gfp);
+		if (unlikely(!p))
+			goto out;
+
+		dpages->dpages = p;
+		dpage = dpages->dpages + dpages->ndpage;
+		p = (void *)__get_free_page(gfp);
+		if (unlikely(!p))
+			goto out;
+
+		dpage->ndentry = 0;
+		dpage->dentries = p;
+		dpages->ndpage++;
+	}
+
+	dpage->dentries[dpage->ndentry++] = dget(dentry);
+	return 0; /* success */
+
+ out:
+	return err;
+}
+
+int au_dcsub_pages(struct au_dcsub_pages *dpages, struct dentry *root,
+		   au_dpages_test test, void *arg)
+{
+	int err;
+	struct dentry *this_parent = root;
+	struct list_head *next;
+	struct super_block *sb = root->d_sb;
+
+	err = 0;
+	spin_lock(&dcache_lock);
+ repeat:
+	next = this_parent->d_subdirs.next;
+ resume:
+	if (this_parent->d_sb == sb
+	    && !IS_ROOT(this_parent)
+	    && atomic_read(&this_parent->d_count)
+	    && this_parent->d_inode
+	    && (!test || test(this_parent, arg))) {
+		err = au_dpages_append(dpages, this_parent, GFP_ATOMIC);
+		if (unlikely(err))
+			goto out;
+	}
+
+	while (next != &this_parent->d_subdirs) {
+		struct list_head *tmp = next;
+		struct dentry *dentry = list_entry(tmp, struct dentry,
+						   d_u.d_child);
+		next = tmp->next;
+		if (/*d_unhashed(dentry) || */!dentry->d_inode)
+			continue;
+		if (!list_empty(&dentry->d_subdirs)) {
+			this_parent = dentry;
+			goto repeat;
+		}
+		if (dentry->d_sb == sb
+		    && atomic_read(&dentry->d_count)
+		    && (!test || test(dentry, arg))) {
+			err = au_dpages_append(dpages, dentry, GFP_ATOMIC);
+			if (unlikely(err))
+				goto out;
+		}
+	}
+
+	if (this_parent != root) {
+		next = this_parent->d_u.d_child.next;
+		this_parent = this_parent->d_parent; /* dcache_lock is locked */
+		goto resume;
+	}
+ out:
+	spin_unlock(&dcache_lock);
+	return err;
+}
+
+int au_dcsub_pages_rev(struct au_dcsub_pages *dpages, struct dentry *dentry,
+		       int do_include, au_dpages_test test, void *arg)
+{
+	int err;
+
+	err = 0;
+	spin_lock(&dcache_lock);
+	if (do_include && (!test || test(dentry, arg))) {
+		err = au_dpages_append(dpages, dentry, GFP_ATOMIC);
+		if (unlikely(err))
+			goto out;
+	}
+	while (!IS_ROOT(dentry)) {
+		dentry = dentry->d_parent; /* dcache_lock is locked */
+		if (!test || test(dentry, arg)) {
+			err = au_dpages_append(dpages, dentry, GFP_ATOMIC);
+			if (unlikely(err))
+				break;
+		}
+	}
+
+ out:
+	spin_unlock(&dcache_lock);
+
+	return err;
+}
+
+struct dentry *au_test_subdir(struct dentry *d1, struct dentry *d2)
+{
+	struct dentry *trap, **dentries;
+	int err, i, j;
+	struct au_dcsub_pages dpages;
+	struct au_dpage *dpage;
+
+	trap = ERR_PTR(-ENOMEM);
+	err = au_dpages_init(&dpages, GFP_NOFS);
+	if (unlikely(err))
+		goto out;
+	err = au_dcsub_pages_rev(&dpages, d1, /*do_include*/1, NULL, NULL);
+	if (unlikely(err))
+		goto out_dpages;
+
+	trap = d1;
+	for (i = 0; !err && i < dpages.ndpage; i++) {
+		dpage = dpages.dpages + i;
+		dentries = dpage->dentries;
+		for (j = 0; !err && j < dpage->ndentry; j++) {
+			struct dentry *d;
+
+			d = dentries[j];
+			err = (d == d2);
+			if (!err)
+				trap = d;
+		}
+	}
+	if (!err)
+		trap = NULL;
+
+ out_dpages:
+	au_dpages_free(&dpages);
+ out:
+	return trap;
+}
diff --git a/fs/aufs/dcsub.h b/fs/aufs/dcsub.h
new file mode 100644
index 0000000..89a7745
--- /dev/null
+++ b/fs/aufs/dcsub.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sub-routines for dentry cache
+ */
+
+#ifndef __AUFS_DCSUB_H__
+#define __AUFS_DCSUB_H__
+
+#ifdef __KERNEL__
+
+#include <linux/dcache.h>
+
+struct au_dpage {
+	int ndentry;
+	struct dentry **dentries;
+};
+
+struct au_dcsub_pages {
+	int ndpage;
+	struct au_dpage *dpages;
+};
+
+/* ---------------------------------------------------------------------- */
+
+int au_dpages_init(struct au_dcsub_pages *dpages, gfp_t gfp);
+void au_dpages_free(struct au_dcsub_pages *dpages);
+typedef int (*au_dpages_test)(struct dentry *dentry, void *arg);
+int au_dcsub_pages(struct au_dcsub_pages *dpages, struct dentry *root,
+		   au_dpages_test test, void *arg);
+int au_dcsub_pages_rev(struct au_dcsub_pages *dpages, struct dentry *dentry,
+		       int do_include, au_dpages_test test, void *arg);
+struct dentry *au_test_subdir(struct dentry *d1, struct dentry *d2);
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_DCSUB_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 12/28] aufs copy-up
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (10 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 11/28] aufs sub-dcache J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 13/28] aufs whiteout J. R. Okajima
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
internal copy-up functions
see wbr_policy.c for copy-down

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/cpup.c | 1028 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/cpup.h |   68 ++++
 2 files changed, 1096 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/cpup.c
 create mode 100644 fs/aufs/cpup.h

diff --git a/fs/aufs/cpup.c b/fs/aufs/cpup.c
new file mode 100644
index 0000000..91808b1
--- /dev/null
+++ b/fs/aufs/cpup.c
@@ -0,0 +1,1028 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * copy-up functions, see wbr_policy.c for copy-down
+ */
+
+#include <linux/fs_stack.h>
+#include <linux/uaccess.h>
+#include "aufs.h"
+
+void au_cpup_attr_flags(struct inode *dst, struct inode *src)
+{
+	const unsigned int mask = S_DEAD | S_SWAPFILE | S_PRIVATE
+		| S_NOATIME | S_NOCMTIME;
+
+	dst->i_flags |= src->i_flags & ~mask;
+	if (au_test_fs_notime(dst->i_sb))
+		dst->i_flags |= S_NOATIME | S_NOCMTIME;
+}
+
+void au_cpup_attr_timesizes(struct inode *inode)
+{
+	struct inode *h_inode;
+
+	h_inode = au_h_iptr(inode, au_ibstart(inode));
+	fsstack_copy_attr_times(inode, h_inode);
+	vfsub_copy_inode_size(inode, h_inode);
+}
+
+void au_cpup_attr_nlink(struct inode *inode, int force)
+{
+	struct inode *h_inode;
+	struct super_block *sb;
+	aufs_bindex_t bindex, bend;
+
+	sb = inode->i_sb;
+	bindex = au_ibstart(inode);
+	h_inode = au_h_iptr(inode, bindex);
+	if (!force
+	    && !S_ISDIR(h_inode->i_mode)
+	    && au_opt_test(au_mntflags(sb), PLINK)
+	    && au_plink_test(inode))
+		return;
+
+	inode->i_nlink = h_inode->i_nlink;
+
+	/*
+	 * fewer nlink makes find(1) noisy, but larger nlink doesn't.
+	 * it may includes whplink directory.
+	 */
+	if (S_ISDIR(h_inode->i_mode)) {
+		bend = au_ibend(inode);
+		for (bindex++; bindex <= bend; bindex++) {
+			h_inode = au_h_iptr(inode, bindex);
+			if (h_inode)
+				au_add_nlink(inode, h_inode);
+		}
+	}
+}
+
+void au_cpup_attr_changeable(struct inode *inode)
+{
+	struct inode *h_inode;
+
+	h_inode = au_h_iptr(inode, au_ibstart(inode));
+	inode->i_mode = h_inode->i_mode;
+	inode->i_uid = h_inode->i_uid;
+	inode->i_gid = h_inode->i_gid;
+	au_cpup_attr_timesizes(inode);
+	au_cpup_attr_flags(inode, h_inode);
+}
+
+void au_cpup_igen(struct inode *inode, struct inode *h_inode)
+{
+	struct au_iinfo *iinfo = au_ii(inode);
+
+	iinfo->ii_higen = h_inode->i_generation;
+	iinfo->ii_hsb1 = h_inode->i_sb;
+}
+
+void au_cpup_attr_all(struct inode *inode, int force)
+{
+	struct inode *h_inode;
+
+	h_inode = au_h_iptr(inode, au_ibstart(inode));
+	au_cpup_attr_changeable(inode);
+	if (inode->i_nlink > 0)
+		au_cpup_attr_nlink(inode, force);
+	inode->i_rdev = h_inode->i_rdev;
+	inode->i_blkbits = h_inode->i_blkbits;
+	au_cpup_igen(inode, h_inode);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* Note: dt_dentry and dt_h_dentry are not dget/dput-ed */
+
+/* keep the timestamps of the parent dir when cpup */
+void au_dtime_store(struct au_dtime *dt, struct dentry *dentry,
+		    struct path *h_path)
+{
+	struct inode *h_inode;
+
+	dt->dt_dentry = dentry;
+	dt->dt_h_path = *h_path;
+	h_inode = h_path->dentry->d_inode;
+	dt->dt_atime = h_inode->i_atime;
+	dt->dt_mtime = h_inode->i_mtime;
+	/* smp_mb(); */
+}
+
+void au_dtime_revert(struct au_dtime *dt)
+{
+	struct iattr attr;
+	int err;
+
+	attr.ia_atime = dt->dt_atime;
+	attr.ia_mtime = dt->dt_mtime;
+	attr.ia_valid = ATTR_FORCE | ATTR_MTIME | ATTR_MTIME_SET
+		| ATTR_ATIME | ATTR_ATIME_SET;
+
+	err = vfsub_notify_change(&dt->dt_h_path, &attr);
+	if (unlikely(err))
+		AuWarn("restoring timestamps failed(%d). ignored\n", err);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static noinline_for_stack
+int cpup_iattr(struct dentry *dst, aufs_bindex_t bindex, struct dentry *h_src)
+{
+	int err, sbits;
+	struct iattr ia;
+	struct path h_path;
+	struct inode *h_isrc;
+
+	h_path.dentry = au_h_dptr(dst, bindex);
+	h_path.mnt = au_sbr_mnt(dst->d_sb, bindex);
+	h_isrc = h_src->d_inode;
+	ia.ia_valid = ATTR_FORCE | ATTR_MODE | ATTR_UID | ATTR_GID
+		| ATTR_ATIME | ATTR_MTIME
+		| ATTR_ATIME_SET | ATTR_MTIME_SET;
+	ia.ia_mode = h_isrc->i_mode;
+	ia.ia_uid = h_isrc->i_uid;
+	ia.ia_gid = h_isrc->i_gid;
+	ia.ia_atime = h_isrc->i_atime;
+	ia.ia_mtime = h_isrc->i_mtime;
+	sbits = !!(ia.ia_mode & (S_ISUID | S_ISGID));
+	au_cpup_attr_flags(h_path.dentry->d_inode, h_isrc);
+	err = vfsub_notify_change(&h_path, &ia);
+
+	/* is this nfs only? */
+	if (!err && sbits && au_test_nfs(h_path.dentry->d_sb)) {
+		ia.ia_valid = ATTR_FORCE | ATTR_MODE;
+		ia.ia_mode = h_isrc->i_mode;
+		err = vfsub_notify_change(&h_path, &ia);
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int au_do_copy_file(struct file *dst, struct file *src, loff_t len,
+			   char *buf, unsigned long blksize)
+{
+	int err;
+	size_t sz, rbytes, wbytes;
+	unsigned char all_zero;
+	char *p, *zp;
+	struct mutex *h_mtx;
+	/* reduce stack usage */
+	struct iattr *ia;
+
+	zp = page_address(ZERO_PAGE(0));
+	if (unlikely(!zp))
+		return -ENOMEM; /* possible? */
+
+	err = 0;
+	all_zero = 0;
+	while (len) {
+		AuDbg("len %lld\n", len);
+		sz = blksize;
+		if (len < blksize)
+			sz = len;
+
+		rbytes = 0;
+		/* todo: signal_pending? */
+		while (!rbytes || err == -EAGAIN || err == -EINTR) {
+			rbytes = vfsub_read_k(src, buf, sz, &src->f_pos);
+			err = rbytes;
+		}
+		if (unlikely(err < 0))
+			break;
+
+		all_zero = 0;
+		if (len >= rbytes && rbytes == blksize)
+			all_zero = !memcmp(buf, zp, rbytes);
+		if (!all_zero) {
+			wbytes = rbytes;
+			p = buf;
+			while (wbytes) {
+				size_t b;
+
+				b = vfsub_write_k(dst, p, wbytes, &dst->f_pos);
+				err = b;
+				/* todo: signal_pending? */
+				if (unlikely(err == -EAGAIN || err == -EINTR))
+					continue;
+				if (unlikely(err < 0))
+					break;
+				wbytes -= b;
+				p += b;
+			}
+		} else {
+			loff_t res;
+
+			AuLabel(hole);
+			res = vfsub_llseek(dst, rbytes, SEEK_CUR);
+			err = res;
+			if (unlikely(res < 0))
+				break;
+		}
+		len -= rbytes;
+		err = 0;
+	}
+
+	/* the last block may be a hole */
+	if (!err && all_zero) {
+		AuLabel(last hole);
+
+		err = 1;
+		if (au_test_nfs(dst->f_dentry->d_sb)) {
+			/* nfs requires this step to make last hole */
+			/* is this only nfs? */
+			do {
+				/* todo: signal_pending? */
+				err = vfsub_write_k(dst, "\0", 1, &dst->f_pos);
+			} while (err == -EAGAIN || err == -EINTR);
+			if (err == 1)
+				dst->f_pos--;
+		}
+
+		if (err == 1) {
+			ia = (void *)buf;
+			ia->ia_size = dst->f_pos;
+			ia->ia_valid = ATTR_SIZE | ATTR_FILE;
+			ia->ia_file = dst;
+			h_mtx = &dst->f_dentry->d_inode->i_mutex;
+			mutex_lock_nested(h_mtx, AuLsc_I_CHILD2);
+			err = vfsub_notify_change(&dst->f_path, ia);
+			mutex_unlock(h_mtx);
+		}
+	}
+
+	return err;
+}
+
+int au_copy_file(struct file *dst, struct file *src, loff_t len)
+{
+	int err;
+	unsigned long blksize;
+	unsigned char do_kfree;
+	char *buf;
+
+	err = -ENOMEM;
+	blksize = dst->f_dentry->d_sb->s_blocksize;
+	if (!blksize || PAGE_SIZE < blksize)
+		blksize = PAGE_SIZE;
+	AuDbg("blksize %lu\n", blksize);
+	do_kfree = (blksize != PAGE_SIZE && blksize >= sizeof(struct iattr *));
+	if (do_kfree)
+		buf = kmalloc(blksize, GFP_NOFS);
+	else
+		buf = (void *)__get_free_page(GFP_NOFS);
+	if (unlikely(!buf))
+		goto out;
+
+	if (len > (1 << 22))
+		AuDbg("copying a large file %lld\n", (long long)len);
+
+	src->f_pos = 0;
+	dst->f_pos = 0;
+	err = au_do_copy_file(dst, src, len, buf, blksize);
+	if (do_kfree)
+		kfree(buf);
+	else
+		free_page((unsigned long)buf);
+
+ out:
+	return err;
+}
+
+/*
+ * to support a sparse file which is opened with O_APPEND,
+ * we need to close the file.
+ */
+static int au_cp_regular(struct dentry *dentry, aufs_bindex_t bdst,
+			aufs_bindex_t bsrc, loff_t len)
+{
+	int err, i;
+	enum { SRC, DST };
+	struct {
+		aufs_bindex_t bindex;
+		unsigned int flags;
+		struct dentry *dentry;
+		struct file *file;
+		void *label, *label_file;
+	} *f, file[] = {
+		{
+			.bindex = bsrc,
+			.flags = O_RDONLY | O_NOATIME | O_LARGEFILE,
+			.file = NULL,
+			.label = &&out,
+			.label_file = &&out_src
+		},
+		{
+			.bindex = bdst,
+			.flags = O_WRONLY | O_NOATIME | O_LARGEFILE,
+			.file = NULL,
+			.label = &&out_src,
+			.label_file = &&out_dst
+		}
+	};
+	struct super_block *sb;
+
+	/* bsrc branch can be ro/rw. */
+	sb = dentry->d_sb;
+	f = file;
+	for (i = 0; i < 2; i++, f++) {
+		f->dentry = au_h_dptr(dentry, f->bindex);
+		f->file = au_h_open(dentry, f->bindex, f->flags, /*file*/NULL);
+		err = PTR_ERR(f->file);
+		if (IS_ERR(f->file))
+			goto *f->label;
+		err = -EINVAL;
+		if (unlikely(!f->file->f_op))
+			goto *f->label_file;
+	}
+
+	/* try stopping to update while we copyup */
+	IMustLock(file[SRC].dentry->d_inode);
+	err = au_copy_file(file[DST].file, file[SRC].file, len);
+
+ out_dst:
+	fput(file[DST].file);
+	au_sbr_put(sb, file[DST].bindex);
+ out_src:
+	fput(file[SRC].file);
+	au_sbr_put(sb, file[SRC].bindex);
+ out:
+	return err;
+}
+
+static int au_do_cpup_regular(struct dentry *dentry, aufs_bindex_t bdst,
+			      aufs_bindex_t bsrc, loff_t len,
+			      struct inode *h_dir, struct path *h_path)
+{
+	int err, rerr;
+	loff_t l;
+
+	err = 0;
+	l = i_size_read(au_h_iptr(dentry->d_inode, bsrc));
+	if (len == -1 || l < len)
+		len = l;
+	if (len)
+		err = au_cp_regular(dentry, bdst, bsrc, len);
+	if (!err)
+		goto out; /* success */
+
+	rerr = vfsub_unlink(h_dir, h_path, /*force*/0);
+	if (rerr) {
+		AuIOErr("failed unlinking cpup-ed %.*s(%d, %d)\n",
+			AuDLNPair(h_path->dentry), err, rerr);
+		err = -EIO;
+	}
+
+ out:
+	return err;
+}
+
+static int au_do_cpup_symlink(struct path *h_path, struct dentry *h_src,
+			      struct inode *h_dir)
+{
+	int err, symlen;
+	mm_segment_t old_fs;
+	char *sym;
+
+	err = -ENOSYS;
+	if (unlikely(!h_src->d_inode->i_op->readlink))
+		goto out;
+
+	err = -ENOMEM;
+	sym = __getname();
+	if (unlikely(!sym))
+		goto out;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	symlen = h_src->d_inode->i_op->readlink(h_src, (char __user *)sym,
+						PATH_MAX);
+	err = symlen;
+	set_fs(old_fs);
+
+	if (symlen > 0) {
+		sym[symlen] = 0;
+		err = vfsub_symlink(h_dir, h_path, sym);
+	}
+	__putname(sym);
+
+ out:
+	return err;
+}
+
+/* return with the lower dst inode is locked */
+static noinline_for_stack
+int cpup_entry(struct dentry *dentry, aufs_bindex_t bdst,
+	       aufs_bindex_t bsrc, loff_t len, unsigned int flags,
+	       struct dentry *dst_parent)
+{
+	int err;
+	umode_t mode;
+	unsigned int mnt_flags;
+	unsigned char isdir;
+	const unsigned char do_dt = !!au_ftest_cpup(flags, DTIME);
+	struct au_dtime dt;
+	struct path h_path;
+	struct dentry *h_src, *h_dst, *h_parent;
+	struct inode *h_inode, *h_dir;
+	struct super_block *sb;
+
+	/* bsrc branch can be ro/rw. */
+	h_src = au_h_dptr(dentry, bsrc);
+	h_inode = h_src->d_inode;
+	AuDebugOn(h_inode != au_h_iptr(dentry->d_inode, bsrc));
+
+	/* try stopping to be referenced while we are creating */
+	h_dst = au_h_dptr(dentry, bdst);
+	h_parent = h_dst->d_parent; /* dir inode is locked */
+	h_dir = h_parent->d_inode;
+	IMustLock(h_dir);
+	AuDebugOn(h_parent != h_dst->d_parent);
+
+	sb = dentry->d_sb;
+	h_path.mnt = au_sbr_mnt(sb, bdst);
+	if (do_dt) {
+		h_path.dentry = h_parent;
+		au_dtime_store(&dt, dst_parent, &h_path);
+	}
+	h_path.dentry = h_dst;
+
+	isdir = 0;
+	mode = h_inode->i_mode;
+	switch (mode & S_IFMT) {
+	case S_IFREG:
+		/* try stopping to update while we are referencing */
+		IMustLock(h_inode);
+		err = vfsub_create(h_dir, &h_path, mode | S_IWUSR);
+		if (!err)
+			err = au_do_cpup_regular
+				(dentry, bdst, bsrc, len,
+				 au_h_iptr(dst_parent->d_inode, bdst), &h_path);
+		break;
+	case S_IFDIR:
+		isdir = 1;
+		err = vfsub_mkdir(h_dir, &h_path, mode);
+		if (!err) {
+			/*
+			 * strange behaviour from the users view,
+			 * particularry setattr case
+			 */
+			if (au_ibstart(dst_parent->d_inode) == bdst)
+				au_cpup_attr_nlink(dst_parent->d_inode,
+						   /*force*/1);
+			au_cpup_attr_nlink(dentry->d_inode, /*force*/1);
+		}
+		break;
+	case S_IFLNK:
+		err = au_do_cpup_symlink(&h_path, h_src, h_dir);
+		break;
+	case S_IFCHR:
+	case S_IFBLK:
+		AuDebugOn(!capable(CAP_MKNOD));
+		/*FALLTHROUGH*/
+	case S_IFIFO:
+	case S_IFSOCK:
+		err = vfsub_mknod(h_dir, &h_path, mode, h_inode->i_rdev);
+		break;
+	default:
+		AuIOErr("Unknown inode type 0%o\n", mode);
+		err = -EIO;
+	}
+
+	mnt_flags = au_mntflags(sb);
+	if (!au_opt_test(mnt_flags, UDBA_NONE)
+	    && !isdir
+	    && au_opt_test(mnt_flags, XINO)
+	    && h_inode->i_nlink == 1
+	    /* todo: unnecessary? */
+	    /* && dentry->d_inode->i_nlink == 1 */
+	    && bdst < bsrc
+	    && !au_ftest_cpup(flags, KEEPLINO))
+		au_xino_write0(sb, bsrc, h_inode->i_ino, /*ino*/0);
+		/* ignore this error */
+
+	if (do_dt)
+		au_dtime_revert(&dt);
+	return err;
+}
+
+/*
+ * copyup the @dentry from @bsrc to @bdst.
+ * the caller must set the both of lower dentries.
+ * @len is for truncating when it is -1 copyup the entire file.
+ * in link/rename cases, @dst_parent may be different from the real one.
+ */
+static int au_cpup_single(struct dentry *dentry, aufs_bindex_t bdst,
+			  aufs_bindex_t bsrc, loff_t len, unsigned int flags,
+			  struct dentry *dst_parent)
+{
+	int err, rerr;
+	aufs_bindex_t old_ibstart;
+	unsigned char isdir, plink;
+	struct au_dtime dt;
+	struct path h_path;
+	struct dentry *h_src, *h_dst, *h_parent;
+	struct inode *dst_inode, *h_dir, *inode;
+	struct super_block *sb;
+
+	AuDebugOn(bsrc <= bdst);
+
+	sb = dentry->d_sb;
+	h_path.mnt = au_sbr_mnt(sb, bdst);
+	h_dst = au_h_dptr(dentry, bdst);
+	h_parent = h_dst->d_parent; /* dir inode is locked */
+	h_dir = h_parent->d_inode;
+	IMustLock(h_dir);
+
+	h_src = au_h_dptr(dentry, bsrc);
+	inode = dentry->d_inode;
+
+	if (!dst_parent)
+		dst_parent = dget_parent(dentry);
+	else
+		dget(dst_parent);
+
+	plink = !!au_opt_test(au_mntflags(sb), PLINK);
+	dst_inode = au_h_iptr(inode, bdst);
+	if (dst_inode) {
+		if (unlikely(!plink)) {
+			err = -EIO;
+			AuIOErr("i%lu exists on a upper branch "
+				"but plink is disabled\n", inode->i_ino);
+			goto out;
+		}
+
+		if (dst_inode->i_nlink) {
+			const int do_dt = au_ftest_cpup(flags, DTIME);
+
+			h_src = au_plink_lkup(inode, bdst);
+			err = PTR_ERR(h_src);
+			if (IS_ERR(h_src))
+				goto out;
+			if (unlikely(!h_src->d_inode)) {
+				err = -EIO;
+				AuIOErr("i%lu exists on a upper branch "
+					"but plink is broken\n", inode->i_ino);
+				dput(h_src);
+				goto out;
+			}
+
+			if (do_dt) {
+				h_path.dentry = h_parent;
+				au_dtime_store(&dt, dst_parent, &h_path);
+			}
+			h_path.dentry = h_dst;
+			err = vfsub_link(h_src, h_dir, &h_path);
+			if (do_dt)
+				au_dtime_revert(&dt);
+			dput(h_src);
+			goto out;
+		} else
+			/* todo: cpup_wh_file? */
+			/* udba work */
+			au_update_brange(inode, 1);
+	}
+
+	old_ibstart = au_ibstart(inode);
+	err = cpup_entry(dentry, bdst, bsrc, len, flags, dst_parent);
+	if (unlikely(err))
+		goto out;
+	dst_inode = h_dst->d_inode;
+	mutex_lock_nested(&dst_inode->i_mutex, AuLsc_I_CHILD2);
+
+	err = cpup_iattr(dentry, bdst, h_src);
+	isdir = S_ISDIR(dst_inode->i_mode);
+	if (!err) {
+		if (bdst < old_ibstart)
+			au_set_ibstart(inode, bdst);
+		au_set_h_iptr(inode, bdst, au_igrab(dst_inode),
+			      au_hi_flags(inode, isdir));
+		mutex_unlock(&dst_inode->i_mutex);
+		if (!isdir
+		    && h_src->d_inode->i_nlink > 1
+		    && plink)
+			au_plink_append(inode, bdst, h_dst);
+		goto out; /* success */
+	}
+
+	/* revert */
+	h_path.dentry = h_parent;
+	mutex_unlock(&dst_inode->i_mutex);
+	au_dtime_store(&dt, dst_parent, &h_path);
+	h_path.dentry = h_dst;
+	if (!isdir)
+		rerr = vfsub_unlink(h_dir, &h_path, /*force*/0);
+	else
+		rerr = vfsub_rmdir(h_dir, &h_path);
+	au_dtime_revert(&dt);
+	if (rerr) {
+		AuIOErr("failed removing broken entry(%d, %d)\n", err, rerr);
+		err = -EIO;
+	}
+
+ out:
+	dput(dst_parent);
+	return err;
+}
+
+struct au_cpup_single_args {
+	int *errp;
+	struct dentry *dentry;
+	aufs_bindex_t bdst, bsrc;
+	loff_t len;
+	unsigned int flags;
+	struct dentry *dst_parent;
+};
+
+static void au_call_cpup_single(void *args)
+{
+	struct au_cpup_single_args *a = args;
+	*a->errp = au_cpup_single(a->dentry, a->bdst, a->bsrc, a->len,
+				  a->flags, a->dst_parent);
+}
+
+int au_sio_cpup_single(struct dentry *dentry, aufs_bindex_t bdst,
+		       aufs_bindex_t bsrc, loff_t len, unsigned int flags,
+		       struct dentry *dst_parent)
+{
+	int err, wkq_err;
+	umode_t mode;
+	struct dentry *h_dentry;
+
+	h_dentry = au_h_dptr(dentry, bsrc);
+	mode = h_dentry->d_inode->i_mode & S_IFMT;
+	if ((mode != S_IFCHR && mode != S_IFBLK)
+	    || capable(CAP_MKNOD))
+		err = au_cpup_single(dentry, bdst, bsrc, len, flags,
+				     dst_parent);
+	else {
+		struct au_cpup_single_args args = {
+			.errp		= &err,
+			.dentry		= dentry,
+			.bdst		= bdst,
+			.bsrc		= bsrc,
+			.len		= len,
+			.flags		= flags,
+			.dst_parent	= dst_parent
+		};
+		wkq_err = au_wkq_wait(au_call_cpup_single, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+
+	return err;
+}
+
+/*
+ * copyup the @dentry from the first active lower branch to @bdst,
+ * using au_cpup_single().
+ */
+static int au_cpup_simple(struct dentry *dentry, aufs_bindex_t bdst, loff_t len,
+			  unsigned int flags)
+{
+	int err;
+	aufs_bindex_t bsrc, bend;
+
+	bend = au_dbend(dentry);
+	for (bsrc = bdst + 1; bsrc <= bend; bsrc++)
+		if (au_h_dptr(dentry, bsrc))
+			break;
+
+	err = au_lkup_neg(dentry, bdst);
+	if (!err) {
+		err = au_cpup_single(dentry, bdst, bsrc, len, flags, NULL);
+		if (!err)
+			return 0; /* success */
+
+		/* revert */
+		au_set_h_dptr(dentry, bdst, NULL);
+		au_set_dbstart(dentry, bsrc);
+	}
+
+	return err;
+}
+
+struct au_cpup_simple_args {
+	int *errp;
+	struct dentry *dentry;
+	aufs_bindex_t bdst;
+	loff_t len;
+	unsigned int flags;
+};
+
+static void au_call_cpup_simple(void *args)
+{
+	struct au_cpup_simple_args *a = args;
+	*a->errp = au_cpup_simple(a->dentry, a->bdst, a->len, a->flags);
+}
+
+int au_sio_cpup_simple(struct dentry *dentry, aufs_bindex_t bdst, loff_t len,
+		       unsigned int flags)
+{
+	int err, wkq_err;
+	unsigned char do_sio;
+	struct dentry *parent;
+	struct inode *h_dir;
+
+	parent = dget_parent(dentry);
+	h_dir = au_h_iptr(parent->d_inode, bdst);
+	do_sio = !!au_test_h_perm_sio(h_dir, MAY_EXEC | MAY_WRITE);
+	if (!do_sio) {
+		/*
+		 * testing CAP_MKNOD is for generic fs,
+		 * but CAP_FSETID is for xfs only, currently.
+		 */
+		umode_t mode = dentry->d_inode->i_mode;
+		do_sio = (((mode & (S_IFCHR | S_IFBLK))
+			   && !capable(CAP_MKNOD))
+			  || ((mode & (S_ISUID | S_ISGID))
+			      && !capable(CAP_FSETID)));
+	}
+	if (!do_sio)
+		err = au_cpup_simple(dentry, bdst, len, flags);
+	else {
+		struct au_cpup_simple_args args = {
+			.errp		= &err,
+			.dentry		= dentry,
+			.bdst		= bdst,
+			.len		= len,
+			.flags		= flags
+		};
+		wkq_err = au_wkq_wait(au_call_cpup_simple, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+
+	dput(parent);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * copyup the deleted file for writing.
+ */
+static int au_do_cpup_wh(struct dentry *dentry, aufs_bindex_t bdst,
+			 struct dentry *wh_dentry, struct file *file,
+			 loff_t len)
+{
+	int err;
+	aufs_bindex_t bstart;
+	struct au_dinfo *dinfo;
+	struct dentry *h_d_dst, *h_d_start;
+
+	dinfo = au_di(dentry);
+	bstart = dinfo->di_bstart;
+	h_d_dst = dinfo->di_hdentry[0 + bdst].hd_dentry;
+	dinfo->di_bstart = bdst;
+	dinfo->di_hdentry[0 + bdst].hd_dentry = wh_dentry;
+	h_d_start = dinfo->di_hdentry[0 + bstart].hd_dentry;
+	if (file)
+		dinfo->di_hdentry[0 + bstart].hd_dentry
+			= au_h_fptr(file, au_fbstart(file))->f_dentry;
+	err = au_cpup_single(dentry, bdst, bstart, len, !AuCpup_DTIME,
+			     /*h_parent*/NULL);
+	if (!err && file) {
+		err = au_reopen_nondir(file);
+		dinfo->di_hdentry[0 + bstart].hd_dentry = h_d_start;
+	}
+	dinfo->di_hdentry[0 + bdst].hd_dentry = h_d_dst;
+	dinfo->di_bstart = bstart;
+
+	return err;
+}
+
+static int au_cpup_wh(struct dentry *dentry, aufs_bindex_t bdst, loff_t len,
+		      struct file *file)
+{
+	int err;
+	struct au_dtime dt;
+	struct dentry *parent, *h_parent, *wh_dentry;
+	struct au_branch *br;
+	struct path h_path;
+
+	br = au_sbr(dentry->d_sb, bdst);
+	parent = dget_parent(dentry);
+	h_parent = au_h_dptr(parent, bdst);
+	wh_dentry = au_whtmp_lkup(h_parent, br, &dentry->d_name);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out;
+
+	h_path.dentry = h_parent;
+	h_path.mnt = br->br_mnt;
+	au_dtime_store(&dt, parent, &h_path);
+	err = au_do_cpup_wh(dentry, bdst, wh_dentry, file, len);
+	if (unlikely(err))
+		goto out_wh;
+
+	dget(wh_dentry);
+	h_path.dentry = wh_dentry;
+	err = vfsub_unlink(h_parent->d_inode, &h_path, /*force*/0);
+	if (unlikely(err)) {
+		AuIOErr("failed remove copied-up tmp file %.*s(%d)\n",
+			AuDLNPair(wh_dentry), err);
+		err = -EIO;
+	}
+	au_dtime_revert(&dt);
+	au_set_hi_wh(dentry->d_inode, bdst, wh_dentry);
+
+ out_wh:
+	dput(wh_dentry);
+ out:
+	dput(parent);
+	return err;
+}
+
+struct au_cpup_wh_args {
+	int *errp;
+	struct dentry *dentry;
+	aufs_bindex_t bdst;
+	loff_t len;
+	struct file *file;
+};
+
+static void au_call_cpup_wh(void *args)
+{
+	struct au_cpup_wh_args *a = args;
+	*a->errp = au_cpup_wh(a->dentry, a->bdst, a->len, a->file);
+}
+
+int au_sio_cpup_wh(struct dentry *dentry, aufs_bindex_t bdst, loff_t len,
+		   struct file *file)
+{
+	int err, wkq_err;
+	struct dentry *parent, *h_orph, *h_parent, *h_dentry;
+	struct inode *dir, *h_dir, *h_tmpdir, *h_inode;
+	struct au_wbr *wbr;
+
+	parent = dget_parent(dentry);
+	dir = parent->d_inode;
+	h_orph = NULL;
+	h_parent = NULL;
+	h_dir = au_igrab(au_h_iptr(dir, bdst));
+	h_tmpdir = h_dir;
+	if (!h_dir->i_nlink) {
+		wbr = au_sbr(dentry->d_sb, bdst)->br_wbr;
+		h_orph = wbr->wbr_orph;
+
+		h_parent = dget(au_h_dptr(parent, bdst));
+		au_set_h_dptr(parent, bdst, NULL);
+		au_set_h_dptr(parent, bdst, dget(h_orph));
+		h_tmpdir = h_orph->d_inode;
+		au_set_h_iptr(dir, bdst, NULL, 0);
+		au_set_h_iptr(dir, bdst, au_igrab(h_tmpdir), /*flags*/0);
+
+		/* this temporary unlock is safe */
+		if (file)
+			h_dentry = au_h_fptr(file, au_fbstart(file))->f_dentry;
+		else
+			h_dentry = au_h_dptr(dentry, au_dbstart(dentry));
+		h_inode = h_dentry->d_inode;
+		IMustLock(h_inode);
+		mutex_unlock(&h_inode->i_mutex);
+		mutex_lock_nested(&h_tmpdir->i_mutex, AuLsc_I_PARENT2);
+		mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+	}
+
+	if (!au_test_h_perm_sio(h_tmpdir, MAY_EXEC | MAY_WRITE))
+		err = au_cpup_wh(dentry, bdst, len, file);
+	else {
+		struct au_cpup_wh_args args = {
+			.errp	= &err,
+			.dentry	= dentry,
+			.bdst	= bdst,
+			.len	= len,
+			.file	= file
+		};
+		wkq_err = au_wkq_wait(au_call_cpup_wh, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+
+	if (h_orph) {
+		mutex_unlock(&h_tmpdir->i_mutex);
+		au_set_h_iptr(dir, bdst, NULL, 0);
+		au_set_h_iptr(dir, bdst, au_igrab(h_dir), /*flags*/0);
+		au_set_h_dptr(parent, bdst, NULL);
+		au_set_h_dptr(parent, bdst, h_parent);
+	}
+	iput(h_dir);
+	dput(parent);
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * generic routine for both of copy-up and copy-down.
+ */
+/* cf. revalidate function in file.c */
+int au_cp_dirs(struct dentry *dentry, aufs_bindex_t bdst,
+	       int (*cp)(struct dentry *dentry, aufs_bindex_t bdst,
+			 struct dentry *h_parent, void *arg),
+	       void *arg)
+{
+	int err;
+	struct au_pin pin;
+	struct dentry *d, *parent, *h_parent, *real_parent;
+
+	err = 0;
+	parent = dget_parent(dentry);
+	if (IS_ROOT(parent))
+		goto out;
+
+	au_pin_init(&pin, dentry, bdst, AuLsc_DI_PARENT2, AuLsc_I_PARENT2,
+		    au_opt_udba(dentry->d_sb), AuPin_MNT_WRITE);
+
+	/* do not use au_dpage */
+	real_parent = parent;
+	while (1) {
+		dput(parent);
+		parent = dget_parent(dentry);
+		h_parent = au_h_dptr(parent, bdst);
+		if (h_parent)
+			goto out; /* success */
+
+		/* find top dir which is necessary to cpup */
+		do {
+			d = parent;
+			dput(parent);
+			parent = dget_parent(d);
+			di_read_lock_parent3(parent, !AuLock_IR);
+			h_parent = au_h_dptr(parent, bdst);
+			di_read_unlock(parent, !AuLock_IR);
+		} while (!h_parent);
+
+		if (d != real_parent)
+			di_write_lock_child3(d);
+
+		/* somebody else might create while we were sleeping */
+		if (!au_h_dptr(d, bdst) || !au_h_dptr(d, bdst)->d_inode) {
+			if (au_h_dptr(d, bdst))
+				au_update_dbstart(d);
+
+			au_pin_set_dentry(&pin, d);
+			err = au_do_pin(&pin);
+			if (!err) {
+				err = cp(d, bdst, h_parent, arg);
+				au_unpin(&pin);
+			}
+		}
+
+		if (d != real_parent)
+			di_write_unlock(d);
+		if (unlikely(err))
+			break;
+	}
+
+ out:
+	dput(parent);
+	return err;
+}
+
+static int au_cpup_dir(struct dentry *dentry, aufs_bindex_t bdst,
+		       struct dentry *h_parent __maybe_unused ,
+		       void *arg __maybe_unused)
+{
+	return au_sio_cpup_simple(dentry, bdst, -1, AuCpup_DTIME);
+}
+
+int au_cpup_dirs(struct dentry *dentry, aufs_bindex_t bdst)
+{
+	return au_cp_dirs(dentry, bdst, au_cpup_dir, NULL);
+}
+
+int au_test_and_cpup_dirs(struct dentry *dentry, aufs_bindex_t bdst)
+{
+	int err;
+	struct dentry *parent;
+	struct inode *dir;
+
+	parent = dget_parent(dentry);
+	dir = parent->d_inode;
+	err = 0;
+	if (au_h_iptr(dir, bdst))
+		goto out;
+
+	di_read_unlock(parent, AuLock_IR);
+	di_write_lock_parent(parent);
+	/* someone else might change our inode while we were sleeping */
+	if (!au_h_iptr(dir, bdst))
+		err = au_cpup_dirs(dentry, bdst);
+	di_downgrade_lock(parent, AuLock_IR);
+
+ out:
+	dput(parent);
+	return err;
+}
diff --git a/fs/aufs/cpup.h b/fs/aufs/cpup.h
new file mode 100644
index 0000000..240014a
--- /dev/null
+++ b/fs/aufs/cpup.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * copy-up/down functions
+ */
+
+#ifndef __AUFS_CPUP_H__
+#define __AUFS_CPUP_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/aufs_type.h>
+
+void au_cpup_attr_flags(struct inode *dst, struct inode *src);
+void au_cpup_attr_timesizes(struct inode *inode);
+void au_cpup_attr_nlink(struct inode *inode, int force);
+void au_cpup_attr_changeable(struct inode *inode);
+void au_cpup_igen(struct inode *inode, struct inode *h_inode);
+void au_cpup_attr_all(struct inode *inode, int force);
+
+/* ---------------------------------------------------------------------- */
+
+/* cpup flags */
+#define AuCpup_DTIME	1		/* do dtime_store/revert */
+#define AuCpup_KEEPLINO	(1 << 1)	/* do not clear the lower xino,
+					   for link(2) */
+#define au_ftest_cpup(flags, name)	((flags) & AuCpup_##name)
+#define au_fset_cpup(flags, name)	{ (flags) |= AuCpup_##name; }
+#define au_fclr_cpup(flags, name)	{ (flags) &= ~AuCpup_##name; }
+
+int au_copy_file(struct file *dst, struct file *src, loff_t len);
+int au_sio_cpup_single(struct dentry *dentry, aufs_bindex_t bdst,
+		       aufs_bindex_t bsrc, loff_t len, unsigned int flags,
+		       struct dentry *dst_parent);
+int au_sio_cpup_simple(struct dentry *dentry, aufs_bindex_t bdst, loff_t len,
+		       unsigned int flags);
+int au_sio_cpup_wh(struct dentry *dentry, aufs_bindex_t bdst, loff_t len,
+		   struct file *file);
+
+int au_cp_dirs(struct dentry *dentry, aufs_bindex_t bdst,
+	       int (*cp)(struct dentry *dentry, aufs_bindex_t bdst,
+			 struct dentry *h_parent, void *arg),
+	       void *arg);
+int au_cpup_dirs(struct dentry *dentry, aufs_bindex_t bdst);
+int au_test_and_cpup_dirs(struct dentry *dentry, aufs_bindex_t bdst);
+
+/* ---------------------------------------------------------------------- */
+
+/* keep timestamps when copyup */
+struct au_dtime {
+	struct dentry *dt_dentry;
+	struct path dt_h_path;
+	struct timespec dt_atime, dt_mtime;
+};
+void au_dtime_store(struct au_dtime *dt, struct dentry *dentry,
+		    struct path *h_path);
+void au_dtime_revert(struct au_dtime *dt);
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_CPUP_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 13/28] aufs whiteout
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (11 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 12/28] aufs copy-up J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 14/28] aufs pseudo-link J. R. Okajima
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
whiteout for logical deletion and opaque directory

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/whout.c | 1010 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/whout.h |   78 +++++
 2 files changed, 1088 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/whout.c
 create mode 100644 fs/aufs/whout.h

diff --git a/fs/aufs/whout.c b/fs/aufs/whout.c
new file mode 100644
index 0000000..f2343d2
--- /dev/null
+++ b/fs/aufs/whout.c
@@ -0,0 +1,1010 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * whiteout for logical deletion and opaque directory
+ */
+
+#include <linux/fs.h>
+#include "aufs.h"
+
+#define WH_MASK			S_IRUGO
+
+/*
+ * If a directory contains this file, then it is opaque.  We start with the
+ * .wh. flag so that it is blocked by lookup.
+ */
+static struct qstr diropq_name = {
+	.name = AUFS_WH_DIROPQ,
+	.len = sizeof(AUFS_WH_DIROPQ) - 1
+};
+
+/*
+ * generate whiteout name, which is NOT terminated by NULL.
+ * @name: original d_name.name
+ * @len: original d_name.len
+ * @wh: whiteout qstr
+ * returns zero when succeeds, otherwise error.
+ * succeeded value as wh->name should be freed by kfree().
+ */
+int au_wh_name_alloc(struct qstr *wh, const struct qstr *name)
+{
+	char *p;
+
+	if (unlikely(name->len > PATH_MAX - AUFS_WH_PFX_LEN))
+		return -ENAMETOOLONG;
+
+	wh->len = name->len + AUFS_WH_PFX_LEN;
+	p = kmalloc(wh->len, GFP_NOFS);
+	wh->name = p;
+	if (p) {
+		memcpy(p, AUFS_WH_PFX, AUFS_WH_PFX_LEN);
+		memcpy(p + AUFS_WH_PFX_LEN, name->name, name->len);
+		/* smp_mb(); */
+		return 0;
+	}
+	return -ENOMEM;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * test if the @wh_name exists under @h_parent.
+ * @try_sio specifies the necessary of super-io.
+ */
+int au_wh_test(struct dentry *h_parent, struct qstr *wh_name,
+	       struct au_branch *br, int try_sio)
+{
+	int err;
+	struct dentry *wh_dentry;
+	struct inode *h_dir;
+
+	h_dir = h_parent->d_inode;
+	if (!try_sio)
+		wh_dentry = au_lkup_one(wh_name, h_parent, br, /*nd*/NULL);
+	else
+		wh_dentry = au_sio_lkup_one(wh_name, h_parent, br);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out;
+
+	err = 0;
+	if (!wh_dentry->d_inode)
+		goto out_wh; /* success */
+
+	err = 1;
+	if (S_ISREG(wh_dentry->d_inode->i_mode))
+		goto out_wh; /* success */
+
+	err = -EIO;
+	AuIOErr("%.*s Invalid whiteout entry type 0%o.\n",
+		AuDLNPair(wh_dentry), wh_dentry->d_inode->i_mode);
+
+ out_wh:
+	dput(wh_dentry);
+ out:
+	return err;
+}
+
+/*
+ * test if the @h_dentry sets opaque or not.
+ */
+int au_diropq_test(struct dentry *h_dentry, struct au_branch *br)
+{
+	int err;
+	struct inode *h_dir;
+
+	h_dir = h_dentry->d_inode;
+	err = au_wh_test(h_dentry, &diropq_name, br,
+			 au_test_h_perm_sio(h_dir, MAY_EXEC));
+	return err;
+}
+
+/*
+ * returns a negative dentry whose name is unique and temporary.
+ */
+struct dentry *au_whtmp_lkup(struct dentry *h_parent, struct au_branch *br,
+			     struct qstr *prefix)
+{
+#define HEX_LEN	4
+	struct dentry *dentry;
+	int i;
+	char defname[AUFS_WH_PFX_LEN * 2 + DNAME_INLINE_LEN_MIN + 1
+		     + HEX_LEN + 1], *name, *p;
+	static unsigned short cnt;
+	struct qstr qs;
+
+	name = defname;
+	qs.len = sizeof(defname) - DNAME_INLINE_LEN_MIN + prefix->len - 1;
+	if (unlikely(prefix->len > DNAME_INLINE_LEN_MIN)) {
+		dentry = ERR_PTR(-ENAMETOOLONG);
+		if (unlikely(qs.len >= PATH_MAX))
+			goto out;
+		dentry = ERR_PTR(-ENOMEM);
+		name = kmalloc(qs.len + 1, GFP_NOFS);
+		if (unlikely(!name))
+			goto out;
+	}
+
+	/* doubly whiteout-ed */
+	memcpy(name, AUFS_WH_PFX AUFS_WH_PFX, AUFS_WH_PFX_LEN * 2);
+	p = name + AUFS_WH_PFX_LEN * 2;
+	memcpy(p, prefix->name, prefix->len);
+	p += prefix->len;
+	*p++ = '.';
+	AuDebugOn(name + qs.len + 1 - p <= HEX_LEN);
+
+	qs.name = name;
+	for (i = 0; i < 3; i++) {
+		sprintf(p, "%.*d", HEX_LEN, cnt++);
+		dentry = au_sio_lkup_one(&qs, h_parent, br);
+		if (IS_ERR(dentry) || !dentry->d_inode)
+			goto out_name;
+		dput(dentry);
+	}
+	/* AuWarn("could not get random name\n"); */
+	dentry = ERR_PTR(-EEXIST);
+	AuDbg("%.*s\n", AuLNPair(&qs));
+	BUG();
+
+ out_name:
+	if (name != defname)
+		kfree(name);
+ out:
+	return dentry;
+#undef HEX_LEN
+}
+
+/*
+ * rename the @h_dentry on @br to the whiteouted temporary name.
+ */
+int au_whtmp_ren(struct dentry *h_dentry, struct au_branch *br)
+{
+	int err;
+	struct path h_path = {
+		.mnt = br->br_mnt
+	};
+	struct inode *h_dir;
+	struct dentry *h_parent;
+
+	h_parent = h_dentry->d_parent; /* dir inode is locked */
+	h_dir = h_parent->d_inode;
+	IMustLock(h_dir);
+
+	h_path.dentry = au_whtmp_lkup(h_parent, br, &h_dentry->d_name);
+	err = PTR_ERR(h_path.dentry);
+	if (IS_ERR(h_path.dentry))
+		goto out;
+
+	/* under the same dir, no need to lock_rename() */
+	err = vfsub_rename(h_dir, h_dentry, h_dir, &h_path);
+	AuTraceErr(err);
+	dput(h_path.dentry);
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+/*
+ * functions for removing a whiteout
+ */
+
+static int do_unlink_wh(struct inode *h_dir, struct path *h_path)
+{
+	int force;
+
+	/*
+	 * forces superio when the dir has a sticky bit.
+	 * this may be a violation of unix fs semantics.
+	 */
+	force = (h_dir->i_mode & S_ISVTX)
+		&& h_path->dentry->d_inode->i_uid != current_fsuid();
+	return vfsub_unlink(h_dir, h_path, force);
+}
+
+int au_wh_unlink_dentry(struct inode *h_dir, struct path *h_path,
+			struct dentry *dentry)
+{
+	int err;
+
+	err = do_unlink_wh(h_dir, h_path);
+	if (!err && dentry)
+		au_set_dbwh(dentry, -1);
+
+	return err;
+}
+
+static int unlink_wh_name(struct dentry *h_parent, struct qstr *wh,
+			  struct au_branch *br)
+{
+	int err;
+	struct path h_path = {
+		.mnt = br->br_mnt
+	};
+
+	err = 0;
+	h_path.dentry = au_lkup_one(wh, h_parent, br, /*nd*/NULL);
+	if (IS_ERR(h_path.dentry))
+		err = PTR_ERR(h_path.dentry);
+	else {
+		if (h_path.dentry->d_inode
+		    && S_ISREG(h_path.dentry->d_inode->i_mode))
+			err = do_unlink_wh(h_parent->d_inode, &h_path);
+		dput(h_path.dentry);
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+/*
+ * initialize/clean whiteout for a branch
+ */
+
+static void au_wh_clean(struct inode *h_dir, struct path *whpath,
+			const int isdir)
+{
+	int err;
+
+	if (!whpath->dentry->d_inode)
+		return;
+
+	err = mnt_want_write(whpath->mnt);
+	if (!err) {
+		if (isdir)
+			err = vfsub_rmdir(h_dir, whpath);
+		else
+			err = vfsub_unlink(h_dir, whpath, /*force*/0);
+		mnt_drop_write(whpath->mnt);
+	}
+	if (unlikely(err))
+		AuWarn("failed removing %.*s (%d), ignored.\n",
+		       AuDLNPair(whpath->dentry), err);
+}
+
+static int test_linkable(struct dentry *h_root)
+{
+	struct inode *h_dir = h_root->d_inode;
+
+	if (h_dir->i_op->link)
+		return 0;
+
+	AuErr("%.*s (%s) doesn't support link(2), use noplink and rw+nolwh\n",
+	      AuDLNPair(h_root), au_sbtype(h_root->d_sb));
+	return -ENOSYS;
+}
+
+/* todo: should this mkdir be done in /sbin/mount.aufs helper? */
+static int au_whdir(struct inode *h_dir, struct path *path)
+{
+	int err;
+
+	err = -EEXIST;
+	if (!path->dentry->d_inode) {
+		int mode = S_IRWXU;
+
+		if (au_test_nfs(path->dentry->d_sb))
+			mode |= S_IXUGO;
+		err = mnt_want_write(path->mnt);
+		if (!err) {
+			err = vfsub_mkdir(h_dir, path, mode);
+			mnt_drop_write(path->mnt);
+		}
+	} else if (S_ISDIR(path->dentry->d_inode->i_mode))
+		err = 0;
+	else
+		AuErr("unknown %.*s exists\n", AuDLNPair(path->dentry));
+
+	return err;
+}
+
+struct au_wh_base {
+	const struct qstr *name;
+	struct dentry *dentry;
+};
+
+static void au_wh_init_ro(struct inode *h_dir, struct au_wh_base base[],
+			  struct path *h_path)
+{
+	h_path->dentry = base[AuBrWh_BASE].dentry;
+	au_wh_clean(h_dir, h_path, /*isdir*/0);
+	h_path->dentry = base[AuBrWh_PLINK].dentry;
+	au_wh_clean(h_dir, h_path, /*isdir*/1);
+	h_path->dentry = base[AuBrWh_ORPH].dentry;
+	au_wh_clean(h_dir, h_path, /*isdir*/1);
+}
+
+/*
+ * returns tri-state,
+ * minus: error, caller should print the mesage
+ * zero: succuess
+ * plus: error, caller should NOT print the mesage
+ */
+static int au_wh_init_rw_nolink(struct dentry *h_root, struct au_wbr *wbr,
+				int do_plink, struct au_wh_base base[],
+				struct path *h_path)
+{
+	int err;
+	struct inode *h_dir;
+
+	h_dir = h_root->d_inode;
+	h_path->dentry = base[AuBrWh_BASE].dentry;
+	au_wh_clean(h_dir, h_path, /*isdir*/0);
+	h_path->dentry = base[AuBrWh_PLINK].dentry;
+	if (do_plink) {
+		err = test_linkable(h_root);
+		if (unlikely(err)) {
+			err = 1;
+			goto out;
+		}
+
+		err = au_whdir(h_dir, h_path);
+		if (unlikely(err))
+			goto out;
+		wbr->wbr_plink = dget(base[AuBrWh_PLINK].dentry);
+	} else
+		au_wh_clean(h_dir, h_path, /*isdir*/1);
+	h_path->dentry = base[AuBrWh_ORPH].dentry;
+	err = au_whdir(h_dir, h_path);
+	if (unlikely(err))
+		goto out;
+	wbr->wbr_orph = dget(base[AuBrWh_ORPH].dentry);
+
+ out:
+	return err;
+}
+
+/*
+ * for the moment, aufs supports the branch filesystem which does not support
+ * link(2). testing on FAT which does not support i_op->setattr() fully either,
+ * copyup failed. finally, such filesystem will not be used as the writable
+ * branch.
+ *
+ * returns tri-state, see above.
+ */
+static int au_wh_init_rw(struct dentry *h_root, struct au_wbr *wbr,
+			 int do_plink, struct au_wh_base base[],
+			 struct path *h_path)
+{
+	int err;
+	struct inode *h_dir;
+
+	err = test_linkable(h_root);
+	if (unlikely(err)) {
+		err = 1;
+		goto out;
+	}
+
+	/*
+	 * todo: should this create be done in /sbin/mount.aufs helper?
+	 */
+	err = -EEXIST;
+	h_dir = h_root->d_inode;
+	if (!base[AuBrWh_BASE].dentry->d_inode) {
+		err = mnt_want_write(h_path->mnt);
+		if (!err) {
+			h_path->dentry = base[AuBrWh_BASE].dentry;
+			err = vfsub_create(h_dir, h_path, WH_MASK);
+			mnt_drop_write(h_path->mnt);
+		}
+	} else if (S_ISREG(base[AuBrWh_BASE].dentry->d_inode->i_mode))
+		err = 0;
+	else
+		AuErr("unknown %.*s/%.*s exists\n",
+		      AuDLNPair(h_root), AuDLNPair(base[AuBrWh_BASE].dentry));
+	if (unlikely(err))
+		goto out;
+
+	h_path->dentry = base[AuBrWh_PLINK].dentry;
+	if (do_plink) {
+		err = au_whdir(h_dir, h_path);
+		if (unlikely(err))
+			goto out;
+		wbr->wbr_plink = dget(base[AuBrWh_PLINK].dentry);
+	} else
+		au_wh_clean(h_dir, h_path, /*isdir*/1);
+	wbr->wbr_whbase = dget(base[AuBrWh_BASE].dentry);
+
+	h_path->dentry = base[AuBrWh_ORPH].dentry;
+	err = au_whdir(h_dir, h_path);
+	if (unlikely(err))
+		goto out;
+	wbr->wbr_orph = dget(base[AuBrWh_ORPH].dentry);
+
+ out:
+	return err;
+}
+
+/*
+ * initialize the whiteout base file/dir for @br.
+ */
+int au_wh_init(struct dentry *h_root, struct au_branch *br,
+	       struct super_block *sb)
+{
+	int err, i;
+	const unsigned char do_plink
+		= !!au_opt_test(au_mntflags(sb), PLINK);
+	struct path path = {
+		.mnt = br->br_mnt
+	};
+	struct inode *h_dir;
+	struct au_wbr *wbr = br->br_wbr;
+	static const struct qstr base_name[] = {
+		[AuBrWh_BASE] = {
+			.name	= AUFS_BASE_NAME,
+			.len	= sizeof(AUFS_BASE_NAME) - 1
+		},
+		[AuBrWh_PLINK] = {
+			.name	= AUFS_PLINKDIR_NAME,
+			.len	= sizeof(AUFS_PLINKDIR_NAME) - 1
+		},
+		[AuBrWh_ORPH] = {
+			.name	= AUFS_ORPHDIR_NAME,
+			.len	= sizeof(AUFS_ORPHDIR_NAME) - 1
+		}
+	};
+	struct au_wh_base base[] = {
+		[AuBrWh_BASE] = {
+			.name	= base_name + AuBrWh_BASE,
+			.dentry	= NULL
+		},
+		[AuBrWh_PLINK] = {
+			.name	= base_name + AuBrWh_PLINK,
+			.dentry	= NULL
+		},
+		[AuBrWh_ORPH] = {
+			.name	= base_name + AuBrWh_ORPH,
+			.dentry	= NULL
+		}
+	};
+
+
+	h_dir = h_root->d_inode;
+	for (i = 0; i < AuBrWh_Last; i++) {
+		/* doubly whiteouted */
+		struct dentry *d;
+
+		d = au_wh_lkup(h_root, (void *)base[i].name, br);
+		err = PTR_ERR(d);
+		if (IS_ERR(d))
+			goto out;
+
+		base[i].dentry = d;
+		AuDebugOn(wbr
+			  && wbr->wbr_wh[i]
+			  && wbr->wbr_wh[i] != base[i].dentry);
+	}
+
+	if (wbr)
+		for (i = 0; i < AuBrWh_Last; i++) {
+			dput(wbr->wbr_wh[i]);
+			wbr->wbr_wh[i] = NULL;
+		}
+
+	err = 0;
+
+	switch (br->br_perm) {
+	case AuBrPerm_RO:
+	case AuBrPerm_ROWH:
+	case AuBrPerm_RR:
+	case AuBrPerm_RRWH:
+		au_wh_init_ro(h_dir, base, &path);
+		break;
+
+	case AuBrPerm_RWNoLinkWH:
+		err = au_wh_init_rw_nolink(h_root, wbr, do_plink, base, &path);
+		if (err > 0)
+			goto out;
+		else if (err)
+			goto out_err;
+		break;
+
+	case AuBrPerm_RW:
+		err = au_wh_init_rw(h_root, wbr, do_plink, base, &path);
+		if (err > 0)
+			goto out;
+		else if (err)
+			goto out_err;
+		break;
+
+	default:
+		BUG();
+	}
+	goto out; /* success */
+
+ out_err:
+	AuErr("an error(%d) on the writable branch %.*s(%s)\n",
+	      err, AuDLNPair(h_root), au_sbtype(h_root->d_sb));
+ out:
+	for (i = 0; i < AuBrWh_Last; i++)
+		dput(base[i].dentry);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+/*
+ * whiteouts are all hard-linked usually.
+ * when its link count reaches a ceiling, we create a new whiteout base
+ * asynchronously.
+ */
+
+struct reinit_br_wh {
+	struct super_block *sb;
+	struct au_branch *br;
+};
+
+static void reinit_br_wh(void *arg)
+{
+	int err;
+	aufs_bindex_t bindex;
+	struct path h_path;
+	struct reinit_br_wh *a = arg;
+	struct au_wbr *wbr;
+	struct inode *dir;
+	struct dentry *h_root;
+	struct au_hinode *hdir;
+
+	err = 0;
+	wbr = a->br->br_wbr;
+	/* big aufs lock */
+	si_noflush_write_lock(a->sb);
+	if (!au_br_writable(a->br->br_perm))
+		goto out;
+	bindex = au_br_index(a->sb, a->br->br_id);
+	if (unlikely(bindex < 0))
+		goto out;
+
+	dir = a->sb->s_root->d_inode;
+	/* ii_read_lock_parent(dir); */
+	hdir = au_hi(dir, bindex);
+	h_root = au_h_dptr(a->sb->s_root, bindex);
+
+	au_hin_imtx_lock_nested(hdir, AuLsc_I_PARENT);
+	wbr_wh_write_lock(wbr);
+	err = au_h_verify(wbr->wbr_whbase, au_opt_udba(a->sb), hdir->hi_inode,
+			  h_root, a->br);
+	if (!err) {
+		err = mnt_want_write(a->br->br_mnt);
+		if (!err) {
+			h_path.dentry = wbr->wbr_whbase;
+			h_path.mnt = a->br->br_mnt;
+			err = vfsub_unlink(hdir->hi_inode, &h_path, /*force*/0);
+			mnt_drop_write(a->br->br_mnt);
+		}
+	} else {
+		AuWarn("%.*s is moved, ignored\n", AuDLNPair(wbr->wbr_whbase));
+		err = 0;
+	}
+	dput(wbr->wbr_whbase);
+	wbr->wbr_whbase = NULL;
+	if (!err)
+		err = au_wh_init(h_root, a->br, a->sb);
+	wbr_wh_write_unlock(wbr);
+	au_hin_imtx_unlock(hdir);
+	/* ii_read_unlock(dir); */
+
+ out:
+	if (wbr)
+		atomic_dec(&wbr->wbr_wh_running);
+	atomic_dec(&a->br->br_count);
+	au_nwt_done(&au_sbi(a->sb)->si_nowait);
+	si_write_unlock(a->sb);
+	kfree(arg);
+	if (unlikely(err))
+		AuIOErr("err %d\n", err);
+}
+
+static void kick_reinit_br_wh(struct super_block *sb, struct au_branch *br)
+{
+	int do_dec, wkq_err;
+	struct reinit_br_wh *arg;
+
+	do_dec = 1;
+	if (atomic_inc_return(&br->br_wbr->wbr_wh_running) != 1)
+		goto out;
+
+	/* ignore ENOMEM */
+	arg = kmalloc(sizeof(*arg), GFP_NOFS);
+	if (arg) {
+		/*
+		 * dec(wh_running), kfree(arg) and dec(br_count)
+		 * in reinit function
+		 */
+		arg->sb = sb;
+		arg->br = br;
+		atomic_inc(&br->br_count);
+		wkq_err = au_wkq_nowait(reinit_br_wh, arg, sb);
+		if (unlikely(wkq_err)) {
+			atomic_dec(&br->br_wbr->wbr_wh_running);
+			atomic_dec(&br->br_count);
+			kfree(arg);
+		}
+		do_dec = 0;
+	}
+
+ out:
+	if (do_dec)
+		atomic_dec(&br->br_wbr->wbr_wh_running);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * create the whiteout @wh.
+ */
+static int link_or_create_wh(struct super_block *sb, aufs_bindex_t bindex,
+			     struct dentry *wh)
+{
+	int err;
+	struct path h_path = {
+		.dentry = wh
+	};
+	struct au_branch *br;
+	struct au_wbr *wbr;
+	struct dentry *h_parent;
+	struct inode *h_dir;
+
+	h_parent = wh->d_parent; /* dir inode is locked */
+	h_dir = h_parent->d_inode;
+	IMustLock(h_dir);
+
+	br = au_sbr(sb, bindex);
+	h_path.mnt = br->br_mnt;
+	wbr = br->br_wbr;
+	wbr_wh_read_lock(wbr);
+	if (wbr->wbr_whbase) {
+		err = vfsub_link(wbr->wbr_whbase, h_dir, &h_path);
+		if (!err || err != -EMLINK)
+			goto out;
+
+		/* link count full. re-initialize br_whbase. */
+		kick_reinit_br_wh(sb, br);
+	}
+
+	/* return this error in this context */
+	err = vfsub_create(h_dir, &h_path, WH_MASK);
+
+ out:
+	wbr_wh_read_unlock(wbr);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * create or remove the diropq.
+ */
+static struct dentry *do_diropq(struct dentry *dentry, aufs_bindex_t bindex,
+				unsigned int flags)
+{
+	struct dentry *opq_dentry, *h_dentry;
+	struct super_block *sb;
+	struct au_branch *br;
+	int err;
+
+	sb = dentry->d_sb;
+	br = au_sbr(sb, bindex);
+	h_dentry = au_h_dptr(dentry, bindex);
+	opq_dentry = au_lkup_one(&diropq_name, h_dentry, br, /*nd*/NULL);
+	if (IS_ERR(opq_dentry))
+		goto out;
+
+	if (au_ftest_diropq(flags, CREATE)) {
+		err = link_or_create_wh(sb, bindex, opq_dentry);
+		if (!err) {
+			au_set_dbdiropq(dentry, bindex);
+			goto out; /* success */
+		}
+	} else {
+		struct path tmp = {
+			.dentry = opq_dentry,
+			.mnt	= br->br_mnt
+		};
+		err = do_unlink_wh(au_h_iptr(dentry->d_inode, bindex), &tmp);
+		if (!err)
+			au_set_dbdiropq(dentry, -1);
+	}
+	dput(opq_dentry);
+	opq_dentry = ERR_PTR(err);
+
+ out:
+	return opq_dentry;
+}
+
+struct do_diropq_args {
+	struct dentry **errp;
+	struct dentry *dentry;
+	aufs_bindex_t bindex;
+	unsigned int flags;
+};
+
+static void call_do_diropq(void *args)
+{
+	struct do_diropq_args *a = args;
+	*a->errp = do_diropq(a->dentry, a->bindex, a->flags);
+}
+
+struct dentry *au_diropq_sio(struct dentry *dentry, aufs_bindex_t bindex,
+			     unsigned int flags)
+{
+	struct dentry *diropq, *h_dentry;
+
+	h_dentry = au_h_dptr(dentry, bindex);
+	if (!au_test_h_perm_sio(h_dentry->d_inode, MAY_EXEC | MAY_WRITE))
+		diropq = do_diropq(dentry, bindex, flags);
+	else {
+		int wkq_err;
+		struct do_diropq_args args = {
+			.errp		= &diropq,
+			.dentry		= dentry,
+			.bindex		= bindex,
+			.flags		= flags
+		};
+
+		wkq_err = au_wkq_wait(call_do_diropq, &args);
+		if (unlikely(wkq_err))
+			diropq = ERR_PTR(wkq_err);
+	}
+
+	return diropq;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * lookup whiteout dentry.
+ * @h_parent: lower parent dentry which must exist and be locked
+ * @base_name: name of dentry which will be whiteouted
+ * returns dentry for whiteout.
+ */
+struct dentry *au_wh_lkup(struct dentry *h_parent, struct qstr *base_name,
+			  struct au_branch *br)
+{
+	int err;
+	struct qstr wh_name;
+	struct dentry *wh_dentry;
+
+	err = au_wh_name_alloc(&wh_name, base_name);
+	wh_dentry = ERR_PTR(err);
+	if (!err) {
+		wh_dentry = au_lkup_one(&wh_name, h_parent, br, /*nd*/NULL);
+		kfree(wh_name.name);
+	}
+	return wh_dentry;
+}
+
+/*
+ * link/create a whiteout for @dentry on @bindex.
+ */
+struct dentry *au_wh_create(struct dentry *dentry, aufs_bindex_t bindex,
+			    struct dentry *h_parent)
+{
+	struct dentry *wh_dentry;
+	struct super_block *sb;
+	int err;
+
+	sb = dentry->d_sb;
+	wh_dentry = au_wh_lkup(h_parent, &dentry->d_name, au_sbr(sb, bindex));
+	if (!IS_ERR(wh_dentry) && !wh_dentry->d_inode) {
+		err = link_or_create_wh(sb, bindex, wh_dentry);
+		if (!err)
+			au_set_dbwh(dentry, bindex);
+		else {
+			dput(wh_dentry);
+			wh_dentry = ERR_PTR(err);
+		}
+	}
+
+	return wh_dentry;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* Delete all whiteouts in this directory on branch bindex. */
+static int del_wh_children(struct dentry *h_dentry, struct au_nhash *whlist,
+			   aufs_bindex_t bindex, struct au_branch *br)
+{
+	int err, i;
+	struct qstr wh_name;
+	char *p;
+	struct hlist_head *head;
+	struct au_vdir_wh *tpos;
+	struct hlist_node *pos;
+	struct au_vdir_destr *str;
+
+	err = -ENOMEM;
+	p = __getname();
+	wh_name.name = p;
+	if (unlikely(!wh_name.name))
+		goto out;
+
+	err = 0;
+	memcpy(p, AUFS_WH_PFX, AUFS_WH_PFX_LEN);
+	p += AUFS_WH_PFX_LEN;
+	head = whlist->heads;
+	for (i = 0; !err && i < AuSize_NHASH; i++, head++) {
+		hlist_for_each_entry(tpos, pos, head, wh_hash) {
+			if (tpos->wh_bindex != bindex)
+				continue;
+
+			str = &tpos->wh_str;
+			if (str->len + AUFS_WH_PFX_LEN <= PATH_MAX) {
+				memcpy(p, str->name, str->len);
+				wh_name.len = AUFS_WH_PFX_LEN + str->len;
+				err = unlink_wh_name(h_dentry, &wh_name, br);
+				if (!err)
+					continue;
+				break;
+			}
+			AuIOErr("whiteout name too long %.*s\n",
+				str->len, str->name);
+			err = -EIO;
+			break;
+		}
+	}
+	__putname(wh_name.name);
+
+ out:
+	return err;
+}
+
+struct del_wh_children_args {
+	int *errp;
+	struct dentry *h_dentry;
+	struct au_nhash *whlist;
+	aufs_bindex_t bindex;
+	struct au_branch *br;
+};
+
+static void call_del_wh_children(void *args)
+{
+	struct del_wh_children_args *a = args;
+	*a->errp = del_wh_children(a->h_dentry, a->whlist, a->bindex, a->br);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * rmdir the whiteouted temporary named dir @h_dentry.
+ * @whlist: whiteouted children.
+ */
+int au_whtmp_rmdir(struct inode *dir, aufs_bindex_t bindex,
+		   struct dentry *wh_dentry, struct au_nhash *whlist)
+{
+	int err;
+	struct path h_tmp;
+	struct inode *wh_inode, *h_dir;
+	struct au_branch *br;
+
+	h_dir = wh_dentry->d_parent->d_inode; /* dir inode is locked */
+	IMustLock(h_dir);
+
+	br = au_sbr(dir->i_sb, bindex);
+	wh_inode = wh_dentry->d_inode;
+	mutex_lock_nested(&wh_inode->i_mutex, AuLsc_I_CHILD);
+
+	/*
+	 * someone else might change some whiteouts while we were sleeping.
+	 * it means this whlist may have an obsoleted entry.
+	 */
+	if (!au_test_h_perm_sio(wh_inode, MAY_EXEC | MAY_WRITE))
+		err = del_wh_children(wh_dentry, whlist, bindex, br);
+	else {
+		int wkq_err;
+		struct del_wh_children_args args = {
+			.errp		= &err,
+			.h_dentry	= wh_dentry,
+			.whlist		= whlist,
+			.bindex		= bindex,
+			.br		= br
+		};
+
+		wkq_err = au_wkq_wait(call_del_wh_children, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	}
+	mutex_unlock(&wh_inode->i_mutex);
+
+	if (!err) {
+		h_tmp.dentry = wh_dentry;
+		h_tmp.mnt = br->br_mnt;
+		err = vfsub_rmdir(h_dir, &h_tmp);
+		/* d_drop(h_dentry); */
+	}
+
+	if (!err) {
+		if (au_ibstart(dir) == bindex) {
+			au_cpup_attr_timesizes(dir);
+			drop_nlink(dir);
+		}
+		return 0; /* success */
+	}
+
+	AuWarn("failed removing %.*s(%d), ignored\n",
+	       AuDLNPair(wh_dentry), err);
+	return err;
+}
+
+static void au_whtmp_rmdir_free_args(struct au_whtmp_rmdir_args *args)
+{
+	au_nhash_fin(&args->whlist);
+	dput(args->wh_dentry);
+	iput(args->dir);
+	kfree(args);
+}
+
+static void call_rmdir_whtmp(void *args)
+{
+	int err;
+	struct au_whtmp_rmdir_args *a = args;
+	struct super_block *sb;
+	struct dentry *h_parent;
+	struct inode *h_dir;
+	struct au_branch *br;
+	struct au_hinode *hdir;
+
+	/* rmdir by nfsd may cause deadlock with this i_mutex */
+	/* mutex_lock(&a->dir->i_mutex); */
+	sb = a->dir->i_sb;
+	si_noflush_read_lock(sb);
+	err = au_test_ro(sb, a->bindex, NULL);
+	if (unlikely(err))
+		goto out;
+
+	err = -EIO;
+	br = au_sbr(sb, a->bindex);
+	ii_write_lock_parent(a->dir);
+	h_parent = dget_parent(a->wh_dentry);
+	h_dir = h_parent->d_inode;
+	hdir = au_hi(a->dir, a->bindex);
+	au_hin_imtx_lock_nested(hdir, AuLsc_I_PARENT);
+	err = au_h_verify(a->wh_dentry, au_opt_udba(sb), h_dir, h_parent, br);
+	if (!err) {
+		err = mnt_want_write(br->br_mnt);
+		if (!err) {
+			err = au_whtmp_rmdir(a->dir, a->bindex, a->wh_dentry,
+					     &a->whlist);
+			mnt_drop_write(br->br_mnt);
+		}
+	}
+	au_hin_imtx_unlock(hdir);
+	dput(h_parent);
+	ii_write_unlock(a->dir);
+
+ out:
+	/* mutex_unlock(&a->dir->i_mutex); */
+	au_nwt_done(&au_sbi(sb)->si_nowait);
+	si_read_unlock(sb);
+	au_whtmp_rmdir_free_args(a);
+	if (unlikely(err))
+		AuIOErr("err %d\n", err);
+}
+
+void au_whtmp_kick_rmdir(struct inode *dir, aufs_bindex_t bindex,
+			 struct dentry *wh_dentry, struct au_nhash *whlist,
+			 struct au_whtmp_rmdir_args *args)
+{
+	int wkq_err;
+
+	IMustLock(dir);
+
+	/* all post-process will be done in do_rmdir_whtmp(). */
+	args->dir = au_igrab(dir);
+	args->bindex = bindex;
+	args->wh_dentry = dget(wh_dentry);
+	au_nhash_init(&args->whlist);
+	au_nhash_move(&args->whlist, whlist);
+	wkq_err = au_wkq_nowait(call_rmdir_whtmp, args, dir->i_sb);
+	if (unlikely(wkq_err)) {
+		AuWarn("rmdir error %.*s (%d), ignored\n",
+		       AuDLNPair(wh_dentry), wkq_err);
+		au_whtmp_rmdir_free_args(args);
+	}
+}
diff --git a/fs/aufs/whout.h b/fs/aufs/whout.h
new file mode 100644
index 0000000..084e34c
--- /dev/null
+++ b/fs/aufs/whout.h
@@ -0,0 +1,78 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * whiteout for logical deletion and opaque directory
+ */
+
+#ifndef __AUFS_WHOUT_H__
+#define __AUFS_WHOUT_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/aufs_type.h>
+#include "dir.h"
+
+/* whout.c */
+int au_wh_name_alloc(struct qstr *wh, const struct qstr *name);
+struct au_branch;
+int au_wh_test(struct dentry *h_parent, struct qstr *wh_name,
+	       struct au_branch *br, int try_sio);
+int au_diropq_test(struct dentry *h_dentry, struct au_branch *br);
+struct dentry *au_whtmp_lkup(struct dentry *h_parent, struct au_branch *br,
+			     struct qstr *prefix);
+int au_whtmp_ren(struct dentry *h_dentry, struct au_branch *br);
+int au_wh_unlink_dentry(struct inode *h_dir, struct path *h_path,
+			struct dentry *dentry);
+int au_wh_init(struct dentry *h_parent, struct au_branch *br,
+	       struct super_block *sb);
+
+/* diropq flags */
+#define AuDiropq_CREATE	1
+#define au_ftest_diropq(flags, name)	((flags) & AuDiropq_##name)
+#define au_fset_diropq(flags, name)	{ (flags) |= AuDiropq_##name; }
+#define au_fclr_diropq(flags, name)	{ (flags) &= ~AuDiropq_##name; }
+
+struct dentry *au_diropq_sio(struct dentry *dentry, aufs_bindex_t bindex,
+			     unsigned int flags);
+struct dentry *au_wh_lkup(struct dentry *h_parent, struct qstr *base_name,
+			  struct au_branch *br);
+struct dentry *au_wh_create(struct dentry *dentry, aufs_bindex_t bindex,
+			    struct dentry *h_parent);
+
+/* real rmdir for the whiteout-ed dir */
+struct au_whtmp_rmdir_args {
+	struct inode *dir;
+	aufs_bindex_t bindex;
+	struct dentry *wh_dentry;
+	struct au_nhash whlist;
+};
+
+int au_whtmp_rmdir(struct inode *dir, aufs_bindex_t bindex,
+		   struct dentry *wh_dentry, struct au_nhash *whlist);
+void au_whtmp_kick_rmdir(struct inode *dir, aufs_bindex_t bindex,
+			 struct dentry *wh_dentry, struct au_nhash *whlist,
+			 struct au_whtmp_rmdir_args *args);
+
+/* ---------------------------------------------------------------------- */
+
+static inline struct dentry *au_diropq_create(struct dentry *dentry,
+					      aufs_bindex_t bindex)
+{
+	return au_diropq_sio(dentry, bindex, AuDiropq_CREATE);
+}
+
+static inline int au_diropq_remove(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	return PTR_ERR(au_diropq_sio(dentry, bindex, !AuDiropq_CREATE));
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_WHOUT_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 14/28] aufs pseudo-link
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (12 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 13/28] aufs whiteout J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 15/28] aufs policies to select one among multiple writable branches J. R. Okajima
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
pseudo-link is a logical hardlink over branches or different filesystems.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/plink.c |  335 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 335 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/plink.c

diff --git a/fs/aufs/plink.c b/fs/aufs/plink.c
new file mode 100644
index 0000000..62d258f
--- /dev/null
+++ b/fs/aufs/plink.c
@@ -0,0 +1,335 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * pseudo-link
+ */
+
+#include "aufs.h"
+
+/*
+ * during a user process maintains the pseudo-links,
+ * prohibit adding a new plink and branch manipulation.
+ */
+void au_plink_block_maintain(struct super_block *sb)
+{
+	struct au_sbinfo *sbi = au_sbi(sb);
+	/* gave up wake_up_bit() */
+	wait_event(sbi->si_plink_wq, !au_ftest_si(sbi, MAINTAIN_PLINK));
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct pseudo_link {
+	struct list_head list;
+	struct inode *inode;
+};
+
+#ifdef CONFIG_AUFS_DEBUG
+void au_plink_list(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+	struct list_head *plink_list;
+	struct pseudo_link *plink;
+
+	sbinfo = au_sbi(sb);
+	AuDebugOn(!au_opt_test(au_mntflags(sb), PLINK));
+
+	plink_list = &sbinfo->si_plink.head;
+	spin_lock(&sbinfo->si_plink.spin);
+	list_for_each_entry(plink, plink_list, list)
+		AuDbg("%lu\n", plink->inode->i_ino);
+	spin_unlock(&sbinfo->si_plink.spin);
+}
+#endif
+
+/* is the inode pseudo-linked? */
+int au_plink_test(struct inode *inode)
+{
+	int found;
+	struct au_sbinfo *sbinfo;
+	struct list_head *plink_list;
+	struct pseudo_link *plink;
+
+	sbinfo = au_sbi(inode->i_sb);
+	AuDebugOn(!au_opt_test(au_mntflags(inode->i_sb), PLINK));
+
+	found = 0;
+	plink_list = &sbinfo->si_plink.head;
+	spin_lock(&sbinfo->si_plink.spin);
+	list_for_each_entry(plink, plink_list, list)
+		if (plink->inode == inode) {
+			found = 1;
+			break;
+		}
+	spin_unlock(&sbinfo->si_plink.spin);
+	return found;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * generate a name for plink.
+ * the file will be stored under AUFS_WH_PLINKDIR.
+ */
+/* 20 is max digits length of ulong 64 */
+#define PLINK_NAME_LEN	((20 + 1) * 2)
+
+static int plink_name(char *name, int len, struct inode *inode,
+		      aufs_bindex_t bindex)
+{
+	int rlen;
+	struct inode *h_inode;
+
+	h_inode = au_h_iptr(inode, bindex);
+	rlen = snprintf(name, len, "%lu.%lu", inode->i_ino, h_inode->i_ino);
+	return rlen;
+}
+
+/* lookup the plink-ed @inode under the branch at @bindex */
+struct dentry *au_plink_lkup(struct inode *inode, aufs_bindex_t bindex)
+{
+	struct dentry *h_dentry, *h_parent;
+	struct au_branch *br;
+	struct inode *h_dir;
+	char a[PLINK_NAME_LEN];
+	struct qstr tgtname = {
+		.name	= a
+	};
+
+	br = au_sbr(inode->i_sb, bindex);
+	h_parent = br->br_wbr->wbr_plink;
+	h_dir = h_parent->d_inode;
+	tgtname.len = plink_name(a, sizeof(a), inode, bindex);
+
+	/* always superio. */
+	mutex_lock_nested(&h_dir->i_mutex, AuLsc_I_CHILD2);
+	h_dentry = au_sio_lkup_one(&tgtname, h_parent, br);
+	mutex_unlock(&h_dir->i_mutex);
+	return h_dentry;
+}
+
+/* create a pseudo-link */
+static int do_whplink(struct qstr *tgt, struct dentry *h_parent,
+		      struct dentry *h_dentry, struct au_branch *br)
+{
+	int err;
+	struct path h_path = {
+		.mnt = br->br_mnt
+	};
+	struct inode *h_dir;
+
+	h_dir = h_parent->d_inode;
+ again:
+	h_path.dentry = au_lkup_one(tgt, h_parent, br, /*nd*/NULL);
+	err = PTR_ERR(h_path.dentry);
+	if (IS_ERR(h_path.dentry))
+		goto out;
+
+	err = 0;
+	/* wh.plink dir is not monitored */
+	if (h_path.dentry->d_inode
+	    && h_path.dentry->d_inode != h_dentry->d_inode) {
+		err = vfsub_unlink(h_dir, &h_path, /*force*/0);
+		dput(h_path.dentry);
+		h_path.dentry = NULL;
+		if (!err)
+			goto again;
+	}
+	if (!err && !h_path.dentry->d_inode)
+		err = vfsub_link(h_dentry, h_dir, &h_path);
+	dput(h_path.dentry);
+
+ out:
+	return err;
+}
+
+struct do_whplink_args {
+	int *errp;
+	struct qstr *tgt;
+	struct dentry *h_parent;
+	struct dentry *h_dentry;
+	struct au_branch *br;
+};
+
+static void call_do_whplink(void *args)
+{
+	struct do_whplink_args *a = args;
+	*a->errp = do_whplink(a->tgt, a->h_parent, a->h_dentry, a->br);
+}
+
+static int whplink(struct dentry *h_dentry, struct inode *inode,
+		   aufs_bindex_t bindex, struct au_branch *br)
+{
+	int err, wkq_err;
+	struct au_wbr *wbr;
+	struct dentry *h_parent;
+	struct inode *h_dir;
+	char a[PLINK_NAME_LEN];
+	struct qstr tgtname = {
+		.name = a
+	};
+
+	wbr = au_sbr(inode->i_sb, bindex)->br_wbr;
+	h_parent = wbr->wbr_plink;
+	h_dir = h_parent->d_inode;
+	tgtname.len = plink_name(a, sizeof(a), inode, bindex);
+
+	/* always superio. */
+	mutex_lock_nested(&h_dir->i_mutex, AuLsc_I_CHILD2);
+	if (!au_test_wkq(current)) {
+		struct do_whplink_args args = {
+			.errp		= &err,
+			.tgt		= &tgtname,
+			.h_parent	= h_parent,
+			.h_dentry	= h_dentry,
+			.br		= br
+		};
+		wkq_err = au_wkq_wait(call_do_whplink, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+	} else
+		err = do_whplink(&tgtname, h_parent, h_dentry, br);
+	mutex_unlock(&h_dir->i_mutex);
+
+	return err;
+}
+
+/* free a single plink */
+static void do_put_plink(struct pseudo_link *plink, int do_del)
+{
+	iput(plink->inode);
+	if (do_del)
+		list_del(&plink->list);
+	kfree(plink);
+}
+
+/*
+ * create a new pseudo-link for @h_dentry on @bindex.
+ * the linked inode is held in aufs @inode.
+ */
+void au_plink_append(struct inode *inode, aufs_bindex_t bindex,
+		     struct dentry *h_dentry)
+{
+	struct super_block *sb;
+	struct au_sbinfo *sbinfo;
+	struct list_head *plink_list;
+	struct pseudo_link *plink;
+	int found, err, cnt;
+
+	sb = inode->i_sb;
+	sbinfo = au_sbi(sb);
+	AuDebugOn(!au_opt_test(au_mntflags(sb), PLINK));
+
+	err = 0;
+	cnt = 0;
+	found = 0;
+	plink_list = &sbinfo->si_plink.head;
+	spin_lock(&sbinfo->si_plink.spin);
+	list_for_each_entry(plink, plink_list, list) {
+		cnt++;
+		if (plink->inode == inode) {
+			found = 1;
+			break;
+		}
+	}
+	if (found) {
+		spin_unlock(&sbinfo->si_plink.spin);
+		return;
+	}
+
+	plink = NULL;
+	if (!found) {
+		plink = kmalloc(sizeof(*plink), GFP_ATOMIC);
+		if (plink) {
+			plink->inode = au_igrab(inode);
+			list_add(&plink->list, plink_list);
+			cnt++;
+		} else
+			err = -ENOMEM;
+	}
+	spin_unlock(&sbinfo->si_plink.spin);
+
+	if (!err) {
+		au_plink_block_maintain(sb);
+		err = whplink(h_dentry, inode, bindex, au_sbr(sb, bindex));
+	}
+
+	if (unlikely(cnt > AUFS_PLINK_WARN))
+		AuWarn1("unexpectedly many pseudo links, %d\n", cnt);
+	if (unlikely(err)) {
+		AuWarn("err %d, damaged pseudo link.\n", err);
+		if (!found && plink)
+			do_put_plink(plink, /*do_del*/1);
+	}
+}
+
+/* free all plinks */
+void au_plink_put(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+	struct list_head *plink_list;
+	struct pseudo_link *plink, *tmp;
+
+	sbinfo = au_sbi(sb);
+	AuDebugOn(!au_opt_test(au_mntflags(sb), PLINK));
+
+	plink_list = &sbinfo->si_plink.head;
+	/* no spin_lock since sbinfo is write-locked */
+	list_for_each_entry_safe(plink, tmp, plink_list, list)
+		do_put_plink(plink, 0);
+	INIT_LIST_HEAD(plink_list);
+}
+
+/* free the plinks on a branch specified by @br_id */
+void au_plink_half_refresh(struct super_block *sb, aufs_bindex_t br_id)
+{
+	struct au_sbinfo *sbinfo;
+	struct list_head *plink_list;
+	struct pseudo_link *plink, *tmp;
+	struct inode *inode;
+	aufs_bindex_t bstart, bend, bindex;
+	unsigned char do_put;
+
+	sbinfo = au_sbi(sb);
+	AuDebugOn(!au_opt_test(au_mntflags(sb), PLINK));
+
+	plink_list = &sbinfo->si_plink.head;
+	/* no spin_lock since sbinfo is write-locked */
+	list_for_each_entry_safe(plink, tmp, plink_list, list) {
+		do_put = 0;
+		inode = au_igrab(plink->inode);
+		ii_write_lock_child(inode);
+		bstart = au_ibstart(inode);
+		bend = au_ibend(inode);
+		if (bstart >= 0) {
+			for (bindex = bstart; bindex <= bend; bindex++) {
+				if (!au_h_iptr(inode, bindex)
+				    || au_ii_br_id(inode, bindex) != br_id)
+					continue;
+				au_set_h_iptr(inode, bindex, NULL, 0);
+				do_put = 1;
+				break;
+			}
+		} else
+			do_put_plink(plink, 1);
+
+		if (do_put) {
+			for (bindex = bstart; bindex <= bend; bindex++)
+				if (au_h_iptr(inode, bindex)) {
+					do_put = 0;
+					break;
+				}
+			if (do_put)
+				do_put_plink(plink, 1);
+		}
+		ii_write_unlock(inode);
+		iput(inode);
+	}
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 15/28] aufs policies to select one among multiple writable branches
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (13 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 14/28] aufs pseudo-link J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 16/28] aufs dentry and lookup J. R. Okajima
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
when creating/removing an entry and there are multiple writable branches
specified in aufs, aufs has to decide which branch is the target.
aufs has several policies for that such as, round-robin, most-free-space
and so on.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/wbr_policy.c |  628 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 628 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/wbr_policy.c

diff --git a/fs/aufs/wbr_policy.c b/fs/aufs/wbr_policy.c
new file mode 100644
index 0000000..436ebfc
--- /dev/null
+++ b/fs/aufs/wbr_policy.c
@@ -0,0 +1,628 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * policies for selecting one among multiple writable branches
+ */
+
+#include <linux/statfs.h>
+#include "aufs.h"
+
+/* subset of cpup_attr() */
+static noinline_for_stack
+int au_cpdown_attr(struct path *h_path, struct dentry *h_src)
+{
+	int err, sbits;
+	struct iattr ia;
+	struct inode *h_isrc;
+
+	h_isrc = h_src->d_inode;
+	ia.ia_valid = ATTR_FORCE | ATTR_MODE | ATTR_UID | ATTR_GID;
+	ia.ia_mode = h_isrc->i_mode;
+	ia.ia_uid = h_isrc->i_uid;
+	ia.ia_gid = h_isrc->i_gid;
+	sbits = !!(ia.ia_mode & (S_ISUID | S_ISGID));
+	au_cpup_attr_flags(h_path->dentry->d_inode, h_isrc);
+	err = vfsub_sio_notify_change(h_path, &ia);
+
+	/* is this nfs only? */
+	if (!err && sbits && au_test_nfs(h_path->dentry->d_sb)) {
+		ia.ia_valid = ATTR_FORCE | ATTR_MODE;
+		ia.ia_mode = h_isrc->i_mode;
+		err = vfsub_sio_notify_change(h_path, &ia);
+	}
+
+	return err;
+}
+
+#define AuCpdown_PARENT_OPQ	1
+#define AuCpdown_WHED		(1 << 1)
+#define AuCpdown_MADE_DIR	(1 << 2)
+#define AuCpdown_DIROPQ		(1 << 3)
+#define au_ftest_cpdown(flags, name)	((flags) & AuCpdown_##name)
+#define au_fset_cpdown(flags, name)	{ (flags) |= AuCpdown_##name; }
+#define au_fclr_cpdown(flags, name)	{ (flags) &= ~AuCpdown_##name; }
+
+struct au_cpdown_dir_args {
+	struct dentry *parent;
+	unsigned int flags;
+};
+
+static int au_cpdown_dir_opq(struct dentry *dentry, aufs_bindex_t bdst,
+			     struct au_cpdown_dir_args *a)
+{
+	int err;
+	struct dentry *opq_dentry;
+
+	opq_dentry = au_diropq_create(dentry, bdst);
+	err = PTR_ERR(opq_dentry);
+	if (IS_ERR(opq_dentry))
+		goto out;
+	dput(opq_dentry);
+	au_fset_cpdown(a->flags, DIROPQ);
+
+ out:
+	return err;
+}
+
+static int au_cpdown_dir_wh(struct dentry *dentry, struct dentry *h_parent,
+			    struct inode *dir, aufs_bindex_t bdst)
+{
+	int err;
+	struct path h_path;
+	struct au_branch *br;
+
+	br = au_sbr(dentry->d_sb, bdst);
+	h_path.dentry = au_wh_lkup(h_parent, &dentry->d_name, br);
+	err = PTR_ERR(h_path.dentry);
+	if (IS_ERR(h_path.dentry))
+		goto out;
+
+	err = 0;
+	if (h_path.dentry->d_inode) {
+		h_path.mnt = br->br_mnt;
+		err = au_wh_unlink_dentry(au_h_iptr(dir, bdst), &h_path,
+					  dentry);
+	}
+	dput(h_path.dentry);
+
+ out:
+	return err;
+}
+
+static int au_cpdown_dir(struct dentry *dentry, aufs_bindex_t bdst,
+			 struct dentry *h_parent, void *arg)
+{
+	int err, rerr;
+	aufs_bindex_t bend, bopq, bstart;
+	unsigned char parent_opq;
+	struct path h_path;
+	struct dentry *parent;
+	struct inode *h_dir, *h_inode, *inode, *dir;
+	struct au_cpdown_dir_args *args = arg;
+
+	bstart = au_dbstart(dentry);
+	/* dentry is di-locked */
+	parent = dget_parent(dentry);
+	dir = parent->d_inode;
+	h_dir = h_parent->d_inode;
+	AuDebugOn(h_dir != au_h_iptr(dir, bdst));
+	IMustLock(h_dir);
+
+	err = au_lkup_neg(dentry, bdst);
+	if (unlikely(err < 0))
+		goto out;
+	h_path.dentry = au_h_dptr(dentry, bdst);
+	h_path.mnt = au_sbr_mnt(dentry->d_sb, bdst);
+	err = vfsub_sio_mkdir(au_h_iptr(dir, bdst), &h_path,
+			      S_IRWXU | S_IRUGO | S_IXUGO);
+	if (unlikely(err))
+		goto out_put;
+	au_fset_cpdown(args->flags, MADE_DIR);
+
+	bend = au_dbend(dentry);
+	bopq = au_dbdiropq(dentry);
+	au_fclr_cpdown(args->flags, WHED);
+	au_fclr_cpdown(args->flags, DIROPQ);
+	if (au_dbwh(dentry) == bdst)
+		au_fset_cpdown(args->flags, WHED);
+	if (!au_ftest_cpdown(args->flags, PARENT_OPQ) && bopq <= bdst)
+		au_fset_cpdown(args->flags, PARENT_OPQ);
+	parent_opq = (au_ftest_cpdown(args->flags, PARENT_OPQ)
+		      && args->parent == dentry);
+	h_inode = h_path.dentry->d_inode;
+	mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+	if (au_ftest_cpdown(args->flags, WHED)) {
+		err = au_cpdown_dir_opq(dentry, bdst, args);
+		if (unlikely(err)) {
+			mutex_unlock(&h_inode->i_mutex);
+			goto out_dir;
+		}
+	}
+
+	err = au_cpdown_attr(&h_path, au_h_dptr(dentry, bstart));
+	mutex_unlock(&h_inode->i_mutex);
+	if (unlikely(err))
+		goto out_opq;
+
+	if (au_ftest_cpdown(args->flags, WHED)) {
+		err = au_cpdown_dir_wh(dentry, h_parent, dir, bdst);
+		if (unlikely(err))
+			goto out_opq;
+	}
+
+	inode = dentry->d_inode;
+	if (au_ibend(inode) < bdst)
+		au_set_ibend(inode, bdst);
+	au_set_h_iptr(inode, bdst, au_igrab(h_inode),
+		      au_hi_flags(inode, /*isdir*/1));
+	goto out; /* success */
+
+	/* revert */
+ out_opq:
+	if (au_ftest_cpdown(args->flags, DIROPQ)) {
+		mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+		rerr = au_diropq_remove(dentry, bdst);
+		mutex_unlock(&h_inode->i_mutex);
+		if (unlikely(rerr)) {
+			AuIOErr("failed removing diropq for %.*s b%d (%d)\n",
+				AuDLNPair(dentry), bdst, rerr);
+			err = -EIO;
+			goto out;
+		}
+	}
+ out_dir:
+	if (au_ftest_cpdown(args->flags, MADE_DIR)) {
+		rerr = vfsub_sio_rmdir(au_h_iptr(dir, bdst), &h_path);
+		if (unlikely(rerr)) {
+			AuIOErr("failed removing %.*s b%d (%d)\n",
+				AuDLNPair(dentry), bdst, rerr);
+			err = -EIO;
+		}
+	}
+ out_put:
+	au_set_h_dptr(dentry, bdst, NULL);
+	if (au_dbend(dentry) == bdst)
+		au_update_dbend(dentry);
+ out:
+	dput(parent);
+	return err;
+}
+
+int au_cpdown_dirs(struct dentry *dentry, aufs_bindex_t bdst)
+{
+	int err;
+	struct au_cpdown_dir_args args = {
+		.parent	= dget_parent(dentry),
+		.flags	= 0
+	};
+
+	err = au_cp_dirs(dentry, bdst, au_cpdown_dir, &args);
+	dput(args.parent);
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* policies for create */
+
+static int au_wbr_bu(struct super_block *sb, aufs_bindex_t bindex)
+{
+	for (; bindex >= 0; bindex--)
+		if (!au_br_rdonly(au_sbr(sb, bindex)))
+			return bindex;
+	return -EROFS;
+}
+
+/* top down parent */
+static int au_wbr_create_tdp(struct dentry *dentry, int isdir __maybe_unused)
+{
+	int err;
+	aufs_bindex_t bstart, bindex;
+	struct super_block *sb;
+	struct dentry *parent, *h_parent;
+
+	sb = dentry->d_sb;
+	bstart = au_dbstart(dentry);
+	err = bstart;
+	if (!au_br_rdonly(au_sbr(sb, bstart)))
+		goto out;
+
+	err = -EROFS;
+	parent = dget_parent(dentry);
+	for (bindex = au_dbstart(parent); bindex < bstart; bindex++) {
+		h_parent = au_h_dptr(parent, bindex);
+		if (!h_parent || !h_parent->d_inode)
+			continue;
+
+		if (!au_br_rdonly(au_sbr(sb, bindex))) {
+			err = bindex;
+			break;
+		}
+	}
+	dput(parent);
+
+	/* bottom up here */
+	if (unlikely(err < 0))
+		err = au_wbr_bu(sb, bstart - 1);
+
+ out:
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* an exception for the policy other than tdp */
+static int au_wbr_create_exp(struct dentry *dentry)
+{
+	int err;
+	aufs_bindex_t bwh, bdiropq;
+	struct dentry *parent;
+
+	err = -1;
+	bwh = au_dbwh(dentry);
+	parent = dget_parent(dentry);
+	bdiropq = au_dbdiropq(parent);
+	if (bwh >= 0) {
+		if (bdiropq >= 0)
+			err = min(bdiropq, bwh);
+		else
+			err = bwh;
+		AuDbg("%d\n", err);
+	} else if (bdiropq >= 0) {
+		err = bdiropq;
+		AuDbg("%d\n", err);
+	}
+	dput(parent);
+
+	if (err >= 0 && au_br_rdonly(au_sbr(dentry->d_sb, err)))
+		err = -1;
+
+	AuDbg("%d\n", err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* round robin */
+static int au_wbr_create_init_rr(struct super_block *sb)
+{
+	int err;
+
+	err = au_wbr_bu(sb, au_sbend(sb));
+	atomic_set(&au_sbi(sb)->si_wbr_rr_next, -err); /* less important */
+
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+static int au_wbr_create_rr(struct dentry *dentry, int isdir)
+{
+	int err, nbr;
+	unsigned int u;
+	aufs_bindex_t bindex, bend;
+	struct super_block *sb;
+	atomic_t *next;
+
+	err = au_wbr_create_exp(dentry);
+	if (err >= 0)
+		goto out;
+
+	sb = dentry->d_sb;
+	next = &au_sbi(sb)->si_wbr_rr_next;
+	bend = au_sbend(sb);
+	nbr = bend + 1;
+	for (bindex = 0; bindex <= bend; bindex++) {
+		if (!isdir) {
+			err = atomic_dec_return(next) + 1;
+			/* modulo for 0 is meaningless */
+			if (unlikely(!err))
+				err = atomic_dec_return(next) + 1;
+		} else
+			err = atomic_read(next);
+		AuDbg("%d\n", err);
+		u = err;
+		err = u % nbr;
+		AuDbg("%d\n", err);
+		if (!au_br_rdonly(au_sbr(sb, err)))
+			break;
+		err = -EROFS;
+	}
+
+ out:
+	AuDbg("%d\n", err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* most free space */
+static void au_mfs(struct dentry *dentry)
+{
+	struct super_block *sb;
+	struct au_branch *br;
+	struct au_wbr_mfs *mfs;
+	aufs_bindex_t bindex, bend;
+	int err;
+	unsigned long long b, bavail;
+	/* reduce the stack usage */
+	struct kstatfs *st;
+
+	st = kmalloc(sizeof(*st), GFP_NOFS);
+	if (unlikely(!st)) {
+		AuWarn1("failed updating mfs(%d), ignored\n", -ENOMEM);
+		return;
+	}
+
+	bavail = 0;
+	sb = dentry->d_sb;
+	mfs = &au_sbi(sb)->si_wbr_mfs;
+	mfs->mfs_bindex = -EROFS;
+	mfs->mfsrr_bytes = 0;
+	bend = au_sbend(sb);
+	for (bindex = 0; bindex <= bend; bindex++) {
+		br = au_sbr(sb, bindex);
+		if (au_br_rdonly(br))
+			continue;
+
+		/* sb->s_root for NFS is unreliable */
+		err = vfs_statfs(br->br_mnt->mnt_root, st);
+		if (unlikely(err)) {
+			AuWarn1("failed statfs, b%d, %d\n", bindex, err);
+			continue;
+		}
+
+		/* when the available size is equal, select the lower one */
+		BUILD_BUG_ON(sizeof(b) < sizeof(st->f_bavail)
+			     || sizeof(b) < sizeof(st->f_bsize));
+		b = st->f_bavail * st->f_bsize;
+		br->br_wbr->wbr_bytes = b;
+		if (b >= bavail) {
+			bavail = b;
+			mfs->mfs_bindex = bindex;
+			mfs->mfs_jiffy = jiffies;
+		}
+	}
+
+	mfs->mfsrr_bytes = bavail;
+	AuDbg("b%d\n", mfs->mfs_bindex);
+	kfree(st);
+}
+
+static int au_wbr_create_mfs(struct dentry *dentry, int isdir __maybe_unused)
+{
+	int err;
+	struct super_block *sb;
+	struct au_wbr_mfs *mfs;
+
+	err = au_wbr_create_exp(dentry);
+	if (err >= 0)
+		goto out;
+
+	sb = dentry->d_sb;
+	mfs = &au_sbi(sb)->si_wbr_mfs;
+	mutex_lock(&mfs->mfs_lock);
+	if (time_after(jiffies, mfs->mfs_jiffy + mfs->mfs_expire)
+	    || mfs->mfs_bindex < 0
+	    || au_br_rdonly(au_sbr(sb, mfs->mfs_bindex)))
+		au_mfs(dentry);
+	mutex_unlock(&mfs->mfs_lock);
+	err = mfs->mfs_bindex;
+
+ out:
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+static int au_wbr_create_init_mfs(struct super_block *sb)
+{
+	struct au_wbr_mfs *mfs;
+
+	mfs = &au_sbi(sb)->si_wbr_mfs;
+	mutex_init(&mfs->mfs_lock);
+	mfs->mfs_jiffy = 0;
+	mfs->mfs_bindex = -EROFS;
+
+	return 0;
+}
+
+static int au_wbr_create_fin_mfs(struct super_block *sb __maybe_unused)
+{
+	mutex_destroy(&au_sbi(sb)->si_wbr_mfs.mfs_lock);
+	return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* most free space and then round robin */
+static int au_wbr_create_mfsrr(struct dentry *dentry, int isdir)
+{
+	int err;
+	struct au_wbr_mfs *mfs;
+
+	err = au_wbr_create_mfs(dentry, isdir);
+	if (err >= 0) {
+		mfs = &au_sbi(dentry->d_sb)->si_wbr_mfs;
+		if (mfs->mfsrr_bytes < mfs->mfsrr_watermark)
+			err = au_wbr_create_rr(dentry, isdir);
+	}
+
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+static int au_wbr_create_init_mfsrr(struct super_block *sb)
+{
+	int err;
+
+	au_wbr_create_init_mfs(sb); /* ignore */
+	err = au_wbr_create_init_rr(sb);
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* top down parent and most free space */
+static int au_wbr_create_pmfs(struct dentry *dentry, int isdir)
+{
+	int err, e2;
+	unsigned long long b;
+	aufs_bindex_t bindex, bstart, bend;
+	struct super_block *sb;
+	struct dentry *parent, *h_parent;
+	struct au_branch *br;
+
+	err = au_wbr_create_tdp(dentry, isdir);
+	if (unlikely(err < 0))
+		goto out;
+	parent = dget_parent(dentry);
+	bstart = au_dbstart(parent);
+	bend = au_dbtaildir(parent);
+	if (bstart == bend)
+		goto out_parent; /* success */
+
+	e2 = au_wbr_create_mfs(dentry, isdir);
+	if (e2 < 0)
+		goto out_parent; /* success */
+
+	/* when the available size is equal, select upper one */
+	sb = dentry->d_sb;
+	br = au_sbr(sb, err);
+	b = br->br_wbr->wbr_bytes;
+	AuDbg("b%d, %llu\n", err, b);
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		h_parent = au_h_dptr(parent, bindex);
+		if (!h_parent || !h_parent->d_inode)
+			continue;
+
+		br = au_sbr(sb, bindex);
+		if (!au_br_rdonly(br) && br->br_wbr->wbr_bytes > b) {
+			b = br->br_wbr->wbr_bytes;
+			err = bindex;
+			AuDbg("b%d, %llu\n", err, b);
+		}
+	}
+
+ out_parent:
+	dput(parent);
+ out:
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* policies for copyup */
+
+/* top down parent */
+static int au_wbr_copyup_tdp(struct dentry *dentry)
+{
+	return au_wbr_create_tdp(dentry, /*isdir, anything is ok*/0);
+}
+
+/* bottom up parent */
+static int au_wbr_copyup_bup(struct dentry *dentry)
+{
+	int err;
+	aufs_bindex_t bindex, bstart;
+	struct dentry *parent, *h_parent;
+	struct super_block *sb;
+
+	err = -EROFS;
+	sb = dentry->d_sb;
+	parent = dget_parent(dentry);
+	bstart = au_dbstart(parent);
+	for (bindex = au_dbstart(dentry); bindex >= bstart; bindex--) {
+		h_parent = au_h_dptr(parent, bindex);
+		if (!h_parent || !h_parent->d_inode)
+			continue;
+
+		if (!au_br_rdonly(au_sbr(sb, bindex))) {
+			err = bindex;
+			break;
+		}
+	}
+	dput(parent);
+
+	/* bottom up here */
+	if (unlikely(err < 0))
+		err = au_wbr_bu(sb, bstart - 1);
+
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+/* bottom up */
+static int au_wbr_copyup_bu(struct dentry *dentry)
+{
+	int err;
+
+	err = au_wbr_bu(dentry->d_sb, au_dbstart(dentry));
+
+	AuDbg("b%d\n", err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct au_wbr_copyup_operations au_wbr_copyup_ops[] = {
+	[AuWbrCopyup_TDP] = {
+		.copyup	= au_wbr_copyup_tdp
+	},
+	[AuWbrCopyup_BUP] = {
+		.copyup	= au_wbr_copyup_bup
+	},
+	[AuWbrCopyup_BU] = {
+		.copyup	= au_wbr_copyup_bu
+	}
+};
+
+struct au_wbr_create_operations au_wbr_create_ops[] = {
+	[AuWbrCreate_TDP] = {
+		.create	= au_wbr_create_tdp
+	},
+	[AuWbrCreate_RR] = {
+		.create	= au_wbr_create_rr,
+		.init	= au_wbr_create_init_rr
+	},
+	[AuWbrCreate_MFS] = {
+		.create	= au_wbr_create_mfs,
+		.init	= au_wbr_create_init_mfs,
+		.fin	= au_wbr_create_fin_mfs
+	},
+	[AuWbrCreate_MFSV] = {
+		.create	= au_wbr_create_mfs,
+		.init	= au_wbr_create_init_mfs,
+		.fin	= au_wbr_create_fin_mfs
+	},
+	[AuWbrCreate_MFSRR] = {
+		.create	= au_wbr_create_mfsrr,
+		.init	= au_wbr_create_init_mfsrr,
+		.fin	= au_wbr_create_fin_mfs
+	},
+	[AuWbrCreate_MFSRRV] = {
+		.create	= au_wbr_create_mfsrr,
+		.init	= au_wbr_create_init_mfsrr,
+		.fin	= au_wbr_create_fin_mfs
+	},
+	[AuWbrCreate_PMFS] = {
+		.create	= au_wbr_create_pmfs,
+		.init	= au_wbr_create_init_mfs,
+		.fin	= au_wbr_create_fin_mfs
+	},
+	[AuWbrCreate_PMFSV] = {
+		.create	= au_wbr_create_pmfs,
+		.init	= au_wbr_create_init_mfs,
+		.fin	= au_wbr_create_fin_mfs
+	}
+};
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 16/28] aufs dentry and lookup
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (14 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 15/28] aufs policies to select one among multiple writable branches J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 17/28] aufs file J. R. Okajima
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
dentry operations and private data

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/dentry.c |  858 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/dentry.h |  213 ++++++++++++++
 fs/aufs/dinfo.c  |  351 ++++++++++++++++++++++
 3 files changed, 1422 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/dentry.c
 create mode 100644 fs/aufs/dentry.h
 create mode 100644 fs/aufs/dinfo.c

diff --git a/fs/aufs/dentry.c b/fs/aufs/dentry.c
new file mode 100644
index 0000000..da4fc58
--- /dev/null
+++ b/fs/aufs/dentry.c
@@ -0,0 +1,858 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * lookup and dentry operations
+ */
+
+#include "aufs.h"
+
+static void au_h_nd(struct nameidata *h_nd, struct nameidata *nd)
+{
+	if (nd) {
+		*h_nd = *nd;
+
+		/*
+		 * gave up supporting LOOKUP_CREATE/OPEN for lower fs,
+		 * due to whiteout and branch permission.
+		 */
+		h_nd->flags &= ~(/*LOOKUP_PARENT |*/ LOOKUP_OPEN | LOOKUP_CREATE
+				 | LOOKUP_FOLLOW);
+		/* unnecessary? */
+		h_nd->intent.open.file = NULL;
+	} else
+		memset(h_nd, 0, sizeof(*h_nd));
+}
+
+struct au_lkup_one_args {
+	struct dentry **errp;
+	struct qstr *name;
+	struct dentry *h_parent;
+	struct au_branch *br;
+	struct nameidata *nd;
+};
+
+struct dentry *au_lkup_one(struct qstr *name, struct dentry *h_parent,
+			   struct au_branch *br, struct nameidata *nd)
+{
+	struct dentry *h_dentry;
+	int err;
+	struct nameidata h_nd;
+
+	if (au_test_fs_null_nd(h_parent->d_sb))
+		return vfsub_lookup_one_len(name->name, h_parent, name->len);
+
+	au_h_nd(&h_nd, nd);
+	h_nd.path.dentry = h_parent;
+	h_nd.path.mnt = br->br_mnt;
+
+	err = __lookup_one_len(name->name, &h_nd.last, NULL, name->len);
+	h_dentry = ERR_PTR(err);
+	if (!err) {
+		path_get(&h_nd.path);
+		h_dentry = vfsub_lookup_hash(&h_nd);
+		path_put(&h_nd.path);
+	}
+
+	return h_dentry;
+}
+
+static void au_call_lkup_one(void *args)
+{
+	struct au_lkup_one_args *a = args;
+	*a->errp = au_lkup_one(a->name, a->h_parent, a->br, a->nd);
+}
+
+#define AuLkup_ALLOW_NEG	1
+#define au_ftest_lkup(flags, name)	((flags) & AuLkup_##name)
+#define au_fset_lkup(flags, name)	{ (flags) |= AuLkup_##name; }
+#define au_fclr_lkup(flags, name)	{ (flags) &= ~AuLkup_##name; }
+
+struct au_do_lookup_args {
+	unsigned int		flags;
+	mode_t			type;
+	struct nameidata	*nd;
+};
+
+/*
+ * returns positive/negative dentry, NULL or an error.
+ * NULL means whiteout-ed or not-found.
+ */
+static struct dentry*
+au_do_lookup(struct dentry *h_parent, struct dentry *dentry,
+	     aufs_bindex_t bindex, struct qstr *wh_name,
+	     struct au_do_lookup_args *args)
+{
+	struct dentry *h_dentry;
+	struct inode *h_inode, *inode;
+	struct qstr *name;
+	struct au_branch *br;
+	int wh_found, opq;
+	unsigned char wh_able;
+	const unsigned char allow_neg = !!au_ftest_lkup(args->flags, ALLOW_NEG);
+
+	name = &dentry->d_name;
+	wh_found = 0;
+	br = au_sbr(dentry->d_sb, bindex);
+	wh_able = !!au_br_whable(br->br_perm);
+	if (wh_able)
+		wh_found = au_wh_test(h_parent, wh_name, br, /*try_sio*/0);
+	h_dentry = ERR_PTR(wh_found);
+	if (!wh_found)
+		goto real_lookup;
+	if (unlikely(wh_found < 0))
+		goto out;
+
+	/* We found a whiteout */
+	/* au_set_dbend(dentry, bindex); */
+	au_set_dbwh(dentry, bindex);
+	if (!allow_neg)
+		return NULL; /* success */
+
+ real_lookup:
+	h_dentry = au_lkup_one(name, h_parent, br, args->nd);
+	if (IS_ERR(h_dentry))
+		goto out;
+
+	h_inode = h_dentry->d_inode;
+	if (!h_inode) {
+		if (!allow_neg)
+			goto out_neg;
+	} else if (wh_found
+		   || (args->type && args->type != (h_inode->i_mode & S_IFMT)))
+		goto out_neg;
+
+	if (au_dbend(dentry) <= bindex)
+		au_set_dbend(dentry, bindex);
+	if (au_dbstart(dentry) < 0 || bindex < au_dbstart(dentry))
+		au_set_dbstart(dentry, bindex);
+	au_set_h_dptr(dentry, bindex, h_dentry);
+
+	inode = dentry->d_inode;
+	if (!h_inode || !S_ISDIR(h_inode->i_mode) || !wh_able
+	    || (inode && !S_ISDIR(inode->i_mode)))
+		goto out; /* success */
+
+	mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+	opq = au_diropq_test(h_dentry, br);
+	mutex_unlock(&h_inode->i_mutex);
+	if (opq > 0)
+		au_set_dbdiropq(dentry, bindex);
+	else if (unlikely(opq < 0)) {
+		au_set_h_dptr(dentry, bindex, NULL);
+		h_dentry = ERR_PTR(opq);
+	}
+	goto out;
+
+ out_neg:
+	dput(h_dentry);
+	h_dentry = NULL;
+ out:
+	return h_dentry;
+}
+
+/*
+ * returns the number of lower positive dentries,
+ * otherwise an error.
+ * can be called at unlinking with @type is zero.
+ */
+int au_lkup_dentry(struct dentry *dentry, aufs_bindex_t bstart, mode_t type,
+		   struct nameidata *nd)
+{
+	int npositive, err;
+	aufs_bindex_t bindex, btail, bdiropq;
+	unsigned char isdir;
+	struct qstr whname;
+	struct au_do_lookup_args args = {
+		.flags	= 0,
+		.type	= type,
+		.nd	= nd
+	};
+	const struct qstr *name = &dentry->d_name;
+	struct dentry *parent;
+	struct inode *inode;
+
+	err = -EPERM;
+	parent = dget_parent(dentry);
+	if (unlikely(!strncmp(name->name, AUFS_WH_PFX, AUFS_WH_PFX_LEN)))
+		goto out;
+
+	err = au_wh_name_alloc(&whname, name);
+	if (unlikely(err))
+		goto out;
+
+	inode = dentry->d_inode;
+	isdir = !!(inode && S_ISDIR(inode->i_mode));
+	if (!type)
+		au_fset_lkup(args.flags, ALLOW_NEG);
+
+	npositive = 0;
+	btail = au_dbtaildir(parent);
+	for (bindex = bstart; bindex <= btail; bindex++) {
+		struct dentry *h_parent, *h_dentry;
+		struct inode *h_inode, *h_dir;
+
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (h_dentry) {
+			if (h_dentry->d_inode)
+				npositive++;
+			if (type != S_IFDIR)
+				break;
+			continue;
+		}
+		h_parent = au_h_dptr(parent, bindex);
+		if (!h_parent)
+			continue;
+		h_dir = h_parent->d_inode;
+		if (!h_dir || !S_ISDIR(h_dir->i_mode))
+			continue;
+
+		mutex_lock_nested(&h_dir->i_mutex, AuLsc_I_PARENT);
+		h_dentry = au_do_lookup(h_parent, dentry, bindex, &whname,
+					&args);
+		mutex_unlock(&h_dir->i_mutex);
+		err = PTR_ERR(h_dentry);
+		if (IS_ERR(h_dentry))
+			goto out_wh;
+		au_fclr_lkup(args.flags, ALLOW_NEG);
+
+		if (au_dbwh(dentry) >= 0)
+			break;
+		if (!h_dentry)
+			continue;
+		h_inode = h_dentry->d_inode;
+		if (!h_inode)
+			continue;
+		npositive++;
+		if (!args.type)
+			args.type = h_inode->i_mode & S_IFMT;
+		if (args.type != S_IFDIR)
+			break;
+		else if (isdir) {
+			/* the type of lower may be different */
+			bdiropq = au_dbdiropq(dentry);
+			if (bdiropq >= 0 && bdiropq <= bindex)
+				break;
+		}
+	}
+
+	if (npositive) {
+		AuLabel(positive);
+		au_update_dbstart(dentry);
+	}
+	err = npositive;
+	if (unlikely(!au_opt_test(au_mntflags(dentry->d_sb), UDBA_NONE)
+		     && au_dbstart(dentry) < 0))
+		/* both of real entry and whiteout found */
+		err = -EIO;
+
+ out_wh:
+	kfree(whname.name);
+ out:
+	dput(parent);
+	return err;
+}
+
+struct dentry *au_sio_lkup_one(struct qstr *name, struct dentry *parent,
+			       struct au_branch *br)
+{
+	struct dentry *dentry;
+	int wkq_err;
+
+	if (!au_test_h_perm_sio(parent->d_inode, MAY_EXEC))
+		dentry = au_lkup_one(name, parent, br, /*nd*/NULL);
+	else {
+		struct au_lkup_one_args args = {
+			.errp		= &dentry,
+			.name		= name,
+			.h_parent	= parent,
+			.br		= br,
+			.nd		= NULL
+		};
+
+		wkq_err = au_wkq_wait(au_call_lkup_one, &args);
+		if (unlikely(wkq_err))
+			dentry = ERR_PTR(wkq_err);
+	}
+
+	return dentry;
+}
+
+/*
+ * lookup @dentry on @bindex which should be negative.
+ */
+int au_lkup_neg(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	int err;
+	struct dentry *parent, *h_parent, *h_dentry;
+	struct qstr *name;
+
+	name = &dentry->d_name;
+	parent = dget_parent(dentry);
+	h_parent = au_h_dptr(parent, bindex);
+	h_dentry = au_sio_lkup_one(name, h_parent,
+				   au_sbr(dentry->d_sb, bindex));
+	err = PTR_ERR(h_dentry);
+	if (IS_ERR(h_dentry))
+		goto out;
+	if (unlikely(h_dentry->d_inode)) {
+		err = -EIO;
+		AuIOErr("b%d %.*s should be negative.\n",
+			bindex, AuDLNPair(h_dentry));
+		dput(h_dentry);
+		goto out;
+	}
+
+	if (bindex < au_dbstart(dentry))
+		au_set_dbstart(dentry, bindex);
+	if (au_dbend(dentry) < bindex)
+		au_set_dbend(dentry, bindex);
+	au_set_h_dptr(dentry, bindex, h_dentry);
+	err = 0;
+
+ out:
+	dput(parent);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* subset of struct inode */
+struct au_iattr {
+	unsigned long		i_ino;
+	/* unsigned int		i_nlink; */
+	uid_t			i_uid;
+	gid_t			i_gid;
+	u64			i_version;
+/*
+	loff_t			i_size;
+	blkcnt_t		i_blocks;
+*/
+	umode_t			i_mode;
+};
+
+static void au_iattr_save(struct au_iattr *ia, struct inode *h_inode)
+{
+	ia->i_ino = h_inode->i_ino;
+	/* ia->i_nlink = h_inode->i_nlink; */
+	ia->i_uid = h_inode->i_uid;
+	ia->i_gid = h_inode->i_gid;
+	ia->i_version = h_inode->i_version;
+/*
+	ia->i_size = h_inode->i_size;
+	ia->i_blocks = h_inode->i_blocks;
+*/
+	ia->i_mode = (h_inode->i_mode & S_IFMT);
+}
+
+static int au_iattr_test(struct au_iattr *ia, struct inode *h_inode)
+{
+	return ia->i_ino != h_inode->i_ino
+		/* || ia->i_nlink != h_inode->i_nlink */
+		|| ia->i_uid != h_inode->i_uid
+		|| ia->i_gid != h_inode->i_gid
+		|| ia->i_version != h_inode->i_version
+/*
+		|| ia->i_size != h_inode->i_size
+		|| ia->i_blocks != h_inode->i_blocks
+*/
+		|| ia->i_mode != (h_inode->i_mode & S_IFMT);
+}
+
+static int au_h_verify_dentry(struct dentry *h_dentry, struct dentry *h_parent,
+			      struct au_branch *br)
+{
+	int err;
+	struct au_iattr ia;
+	struct dentry *h_d;
+	struct inode *h_inode;
+	struct super_block *h_sb;
+
+	err = 0;
+	memset(&ia, -1, sizeof(ia));
+	h_sb = h_dentry->d_sb;
+	h_inode = h_dentry->d_inode;
+	if (h_inode)
+		au_iattr_save(&ia, h_dentry->d_inode);
+	else if (au_test_nfs(h_sb) || au_test_fuse(h_sb))
+		/* nfs d_revalidate may return 0 for negative dentry */
+		/* fuse d_revalidate always return 0 for negative dentry */
+		goto out;
+
+	/* main purpose is namei.c:cached_lookup() and d_revalidate */
+	h_d = au_lkup_one(&h_dentry->d_name, h_parent, br, /*nd*/NULL);
+	err = PTR_ERR(h_d);
+	if (IS_ERR(h_d))
+		goto out;
+
+	err = 0;
+	if (unlikely(h_d != h_dentry
+		     || h_d->d_inode != h_inode
+		     || (h_inode && au_iattr_test(&ia, h_inode))))
+		err = -EBUSY;
+	dput(h_d);
+
+ out:
+	AuTraceErr(err);
+	return err;
+}
+
+int au_h_verify(struct dentry *h_dentry, unsigned int udba, struct inode *h_dir,
+		struct dentry *h_parent, struct au_branch *br)
+{
+	int err;
+
+	err = 0;
+	if (udba == AuOpt_UDBA_REVAL) {
+		IMustLock(h_dir);
+		err = (h_dentry->d_parent->d_inode != h_dir);
+	} else if (udba == AuOpt_UDBA_HINOTIFY)
+		err = au_h_verify_dentry(h_dentry, h_parent, br);
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static void au_do_refresh_hdentry(struct au_hdentry *p, struct au_dinfo *dinfo,
+				  struct dentry *parent)
+{
+	struct dentry *h_d, *h_dp;
+	struct au_hdentry tmp, *q;
+	struct super_block *sb;
+	aufs_bindex_t new_bindex, bindex, bend, bwh, bdiropq;
+
+	bend = dinfo->di_bend;
+	bwh = dinfo->di_bwh;
+	bdiropq = dinfo->di_bdiropq;
+	for (bindex = dinfo->di_bstart; bindex <= bend; bindex++, p++) {
+		h_d = p->hd_dentry;
+		if (!h_d)
+			continue;
+
+		h_dp = dget_parent(h_d);
+		if (h_dp == au_h_dptr(parent, bindex)) {
+			dput(h_dp);
+			continue;
+		}
+
+		new_bindex = au_find_dbindex(parent, h_dp);
+		dput(h_dp);
+		if (dinfo->di_bwh == bindex)
+			bwh = new_bindex;
+		if (dinfo->di_bdiropq == bindex)
+			bdiropq = new_bindex;
+		if (new_bindex < 0) {
+			au_hdput(p);
+			p->hd_dentry = NULL;
+			continue;
+		}
+
+		/* swap two lower dentries, and loop again */
+		q = dinfo->di_hdentry + new_bindex;
+		tmp = *q;
+		*q = *p;
+		*p = tmp;
+		if (tmp.hd_dentry) {
+			bindex--;
+			p--;
+		}
+	}
+
+	sb = parent->d_sb;
+	dinfo->di_bwh = -1;
+	if (bwh >= 0 && bwh <= au_sbend(sb) && au_sbr_whable(sb, bwh))
+		dinfo->di_bwh = bwh;
+
+	dinfo->di_bdiropq = -1;
+	if (bdiropq >= 0
+	    && bdiropq <= au_sbend(sb)
+	    && au_sbr_whable(sb, bdiropq))
+		dinfo->di_bdiropq = bdiropq;
+
+	bend = au_dbend(parent);
+	p = dinfo->di_hdentry;
+	for (bindex = 0; bindex <= bend; bindex++, p++)
+		if (p->hd_dentry) {
+			dinfo->di_bstart = bindex;
+			break;
+		}
+
+	p = dinfo->di_hdentry + bend;
+	for (bindex = bend; bindex >= 0; bindex--, p--)
+		if (p->hd_dentry) {
+			dinfo->di_bend = bindex;
+			break;
+		}
+}
+
+/*
+ * returns the number of found lower positive dentries,
+ * otherwise an error.
+ */
+int au_refresh_hdentry(struct dentry *dentry, mode_t type)
+{
+	int npositive, err;
+	unsigned int sigen;
+	aufs_bindex_t bstart;
+	struct au_dinfo *dinfo;
+	struct super_block *sb;
+	struct dentry *parent;
+
+	sb = dentry->d_sb;
+	AuDebugOn(IS_ROOT(dentry));
+	sigen = au_sigen(sb);
+	parent = dget_parent(dentry);
+	AuDebugOn(au_digen(parent) != sigen
+		  || au_iigen(parent->d_inode) != sigen);
+
+	dinfo = au_di(dentry);
+	err = au_di_realloc(dinfo, au_sbend(sb) + 1);
+	npositive = err;
+	if (unlikely(err))
+		goto out;
+	au_do_refresh_hdentry(dinfo->di_hdentry + dinfo->di_bstart, dinfo,
+			      parent);
+
+	npositive = 0;
+	bstart = au_dbstart(parent);
+	if (type != S_IFDIR && dinfo->di_bstart == bstart)
+		goto out_dgen; /* success */
+
+	npositive = au_lkup_dentry(dentry, bstart, type, /*nd*/NULL);
+	if (npositive < 0)
+		goto out;
+	if (dinfo->di_bwh >= 0 && dinfo->di_bwh <= dinfo->di_bstart)
+		d_drop(dentry);
+
+ out_dgen:
+	au_update_digen(dentry);
+ out:
+	dput(parent);
+	AuTraceErr(npositive);
+	return npositive;
+}
+
+static noinline_for_stack
+int au_do_h_d_reval(struct dentry *h_dentry, struct nameidata *nd,
+		    struct dentry *dentry, aufs_bindex_t bindex)
+{
+	int err, valid;
+	int (*reval)(struct dentry *, struct nameidata *);
+
+	err = 0;
+	reval = NULL;
+	if (h_dentry->d_op)
+		reval = h_dentry->d_op->d_revalidate;
+	if (!reval)
+		goto out;
+
+	AuDbg("b%d\n", bindex);
+	if (au_test_fs_null_nd(h_dentry->d_sb))
+		/* it may return tri-state */
+		valid = reval(h_dentry, NULL);
+	else {
+		struct nameidata h_nd;
+		int locked;
+		struct dentry *parent;
+
+		au_h_nd(&h_nd, nd);
+		parent = nd->path.dentry;
+		locked = (nd && nd->path.dentry != dentry);
+		if (locked)
+			di_read_lock_parent(parent, AuLock_IR);
+		BUG_ON(bindex > au_dbend(parent));
+		h_nd.path.dentry = au_h_dptr(parent, bindex);
+		BUG_ON(!h_nd.path.dentry);
+		h_nd.path.mnt = au_sbr(parent->d_sb, bindex)->br_mnt;
+		path_get(&h_nd.path);
+		valid = reval(h_dentry, &h_nd);
+		path_put(&h_nd.path);
+		if (locked)
+			di_read_unlock(parent, AuLock_IR);
+	}
+
+	if (unlikely(valid < 0))
+		err = valid;
+	else if (!valid)
+		err = -EINVAL;
+
+ out:
+	AuTraceErr(err);
+	return err;
+}
+
+/* todo: remove this */
+static int h_d_revalidate(struct dentry *dentry, struct inode *inode,
+			  struct nameidata *nd, int do_udba)
+{
+	int err;
+	umode_t mode, h_mode;
+	aufs_bindex_t bindex, btail, bstart, ibs, ibe;
+	unsigned char plus, unhashed, is_root, h_plus;
+	struct inode *first, *h_inode, *h_cached_inode;
+	struct dentry *h_dentry;
+	struct qstr *name, *h_name;
+
+	err = 0;
+	plus = 0;
+	mode = 0;
+	first = NULL;
+	ibs = -1;
+	ibe = -1;
+	unhashed = !!d_unhashed(dentry);
+	is_root = !!IS_ROOT(dentry);
+	name = &dentry->d_name;
+
+	/*
+	 * Theoretically, REVAL test should be unnecessary in case of INOTIFY.
+	 * But inotify doesn't fire some necessary events,
+	 *	IN_ATTRIB for atime/nlink/pageio
+	 *	IN_DELETE for NFS dentry
+	 * Let's do REVAL test too.
+	 */
+	if (do_udba && inode) {
+		mode = (inode->i_mode & S_IFMT);
+		plus = (inode->i_nlink > 0);
+		first = au_h_iptr(inode, au_ibstart(inode));
+		ibs = au_ibstart(inode);
+		ibe = au_ibend(inode);
+	}
+
+	bstart = au_dbstart(dentry);
+	btail = bstart;
+	if (inode && S_ISDIR(inode->i_mode))
+		btail = au_dbtaildir(dentry);
+	for (bindex = bstart; bindex <= btail; bindex++) {
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (!h_dentry)
+			continue;
+
+		AuDbg("b%d, %.*s\n", bindex, AuDLNPair(h_dentry));
+		h_name = &h_dentry->d_name;
+		if (unlikely(do_udba
+			     && !is_root
+			     && (unhashed != !!d_unhashed(h_dentry)
+				 || name->len != h_name->len
+				 || memcmp(name->name, h_name->name, name->len))
+			    )) {
+			AuDbg("unhash 0x%x 0x%x, %.*s %.*s\n",
+				  unhashed, d_unhashed(h_dentry),
+				  AuDLNPair(dentry), AuDLNPair(h_dentry));
+			goto err;
+		}
+
+		err = au_do_h_d_reval(h_dentry, nd, dentry, bindex);
+		if (unlikely(err))
+			/* do not goto err, to keep the errno */
+			break;
+
+		/* todo: plink too? */
+		if (!do_udba)
+			continue;
+
+		/* UDBA tests */
+		h_inode = h_dentry->d_inode;
+		if (unlikely(!!inode != !!h_inode))
+			goto err;
+
+		h_plus = plus;
+		h_mode = mode;
+		h_cached_inode = h_inode;
+		if (h_inode) {
+			h_mode = (h_inode->i_mode & S_IFMT);
+			h_plus = (h_inode->i_nlink > 0);
+		}
+		if (inode && ibs <= bindex && bindex <= ibe)
+			h_cached_inode = au_h_iptr(inode, bindex);
+
+		if (unlikely(plus != h_plus
+			     || mode != h_mode
+			     || h_cached_inode != h_inode))
+			goto err;
+		continue;
+
+	err:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+static int simple_reval_dpath(struct dentry *dentry, unsigned int sigen)
+{
+	int err;
+	struct dentry *parent;
+	struct inode *inode;
+
+	inode = dentry->d_inode;
+	if (au_digen(dentry) == sigen && au_iigen(inode) == sigen)
+		return 0;
+
+	parent = dget_parent(dentry);
+	di_read_lock_parent(parent, AuLock_IR);
+	AuDebugOn(au_digen(parent) != sigen
+		  || au_iigen(parent->d_inode) != sigen);
+	au_dbg_verify_gen(parent, sigen);
+
+	/* returns a number of positive dentries */
+	err = au_refresh_hdentry(dentry, inode->i_mode & S_IFMT);
+	if (err >= 0)
+		err = au_refresh_hinode(inode, dentry);
+
+	di_read_unlock(parent, AuLock_IR);
+	dput(parent);
+	return err;
+}
+
+int au_reval_dpath(struct dentry *dentry, unsigned int sigen)
+{
+	int err;
+	struct dentry *d, *parent;
+	struct inode *inode;
+
+	if (!au_ftest_si(au_sbi(dentry->d_sb), FAILED_REFRESH_DIRS))
+		return simple_reval_dpath(dentry, sigen);
+
+	/* slow loop, keep it simple and stupid */
+	/* cf: au_cpup_dirs() */
+	err = 0;
+	parent = NULL;
+	while (au_digen(dentry) != sigen
+	       || au_iigen(dentry->d_inode) != sigen) {
+		d = dentry;
+		while (1) {
+			dput(parent);
+			parent = dget_parent(d);
+			if (au_digen(parent) == sigen
+			    && au_iigen(parent->d_inode) == sigen)
+				break;
+			d = parent;
+		}
+
+		inode = d->d_inode;
+		if (d != dentry)
+			di_write_lock_child(d);
+
+		/* someone might update our dentry while we were sleeping */
+		if (au_digen(d) != sigen || au_iigen(d->d_inode) != sigen) {
+			di_read_lock_parent(parent, AuLock_IR);
+			/* returns a number of positive dentries */
+			err = au_refresh_hdentry(d, inode->i_mode & S_IFMT);
+			if (err >= 0)
+				err = au_refresh_hinode(inode, d);
+			di_read_unlock(parent, AuLock_IR);
+		}
+
+		if (d != dentry)
+			di_write_unlock(d);
+		dput(parent);
+		if (unlikely(err))
+			break;
+	}
+
+	return err;
+}
+
+/*
+ * if valid returns 1, otherwise 0.
+ */
+static int aufs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	int valid, err;
+	unsigned int sigen;
+	unsigned char do_udba;
+	struct super_block *sb;
+	struct inode *inode;
+
+	err = -EINVAL;
+	sb = dentry->d_sb;
+	inode = dentry->d_inode;
+	aufs_read_lock(dentry, AuLock_FLUSH | AuLock_DW);
+	sigen = au_sigen(sb);
+	if (au_digen(dentry) != sigen) {
+		AuDebugOn(IS_ROOT(dentry));
+		if (inode)
+			err = au_reval_dpath(dentry, sigen);
+		if (unlikely(err))
+			goto out_dgrade;
+		AuDebugOn(au_digen(dentry) != sigen);
+	}
+	if (inode && au_iigen(inode) != sigen) {
+		AuDebugOn(IS_ROOT(dentry));
+		err = au_refresh_hinode(inode, dentry);
+		if (unlikely(err))
+			goto out_dgrade;
+		AuDebugOn(au_iigen(inode) != sigen);
+	}
+	di_downgrade_lock(dentry, AuLock_IR);
+
+	AuDebugOn(au_digen(dentry) != sigen);
+	AuDebugOn(inode && au_iigen(inode) != sigen);
+	err = -EINVAL;
+	do_udba = !au_opt_test(au_mntflags(sb), UDBA_NONE);
+	if (do_udba && inode) {
+		aufs_bindex_t bstart = au_ibstart(inode);
+
+		if (bstart >= 0
+		    && au_test_higen(inode, au_h_iptr(inode, bstart)))
+			goto out;
+	}
+
+	err = h_d_revalidate(dentry, inode, nd, do_udba);
+	if (unlikely(!err && do_udba && au_dbstart(dentry) < 0))
+		/* both of real entry and whiteout found */
+		err = -EIO;
+	goto out;
+
+ out_dgrade:
+	di_downgrade_lock(dentry, AuLock_IR);
+ out:
+	au_store_oflag(nd);
+	aufs_read_unlock(dentry, AuLock_IR);
+	AuTraceErr(err);
+	valid = !err;
+	if (!valid)
+		AuDbg("%.*s invalid\n", AuDLNPair(dentry));
+	return valid;
+}
+
+static void aufs_d_release(struct dentry *dentry)
+{
+	struct au_dinfo *dinfo;
+	aufs_bindex_t bend, bindex;
+
+	dinfo = dentry->d_fsdata;
+	if (!dinfo)
+		return;
+
+	/* dentry may not be revalidated */
+	bindex = dinfo->di_bstart;
+	if (bindex >= 0) {
+		struct au_hdentry *p;
+
+		bend = dinfo->di_bend;
+		p = dinfo->di_hdentry + bindex;
+		while (bindex++ <= bend) {
+			if (p->hd_dentry)
+				au_hdput(p);
+			p++;
+		}
+	}
+	kfree(dinfo->di_hdentry);
+	au_rwsem_destroy(&dinfo->di_rwsem);
+	au_cache_free_dinfo(dinfo);
+	au_hin_di_reinit(dentry);
+}
+
+struct dentry_operations aufs_dop = {
+	.d_revalidate	= aufs_d_revalidate,
+	.d_release	= aufs_d_release
+};
diff --git a/fs/aufs/dentry.h b/fs/aufs/dentry.h
new file mode 100644
index 0000000..8708db8
--- /dev/null
+++ b/fs/aufs/dentry.h
@@ -0,0 +1,213 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * lookup and dentry operations
+ */
+
+#ifndef __AUFS_DENTRY_H__
+#define __AUFS_DENTRY_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/aufs_type.h>
+#include "rwsem.h"
+
+/* make a single member structure for future use */
+/* todo: remove this structure */
+struct au_hdentry {
+	struct dentry		*hd_dentry;
+};
+
+struct au_dinfo {
+	atomic_t		di_generation;
+
+	struct rw_semaphore	di_rwsem;
+	aufs_bindex_t		di_bstart, di_bend, di_bwh, di_bdiropq;
+	struct au_hdentry	*di_hdentry;
+};
+
+/* ---------------------------------------------------------------------- */
+
+/* dentry.c */
+extern struct dentry_operations aufs_dop;
+struct au_branch;
+struct dentry *au_lkup_one(struct qstr *name, struct dentry *h_parent,
+			   struct au_branch *br, struct nameidata *nd);
+struct dentry *au_sio_lkup_one(struct qstr *name, struct dentry *parent,
+			       struct au_branch *br);
+int au_h_verify(struct dentry *h_dentry, unsigned int udba, struct inode *h_dir,
+		struct dentry *h_parent, struct au_branch *br);
+
+int au_lkup_dentry(struct dentry *dentry, aufs_bindex_t bstart, mode_t type,
+		   struct nameidata *nd);
+int au_lkup_neg(struct dentry *dentry, aufs_bindex_t bindex);
+int au_refresh_hdentry(struct dentry *dentry, mode_t type);
+int au_reval_dpath(struct dentry *dentry, unsigned int sigen);
+
+/* dinfo.c */
+int au_alloc_dinfo(struct dentry *dentry);
+int au_di_realloc(struct au_dinfo *dinfo, int nbr);
+
+void di_read_lock(struct dentry *d, int flags, unsigned int lsc);
+void di_read_unlock(struct dentry *d, int flags);
+void di_downgrade_lock(struct dentry *d, int flags);
+void di_write_lock(struct dentry *d, unsigned int lsc);
+void di_write_unlock(struct dentry *d);
+void di_write_lock2_child(struct dentry *d1, struct dentry *d2, int isdir);
+void di_write_lock2_parent(struct dentry *d1, struct dentry *d2, int isdir);
+void di_write_unlock2(struct dentry *d1, struct dentry *d2);
+
+struct dentry *au_h_dptr(struct dentry *dentry, aufs_bindex_t bindex);
+aufs_bindex_t au_dbtail(struct dentry *dentry);
+aufs_bindex_t au_dbtaildir(struct dentry *dentry);
+
+void au_set_h_dptr(struct dentry *dentry, aufs_bindex_t bindex,
+		   struct dentry *h_dentry);
+void au_update_digen(struct dentry *dentry);
+void au_update_dbrange(struct dentry *dentry, int do_put_zero);
+void au_update_dbstart(struct dentry *dentry);
+void au_update_dbend(struct dentry *dentry);
+int au_find_dbindex(struct dentry *dentry, struct dentry *h_dentry);
+
+/* ---------------------------------------------------------------------- */
+
+static inline struct au_dinfo *au_di(struct dentry *dentry)
+{
+	return dentry->d_fsdata;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* lock subclass for dinfo */
+enum {
+	AuLsc_DI_CHILD,		/* child first */
+	AuLsc_DI_CHILD2,	/* rename(2), link(2), and cpup at hinotify */
+	AuLsc_DI_CHILD3,	/* copyup dirs */
+	AuLsc_DI_PARENT,
+	AuLsc_DI_PARENT2,
+	AuLsc_DI_PARENT3
+};
+
+/*
+ * di_read_lock_child, di_write_lock_child,
+ * di_read_lock_child2, di_write_lock_child2,
+ * di_read_lock_child3, di_write_lock_child3,
+ * di_read_lock_parent, di_write_lock_parent,
+ * di_read_lock_parent2, di_write_lock_parent2,
+ * di_read_lock_parent3, di_write_lock_parent3,
+ */
+#define AuReadLockFunc(name, lsc) \
+static inline void di_read_lock_##name(struct dentry *d, int flags) \
+{ di_read_lock(d, flags, AuLsc_DI_##lsc); }
+
+#define AuWriteLockFunc(name, lsc) \
+static inline void di_write_lock_##name(struct dentry *d) \
+{ di_write_lock(d, AuLsc_DI_##lsc); }
+
+#define AuRWLockFuncs(name, lsc) \
+	AuReadLockFunc(name, lsc) \
+	AuWriteLockFunc(name, lsc)
+
+AuRWLockFuncs(child, CHILD);
+AuRWLockFuncs(child2, CHILD2);
+AuRWLockFuncs(child3, CHILD3);
+AuRWLockFuncs(parent, PARENT);
+AuRWLockFuncs(parent2, PARENT2);
+AuRWLockFuncs(parent3, PARENT3);
+
+#undef AuReadLockFunc
+#undef AuWriteLockFunc
+#undef AuRWLockFuncs
+
+#define DiMustNoWaiters(d)	AuRwMustNoWaiters(&au_di(d)->di_rwsem)
+
+/* ---------------------------------------------------------------------- */
+
+/* todo: memory barrier? */
+static inline unsigned int au_digen(struct dentry *d)
+{
+	return atomic_read(&au_di(d)->di_generation);
+}
+
+static inline void au_h_dentry_init(struct au_hdentry *hdentry)
+{
+	hdentry->hd_dentry = NULL;
+}
+
+static inline void au_hdput(struct au_hdentry *hd)
+{
+	dput(hd->hd_dentry);
+}
+
+static inline aufs_bindex_t au_dbstart(struct dentry *dentry)
+{
+	return au_di(dentry)->di_bstart;
+}
+
+static inline aufs_bindex_t au_dbend(struct dentry *dentry)
+{
+	return au_di(dentry)->di_bend;
+}
+
+static inline aufs_bindex_t au_dbwh(struct dentry *dentry)
+{
+	return au_di(dentry)->di_bwh;
+}
+
+static inline aufs_bindex_t au_dbdiropq(struct dentry *dentry)
+{
+	return au_di(dentry)->di_bdiropq;
+}
+
+/* todo: hard/soft set? */
+static inline void au_set_dbstart(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	au_di(dentry)->di_bstart = bindex;
+}
+
+static inline void au_set_dbend(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	au_di(dentry)->di_bend = bindex;
+}
+
+static inline void au_set_dbwh(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	/* dbwh can be outside of bstart - bend range */
+	au_di(dentry)->di_bwh = bindex;
+}
+
+static inline void au_set_dbdiropq(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	au_di(dentry)->di_bdiropq = bindex;
+}
+
+/* ---------------------------------------------------------------------- */
+
+#ifdef CONFIG_AUFS_HINOTIFY
+static inline void au_digen_dec(struct dentry *d)
+{
+	atomic_dec(&au_di(d)->di_generation);
+}
+
+static inline void au_hin_di_reinit(struct dentry *dentry)
+{
+	dentry->d_fsdata = NULL;
+}
+#else
+static inline void au_hin_di_reinit(struct dentry *dentry __maybe_unused)
+{
+	/* empty */
+}
+#endif /* CONFIG_AUFS_HINOTIFY */
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_DENTRY_H__ */
diff --git a/fs/aufs/dinfo.c b/fs/aufs/dinfo.c
new file mode 100644
index 0000000..b0645aa
--- /dev/null
+++ b/fs/aufs/dinfo.c
@@ -0,0 +1,351 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * dentry private data
+ */
+
+#include "aufs.h"
+
+int au_alloc_dinfo(struct dentry *dentry)
+{
+	struct au_dinfo *dinfo;
+	struct super_block *sb;
+	int nbr;
+
+	dinfo = au_cache_alloc_dinfo();
+	if (unlikely(!dinfo))
+		goto out;
+
+	sb = dentry->d_sb;
+	nbr = au_sbend(sb) + 1;
+	if (nbr <= 0)
+		nbr = 1;
+	dinfo->di_hdentry = kcalloc(nbr, sizeof(*dinfo->di_hdentry), GFP_NOFS);
+	if (unlikely(!dinfo->di_hdentry))
+		goto out_dinfo;
+
+	atomic_set(&dinfo->di_generation, au_sigen(sb));
+	/* smp_mb(); */ /* atomic_set */
+	init_rwsem(&dinfo->di_rwsem);
+	down_write_nested(&dinfo->di_rwsem, AuLsc_DI_CHILD);
+	dinfo->di_bstart = -1;
+	dinfo->di_bend = -1;
+	dinfo->di_bwh = -1;
+	dinfo->di_bdiropq = -1;
+
+	dentry->d_fsdata = dinfo;
+	dentry->d_op = &aufs_dop;
+	return 0; /* success */
+
+ out_dinfo:
+	au_cache_free_dinfo(dinfo);
+ out:
+	return -ENOMEM;
+}
+
+int au_di_realloc(struct au_dinfo *dinfo, int nbr)
+{
+	int err, sz;
+	struct au_hdentry *hdp;
+
+	err = -ENOMEM;
+	sz = sizeof(*hdp) * (dinfo->di_bend + 1);
+	if (!sz)
+		sz = sizeof(*hdp);
+	hdp = au_kzrealloc(dinfo->di_hdentry, sz, sizeof(*hdp) * nbr, GFP_NOFS);
+	if (hdp) {
+		dinfo->di_hdentry = hdp;
+		err = 0;
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static void do_ii_write_lock(struct inode *inode, unsigned int lsc)
+{
+	switch (lsc) {
+	case AuLsc_DI_CHILD:
+		ii_write_lock_child(inode);
+		break;
+	case AuLsc_DI_CHILD2:
+		ii_write_lock_child2(inode);
+		break;
+	case AuLsc_DI_CHILD3:
+		ii_write_lock_child3(inode);
+		break;
+	case AuLsc_DI_PARENT:
+		ii_write_lock_parent(inode);
+		break;
+	case AuLsc_DI_PARENT2:
+		ii_write_lock_parent2(inode);
+		break;
+	case AuLsc_DI_PARENT3:
+		ii_write_lock_parent3(inode);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void do_ii_read_lock(struct inode *inode, unsigned int lsc)
+{
+	switch (lsc) {
+	case AuLsc_DI_CHILD:
+		ii_read_lock_child(inode);
+		break;
+	case AuLsc_DI_CHILD2:
+		ii_read_lock_child2(inode);
+		break;
+	case AuLsc_DI_CHILD3:
+		ii_read_lock_child3(inode);
+		break;
+	case AuLsc_DI_PARENT:
+		ii_read_lock_parent(inode);
+		break;
+	case AuLsc_DI_PARENT2:
+		ii_read_lock_parent2(inode);
+		break;
+	case AuLsc_DI_PARENT3:
+		ii_read_lock_parent3(inode);
+		break;
+	default:
+		BUG();
+	}
+}
+
+void di_read_lock(struct dentry *d, int flags, unsigned int lsc)
+{
+	down_read_nested(&au_di(d)->di_rwsem, lsc);
+	if (d->d_inode) {
+		if (au_ftest_lock(flags, IW))
+			do_ii_write_lock(d->d_inode, lsc);
+		else if (au_ftest_lock(flags, IR))
+			do_ii_read_lock(d->d_inode, lsc);
+	}
+}
+
+void di_read_unlock(struct dentry *d, int flags)
+{
+	if (d->d_inode) {
+		if (au_ftest_lock(flags, IW))
+			ii_write_unlock(d->d_inode);
+		else if (au_ftest_lock(flags, IR))
+			ii_read_unlock(d->d_inode);
+	}
+	up_read(&au_di(d)->di_rwsem);
+}
+
+void di_downgrade_lock(struct dentry *d, int flags)
+{
+	downgrade_write(&au_di(d)->di_rwsem);
+	if (d->d_inode && au_ftest_lock(flags, IR))
+		ii_downgrade_lock(d->d_inode);
+}
+
+void di_write_lock(struct dentry *d, unsigned int lsc)
+{
+	down_write_nested(&au_di(d)->di_rwsem, lsc);
+	if (d->d_inode)
+		do_ii_write_lock(d->d_inode, lsc);
+}
+
+void di_write_unlock(struct dentry *d)
+{
+	if (d->d_inode)
+		ii_write_unlock(d->d_inode);
+	up_write(&au_di(d)->di_rwsem);
+}
+
+void di_write_lock2_child(struct dentry *d1, struct dentry *d2, int isdir)
+{
+	AuDebugOn(d1 == d2
+		  || d1->d_inode == d2->d_inode
+		  || d1->d_sb != d2->d_sb);
+
+	if (isdir && au_test_subdir(d1, d2)) {
+		di_write_lock_child(d1);
+		di_write_lock_child2(d2);
+	} else {
+		/* there should be no races */
+		di_write_lock_child(d2);
+		di_write_lock_child2(d1);
+	}
+}
+
+void di_write_lock2_parent(struct dentry *d1, struct dentry *d2, int isdir)
+{
+	AuDebugOn(d1 == d2
+		  || d1->d_inode == d2->d_inode
+		  || d1->d_sb != d2->d_sb);
+
+	if (isdir && au_test_subdir(d1, d2)) {
+		di_write_lock_parent(d1);
+		di_write_lock_parent2(d2);
+	} else {
+		/* there should be no races */
+		di_write_lock_parent(d2);
+		di_write_lock_parent2(d1);
+	}
+}
+
+void di_write_unlock2(struct dentry *d1, struct dentry *d2)
+{
+	di_write_unlock(d1);
+	if (d1->d_inode == d2->d_inode)
+		up_write(&au_di(d2)->di_rwsem);
+	else
+		di_write_unlock(d2);
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct dentry *au_h_dptr(struct dentry *dentry, aufs_bindex_t bindex)
+{
+	struct dentry *d;
+
+	if (au_dbstart(dentry) < 0 || bindex < au_dbstart(dentry))
+		return NULL;
+	AuDebugOn(bindex < 0);
+	d = au_di(dentry)->di_hdentry[0 + bindex].hd_dentry;
+	AuDebugOn(d && (atomic_read(&d->d_count) <= 0));
+	return d;
+}
+
+aufs_bindex_t au_dbtail(struct dentry *dentry)
+{
+	aufs_bindex_t bend, bwh;
+
+	bend = au_dbend(dentry);
+	if (0 <= bend) {
+		bwh = au_dbwh(dentry);
+		if (!bwh)
+			return bwh;
+		if (0 < bwh && bwh < bend)
+			return bwh - 1;
+	}
+	return bend;
+}
+
+aufs_bindex_t au_dbtaildir(struct dentry *dentry)
+{
+	aufs_bindex_t bend, bopq;
+
+	bend = au_dbtail(dentry);
+	if (0 <= bend) {
+		bopq = au_dbdiropq(dentry);
+		if (0 <= bopq && bopq < bend)
+			bend = bopq;
+	}
+	return bend;
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_set_h_dptr(struct dentry *dentry, aufs_bindex_t bindex,
+		   struct dentry *h_dentry)
+{
+	struct au_hdentry *hd = au_di(dentry)->di_hdentry + bindex;
+
+	if (hd->hd_dentry)
+		au_hdput(hd);
+	hd->hd_dentry = h_dentry;
+}
+
+void au_update_digen(struct dentry *dentry)
+{
+	atomic_set(&au_di(dentry)->di_generation, au_sigen(dentry->d_sb));
+	/* smp_mb(); */ /* atomic_set */
+}
+
+void au_update_dbrange(struct dentry *dentry, int do_put_zero)
+{
+	struct au_dinfo *dinfo;
+	struct dentry *h_d;
+
+	dinfo = au_di(dentry);
+	if (!dinfo || dinfo->di_bstart < 0)
+		return;
+
+	if (do_put_zero) {
+		aufs_bindex_t bindex, bend;
+
+		bend = dinfo->di_bend;
+		for (bindex = dinfo->di_bstart; bindex <= bend; bindex++) {
+			h_d = dinfo->di_hdentry[0 + bindex].hd_dentry;
+			if (h_d && !h_d->d_inode)
+				au_set_h_dptr(dentry, bindex, NULL);
+		}
+	}
+
+	dinfo->di_bstart = -1;
+	while (++dinfo->di_bstart <= dinfo->di_bend)
+		if (dinfo->di_hdentry[0 + dinfo->di_bstart].hd_dentry)
+			break;
+	if (dinfo->di_bstart > dinfo->di_bend) {
+		dinfo->di_bstart = -1;
+		dinfo->di_bend = -1;
+		return;
+	}
+
+	dinfo->di_bend++;
+	while (0 <= --dinfo->di_bend)
+		if (dinfo->di_hdentry[0 + dinfo->di_bend].hd_dentry)
+			break;
+	AuDebugOn(dinfo->di_bstart > dinfo->di_bend || dinfo->di_bend < 0);
+}
+
+void au_update_dbstart(struct dentry *dentry)
+{
+	aufs_bindex_t bindex, bend;
+	struct dentry *h_dentry;
+
+	bend = au_dbend(dentry);
+	for (bindex = au_dbstart(dentry); bindex <= bend; bindex++) {
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (!h_dentry)
+			continue;
+		if (h_dentry->d_inode) {
+			au_set_dbstart(dentry, bindex);
+			return;
+		}
+		au_set_h_dptr(dentry, bindex, NULL);
+	}
+}
+
+void au_update_dbend(struct dentry *dentry)
+{
+	aufs_bindex_t bindex, bstart;
+	struct dentry *h_dentry;
+
+	bstart = au_dbstart(dentry);
+	for (bindex = au_dbend(dentry); bindex <= bstart; bindex--) {
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (!h_dentry)
+			continue;
+		if (h_dentry->d_inode) {
+			au_set_dbend(dentry, bindex);
+			return;
+		}
+		au_set_h_dptr(dentry, bindex, NULL);
+	}
+}
+
+int au_find_dbindex(struct dentry *dentry, struct dentry *h_dentry)
+{
+	aufs_bindex_t bindex, bend;
+
+	bend = au_dbend(dentry);
+	for (bindex = au_dbstart(dentry); bindex <= bend; bindex++)
+		if (au_h_dptr(dentry, bindex) == h_dentry)
+			return bindex;
+	return -1;
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 17/28] aufs file
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (15 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 16/28] aufs dentry and lookup J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 18/28] aufs direcotry J. R. Okajima
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
private data, file operations, vm operations, and address_space
operations.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/f_op.c  |  551 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/file.c  |  552 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/file.h  |  148 +++++++++++++++
 fs/aufs/finfo.c |  126 +++++++++++++
 4 files changed, 1377 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/f_op.c
 create mode 100644 fs/aufs/file.c
 create mode 100644 fs/aufs/file.h
 create mode 100644 fs/aufs/finfo.c

diff --git a/fs/aufs/f_op.c b/fs/aufs/f_op.c
new file mode 100644
index 0000000..04cb948
--- /dev/null
+++ b/fs/aufs/f_op.c
@@ -0,0 +1,551 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * file and vm operations
+ */
+
+#include <linux/fs_stack.h>
+#include <linux/poll.h>
+#include "aufs.h"
+
+/* common function to regular file and dir */
+int aufs_flush(struct file *file, fl_owner_t id)
+{
+	int err;
+	aufs_bindex_t bindex, bend;
+	struct dentry *dentry;
+	struct file *h_file;
+
+	dentry = file->f_dentry;
+	si_noflush_read_lock(dentry->d_sb);
+	fi_read_lock(file);
+	di_read_lock_child(dentry, AuLock_IW);
+
+	err = 0;
+	bend = au_fbend(file);
+	for (bindex = au_fbstart(file); !err && bindex <= bend; bindex++) {
+		h_file = au_h_fptr(file, bindex);
+		if (!h_file || !h_file->f_op || !h_file->f_op->flush)
+			continue;
+
+		err = h_file->f_op->flush(h_file, id);
+		if (!err)
+			vfsub_update_h_iattr(&h_file->f_path, /*did*/NULL);
+		/*ignore*/
+	}
+	au_cpup_attr_timesizes(dentry->d_inode);
+
+	di_read_unlock(dentry, AuLock_IW);
+	fi_read_unlock(file);
+	si_read_unlock(dentry->d_sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int do_open_nondir(struct file *file, int flags)
+{
+	int err;
+	aufs_bindex_t bindex;
+	struct file *h_file;
+	struct dentry *dentry;
+
+	err = 0;
+	dentry = file->f_dentry;
+	au_fi(file)->fi_h_vm_ops = NULL;
+	bindex = au_dbstart(dentry);
+	/* O_TRUNC is processed already */
+	BUG_ON(au_test_ro(dentry->d_sb, bindex, dentry->d_inode)
+	       && (flags & O_TRUNC));
+
+	h_file = au_h_open(dentry, bindex, flags, file);
+	if (IS_ERR(h_file))
+		err = PTR_ERR(h_file);
+	else {
+		au_set_fbstart(file, bindex);
+		au_set_fbend(file, bindex);
+		au_set_h_fptr(file, bindex, h_file);
+		au_update_figen(file);
+		/* todo: necessary? */
+		/* file->f_ra = h_file->f_ra; */
+	}
+	return err;
+}
+
+static int aufs_open_nondir(struct inode *inode __maybe_unused,
+			    struct file *file)
+{
+	return au_do_open(file, do_open_nondir);
+}
+
+static int aufs_release_nondir(struct inode *inode __maybe_unused,
+			       struct file *file)
+{
+	struct super_block *sb = file->f_dentry->d_sb;
+
+	si_noflush_read_lock(sb);
+	au_finfo_fin(file);
+	si_read_unlock(sb);
+	return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static ssize_t aufs_read(struct file *file, char __user *buf, size_t count,
+			 loff_t *ppos)
+{
+	ssize_t err;
+	struct dentry *dentry;
+	struct file *h_file;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/0);
+	if (unlikely(err))
+		goto out;
+
+	h_file = au_h_fptr(file, au_fbstart(file));
+	err = vfsub_read_u(h_file, buf, count, ppos);
+	/* todo: necessary? */
+	/* file->f_ra = h_file->f_ra; */
+	fsstack_copy_attr_atime(dentry->d_inode, h_file->f_dentry->d_inode);
+
+	di_read_unlock(dentry, AuLock_IR);
+	fi_read_unlock(file);
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+static ssize_t aufs_write(struct file *file, const char __user *ubuf,
+			  size_t count, loff_t *ppos)
+{
+	ssize_t err;
+	aufs_bindex_t bstart;
+	struct au_pin pin;
+	struct dentry *dentry;
+	struct inode *inode;
+	struct super_block *sb;
+	struct file *h_file;
+	char __user *buf = (char __user *)ubuf;
+
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	inode = dentry->d_inode;
+	mutex_lock(&inode->i_mutex);
+	si_read_lock(sb, AuLock_FLUSH);
+
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/1);
+	if (unlikely(err))
+		goto out;
+
+	err = au_ready_to_write(file, -1, &pin);
+	di_downgrade_lock(dentry, AuLock_IR);
+	if (unlikely(err))
+		goto out_unlock;
+
+	bstart = au_fbstart(file);
+	h_file = au_h_fptr(file, bstart);
+	au_unpin(&pin);
+	err = vfsub_write_u(h_file, buf, count, ppos);
+	au_cpup_attr_timesizes(inode);
+	inode->i_mode = h_file->f_dentry->d_inode->i_mode;
+
+ out_unlock:
+	di_read_unlock(dentry, AuLock_IR);
+	fi_write_unlock(file);
+ out:
+	si_read_unlock(sb);
+	mutex_unlock(&inode->i_mutex);
+	return err;
+}
+
+static ssize_t aufs_splice_read(struct file *file, loff_t *ppos,
+				struct pipe_inode_info *pipe, size_t len,
+				unsigned int flags)
+{
+	ssize_t err;
+	struct file *h_file;
+	struct dentry *dentry;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/0);
+	if (unlikely(err))
+		goto out;
+
+	err = -EINVAL;
+	h_file = au_h_fptr(file, au_fbstart(file));
+	if (au_test_loopback_kthread()) {
+		file->f_mapping = h_file->f_mapping;
+		smp_mb(); /* unnecessary? */
+	}
+	err = vfsub_splice_to(h_file, ppos, pipe, len, flags);
+	/* todo: necessasry? */
+	/* file->f_ra = h_file->f_ra; */
+	fsstack_copy_attr_atime(dentry->d_inode, h_file->f_dentry->d_inode);
+
+	di_read_unlock(dentry, AuLock_IR);
+	fi_read_unlock(file);
+
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+static ssize_t
+aufs_splice_write(struct pipe_inode_info *pipe, struct file *file, loff_t *ppos,
+		  size_t len, unsigned int flags)
+{
+	ssize_t err;
+	struct au_pin pin;
+	struct dentry *dentry;
+	struct inode *inode;
+	struct super_block *sb;
+	struct file *h_file;
+
+	dentry = file->f_dentry;
+	inode = dentry->d_inode;
+	mutex_lock(&inode->i_mutex);
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/1);
+	if (unlikely(err))
+		goto out;
+
+	err = au_ready_to_write(file, -1, &pin);
+	di_downgrade_lock(dentry, AuLock_IR);
+	if (unlikely(err))
+		goto out_unlock;
+
+	h_file = au_h_fptr(file, au_fbstart(file));
+	au_unpin(&pin);
+	err = vfsub_splice_from(pipe, h_file, ppos, len, flags);
+	au_cpup_attr_timesizes(inode);
+	inode->i_mode = h_file->f_dentry->d_inode->i_mode;
+
+ out_unlock:
+	di_read_unlock(dentry, AuLock_IR);
+	fi_write_unlock(file);
+ out:
+	si_read_unlock(sb);
+	mutex_unlock(&inode->i_mutex);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static struct file *au_safe_file(struct vm_area_struct *vma)
+{
+	struct file *file;
+
+	file = vma->vm_file;
+	if (file->private_data && au_test_aufs(file->f_dentry->d_sb))
+		return file;
+	return NULL;
+}
+
+static void au_reset_file(struct vm_area_struct *vma, struct file *file)
+{
+	vma->vm_file = file;
+	/* smp_mb(); */ /* flush vm_file */
+}
+
+static int aufs_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int err;
+	static DECLARE_WAIT_QUEUE_HEAD(wq);
+	struct file *file, *h_file;
+	struct au_finfo *finfo;
+
+	/* todo: non-robr mode, user vm_file as it is? */
+	wait_event(wq, (file = au_safe_file(vma)));
+
+	/* do not revalidate, no si lock */
+	finfo = au_fi(file);
+	h_file = finfo->fi_hfile[0 + finfo->fi_bstart].hf_file;
+	AuDebugOn(!h_file || !au_test_mmapped(file));
+
+	fi_write_lock(file);
+	vma->vm_file = h_file;
+	err = finfo->fi_h_vm_ops->fault(vma, vmf);
+	/* todo: necessary? */
+	/* file->f_ra = h_file->f_ra; */
+	au_reset_file(vma, file);
+	fi_write_unlock(file);
+#if 0 /* def CONFIG_SMP */
+	/* wake_up_nr(&wq, online_cpu - 1); */
+	wake_up_all(&wq);
+#else
+	wake_up(&wq);
+#endif
+
+	return err;
+}
+
+static struct vm_operations_struct aufs_vm_ops = {
+	.fault		= aufs_fault
+};
+
+/* ---------------------------------------------------------------------- */
+
+static struct vm_operations_struct *au_vm_ops(struct file *h_file,
+					      struct vm_area_struct *vma)
+{
+	struct vm_operations_struct *vm_ops;
+	int err;
+
+	err = h_file->f_op->mmap(h_file, vma);
+	vm_ops = ERR_PTR(err);
+	if (unlikely(err))
+		goto out;
+	vm_ops = vma->vm_ops;
+	err = do_munmap(current->mm, vma->vm_start,
+			vma->vm_end - vma->vm_start);
+	if (unlikely(err)) {
+		AuIOErr("failed internal unmapping %.*s, %d\n",
+			AuDLNPair(h_file->f_dentry), err);
+		vm_ops = ERR_PTR(-EIO);
+	}
+
+ out:
+	return vm_ops;
+}
+
+static int aufs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int err;
+	unsigned char wlock, mmapped;
+	struct dentry *dentry;
+	struct super_block *sb;
+	struct file *h_file;
+	struct vm_operations_struct *vm_ops;
+
+	dentry = file->f_dentry;
+	mmapped = !!au_test_mmapped(file); /* can be harmless race condition */
+	wlock = !!(file->f_mode & FMODE_WRITE);
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, wlock | !mmapped);
+	if (unlikely(err))
+		goto out;
+
+	if (wlock) {
+		struct au_pin pin;
+
+		err = au_ready_to_write(file, -1, &pin);
+		di_downgrade_lock(dentry, AuLock_IR);
+		if (unlikely(err))
+			goto out_unlock;
+		au_unpin(&pin);
+	} else if (!mmapped)
+		di_downgrade_lock(dentry, AuLock_IR);
+
+	h_file = au_h_fptr(file, au_fbstart(file));
+	if (au_test_fs_bad_mapping(h_file->f_dentry->d_sb)) {
+		/*
+		 * by this assignment, f_mapping will differs from aufs inode
+		 * i_mapping.
+		 * if someone else mixes the use of f_dentry->d_inode and
+		 * f_mapping->host, then a problem may arise.
+		 */
+		file->f_mapping = h_file->f_mapping;
+	}
+
+	vm_ops = NULL;
+	if (!mmapped) {
+		vm_ops = au_vm_ops(h_file, vma);
+		err = PTR_ERR(vm_ops);
+		if (IS_ERR(vm_ops))
+			goto out_unlock;
+	}
+
+	/*
+	 * unnecessary to handle MAP_DENYWRITE and deny_write_access()?
+	 * currently MAP_DENYWRITE from userspace is ignored, but elf loader
+	 * sets it. when FMODE_EXEC is set (by open_exec() or sys_uselib()),
+	 * both of the aufs file and the lower file is deny_write_access()-ed.
+	 * finally I hope we can skip handlling MAP_DENYWRITE here.
+	 */
+	err = generic_file_mmap(file, vma);
+	if (unlikely(err))
+		goto out_unlock;
+	vma->vm_ops = &aufs_vm_ops;
+	/* test again */
+	if (!au_test_mmapped(file))
+		au_fi(file)->fi_h_vm_ops = vm_ops;
+
+	vfsub_file_accessed(h_file);
+	fsstack_copy_attr_atime(dentry->d_inode, h_file->f_dentry->d_inode);
+
+ out_unlock:
+	di_read_unlock(dentry, AuLock_IR);
+	if (!wlock && mmapped)
+		fi_read_unlock(file);
+	else
+		fi_write_unlock(file);
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static unsigned int aufs_poll(struct file *file, poll_table *wait)
+{
+	unsigned int mask;
+	int err;
+	struct file *h_file;
+	struct dentry *dentry;
+	struct super_block *sb;
+
+	/* We should pretend an error happened. */
+	mask = POLLERR /* | POLLIN | POLLOUT */;
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/0);
+	if (unlikely(err))
+		goto out;
+
+	/* it is not an error if h_file has no operation */
+	mask = DEFAULT_POLLMASK;
+	h_file = au_h_fptr(file, au_fbstart(file));
+	if (h_file->f_op && h_file->f_op->poll)
+		mask = h_file->f_op->poll(h_file, wait);
+
+	di_read_unlock(dentry, AuLock_IR);
+	fi_read_unlock(file);
+
+ out:
+	si_read_unlock(sb);
+	AuTraceErr((int)mask);
+	return mask;
+}
+
+static int aufs_fsync_nondir(struct file *file, struct dentry *dentry,
+			     int datasync)
+{
+	int err;
+	struct au_pin pin;
+	struct inode *inode;
+	struct file *h_file;
+	struct super_block *sb;
+
+	inode = dentry->d_inode;
+	IMustLock(file->f_mapping->host);
+	if (inode != file->f_mapping->host) {
+		mutex_unlock(&file->f_mapping->host->i_mutex);
+		mutex_lock(&inode->i_mutex);
+	}
+	IMustLock(inode);
+
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+
+	err = 0; /* -EBADF; */ /* posix? */
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		goto out;
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/1);
+	if (unlikely(err))
+		goto out;
+
+	err = au_ready_to_write(file, -1, &pin);
+	di_downgrade_lock(dentry, AuLock_IR);
+	if (unlikely(err))
+		goto out_unlock;
+	au_unpin(&pin);
+
+	err = -EINVAL;
+	h_file = au_h_fptr(file, au_fbstart(file));
+	if (h_file->f_op && h_file->f_op->fsync) {
+		struct dentry *h_d;
+		struct mutex *h_mtx;
+
+		/*
+		 * no filemap_fdatawrite() since aufs file has no its own
+		 * mapping, but dir.
+		 */
+		h_d = h_file->f_dentry;
+		h_mtx = &h_d->d_inode->i_mutex;
+		mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+		err = h_file->f_op->fsync(h_file, h_d, datasync);
+		if (!err)
+			vfsub_update_h_iattr(&h_file->f_path, /*did*/NULL);
+		/*ignore*/
+		au_cpup_attr_timesizes(inode);
+		mutex_unlock(h_mtx);
+	}
+
+ out_unlock:
+	di_read_unlock(dentry, AuLock_IR);
+	fi_write_unlock(file);
+ out:
+	si_read_unlock(sb);
+	if (inode != file->f_mapping->host) {
+		mutex_unlock(&inode->i_mutex);
+		mutex_lock(&file->f_mapping->host->i_mutex);
+	}
+	return err;
+}
+
+static int aufs_fasync(int fd, struct file *file, int flag)
+{
+	int err;
+	struct file *h_file;
+	struct dentry *dentry;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_reval_and_lock_fdi(file, au_reopen_nondir, /*wlock*/0);
+	if (unlikely(err))
+		goto out;
+
+	h_file = au_h_fptr(file, au_fbstart(file));
+	if (h_file->f_op && h_file->f_op->fasync)
+		err = h_file->f_op->fasync(fd, h_file, flag);
+
+	di_read_unlock(dentry, AuLock_IR);
+	fi_read_unlock(file);
+
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct file_operations aufs_file_fop = {
+	/*
+	 * while generic_file_llseek/_unlocked() don't use BKL,
+	 * don't use it since it operates file->f_mapping->host.
+	 * in aufs, it may be a real file and may confuse users by UDBA.
+	 */
+	/* .llseek		= generic_file_llseek, */
+
+	.read		= aufs_read,
+	.write		= aufs_write,
+	.poll		= aufs_poll,
+	.mmap		= aufs_mmap,
+	.open		= aufs_open_nondir,
+	.flush		= aufs_flush,
+	.release	= aufs_release_nondir,
+	.fsync		= aufs_fsync_nondir,
+	.fasync		= aufs_fasync,
+	.splice_write	= aufs_splice_write,
+	.splice_read	= aufs_splice_read
+};
diff --git a/fs/aufs/file.c b/fs/aufs/file.c
new file mode 100644
index 0000000..470281b
--- /dev/null
+++ b/fs/aufs/file.c
@@ -0,0 +1,552 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * handling file/dir, and address_space operation
+ */
+
+#include <linux/fsnotify.h>
+#include <linux/pagemap.h>
+#include "aufs.h"
+
+/*
+ * a dirty trick for handling deny_write_access().
+ * because FMODE_EXEC flag is not passed to f_op->open(),
+ * set it to file->private_data temporary.
+ */
+void au_store_oflag(struct nameidata *nd)
+{
+	if (nd
+	    && !(nd->flags & LOOKUP_CONTINUE)
+	    && (nd->flags & LOOKUP_OPEN))
+		nd->intent.open.file->private_data
+			= (void *)nd->intent.open.flags;
+	/* smp_mb(); */
+}
+
+/* drop flags for writing */
+unsigned int au_file_roflags(unsigned int flags)
+{
+	flags &= ~(O_WRONLY | O_RDWR | O_APPEND | O_CREAT | O_TRUNC);
+	flags |= O_RDONLY | O_NOATIME;
+	return flags;
+}
+
+/* common functions to regular file and dir */
+struct file *au_h_open(struct dentry *dentry, aufs_bindex_t bindex, int flags,
+		       struct file *file)
+{
+	struct file *h_file;
+	struct dentry *h_dentry;
+	struct inode *h_inode;
+	struct super_block *sb;
+	struct au_branch *br;
+	int err;
+
+	h_dentry = au_h_dptr(dentry, bindex);
+	h_inode = h_dentry->d_inode;
+	/* a race condition can happen between open and unlink/rmdir */
+	h_file = ERR_PTR(-ENOENT);
+	if (unlikely((!d_unhashed(dentry) && d_unhashed(h_dentry))
+		     || !h_inode))
+		goto out;
+
+	sb = dentry->d_sb;
+	br = au_sbr(sb, bindex);
+	h_file = ERR_PTR(-EACCES);
+	if (file && (file->f_mode & FMODE_EXEC)
+	    && (br->br_mnt->mnt_flags & MNT_NOEXEC))
+		goto out;
+
+	/* drop flags for writing */
+	if (au_test_ro(sb, bindex, dentry->d_inode))
+		flags = au_file_roflags(flags);
+	flags &= ~O_CREAT;
+	atomic_inc(&br->br_count);
+	h_file = dentry_open(dget(h_dentry), mntget(br->br_mnt), flags,
+			     current_cred());
+	if (IS_ERR(h_file))
+		goto out_br;
+
+	if (file && (file->f_mode & FMODE_EXEC)) {
+		h_file->f_mode |= FMODE_EXEC;
+		err = deny_write_access(h_file);
+		if (unlikely(err)) {
+			fput(h_file);
+			h_file = ERR_PTR(err);
+			goto out_br;
+		}
+	}
+	fsnotify_open(h_dentry);
+	goto out; /* success */
+
+ out_br:
+	atomic_dec(&br->br_count);
+ out:
+	return h_file;
+}
+
+int au_do_open(struct file *file, int (*open)(struct file *file, int flags))
+{
+	int err;
+	struct dentry *dentry;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_finfo_init(file);
+	if (unlikely(err))
+		goto out;
+
+	di_read_lock_child(dentry, AuLock_IR);
+	err = open(file, file->f_flags);
+	di_read_unlock(dentry, AuLock_IR);
+
+	fi_write_unlock(file);
+	if (unlikely(err))
+		au_finfo_fin(file);
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+int au_reopen_nondir(struct file *file)
+{
+	int err;
+	aufs_bindex_t bstart, bindex, bend;
+	struct dentry *dentry;
+	struct file *h_file, *h_file_tmp;
+
+	dentry = file->f_dentry;
+	bstart = au_dbstart(dentry);
+	h_file_tmp = NULL;
+	if (au_fbstart(file) == bstart) {
+		h_file = au_h_fptr(file, bstart);
+		if (file->f_mode == h_file->f_mode)
+			return 0; /* success */
+		h_file_tmp = h_file;
+		get_file(h_file_tmp);
+		au_set_h_fptr(file, bstart, NULL);
+	}
+	AuDebugOn(au_fbstart(file) < bstart
+		  || au_fi(file)->fi_hfile[0 + bstart].hf_file);
+
+	h_file = au_h_open(dentry, bstart, file->f_flags & ~O_TRUNC, file);
+	err = PTR_ERR(h_file);
+	if (IS_ERR(h_file))
+		goto out; /* todo: close all? */
+
+	err = 0;
+	au_set_fbstart(file, bstart);
+	au_set_h_fptr(file, bstart, h_file);
+	au_update_figen(file);
+	/* todo: necessary? */
+	/* file->f_ra = h_file->f_ra; */
+
+	/* close lower files */
+	bend = au_fbend(file);
+	for (bindex = bstart + 1; bindex <= bend; bindex++)
+		au_set_h_fptr(file, bindex, NULL);
+	au_set_fbend(file, bstart);
+
+ out:
+	if (h_file_tmp)
+		fput(h_file_tmp);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int au_reopen_wh(struct file *file, aufs_bindex_t btgt,
+			struct dentry *hi_wh)
+{
+	int err;
+	aufs_bindex_t bstart;
+	struct au_dinfo *dinfo;
+	struct dentry *h_dentry;
+
+	dinfo = au_di(file->f_dentry);
+	bstart = dinfo->di_bstart;
+	dinfo->di_bstart = btgt;
+	h_dentry = dinfo->di_hdentry[0 + btgt].hd_dentry;
+	dinfo->di_hdentry[0 + btgt].hd_dentry = hi_wh;
+	err = au_reopen_nondir(file);
+	dinfo->di_hdentry[0 + btgt].hd_dentry = h_dentry;
+	dinfo->di_bstart = bstart;
+
+	return err;
+}
+
+static int au_ready_to_write_wh(struct file *file, loff_t len,
+				aufs_bindex_t bcpup)
+{
+	int err;
+	struct inode *inode;
+	struct dentry *dentry, *hi_wh;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	inode = dentry->d_inode;
+	hi_wh = au_hi_wh(inode, bcpup);
+	if (!hi_wh)
+		err = au_sio_cpup_wh(dentry, bcpup, len, file);
+	else
+		/* already copied-up after unlink */
+		err = au_reopen_wh(file, bcpup, hi_wh);
+
+	sb = dentry->d_sb;
+	if (!err && inode->i_nlink > 1 && au_opt_test(au_mntflags(sb), PLINK))
+		au_plink_append(inode, bcpup, au_h_dptr(dentry, bcpup));
+
+	return err;
+}
+
+/*
+ * prepare the @file for writing.
+ */
+int au_ready_to_write(struct file *file, loff_t len, struct au_pin *pin)
+{
+	int err;
+	aufs_bindex_t bstart, bcpup;
+	struct dentry *dentry, *parent, *h_dentry;
+	struct inode *h_inode, *inode;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	bstart = au_fbstart(file);
+	inode = dentry->d_inode;
+	err = au_test_ro(sb, bstart, inode);
+	if (!err && (au_h_fptr(file, bstart)->f_mode & FMODE_WRITE)) {
+		err = au_pin(pin, dentry, bstart, AuOpt_UDBA_NONE, /*flags*/0);
+		goto out;
+	}
+
+	/* need to cpup */
+	parent = dget_parent(dentry);
+	di_write_lock_parent(parent);
+	err = AuWbrCopyup(au_sbi(sb), dentry);
+	bcpup = err;
+	if (unlikely(err < 0))
+		goto out_dgrade;
+	err = 0;
+
+	if (!au_h_dptr(parent, bcpup)) {
+		err = au_cpup_dirs(dentry, bcpup);
+		if (unlikely(err))
+			goto out_dgrade;
+	}
+
+	err = au_pin(pin, dentry, bcpup, AuOpt_UDBA_NONE,
+		     AuPin_DI_LOCKED | AuPin_MNT_WRITE);
+	if (unlikely(err))
+		goto out_dgrade;
+
+	h_dentry = au_h_fptr(file, bstart)->f_dentry;
+	h_inode = h_dentry->d_inode;
+	mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+	if (d_unhashed(dentry) /* || d_unhashed(h_dentry) */
+	    /* || !h_inode->i_nlink */) {
+		err = au_ready_to_write_wh(file, len, bcpup);
+		di_downgrade_lock(parent, AuLock_IR);
+	} else {
+		di_downgrade_lock(parent, AuLock_IR);
+		if (!au_h_dptr(dentry, bcpup))
+			err = au_sio_cpup_simple(dentry, bcpup, len,
+						 AuCpup_DTIME);
+		if (!err)
+			err = au_reopen_nondir(file);
+	}
+	mutex_unlock(&h_inode->i_mutex);
+
+	if (!err) {
+		au_pin_set_parent_lflag(pin, /*lflag*/0);
+		goto out_dput; /* success */
+	}
+	au_unpin(pin);
+	goto out_unlock;
+
+ out_dgrade:
+	di_downgrade_lock(parent, AuLock_IR);
+ out_unlock:
+	di_read_unlock(parent, AuLock_IR);
+ out_dput:
+	dput(parent);
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int au_file_refresh_by_inode(struct file *file, int *need_reopen)
+{
+	int err;
+	aufs_bindex_t bstart;
+	struct au_pin pin;
+	struct au_finfo *finfo;
+	struct dentry *dentry, *parent, *hi_wh;
+	struct inode *inode;
+	struct super_block *sb;
+
+	err = 0;
+	finfo = au_fi(file);
+	dentry = file->f_dentry;
+	sb = dentry->d_sb;
+	inode = dentry->d_inode;
+	bstart = au_ibstart(inode);
+	if (bstart == finfo->fi_bstart)
+		goto out;
+
+	parent = dget_parent(dentry);
+	if (au_test_ro(sb, bstart, inode)) {
+		di_read_lock_parent(parent, !AuLock_IR);
+		err = AuWbrCopyup(au_sbi(sb), dentry);
+		bstart = err;
+		di_read_unlock(parent, !AuLock_IR);
+		if (unlikely(err < 0))
+			goto out_parent;
+		err = 0;
+	}
+
+	di_read_lock_parent(parent, AuLock_IR);
+	hi_wh = au_hi_wh(inode, bstart);
+	if (au_opt_test(au_mntflags(sb), PLINK)
+	    && au_plink_test(inode)
+	    && !d_unhashed(dentry)) {
+		err = au_test_and_cpup_dirs(dentry, bstart);
+		if (unlikely(err))
+			goto out_unlock;
+
+		/* always superio. */
+		err = au_pin(&pin, dentry, bstart, AuOpt_UDBA_NONE,
+			     AuPin_DI_LOCKED | AuPin_MNT_WRITE);
+		if (!err)
+			err = au_sio_cpup_simple(dentry, bstart, -1,
+						 AuCpup_DTIME);
+		au_unpin(&pin);
+	} else if (hi_wh) {
+		/* already copied-up after unlink */
+		err = au_reopen_wh(file, bstart, hi_wh);
+		*need_reopen = 0;
+	}
+
+ out_unlock:
+	di_read_unlock(parent, AuLock_IR);
+ out_parent:
+	dput(parent);
+ out:
+	return err;
+}
+
+static void au_do_refresh_file(struct file *file)
+{
+	aufs_bindex_t bindex, bend, new_bindex, brid;
+	struct au_hfile *p, tmp, *q;
+	struct au_finfo *finfo;
+	struct super_block *sb;
+
+	sb = file->f_dentry->d_sb;
+	finfo = au_fi(file);
+	p = finfo->fi_hfile + finfo->fi_bstart;
+	brid = p->hf_br->br_id;
+	bend = finfo->fi_bend;
+	for (bindex = finfo->fi_bstart; bindex <= bend; bindex++, p++) {
+		if (!p->hf_file)
+			continue;
+
+		new_bindex = au_br_index(sb, p->hf_br->br_id);
+		if (new_bindex == bindex)
+			continue;
+		if (new_bindex < 0) {
+			au_set_h_fptr(file, bindex, NULL);
+			continue;
+		}
+
+		/* swap two lower inode, and loop again */
+		q = finfo->fi_hfile + new_bindex;
+		tmp = *q;
+		*q = *p;
+		*p = tmp;
+		if (tmp.hf_file) {
+			bindex--;
+			p--;
+		}
+	}
+
+	p = finfo->fi_hfile;
+	if (!au_test_mmapped(file) && !d_unhashed(file->f_dentry)) {
+		bend = au_sbend(sb);
+		for (finfo->fi_bstart = 0; finfo->fi_bstart <= bend;
+		     finfo->fi_bstart++, p++)
+			if (p->hf_file) {
+				if (p->hf_file->f_dentry
+				    && p->hf_file->f_dentry->d_inode)
+					break;
+				else
+					au_hfput(p, file);
+			}
+	} else {
+		bend = au_br_index(sb, brid);
+		for (finfo->fi_bstart = 0; finfo->fi_bstart < bend;
+		     finfo->fi_bstart++, p++)
+			if (p->hf_file)
+				au_hfput(p, file);
+		bend = au_sbend(sb);
+	}
+
+	p = finfo->fi_hfile + bend;
+	for (finfo->fi_bend = bend; finfo->fi_bend >= finfo->fi_bstart;
+	     finfo->fi_bend--, p--)
+		if (p->hf_file) {
+			if (p->hf_file->f_dentry
+			    && p->hf_file->f_dentry->d_inode)
+				break;
+			else
+				au_hfput(p, file);
+		}
+	AuDebugOn(finfo->fi_bend < finfo->fi_bstart);
+}
+
+/*
+ * after branch manipulating, refresh the file.
+ */
+static int refresh_file(struct file *file, int (*reopen)(struct file *file))
+{
+	int err, need_reopen;
+	struct dentry *dentry;
+	aufs_bindex_t bend, bindex;
+
+	dentry = file->f_dentry;
+	err = au_fi_realloc(au_fi(file), au_sbend(dentry->d_sb) + 1);
+	if (unlikely(err))
+		goto out;
+	au_do_refresh_file(file);
+
+	err = 0;
+	need_reopen = 1;
+	if (!au_test_mmapped(file))
+		err = au_file_refresh_by_inode(file, &need_reopen);
+	if (!err && need_reopen && !d_unhashed(dentry))
+		err = reopen(file);
+	if (!err) {
+		au_update_figen(file);
+		return 0; /* success */
+	}
+
+	/* error, close all lower files */
+	bend = au_fbend(file);
+	for (bindex = au_fbstart(file); bindex <= bend; bindex++)
+		au_set_h_fptr(file, bindex, NULL);
+
+ out:
+	return err;
+}
+
+/* common function to regular file and dir */
+int au_reval_and_lock_fdi(struct file *file, int (*reopen)(struct file *file),
+			  int wlock)
+{
+	int err;
+	unsigned int sigen, figen;
+	aufs_bindex_t bstart;
+	unsigned char pseudo_link;
+	struct dentry *dentry;
+
+	err = 0;
+	dentry = file->f_dentry;
+	sigen = au_sigen(dentry->d_sb);
+	fi_write_lock(file);
+	figen = au_figen(file);
+	di_write_lock_child(dentry);
+	bstart = au_dbstart(dentry);
+	pseudo_link = (bstart != au_ibstart(dentry->d_inode));
+	if (sigen == figen && !pseudo_link && au_fbstart(file) == bstart) {
+		if (!wlock) {
+			di_downgrade_lock(dentry, AuLock_IR);
+			fi_downgrade_lock(file);
+		}
+		goto out; /* success */
+	}
+
+	AuDbg("sigen %d, figen %d\n", sigen, figen);
+	if (sigen != au_digen(dentry)
+	    || sigen != au_iigen(dentry->d_inode)) {
+		err = au_reval_dpath(dentry, sigen);
+		if (unlikely(err < 0))
+			goto out;
+		AuDebugOn(au_digen(dentry) != sigen
+			  || au_iigen(dentry->d_inode) != sigen);
+	}
+
+	err = refresh_file(file, reopen);
+	if (!err) {
+		if (!wlock) {
+			di_downgrade_lock(dentry, AuLock_IR);
+			fi_downgrade_lock(file);
+		}
+	} else {
+		di_write_unlock(dentry);
+		fi_write_unlock(file);
+	}
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* cf. aufs_nopage() */
+/* for madvise(2) */
+static int aufs_readpage(struct file *file __maybe_unused, struct page *page)
+{
+	unlock_page(page);
+	return 0;
+}
+
+/* they will never be called. */
+#ifdef CONFIG_AUFS_DEBUG
+static int aufs_write_begin(struct file *file, struct address_space *mapping,
+			    loff_t pos, unsigned len, unsigned flags,
+			    struct page **pagep, void **fsdata)
+{ AuUnsupport(); return 0; }
+static int aufs_write_end(struct file *file, struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
+{ AuUnsupport(); return 0; }
+static int aufs_writepage(struct page *page, struct writeback_control *wbc)
+{ AuUnsupport(); return 0; }
+static void aufs_sync_page(struct page *page)
+{ AuUnsupport(); }
+
+static int aufs_set_page_dirty(struct page *page)
+{ AuUnsupport(); return 0; }
+static void aufs_invalidatepage(struct page *page, unsigned long offset)
+{ AuUnsupport(); }
+static int aufs_releasepage(struct page *page, gfp_t gfp)
+{ AuUnsupport(); return 0; }
+static ssize_t aufs_direct_IO(int rw, struct kiocb *iocb,
+			      const struct iovec *iov, loff_t offset,
+			      unsigned long nr_segs)
+{ AuUnsupport(); return 0; }
+#endif /* CONFIG_AUFS_DEBUG */
+
+struct address_space_operations aufs_aop = {
+	.readpage	= aufs_readpage,
+#ifdef CONFIG_AUFS_DEBUG
+	.writepage	= aufs_writepage,
+	.sync_page	= aufs_sync_page,
+	.set_page_dirty	= aufs_set_page_dirty,
+	.write_begin	= aufs_write_begin,
+	.write_end	= aufs_write_end,
+	.invalidatepage	= aufs_invalidatepage,
+	.releasepage	= aufs_releasepage,
+	.direct_IO	= aufs_direct_IO,
+#endif /* CONFIG_AUFS_DEBUG */
+};
diff --git a/fs/aufs/file.h b/fs/aufs/file.h
new file mode 100644
index 0000000..3004054
--- /dev/null
+++ b/fs/aufs/file.h
@@ -0,0 +1,148 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * file operations
+ */
+
+#ifndef __AUFS_FILE_H__
+#define __AUFS_FILE_H__
+
+#ifdef __KERNEL__
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/aufs_type.h>
+#include "rwsem.h"
+
+struct au_branch;
+struct au_hfile {
+	struct file		*hf_file;
+	struct au_branch	*hf_br;
+};
+
+struct au_vdir;
+struct au_finfo {
+	atomic_t		fi_generation;
+
+	struct rw_semaphore	fi_rwsem;
+	struct au_hfile		*fi_hfile;
+	aufs_bindex_t		fi_bstart, fi_bend;
+
+	union {
+		/* non-dir only */
+		struct vm_operations_struct	*fi_h_vm_ops;
+
+		/* dir only */
+		struct {
+			struct au_vdir		*fi_vdir_cache;
+			int			fi_maintain_plink;
+		};
+	};
+};
+
+/* ---------------------------------------------------------------------- */
+
+/* file.c */
+extern struct address_space_operations aufs_aop;
+void au_store_oflag(struct nameidata *nd);
+unsigned int au_file_roflags(unsigned int flags);
+struct file *au_h_open(struct dentry *dentry, aufs_bindex_t bindex, int flags,
+		       struct file *file);
+int au_do_open(struct file *file, int (*open)(struct file *file, int flags));
+int au_reopen_nondir(struct file *file);
+struct au_pin;
+int au_ready_to_write(struct file *file, loff_t len, struct au_pin *pin);
+int au_reval_and_lock_fdi(struct file *file, int (*reopen)(struct file *file),
+			  int wlock);
+
+/* f_op.c */
+extern struct file_operations aufs_file_fop;
+int aufs_flush(struct file *file, fl_owner_t id);
+
+/* finfo.c */
+void au_hfput(struct au_hfile *hf, struct file *file);
+void au_set_h_fptr(struct file *file, aufs_bindex_t bindex,
+		   struct file *h_file);
+
+void au_update_figen(struct file *file);
+
+void au_finfo_fin(struct file *file);
+int au_finfo_init(struct file *file);
+int au_fi_realloc(struct au_finfo *finfo, int nbr);
+
+/* ---------------------------------------------------------------------- */
+
+static inline struct au_finfo *au_fi(struct file *file)
+{
+	return file->private_data;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * fi_read_lock, fi_write_lock,
+ * fi_read_unlock, fi_write_unlock, fi_downgrade_lock
+ */
+AuSimpleRwsemFuncs(fi, struct file *f, &au_fi(f)->fi_rwsem);
+
+#define FiMustNoWaiters(f)	AuRwMustNoWaiters(&au_fi(f)->fi_rwsem)
+
+/* ---------------------------------------------------------------------- */
+
+/* todo: hard/soft set? */
+static inline aufs_bindex_t au_fbstart(struct file *file)
+{
+	return au_fi(file)->fi_bstart;
+}
+
+static inline aufs_bindex_t au_fbend(struct file *file)
+{
+	return au_fi(file)->fi_bend;
+}
+
+static inline struct au_vdir *au_fvdir_cache(struct file *file)
+{
+	return au_fi(file)->fi_vdir_cache;
+}
+
+static inline void au_set_fbstart(struct file *file, aufs_bindex_t bindex)
+{
+	au_fi(file)->fi_bstart = bindex;
+}
+
+static inline void au_set_fbend(struct file *file, aufs_bindex_t bindex)
+{
+	au_fi(file)->fi_bend = bindex;
+}
+
+static inline void au_set_fvdir_cache(struct file *file,
+				      struct au_vdir *vdir_cache)
+{
+	au_fi(file)->fi_vdir_cache = vdir_cache;
+}
+
+static inline struct file *au_h_fptr(struct file *file, aufs_bindex_t bindex)
+{
+	return au_fi(file)->fi_hfile[0 + bindex].hf_file;
+}
+
+/* todo: memory barrier? */
+static inline unsigned int au_figen(struct file *f)
+{
+	return atomic_read(&au_fi(f)->fi_generation);
+}
+
+static inline int au_test_mmapped(struct file *f)
+{
+	return !!(au_fi(f)->fi_h_vm_ops);
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_FILE_H__ */
diff --git a/fs/aufs/finfo.c b/fs/aufs/finfo.c
new file mode 100644
index 0000000..49c4934
--- /dev/null
+++ b/fs/aufs/finfo.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * file private data
+ */
+
+#include "aufs.h"
+
+void au_hfput(struct au_hfile *hf, struct file *file)
+{
+	if (file->f_mode & FMODE_EXEC)
+		allow_write_access(hf->hf_file);
+	fput(hf->hf_file);
+	hf->hf_file = NULL;
+	atomic_dec(&hf->hf_br->br_count);
+	hf->hf_br = NULL;
+}
+
+void au_set_h_fptr(struct file *file, aufs_bindex_t bindex, struct file *val)
+{
+	struct au_finfo *finfo = au_fi(file);
+	struct au_hfile *hf;
+
+	hf = finfo->fi_hfile + bindex;
+	if (hf->hf_file)
+		au_hfput(hf, file);
+	if (val) {
+		hf->hf_file = val;
+		hf->hf_br = au_sbr(file->f_dentry->d_sb, bindex);
+	}
+}
+
+void au_update_figen(struct file *file)
+{
+	atomic_set(&au_fi(file)->fi_generation, au_digen(file->f_dentry));
+	/* smp_mb(); */ /* atomic_set */
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_finfo_fin(struct file *file)
+{
+	struct au_finfo *finfo;
+	aufs_bindex_t bindex, bend;
+
+	fi_write_lock(file);
+	bend = au_fbend(file);
+	bindex = au_fbstart(file);
+	if (bindex >= 0)
+		/*
+		 * calls fput() instead of filp_close(),
+		 * since no dnotify or lock for the lower file.
+		 */
+		for (; bindex <= bend; bindex++)
+			au_set_h_fptr(file, bindex, NULL);
+
+	finfo = au_fi(file);
+	au_dbg_verify_hf(finfo);
+	kfree(finfo->fi_hfile);
+	fi_write_unlock(file);
+	au_rwsem_destroy(&finfo->fi_rwsem);
+	au_cache_free_finfo(finfo);
+}
+
+int au_finfo_init(struct file *file)
+{
+	struct au_finfo *finfo;
+	struct dentry *dentry;
+	union {
+		unsigned int u;
+		fmode_t m;
+	} u;
+
+	dentry = file->f_dentry;
+	finfo = au_cache_alloc_finfo();
+	if (unlikely(!finfo))
+		goto out;
+
+	finfo->fi_hfile = kcalloc(au_sbend(dentry->d_sb) + 1,
+				  sizeof(*finfo->fi_hfile), GFP_NOFS);
+	if (unlikely(!finfo->fi_hfile))
+		goto out_finfo;
+
+	init_rwsem(&finfo->fi_rwsem);
+	down_write(&finfo->fi_rwsem);
+	finfo->fi_bstart = -1;
+	finfo->fi_bend = -1;
+	atomic_set(&finfo->fi_generation, au_digen(dentry));
+	/* smp_mb(); */ /* atomic_set */
+
+	/* cf. au_store_oflag() */
+	u.u = (unsigned int)file->private_data;
+	file->f_mode |= (u.m & FMODE_EXEC);
+	file->private_data = finfo;
+	return 0; /* success */
+
+ out_finfo:
+	au_cache_free_finfo(finfo);
+ out:
+	return -ENOMEM;
+}
+
+int au_fi_realloc(struct au_finfo *finfo, int nbr)
+{
+	int err, sz;
+	struct au_hfile *hfp;
+
+	err = -ENOMEM;
+	sz = sizeof(*hfp) * (finfo->fi_bend + 1);
+	if (!sz)
+		sz = sizeof(*hfp);
+	hfp = au_kzrealloc(finfo->fi_hfile, sz, sizeof(*hfp) * nbr, GFP_NOFS);
+	if (hfp) {
+		finfo->fi_hfile = hfp;
+		err = 0;
+	}
+
+	return err;
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 18/28] aufs direcotry
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (16 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 17/28] aufs file J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 19/28] aufs inode J. R. Okajima
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
directory operations
virtual or vertical(stacked) directory

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/dir.c  |  493 +++++++++++++++++++++++++++++++++++
 fs/aufs/dir.h  |  104 ++++++++
 fs/aufs/vdir.c |  776 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1373 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/dir.c
 create mode 100644 fs/aufs/dir.h
 create mode 100644 fs/aufs/vdir.c

diff --git a/fs/aufs/dir.c b/fs/aufs/dir.c
new file mode 100644
index 0000000..44c78c9
--- /dev/null
+++ b/fs/aufs/dir.c
@@ -0,0 +1,493 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * directory operations
+ */
+
+#include <linux/fs_stack.h>
+#include "aufs.h"
+
+void au_add_nlink(struct inode *dir, struct inode *h_dir)
+{
+	AuDebugOn(!S_ISDIR(dir->i_mode) || !S_ISDIR(h_dir->i_mode));
+
+	dir->i_nlink += h_dir->i_nlink - 2;
+	if (h_dir->i_nlink < 2)
+		dir->i_nlink += 2;
+}
+
+void au_sub_nlink(struct inode *dir, struct inode *h_dir)
+{
+	AuDebugOn(!S_ISDIR(dir->i_mode) || !S_ISDIR(h_dir->i_mode));
+
+	dir->i_nlink -= h_dir->i_nlink - 2;
+	if (h_dir->i_nlink < 2)
+		dir->i_nlink -= 2;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int reopen_dir(struct file *file)
+{
+	int err;
+	unsigned int flags;
+	aufs_bindex_t bindex, btail, bstart;
+	struct dentry *dentry, *h_dentry;
+	struct file *h_file;
+
+	/* open all lower dirs */
+	dentry = file->f_dentry;
+	bstart = au_dbstart(dentry);
+	for (bindex = au_fbstart(file); bindex < bstart; bindex++)
+		au_set_h_fptr(file, bindex, NULL);
+	au_set_fbstart(file, bstart);
+
+	btail = au_dbtaildir(dentry);
+	for (bindex = au_fbend(file); btail < bindex; bindex--)
+		au_set_h_fptr(file, bindex, NULL);
+	au_set_fbend(file, btail);
+
+	flags = file->f_flags;
+	for (bindex = bstart; bindex <= btail; bindex++) {
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (!h_dentry)
+			continue;
+		h_file = au_h_fptr(file, bindex);
+		if (h_file)
+			continue;
+
+		h_file = au_h_open(dentry, bindex, flags, file);
+		err = PTR_ERR(h_file);
+		if (IS_ERR(h_file))
+			goto out; /* close all? */
+		au_set_h_fptr(file, bindex, h_file);
+	}
+	au_update_figen(file);
+	/* todo: necessary? */
+	/* file->f_ra = h_file->f_ra; */
+	err = 0;
+
+ out:
+	return err;
+}
+
+static int do_open_dir(struct file *file, int flags)
+{
+	int err;
+	aufs_bindex_t bindex, btail;
+	struct dentry *dentry, *h_dentry;
+	struct file *h_file;
+
+	err = 0;
+	dentry = file->f_dentry;
+	au_set_fvdir_cache(file, NULL);
+	au_fi(file)->fi_maintain_plink = 0;
+	file->f_version = dentry->d_inode->i_version;
+	bindex = au_dbstart(dentry);
+	au_set_fbstart(file, bindex);
+	btail = au_dbtaildir(dentry);
+	au_set_fbend(file, btail);
+	for (; !err && bindex <= btail; bindex++) {
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (!h_dentry)
+			continue;
+
+		h_file = au_h_open(dentry, bindex, flags, file);
+		if (IS_ERR(h_file)) {
+			err = PTR_ERR(h_file);
+			break;
+		}
+		au_set_h_fptr(file, bindex, h_file);
+	}
+	au_update_figen(file);
+	/* todo: necessary? */
+	/* file->f_ra = h_file->f_ra; */
+	if (!err)
+		return 0; /* success */
+
+	/* close all */
+	for (bindex = au_fbstart(file); bindex <= btail; bindex++)
+		au_set_h_fptr(file, bindex, NULL);
+	au_set_fbstart(file, -1);
+	au_set_fbend(file, -1);
+	return err;
+}
+
+static int aufs_open_dir(struct inode *inode __maybe_unused,
+			 struct file *file)
+{
+	return au_do_open(file, do_open_dir);
+}
+
+static int aufs_release_dir(struct inode *inode __maybe_unused,
+			    struct file *file)
+{
+	struct au_vdir *vdir_cache;
+	struct super_block *sb;
+	struct au_sbinfo *sbinfo;
+
+	sb = file->f_dentry->d_sb;
+	si_noflush_read_lock(sb);
+	fi_write_lock(file);
+	vdir_cache = au_fvdir_cache(file);
+	if (vdir_cache)
+		au_vdir_free(vdir_cache);
+	if (au_fi(file)->fi_maintain_plink) {
+		sbinfo = au_sbi(sb);
+		au_fclr_si(sbinfo, MAINTAIN_PLINK);
+		wake_up_all(&sbinfo->si_plink_wq);
+	}
+	fi_write_unlock(file);
+	au_finfo_fin(file);
+	si_read_unlock(sb);
+	return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int au_do_fsync_dir_no_file(struct dentry *dentry, int datasync)
+{
+	int err;
+	aufs_bindex_t bend, bindex;
+	struct inode *inode;
+	struct super_block *sb;
+
+	err = 0;
+	sb = dentry->d_sb;
+	inode = dentry->d_inode;
+	IMustLock(inode);
+	bend = au_dbend(dentry);
+	for (bindex = au_dbstart(dentry); !err && bindex <= bend; bindex++) {
+		struct path h_path;
+		struct inode *h_inode;
+		struct file_operations *fop;
+
+		if (au_test_ro(sb, bindex, inode))
+			continue;
+		h_path.dentry = au_h_dptr(dentry, bindex);
+		if (!h_path.dentry)
+			continue;
+		h_inode = h_path.dentry->d_inode;
+		if (!h_inode)
+			continue;
+
+		/* no mnt_want_write() */
+		/* cf. fs/nsfd/vfs.c and fs/nfsd/nfs4recover.c */
+		/* todo: inotiry fired? */
+		h_path.mnt = au_sbr_mnt(sb, bindex);
+		mutex_lock(&h_inode->i_mutex);
+		fop = (void *)h_inode->i_fop;
+		err = filemap_fdatawrite(h_inode->i_mapping);
+		if (!err && fop && fop->fsync)
+			err = fop->fsync(NULL, h_path.dentry, datasync);
+		if (!err)
+			err = filemap_fdatawrite(h_inode->i_mapping);
+		if (!err)
+			vfsub_update_h_iattr(&h_path, /*did*/NULL); /*ignore*/
+		mutex_unlock(&h_inode->i_mutex);
+	}
+
+	return err;
+}
+
+static int au_do_fsync_dir(struct file *file, int datasync)
+{
+	int err;
+	aufs_bindex_t bend, bindex;
+	struct file *h_file;
+	struct super_block *sb;
+	struct inode *inode;
+	struct mutex *h_mtx;
+
+	err = au_reval_and_lock_fdi(file, reopen_dir, /*wlock*/1);
+	if (unlikely(err))
+		goto out;
+
+	sb = file->f_dentry->d_sb;
+	inode = file->f_dentry->d_inode;
+	bend = au_fbend(file);
+	for (bindex = au_fbstart(file); !err && bindex <= bend; bindex++) {
+		h_file = au_h_fptr(file, bindex);
+		if (!h_file || au_test_ro(sb, bindex, inode))
+			continue;
+
+		err = vfs_fsync(h_file, h_file->f_dentry, datasync);
+		if (!err) {
+			h_mtx = &h_file->f_dentry->d_inode->i_mutex;
+			mutex_lock(h_mtx);
+			vfsub_update_h_iattr(&h_file->f_path, /*did*/NULL);
+			/*ignore*/
+			mutex_unlock(h_mtx);
+		}
+	}
+
+ out:
+	return err;
+}
+
+/*
+ * @file may be NULL
+ */
+static int aufs_fsync_dir(struct file *file, struct dentry *dentry,
+			  int datasync)
+{
+	int err;
+	struct super_block *sb;
+
+	IMustLock(dentry->d_inode);
+
+	err = 0;
+	sb = dentry->d_sb;
+	si_noflush_read_lock(sb);
+	if (file)
+		err = au_do_fsync_dir(file, datasync);
+	else {
+		di_write_lock_child(dentry);
+		err = au_do_fsync_dir_no_file(dentry, datasync);
+	}
+	au_cpup_attr_timesizes(dentry->d_inode);
+	di_write_unlock(dentry);
+	if (file)
+		fi_write_unlock(file);
+
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int aufs_readdir(struct file *file, void *dirent, filldir_t filldir)
+{
+	int err;
+	struct dentry *dentry;
+	struct inode *inode;
+	struct super_block *sb;
+
+	dentry = file->f_dentry;
+	inode = dentry->d_inode;
+	IMustLock(inode);
+
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_reval_and_lock_fdi(file, reopen_dir, /*wlock*/1);
+	if (unlikely(err))
+		goto out;
+	err = au_vdir_init(file);
+	di_downgrade_lock(dentry, AuLock_IR);
+	if (unlikely(err))
+		goto out_unlock;
+
+	err = au_vdir_fill_de(file, dirent, filldir);
+	fsstack_copy_attr_atime(inode, au_h_iptr(inode, au_ibstart(inode)));
+
+ out_unlock:
+	di_read_unlock(dentry, AuLock_IR);
+	fi_write_unlock(file);
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+#define AuTestEmpty_WHONLY	1
+#define AuTestEmpty_CALLED	(1 << 2)
+#define au_ftest_testempty(flags, name)	((flags) & AuTestEmpty_##name)
+#define au_fset_testempty(flags, name)	{ (flags) |= AuTestEmpty_##name; }
+#define au_fclr_testempty(flags, name)	{ (flags) &= ~AuTestEmpty_##name; }
+
+struct test_empty_arg {
+	struct au_nhash *whlist;
+	unsigned int flags;
+	int err;
+	aufs_bindex_t bindex;
+};
+
+static int test_empty_cb(void *__arg, const char *__name, int namelen,
+			 loff_t offset __maybe_unused, u64 ino __maybe_unused,
+			 unsigned int d_type __maybe_unused)
+{
+	struct test_empty_arg *arg = __arg;
+	char *name = (void *)__name;
+
+	arg->err = 0;
+	au_fset_testempty(arg->flags, CALLED);
+	/* smp_mb(); */
+	if (name[0] == '.'
+	    && (namelen == 1 || (name[1] == '.' && namelen == 2)))
+		goto out; /* success */
+
+	if (namelen <= AUFS_WH_PFX_LEN
+	    || memcmp(name, AUFS_WH_PFX, AUFS_WH_PFX_LEN)) {
+		if (au_ftest_testempty(arg->flags, WHONLY)
+		    && !au_nhash_test_known_wh(arg->whlist, name, namelen))
+			arg->err = -ENOTEMPTY;
+		goto out;
+	}
+
+	name += AUFS_WH_PFX_LEN;
+	namelen -= AUFS_WH_PFX_LEN;
+	if (!au_nhash_test_known_wh(arg->whlist, name, namelen))
+		arg->err = au_nhash_append_wh
+			(arg->whlist, name, namelen, arg->bindex);
+
+ out:
+	/* smp_mb(); */
+	AuTraceErr(arg->err);
+	return arg->err;
+}
+
+static int do_test_empty(struct dentry *dentry, struct test_empty_arg *arg)
+{
+	int err;
+	struct file *h_file;
+
+	h_file = au_h_open(dentry, arg->bindex,
+			   O_RDONLY | O_NONBLOCK | O_DIRECTORY | O_LARGEFILE,
+			   /*file*/NULL);
+	err = PTR_ERR(h_file);
+	if (IS_ERR(h_file))
+		goto out;
+
+	err = 0;
+	if (!au_opt_test(au_mntflags(dentry->d_sb), UDBA_NONE)
+	    && !h_file->f_dentry->d_inode->i_nlink)
+		goto out_put;
+
+	do {
+		arg->err = 0;
+		au_fclr_testempty(arg->flags, CALLED);
+		/* smp_mb(); */
+		err = vfsub_readdir(h_file, test_empty_cb, arg);
+		if (err >= 0)
+			err = arg->err;
+	} while (!err && au_ftest_testempty(arg->flags, CALLED));
+
+ out_put:
+	fput(h_file);
+	au_sbr_put(dentry->d_sb, arg->bindex);
+ out:
+	return err;
+}
+
+struct do_test_empty_args {
+	int *errp;
+	struct dentry *dentry;
+	struct test_empty_arg *arg;
+};
+
+static void call_do_test_empty(void *args)
+{
+	struct do_test_empty_args *a = args;
+	*a->errp = do_test_empty(a->dentry, a->arg);
+}
+
+static int sio_test_empty(struct dentry *dentry, struct test_empty_arg *arg)
+{
+	int err, wkq_err;
+	struct dentry *h_dentry;
+	struct inode *h_inode;
+
+	h_dentry = au_h_dptr(dentry, arg->bindex);
+	h_inode = h_dentry->d_inode;
+	mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+	err = au_test_h_perm_sio(h_inode, MAY_EXEC | MAY_READ);
+	mutex_unlock(&h_inode->i_mutex);
+	if (!err)
+		err = do_test_empty(dentry, arg);
+	else {
+		struct do_test_empty_args args = {
+			.errp	= &err,
+			.dentry	= dentry,
+			.arg	= arg
+		};
+		unsigned int flags = arg->flags;
+
+		wkq_err = au_wkq_wait(call_do_test_empty, &args);
+		if (unlikely(wkq_err))
+			err = wkq_err;
+		arg->flags = flags;
+	}
+
+	return err;
+}
+
+int au_test_empty_lower(struct dentry *dentry)
+{
+	int err;
+	aufs_bindex_t bindex, bstart, btail;
+	struct test_empty_arg arg;
+	struct au_nhash *whlist;
+
+	whlist = au_nhash_new(GFP_NOFS);
+	err = PTR_ERR(whlist);
+	if (IS_ERR(whlist))
+		goto out;
+
+	bstart = au_dbstart(dentry);
+	arg.whlist = whlist;
+	arg.flags = 0;
+	arg.bindex = bstart;
+	err = do_test_empty(dentry, &arg);
+	if (unlikely(err))
+		goto out_whlist;
+
+	au_fset_testempty(arg.flags, WHONLY);
+	btail = au_dbtaildir(dentry);
+	for (bindex = bstart + 1; !err && bindex <= btail; bindex++) {
+		struct dentry *h_dentry;
+
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (h_dentry && h_dentry->d_inode) {
+			arg.bindex = bindex;
+			err = do_test_empty(dentry, &arg);
+		}
+	}
+
+ out_whlist:
+	au_nhash_del(whlist);
+ out:
+	return err;
+}
+
+int au_test_empty(struct dentry *dentry, struct au_nhash *whlist)
+{
+	int err;
+	struct test_empty_arg arg;
+	aufs_bindex_t bindex, btail;
+
+	err = 0;
+	arg.whlist = whlist;
+	arg.flags = AuTestEmpty_WHONLY;
+	btail = au_dbtaildir(dentry);
+	for (bindex = au_dbstart(dentry); !err && bindex <= btail; bindex++) {
+		struct dentry *h_dentry;
+
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (h_dentry && h_dentry->d_inode) {
+			arg.bindex = bindex;
+			err = sio_test_empty(dentry, &arg);
+		}
+	}
+
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct file_operations aufs_dir_fop = {
+	.read		= generic_read_dir,
+	.readdir	= aufs_readdir,
+	.unlocked_ioctl	= aufs_ioctl_dir,
+	.open		= aufs_open_dir,
+	.release	= aufs_release_dir,
+	.flush		= aufs_flush,
+	.fsync		= aufs_fsync_dir
+};
diff --git a/fs/aufs/dir.h b/fs/aufs/dir.h
new file mode 100644
index 0000000..1e20ca3
--- /dev/null
+++ b/fs/aufs/dir.h
@@ -0,0 +1,104 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * directory operations
+ */
+
+#ifndef __AUFS_DIR_H__
+#define __AUFS_DIR_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/aufs_type.h>
+
+/* ---------------------------------------------------------------------- */
+
+/* need to be faster and smaller */
+
+#define AuSize_DEBLK	512
+#define AuSize_NHASH	32
+
+typedef char au_vdir_deblk_t[AuSize_DEBLK];
+
+struct au_nhash {
+	struct hlist_head heads[AuSize_NHASH];
+};
+
+struct au_vdir_destr {
+	unsigned char	len;
+	char		name[0];
+} __packed;
+
+struct au_vdir_dehstr {
+	struct hlist_node	hash;
+	struct au_vdir_destr	*str;
+};
+
+struct au_vdir_de {
+	ino_t			de_ino;
+	unsigned char		de_type;
+	/* caution: packed */
+	struct au_vdir_destr	de_str;
+} __packed;
+
+struct au_vdir_wh {
+	struct hlist_node	wh_hash;
+	aufs_bindex_t		wh_bindex;
+	struct au_vdir_destr	wh_str;
+} __packed;
+
+union au_vdir_deblk_p {
+	unsigned char		*p;
+	au_vdir_deblk_t		*deblk;
+	struct au_vdir_de	*de;
+};
+
+struct au_vdir {
+	au_vdir_deblk_t	**vd_deblk;
+	int		vd_nblk;
+	struct {
+		int			i;
+		union au_vdir_deblk_p	p;
+	} vd_last;
+
+	unsigned long	vd_version;
+	unsigned long	vd_jiffy;
+};
+
+/* ---------------------------------------------------------------------- */
+
+/* dir.c */
+extern struct file_operations aufs_dir_fop;
+void au_add_nlink(struct inode *dir, struct inode *h_dir);
+void au_sub_nlink(struct inode *dir, struct inode *h_dir);
+int au_test_empty_lower(struct dentry *dentry);
+int au_test_empty(struct dentry *dentry, struct au_nhash *whlist);
+
+/* vdir.c */
+struct au_nhash *au_nhash_new(gfp_t gfp);
+void au_nhash_del(struct au_nhash *nhash);
+void au_nhash_init(struct au_nhash *nhash);
+void au_nhash_move(struct au_nhash *dst, struct au_nhash *src);
+void au_nhash_fin(struct au_nhash *nhash);
+int au_nhash_test_longer_wh(struct au_nhash *whlist, aufs_bindex_t btgt,
+			    int limit);
+int au_nhash_test_known_wh(struct au_nhash *whlist, char *name, int namelen);
+int au_nhash_append_wh(struct au_nhash *whlist, char *name, int namelen,
+		       aufs_bindex_t bindex);
+void au_vdir_free(struct au_vdir *vdir);
+int au_vdir_init(struct file *file);
+int au_vdir_fill_de(struct file *file, void *dirent, filldir_t filldir);
+
+/* ioctl.c */
+long aufs_ioctl_dir(struct file *file, unsigned int cmd, unsigned long arg);
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_DIR_H__ */
diff --git a/fs/aufs/vdir.c b/fs/aufs/vdir.c
new file mode 100644
index 0000000..b0580c2
--- /dev/null
+++ b/fs/aufs/vdir.c
@@ -0,0 +1,776 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * virtual or vertical directory
+ */
+
+#include "aufs.h"
+
+static int calc_size(int namelen)
+{
+	int sz;
+	const int mask = sizeof(ino_t) - 1;
+
+	BUILD_BUG_ON(sizeof(ino_t) != sizeof(long));
+
+	sz = sizeof(struct au_vdir_de) + namelen;
+	if (sz & mask) {
+		sz += sizeof(ino_t);
+		sz &= ~mask;
+	}
+
+	AuDebugOn(sz % sizeof(ino_t));
+	return sz;
+}
+
+static int set_deblk_end(union au_vdir_deblk_p *p,
+			 union au_vdir_deblk_p *deblk_end)
+{
+	if (calc_size(0) <= deblk_end->p - p->p) {
+		p->de->de_str.len = 0;
+		/* smp_mb(); */
+		return 0;
+	}
+	return -1; /* error */
+}
+
+/* returns true or false */
+static int is_deblk_end(union au_vdir_deblk_p *p,
+			union au_vdir_deblk_p *deblk_end)
+{
+	if (calc_size(0) <= deblk_end->p - p->p)
+		return !p->de->de_str.len;
+	return 1;
+}
+
+static au_vdir_deblk_t *last_deblk(struct au_vdir *vdir)
+{
+	return vdir->vd_deblk[vdir->vd_nblk - 1];
+}
+
+void au_nhash_init(struct au_nhash *nhash)
+{
+	int i;
+	struct hlist_head *heads;
+
+	heads = nhash->heads;
+	for (i = 0; i < AuSize_NHASH; i++)
+		INIT_HLIST_HEAD(heads++);
+}
+
+struct au_nhash *au_nhash_new(gfp_t gfp)
+{
+	struct au_nhash *nhash;
+
+	nhash = kmalloc(sizeof(*nhash), gfp);
+	if (nhash) {
+		au_nhash_init(nhash);
+		return nhash;
+	}
+	return ERR_PTR(-ENOMEM);
+}
+
+void au_nhash_del(struct au_nhash *nhash)
+{
+	au_nhash_fin(nhash);
+	kfree(nhash);
+}
+
+void au_nhash_move(struct au_nhash *dst, struct au_nhash *src)
+{
+	int i;
+	struct hlist_head *dsth, *srch;
+
+	*dst = *src;
+	srch = src->heads;
+	dsth = dst->heads;
+	for (i = 0; i < AuSize_NHASH; i++) {
+		if (dsth->first)
+			dsth->first->pprev = &dsth->first;
+		dsth++;
+		INIT_HLIST_HEAD(srch++);
+	}
+	/* smp_mb(); */
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_nhash_fin(struct au_nhash *whlist)
+{
+	int i;
+	struct hlist_head *head;
+	struct au_vdir_wh *tpos;
+	struct hlist_node *pos, *n;
+
+	head = whlist->heads;
+	for (i = 0; i < AuSize_NHASH; i++) {
+		hlist_for_each_entry_safe(tpos, pos, n, head, wh_hash) {
+			/* hlist_del(pos); */
+			kfree(tpos);
+		}
+		head++;
+	}
+}
+
+int au_nhash_test_longer_wh(struct au_nhash *whlist, aufs_bindex_t btgt,
+			    int limit)
+{
+	int n, i;
+	struct hlist_head *head;
+	struct au_vdir_wh *tpos;
+	struct hlist_node *pos;
+
+	n = 0;
+	head = whlist->heads;
+	for (i = 0; i < AuSize_NHASH; i++) {
+		hlist_for_each_entry(tpos, pos, head, wh_hash)
+			if (tpos->wh_bindex == btgt && ++n > limit)
+				return 1;
+		head++;
+	}
+	return 0;
+}
+
+static unsigned int au_name_hash(const unsigned char *name, unsigned int len)
+{
+	return full_name_hash(name, len) % AuSize_NHASH;
+}
+
+/* returns found or not */
+int au_nhash_test_known_wh(struct au_nhash *whlist, char *name, int namelen)
+{
+	struct hlist_head *head;
+	struct au_vdir_wh *tpos;
+	struct hlist_node *pos;
+	struct au_vdir_destr *str;
+
+	head = whlist->heads + au_name_hash(name, namelen);
+	hlist_for_each_entry(tpos, pos, head, wh_hash) {
+		str = &tpos->wh_str;
+		AuDbg("%.*s\n", str->len, str->name);
+		if (str->len == namelen && !memcmp(str->name, name, namelen))
+			return 1;
+	}
+	return 0;
+}
+
+int au_nhash_append_wh(struct au_nhash *whlist, char *name, int namelen,
+		       aufs_bindex_t bindex)
+{
+	int err;
+	struct au_vdir_destr *str;
+	struct au_vdir_wh *wh;
+
+	err = -ENOMEM;
+	wh = kmalloc(sizeof(*wh) + namelen, GFP_NOFS);
+	if (unlikely(!wh))
+		goto out;
+
+	err = 0;
+	wh->wh_bindex = bindex;
+	str = &wh->wh_str;
+	str->len = namelen;
+	memcpy(str->name, name, namelen);
+	hlist_add_head(&wh->wh_hash,
+		       whlist->heads + au_name_hash(name, namelen));
+	/* smp_mb(); */
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_vdir_free(struct au_vdir *vdir)
+{
+	au_vdir_deblk_t **deblk;
+
+	deblk = vdir->vd_deblk;
+	while (vdir->vd_nblk--)
+		kfree(*deblk++);
+	kfree(vdir->vd_deblk);
+	au_cache_free_vdir(vdir);
+}
+
+static int append_deblk(struct au_vdir *vdir)
+{
+	int err, sz, i;
+	au_vdir_deblk_t **o;
+	union au_vdir_deblk_p p, deblk_end;
+
+	err = -ENOMEM;
+	sz = sizeof(*o) * vdir->vd_nblk;
+	o = au_kzrealloc(vdir->vd_deblk, sz, sz + sizeof(*o), GFP_NOFS);
+	if (unlikely(!o))
+		goto out;
+
+	vdir->vd_deblk = o;
+	p.deblk = kmalloc(sizeof(*p.deblk), GFP_NOFS);
+	if (p.deblk) {
+		i = vdir->vd_nblk++;
+		vdir->vd_deblk[i] = p.deblk;
+		vdir->vd_last.i = i;
+		vdir->vd_last.p.p = p.p;
+		deblk_end.deblk = p.deblk + 1;
+		err = set_deblk_end(&p, &deblk_end);
+	}
+
+ out:
+	return err;
+}
+
+static struct au_vdir *alloc_vdir(void)
+{
+	struct au_vdir *vdir;
+	int err;
+
+	err = -ENOMEM;
+	vdir = au_cache_alloc_vdir();
+	if (unlikely(!vdir))
+		goto out;
+
+	vdir->vd_deblk = kzalloc(sizeof(*vdir->vd_deblk), GFP_NOFS);
+	if (unlikely(!vdir->vd_deblk))
+		goto out_free;
+
+	vdir->vd_nblk = 0;
+	vdir->vd_version = 0;
+	vdir->vd_jiffy = 0;
+	err = append_deblk(vdir);
+	if (!err)
+		return vdir; /* success */
+
+	kfree(vdir->vd_deblk);
+
+ out_free:
+	au_cache_free_vdir(vdir);
+ out:
+	vdir = ERR_PTR(err);
+	return vdir;
+}
+
+static int reinit_vdir(struct au_vdir *vdir)
+{
+	int err;
+	union au_vdir_deblk_p p, deblk_end;
+
+	while (vdir->vd_nblk > 1) {
+		kfree(vdir->vd_deblk[vdir->vd_nblk - 1]);
+		vdir->vd_deblk[vdir->vd_nblk - 1] = NULL;
+		vdir->vd_nblk--;
+	}
+	p.deblk = vdir->vd_deblk[0];
+	deblk_end.deblk = p.deblk + 1;
+	err = set_deblk_end(&p, &deblk_end);
+	vdir->vd_version = 0;
+	vdir->vd_jiffy = 0;
+	vdir->vd_last.i = 0;
+	vdir->vd_last.p.deblk = vdir->vd_deblk[0];
+	/* smp_mb(); */
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static void free_dehlist(struct au_nhash *dehlist)
+{
+	int i;
+	struct hlist_head *head;
+	struct au_vdir_dehstr *tpos;
+	struct hlist_node *pos, *n;
+
+	head = dehlist->heads;
+	for (i = 0; i < AuSize_NHASH; i++) {
+		hlist_for_each_entry_safe(tpos, pos, n, head, hash) {
+			/* hlist_del(pos); */
+			au_cache_free_dehstr(tpos);
+		}
+		head++;
+	}
+}
+
+/* returns found(true) or not */
+static int test_known(struct au_nhash *delist, char *name, int namelen)
+{
+	struct hlist_head *head;
+	struct au_vdir_dehstr *tpos;
+	struct hlist_node *pos;
+	struct au_vdir_destr *str;
+
+	head = delist->heads + au_name_hash(name, namelen);
+	hlist_for_each_entry(tpos, pos, head, hash) {
+		str = tpos->str;
+		AuDbg("%.*s\n", str->len, str->name);
+		if (str->len == namelen && !memcmp(str->name, name, namelen))
+			return 1;
+	}
+	return 0;
+
+}
+
+static int append_de(struct au_vdir *vdir, char *name, int namelen, ino_t ino,
+		     unsigned int d_type, struct au_nhash *delist)
+{
+	int err, sz;
+	union au_vdir_deblk_p p, *room, deblk_end;
+	struct au_vdir_dehstr *dehstr;
+
+	p.deblk = last_deblk(vdir);
+	deblk_end.deblk = p.deblk + 1;
+	room = &vdir->vd_last.p;
+	AuDebugOn(room->p < p.p || deblk_end.p <= room->p
+		  || !is_deblk_end(room, &deblk_end));
+
+	sz = calc_size(namelen);
+	if (unlikely(sz > deblk_end.p - room->p)) {
+		err = append_deblk(vdir);
+		if (unlikely(err))
+			goto out;
+
+		p.deblk = last_deblk(vdir);
+		deblk_end.deblk = p.deblk + 1;
+		/* smp_mb(); */
+		AuDebugOn(room->p != p.p);
+	}
+
+	err = -ENOMEM;
+	dehstr = au_cache_alloc_dehstr();
+	if (unlikely(!dehstr))
+		goto out;
+
+	dehstr->str = &room->de->de_str;
+	hlist_add_head(&dehstr->hash,
+		       delist->heads + au_name_hash(name, namelen));
+	room->de->de_ino = ino;
+	room->de->de_type = d_type;
+	room->de->de_str.len = namelen;
+	memcpy(room->de->de_str.name, name, namelen);
+
+	err = 0;
+	room->p += sz;
+	if (unlikely(set_deblk_end(room, &deblk_end)))
+		err = append_deblk(vdir);
+	/* smp_mb(); */
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int au_ino(struct super_block *sb, aufs_bindex_t bindex, ino_t h_ino,
+		  unsigned int d_type, ino_t *ino)
+{
+	int err;
+	struct mutex *mtx;
+	const int isdir = (d_type == DT_DIR);
+
+	/* prevent hardlinks from race condition */
+	mtx = NULL;
+	if (!isdir) {
+		mtx = &au_sbr(sb, bindex)->br_xino.xi_nondir_mtx;
+		mutex_lock(mtx);
+	}
+	err = au_xino_read(sb, bindex, h_ino, ino);
+	if (unlikely(err))
+		goto out;
+
+	if (!*ino) {
+		err = -EIO;
+		*ino = au_xino_new_ino(sb);
+		if (unlikely(!*ino))
+			goto out;
+		err = au_xino_write(sb, bindex, h_ino, *ino);
+		if (unlikely(err))
+			goto out;
+	}
+
+ out:
+	if (!isdir)
+		mutex_unlock(mtx);
+	return err;
+}
+
+#define AuFillVdir_CALLED	1
+#define au_ftest_fillvdir(flags, name)	((flags) & AuFillVdir_##name)
+#define au_fset_fillvdir(flags, name)	{ (flags) |= AuFillVdir_##name; }
+#define au_fclr_fillvdir(flags, name)	{ (flags) &= ~AuFillVdir_##name; }
+
+struct fillvdir_arg {
+	struct file		*file;
+	struct au_vdir		*vdir;
+	struct au_nhash		*delist;
+	struct au_nhash		*whlist;
+	aufs_bindex_t		bindex;
+	unsigned int		flags;
+	int			err;
+};
+
+static int fillvdir(void *__arg, const char *__name, int namelen,
+		    loff_t offset __maybe_unused, u64 h_ino,
+		    unsigned int d_type)
+{
+	struct fillvdir_arg *arg = __arg;
+	char *name = (void *)__name;
+	struct super_block *sb;
+	struct au_nhash *delist, *whlist;
+	ino_t ino;
+	aufs_bindex_t bindex, bend;
+
+	bend = arg->bindex;
+	arg->err = 0;
+	au_fset_fillvdir(arg->flags, CALLED);
+	/* smp_mb(); */
+	if (namelen <= AUFS_WH_PFX_LEN
+	    || memcmp(name, AUFS_WH_PFX, AUFS_WH_PFX_LEN)) {
+		delist = arg->delist;
+		for (bindex = 0; bindex < bend; bindex++)
+			if (test_known(delist++, name, namelen)
+			    || au_nhash_test_known_wh(arg->whlist + bindex,
+						      name, namelen))
+				goto out; /* already exists or whiteouted */
+
+		ino = 1; /* why does gcc warn? */
+		sb = arg->file->f_dentry->d_sb;
+		arg->err = au_ino(sb, bend, h_ino, d_type, &ino);
+		if (!arg->err)
+			arg->err = append_de(arg->vdir, name, namelen, ino,
+					     d_type, arg->delist + bend);
+	} else {
+		name += AUFS_WH_PFX_LEN;
+		namelen -= AUFS_WH_PFX_LEN;
+		whlist = arg->whlist;
+		for (bindex = 0; bindex < bend; bindex++)
+			if (au_nhash_test_known_wh(whlist++, name, namelen))
+				goto out; /* already whiteouted */
+
+		ino = 1; /* dummy */
+		if (!arg->err)
+			arg->err = au_nhash_append_wh
+				(arg->whlist + bend, name, namelen, bend);
+	}
+
+ out:
+	if (!arg->err)
+		arg->vdir->vd_jiffy = jiffies;
+	/* smp_mb(); */
+	AuTraceErr(arg->err);
+	return arg->err;
+}
+
+static int au_do_read_vdir(struct fillvdir_arg *arg)
+{
+	int err;
+	loff_t offset;
+	aufs_bindex_t bend, bindex, bstart;
+	struct file *hf, *file;
+	struct au_nhash *delist, *whlist;
+
+	err = -ENOMEM;
+	bend = au_fbend(arg->file);
+	arg->delist = kmalloc(sizeof(*arg->delist) * (bend + 1), GFP_NOFS);
+	if (unlikely(!arg->delist))
+		goto out;
+	arg->whlist = kmalloc(sizeof(*arg->whlist) * (bend + 1), GFP_NOFS);
+	if (unlikely(!arg->whlist))
+		goto out_delist;
+
+	err = 0;
+	delist = arg->delist;
+	whlist = arg->whlist;
+	for (bindex = 0; bindex <= bend; bindex++) {
+		au_nhash_init(delist++);
+		au_nhash_init(whlist++);
+	}
+
+	arg->flags = 0;
+	file = arg->file;
+	bstart = au_fbstart(file);
+	for (bindex = bstart; !err && bindex <= bend; bindex++) {
+		hf = au_h_fptr(file, bindex);
+		if (!hf)
+			continue;
+
+		offset = vfsub_llseek(hf, 0, SEEK_SET);
+		err = offset;
+		if (unlikely(offset))
+			break;
+
+		arg->bindex = bindex;
+		do {
+			arg->err = 0;
+			au_fclr_fillvdir(arg->flags, CALLED);
+			/* smp_mb(); */
+			err = vfsub_readdir(hf, fillvdir, arg);
+			if (err >= 0)
+				err = arg->err;
+		} while (!err && au_ftest_fillvdir(arg->flags, CALLED));
+	}
+
+	delist = arg->delist + bstart;
+	whlist = arg->whlist + bstart;
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		free_dehlist(delist++);
+		au_nhash_fin(whlist++);
+	}
+	kfree(arg->whlist);
+
+ out_delist:
+	kfree(arg->delist);
+ out:
+	return err;
+}
+
+static int read_vdir(struct file *file, int may_read)
+{
+	int err;
+	unsigned long expire;
+	unsigned char do_read;
+	struct fillvdir_arg arg;
+	struct inode *inode;
+	struct au_vdir *vdir, *allocated;
+
+	err = 0;
+	inode = file->f_dentry->d_inode;
+	IMustLock(inode);
+	allocated = NULL;
+	do_read = 0;
+	expire = au_sbi(inode->i_sb)->si_rdcache;
+	vdir = au_ivdir(inode);
+	if (!vdir) {
+		do_read = 1;
+		vdir = alloc_vdir();
+		err = PTR_ERR(vdir);
+		if (IS_ERR(vdir))
+			goto out;
+		err = 0;
+		allocated = vdir;
+	} else if (may_read
+		   && (inode->i_version != vdir->vd_version
+		       || time_after(jiffies, vdir->vd_jiffy + expire))) {
+		do_read = 1;
+		err = reinit_vdir(vdir);
+		if (unlikely(err))
+			goto out;
+	}
+
+	if (!do_read)
+		return 0; /* success */
+
+	arg.file = file;
+	arg.vdir = vdir;
+	err = au_do_read_vdir(&arg);
+	if (!err) {
+		/* file->f_pos = 0; */
+		vdir->vd_version = inode->i_version;
+		vdir->vd_last.i = 0;
+		vdir->vd_last.p.deblk = vdir->vd_deblk[0];
+		if (allocated)
+			au_set_ivdir(inode, allocated);
+	} else if (allocated)
+		au_vdir_free(allocated);
+
+ out:
+	return err;
+}
+
+static int copy_vdir(struct au_vdir *tgt, struct au_vdir *src)
+{
+	int err, i, rerr, n;
+
+	AuDebugOn(tgt->vd_nblk != 1);
+
+	err = -ENOMEM;
+	if (tgt->vd_nblk < src->vd_nblk) {
+		au_vdir_deblk_t **p;
+
+		p = au_kzrealloc(tgt->vd_deblk, sizeof(*p) * tgt->vd_nblk,
+				 sizeof(*p) * src->vd_nblk, GFP_NOFS);
+		if (unlikely(!p))
+			goto out;
+		tgt->vd_deblk = p;
+	}
+
+	tgt->vd_nblk = src->vd_nblk;
+	n = src->vd_nblk;
+	memcpy(tgt->vd_deblk[0], src->vd_deblk[0], AuSize_DEBLK);
+	/* tgt->vd_last.i = 0; */
+	/* tgt->vd_last.p.deblk = tgt->vd_deblk[0]; */
+	tgt->vd_version = src->vd_version;
+	tgt->vd_jiffy = src->vd_jiffy;
+
+	for (i = 1; i < n; i++) {
+		tgt->vd_deblk[i] = kmalloc(AuSize_DEBLK, GFP_NOFS);
+		if (tgt->vd_deblk[i])
+			memcpy(tgt->vd_deblk[i], src->vd_deblk[i],
+			       AuSize_DEBLK);
+		else
+			goto out;
+	}
+	/* smp_mb(); */
+	return 0; /* success */
+
+ out:
+	rerr = reinit_vdir(tgt);
+	BUG_ON(rerr);
+	return err;
+}
+
+int au_vdir_init(struct file *file)
+{
+	int err;
+	struct inode *inode;
+	struct au_vdir *vdir_cache, *allocated;
+
+	err = read_vdir(file, !file->f_pos);
+	if (unlikely(err))
+		goto out;
+
+	allocated = NULL;
+	vdir_cache = au_fvdir_cache(file);
+	if (!vdir_cache) {
+		vdir_cache = alloc_vdir();
+		err = PTR_ERR(vdir_cache);
+		if (IS_ERR(vdir_cache))
+			goto out;
+		allocated = vdir_cache;
+	} else if (!file->f_pos && vdir_cache->vd_version != file->f_version) {
+		err = reinit_vdir(vdir_cache);
+		if (unlikely(err))
+			goto out;
+	} else
+		return 0; /* success */
+
+	inode = file->f_dentry->d_inode;
+	err = copy_vdir(vdir_cache, au_ivdir(inode));
+	if (!err) {
+		file->f_version = inode->i_version;
+		if (allocated)
+			au_set_fvdir_cache(file, allocated);
+	} else if (allocated)
+		au_vdir_free(allocated);
+
+ out:
+	return err;
+}
+
+static loff_t calc_offset(struct au_vdir *vdir)
+{
+	loff_t offset;
+	union au_vdir_deblk_p p;
+
+	p.deblk = vdir->vd_deblk[vdir->vd_last.i];
+	offset = vdir->vd_last.p.p - p.p;
+	offset += sizeof(*p.deblk) * vdir->vd_last.i;
+	return offset;
+}
+
+/* returns true or false */
+static int seek_vdir(struct file *file)
+{
+	int valid, i, n;
+	loff_t offset;
+	union au_vdir_deblk_p p, deblk_end;
+	struct au_vdir *vdir_cache;
+
+	valid = 1;
+	vdir_cache = au_fvdir_cache(file);
+	offset = calc_offset(vdir_cache);
+	AuDbg("offset %lld\n", offset);
+	if (file->f_pos == offset)
+		goto out;
+
+	vdir_cache->vd_last.i = 0;
+	vdir_cache->vd_last.p.deblk = vdir_cache->vd_deblk[0];
+	if (!file->f_pos)
+		goto out;
+
+	valid = 0;
+	i = file->f_pos / AuSize_DEBLK;
+	AuDbg("i %d\n", i);
+	if (i >= vdir_cache->vd_nblk)
+		goto out;
+
+	n = vdir_cache->vd_nblk;
+	for (; i < n; i++) {
+		p.deblk = vdir_cache->vd_deblk[i];
+		deblk_end.deblk = p.deblk + 1;
+		offset = i;
+		offset *= AuSize_DEBLK;
+		while (!is_deblk_end(&p, &deblk_end) && offset < file->f_pos) {
+			int l;
+
+			l = calc_size(p.de->de_str.len);
+			offset += l;
+			p.p += l;
+		}
+		if (!is_deblk_end(&p, &deblk_end)) {
+			valid = 1;
+			vdir_cache->vd_last.i = i;
+			vdir_cache->vd_last.p = p;
+			break;
+		}
+	}
+
+ out:
+	/* smp_mb(); */
+	AuTraceErr(!valid);
+	return valid;
+}
+
+int au_vdir_fill_de(struct file *file, void *dirent, filldir_t filldir)
+{
+	int err, l;
+	union au_vdir_deblk_p deblk_end;
+	struct au_vdir *vdir_cache;
+	struct au_vdir_de *de;
+
+	BUILD_BUG_ON(AuSize_DEBLK < NAME_MAX || PAGE_SIZE < AuSize_DEBLK);
+
+	vdir_cache = au_fvdir_cache(file);
+	if (!seek_vdir(file))
+		return 0;
+
+	while (1) {
+		deblk_end.deblk
+			= vdir_cache->vd_deblk[vdir_cache->vd_last.i] + 1;
+		while (!is_deblk_end(&vdir_cache->vd_last.p, &deblk_end)) {
+			de = vdir_cache->vd_last.p.de;
+			AuDbg("%.*s, off%lld, i%lu, dt%d\n",
+			      de->de_str.len, de->de_str.name,
+			      file->f_pos, (unsigned long)de->de_ino,
+			      de->de_type);
+			err = filldir(dirent, de->de_str.name, de->de_str.len,
+				      file->f_pos, de->de_ino, de->de_type);
+			if (unlikely(err)) {
+				AuTraceErr(err);
+				/* todo: ignore the error caused by udba? */
+				/* return err; */
+				return 0;
+			}
+
+			l = calc_size(de->de_str.len);
+			vdir_cache->vd_last.p.p += l;
+			file->f_pos += l;
+		}
+		if (vdir_cache->vd_last.i < vdir_cache->vd_nblk - 1) {
+			vdir_cache->vd_last.i++;
+			vdir_cache->vd_last.p.deblk
+				= vdir_cache->vd_deblk[vdir_cache->vd_last.i];
+			file->f_pos = sizeof(*vdir_cache->vd_last.p.deblk)
+				* vdir_cache->vd_last.i;
+			continue;
+		}
+		break;
+	}
+
+	/* smp_mb(); */
+	return 0;
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 19/28] aufs inode
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (17 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 18/28] aufs direcotry J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 20/28] aufs ioctl J. R. Okajima
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
inode operations and private data

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/i_op.c     |  857 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/i_op_add.c |  625 +++++++++++++++++++++++++++++++++++
 fs/aufs/i_op_del.c |  471 ++++++++++++++++++++++++++
 fs/aufs/i_op_ren.c |  929 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/iinfo.c    |  257 +++++++++++++++
 fs/aufs/inode.c    |  356 ++++++++++++++++++++
 fs/aufs/inode.h    |  471 ++++++++++++++++++++++++++
 7 files changed, 3966 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/i_op.c
 create mode 100644 fs/aufs/i_op_add.c
 create mode 100644 fs/aufs/i_op_del.c
 create mode 100644 fs/aufs/i_op_ren.c
 create mode 100644 fs/aufs/iinfo.c
 create mode 100644 fs/aufs/inode.c
 create mode 100644 fs/aufs/inode.h

diff --git a/fs/aufs/i_op.c b/fs/aufs/i_op.c
new file mode 100644
index 0000000..e7a7279
--- /dev/null
+++ b/fs/aufs/i_op.c
@@ -0,0 +1,857 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode operations (except add/del/rename)
+ */
+
+#include <linux/device_cgroup.h>
+#include <linux/fs_stack.h>
+#include <linux/uaccess.h>
+#include "aufs.h"
+
+static int h_permission(struct inode *h_inode, int mask,
+			struct vfsmount *h_mnt, int brperm)
+{
+	int err;
+	const unsigned char write_mask = !!(mask & (MAY_WRITE | MAY_APPEND));
+
+	err = -EACCES;
+	if ((write_mask && IS_IMMUTABLE(h_inode))
+	    || ((mask & MAY_EXEC)
+		&& S_ISREG(h_inode->i_mode)
+		&& ((h_mnt->mnt_flags & MNT_NOEXEC)
+		    || !(h_inode->i_mode & S_IXUGO))))
+		goto out;
+
+	/*
+	 * - skip the lower fs test in the case of write to ro branch.
+	 * - nfs dir permission write check is optimized, but a policy for
+	 *   link/rename requires a real check.
+	 */
+	if ((write_mask && !au_br_writable(brperm))
+	    || (au_test_nfs(h_inode->i_sb) && S_ISDIR(h_inode->i_mode)
+		&& write_mask && !(mask & MAY_READ))
+	    || !h_inode->i_op->permission) {
+		/* AuLabel(generic_permission); */
+		err = generic_permission(h_inode, mask, NULL);
+	} else {
+		/* AuLabel(h_inode->permission); */
+		err = h_inode->i_op->permission(h_inode, mask);
+		AuTraceErr(err);
+	}
+
+	if (!err)
+		err = devcgroup_inode_permission(h_inode, mask);
+	if (!err)
+		err = security_inode_permission
+			(h_inode, mask & (MAY_READ | MAY_WRITE | MAY_EXEC
+					  | MAY_APPEND));
+
+ out:
+	return err;
+}
+
+static int aufs_permission(struct inode *inode, int mask)
+{
+	int err;
+	aufs_bindex_t bindex, bend;
+	const unsigned char isdir = !!S_ISDIR(inode->i_mode);
+	const unsigned char write_mask = !!(mask & (MAY_WRITE | MAY_APPEND));
+	struct inode *h_inode;
+	struct super_block *sb;
+	struct au_branch *br;
+
+	sb = inode->i_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	ii_read_lock_child(inode);
+
+	if (!isdir || write_mask) {
+		h_inode = au_h_iptr(inode, au_ibstart(inode));
+		AuDebugOn(!h_inode
+			  || ((h_inode->i_mode & S_IFMT)
+			      != (inode->i_mode & S_IFMT)));
+		err = 0;
+		bindex = au_ibstart(inode);
+		br = au_sbr(sb, bindex);
+		err = h_permission(h_inode, mask, br->br_mnt, br->br_perm);
+
+		if (write_mask && !err) {
+			/* test whether the upper writable branch exists */
+			err = -EROFS;
+			for (; bindex >= 0; bindex--)
+				if (!au_br_rdonly(au_sbr(sb, bindex))) {
+					err = 0;
+					break;
+				}
+		}
+		goto out;
+	}
+
+	/* non-write to dir */
+	err = 0;
+	bend = au_ibend(inode);
+	for (bindex = au_ibstart(inode); !err && bindex <= bend; bindex++) {
+		h_inode = au_h_iptr(inode, bindex);
+		if (h_inode) {
+			AuDebugOn(!S_ISDIR(h_inode->i_mode));
+			br = au_sbr(sb, bindex);
+			err = h_permission(h_inode, mask, br->br_mnt,
+					   br->br_perm);
+		}
+	}
+
+ out:
+	ii_read_unlock(inode);
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static struct dentry *aufs_lookup(struct inode *dir, struct dentry *dentry,
+				  struct nameidata *nd)
+{
+	struct dentry *ret, *parent;
+	struct inode *inode, *h_inode;
+	struct mutex *mtx;
+	struct super_block *sb;
+	int err, npositive;
+	aufs_bindex_t bstart;
+
+	IMustLock(dir);
+
+	sb = dir->i_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+	err = au_alloc_dinfo(dentry);
+	ret = ERR_PTR(err);
+	if (unlikely(err))
+		goto out;
+
+	parent = dentry->d_parent; /* dir inode is locked */
+	di_read_lock_parent(parent, AuLock_IR);
+	npositive = au_lkup_dentry(dentry, au_dbstart(parent), /*type*/0, nd);
+	di_read_unlock(parent, AuLock_IR);
+	err = npositive;
+	ret = ERR_PTR(err);
+	if (unlikely(err < 0))
+		goto out_unlock;
+
+	inode = NULL;
+	if (npositive) {
+		bstart = au_dbstart(dentry);
+		h_inode = au_h_dptr(dentry, bstart)->d_inode;
+		if (!S_ISDIR(h_inode->i_mode)) {
+			/*
+			 * stop 'race'-ing between hardlinks under different
+			 * parents.
+			 */
+			mtx = &au_sbr(sb, bstart)->br_xino.xi_nondir_mtx;
+			mutex_lock(mtx);
+			inode = au_new_inode(dentry, /*must_new*/0);
+			mutex_unlock(mtx);
+		} else
+			inode = au_new_inode(dentry, /*must_new*/0);
+		ret = (void *)inode;
+	}
+	if (IS_ERR(inode))
+		goto out_unlock;
+
+	ret = d_splice_alias(inode, dentry);
+	if (unlikely(IS_ERR(ret) && inode))
+		ii_write_unlock(inode);
+	au_store_oflag(nd);
+
+ out_unlock:
+	di_write_unlock(dentry);
+ out:
+	si_read_unlock(sb);
+	return ret;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int au_wr_dir_cpup(struct dentry *dentry, struct dentry *parent,
+			  const unsigned char add_entry, aufs_bindex_t bcpup,
+			  aufs_bindex_t bstart)
+{
+	int err;
+	struct dentry *h_parent;
+	struct inode *h_dir;
+
+	if (add_entry) {
+		au_update_dbstart(dentry);
+		IMustLock(parent->d_inode);
+	} else
+		di_write_lock_parent(parent);
+
+	err = 0;
+	if (!au_h_dptr(parent, bcpup)) {
+		if (bstart < bcpup)
+			err = au_cpdown_dirs(dentry, bcpup);
+		else
+			err = au_cpup_dirs(dentry, bcpup);
+	}
+	if (!err && add_entry) {
+		h_parent = au_h_dptr(parent, bcpup);
+		h_dir = h_parent->d_inode;
+		mutex_lock_nested(&h_dir->i_mutex, AuLsc_I_PARENT);
+		err = au_lkup_neg(dentry, bcpup);
+		/* todo: no unlock here */
+		mutex_unlock(&h_dir->i_mutex);
+		if (bstart < bcpup && au_dbstart(dentry) < 0) {
+			au_set_dbstart(dentry, 0);
+			au_update_dbrange(dentry, /*do_put_zero*/0);
+		}
+	}
+
+	if (!add_entry)
+		di_write_unlock(parent);
+	if (!err)
+		err = bcpup; /* success */
+
+	return err;
+}
+
+/*
+ * decide the branch and the parent dir where we will create a new entry.
+ * returns new bindex or an error.
+ * copyup the parent dir if needed.
+ */
+int au_wr_dir(struct dentry *dentry, struct dentry *src_dentry,
+	      struct au_wr_dir_args *args)
+{
+	int err;
+	aufs_bindex_t bcpup, bstart, src_bstart;
+	const unsigned char add_entry = !!au_ftest_wrdir(args->flags,
+							 ADD_ENTRY);
+	struct super_block *sb;
+	struct dentry *parent;
+	struct au_sbinfo *sbinfo;
+
+	sb = dentry->d_sb;
+	sbinfo = au_sbi(sb);
+	parent = dget_parent(dentry);
+	bstart = au_dbstart(dentry);
+	bcpup = bstart;
+	if (args->force_btgt < 0) {
+		if (src_dentry) {
+			src_bstart = au_dbstart(src_dentry);
+			if (src_bstart < bstart)
+				bcpup = src_bstart;
+		} else if (add_entry) {
+			err = AuWbrCreate(sbinfo, dentry,
+					  au_ftest_wrdir(args->flags, ISDIR));
+			bcpup = err;
+		}
+
+		if (bcpup < 0 || au_test_ro(sb, bcpup, dentry->d_inode)) {
+			if (add_entry)
+				err = AuWbrCopyup(sbinfo, dentry);
+			else {
+				if (!IS_ROOT(dentry)) {
+					di_read_lock_parent(parent, !AuLock_IR);
+					err = AuWbrCopyup(sbinfo, dentry);
+					di_read_unlock(parent, !AuLock_IR);
+				} else
+					err = AuWbrCopyup(sbinfo, dentry);
+			}
+			bcpup = err;
+			if (unlikely(err < 0))
+				goto out;
+		}
+	} else {
+		bcpup = args->force_btgt;
+		AuDebugOn(au_test_ro(sb, bcpup, dentry->d_inode));
+	}
+	AuDbg("bstart %d, bcpup %d\n", bstart, bcpup);
+	if (bstart < bcpup)
+		au_update_dbrange(dentry, /*do_put_zero*/1);
+
+	err = bcpup;
+	if (bcpup == bstart)
+		goto out; /* success */
+
+	/* copyup the new parent into the branch we process */
+	err = au_wr_dir_cpup(dentry, parent, add_entry, bcpup, bstart);
+
+ out:
+	dput(parent);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct dentry *au_pinned_h_parent(struct au_pin *pin)
+{
+	if (pin && pin->parent)
+		return au_h_dptr(pin->parent, pin->bindex);
+	return NULL;
+}
+
+void au_unpin(struct au_pin *p)
+{
+	if (au_ftest_pin(p->flags, MNT_WRITE))
+		mnt_drop_write(p->h_mnt);
+	if (!p->hdir)
+		return;
+
+	au_hin_imtx_unlock(p->hdir);
+	if (!au_ftest_pin(p->flags, DI_LOCKED))
+		di_read_unlock(p->parent, AuLock_IR);
+	iput(p->hdir->hi_inode);
+	dput(p->parent);
+	p->parent = NULL;
+	p->hdir = NULL;
+	p->h_mnt = NULL;
+}
+
+int au_do_pin(struct au_pin *p)
+{
+	int err;
+	struct super_block *sb;
+	struct dentry *h_dentry, *h_parent;
+	struct au_branch *br;
+	struct inode *h_dir;
+
+	err = 0;
+	sb = p->dentry->d_sb;
+	br = au_sbr(sb, p->bindex);
+	if (IS_ROOT(p->dentry)) {
+		if (au_ftest_pin(p->flags, MNT_WRITE)) {
+			p->h_mnt = br->br_mnt;
+			err = mnt_want_write(p->h_mnt);
+			if (unlikely(err)) {
+				au_fclr_pin(p->flags, MNT_WRITE);
+				goto out_err;
+			}
+		}
+		goto out;
+	}
+
+	h_dentry = NULL;
+	if (p->bindex <= au_dbend(p->dentry))
+		h_dentry = au_h_dptr(p->dentry, p->bindex);
+
+	p->parent = dget_parent(p->dentry);
+	if (!au_ftest_pin(p->flags, DI_LOCKED))
+		di_read_lock(p->parent, AuLock_IR, p->lsc_di);
+
+	h_dir = NULL;
+	h_parent = au_h_dptr(p->parent, p->bindex);
+	p->hdir = au_hi(p->parent->d_inode, p->bindex);
+	if (p->hdir)
+		h_dir = p->hdir->hi_inode;
+
+	/* udba case */
+	if (unlikely(!p->hdir || !h_dir)) {
+		err = -EBUSY;
+		if (!au_ftest_pin(p->flags, DI_LOCKED))
+			di_read_unlock(p->parent, AuLock_IR);
+		dput(p->parent);
+		p->parent = NULL;
+		goto out_err;
+	}
+
+	au_igrab(h_dir);
+	au_hin_imtx_lock_nested(p->hdir, p->lsc_hi);
+
+	if (h_dentry) {
+		err = au_h_verify(h_dentry, p->udba, h_dir, h_parent, br);
+		if (unlikely(err)) {
+			au_fclr_pin(p->flags, MNT_WRITE);
+			goto out_unpin;
+		}
+	}
+
+	if (au_ftest_pin(p->flags, MNT_WRITE)) {
+		p->h_mnt = br->br_mnt;
+		err = mnt_want_write(p->h_mnt);
+		if (unlikely(err)) {
+			au_fclr_pin(p->flags, MNT_WRITE);
+			goto out_unpin;
+		}
+	}
+	goto out; /* success */
+
+ out_unpin:
+	au_unpin(p);
+ out_err:
+	AuErr("err %d\n", err);
+	err = -EBUSY;
+ out:
+	return err;
+}
+
+void au_pin_init(struct au_pin *p, struct dentry *dentry,
+		 aufs_bindex_t bindex, int lsc_di, int lsc_hi,
+		 unsigned int udba, unsigned char flags)
+{
+	p->dentry = dentry;
+	p->udba = udba;
+	p->lsc_di = lsc_di;
+	p->lsc_hi = lsc_hi;
+	p->flags = flags;
+	p->bindex = bindex;
+
+	p->parent = NULL;
+	p->hdir = NULL;
+	p->h_mnt = NULL;
+}
+
+int au_pin(struct au_pin *pin, struct dentry *dentry, aufs_bindex_t bindex,
+	   unsigned int udba, unsigned char flags)
+{
+	au_pin_init(pin, dentry, bindex, AuLsc_DI_PARENT, AuLsc_I_PARENT2,
+		    udba, flags);
+	return au_do_pin(pin);
+}
+
+/* ---------------------------------------------------------------------- */
+
+#define AuIcpup_DID_CPUP	1
+#define au_ftest_icpup(flags, name)	((flags) & AuIcpup_##name)
+#define au_fset_icpup(flags, name)	{ (flags) |= AuIcpup_##name; }
+#define au_fclr_icpup(flags, name)	{ (flags) &= ~AuIcpup_##name; }
+
+struct au_icpup_args {
+	unsigned char flags;
+	unsigned char pin_flags;
+	aufs_bindex_t btgt;
+	struct au_pin pin;
+	struct path h_path;
+	struct inode *h_inode;
+};
+
+static int au_lock_and_icpup(struct dentry *dentry, struct iattr *ia,
+			     struct au_icpup_args *a)
+{
+	int err;
+	unsigned int udba;
+	loff_t sz;
+	aufs_bindex_t bstart;
+	struct dentry *hi_wh, *parent;
+	struct inode *inode;
+	struct au_wr_dir_args wr_dir_args = {
+		.force_btgt	= -1,
+		.flags		= 0
+	};
+
+	di_write_lock_child(dentry);
+	bstart = au_dbstart(dentry);
+	inode = dentry->d_inode;
+	if (S_ISDIR(inode->i_mode))
+		au_fset_wrdir(wr_dir_args.flags, ISDIR);
+	/* plink or hi_wh() case */
+	if (bstart != au_ibstart(inode))
+		wr_dir_args.force_btgt = au_ibstart(inode);
+	err = au_wr_dir(dentry, /*src_dentry*/NULL, &wr_dir_args);
+	if (unlikely(err < 0))
+		goto out_dentry;
+	a->btgt = err;
+	if (err != bstart)
+		au_fset_icpup(a->flags, DID_CPUP);
+
+	err = 0;
+	a->pin_flags = AuPin_MNT_WRITE;
+	parent = NULL;
+	if (!IS_ROOT(dentry)) {
+		au_fset_pin(a->pin_flags, DI_LOCKED);
+		parent = dget_parent(dentry);
+		di_write_lock_parent(parent);
+	}
+
+	udba = au_opt_udba(dentry->d_sb);
+	if (d_unhashed(dentry) || (ia->ia_valid & ATTR_FILE))
+		udba = AuOpt_UDBA_NONE;
+	err = au_pin(&a->pin, dentry, a->btgt, udba, a->pin_flags);
+	if (unlikely(err)) {
+		if (parent) {
+			di_write_unlock(parent);
+			dput(parent);
+		}
+		goto out_dentry;
+	}
+	a->h_path.dentry = au_h_dptr(dentry, bstart);
+	a->h_inode = a->h_path.dentry->d_inode;
+	mutex_lock_nested(&a->h_inode->i_mutex, AuLsc_I_CHILD);
+	sz = -1;
+	if ((ia->ia_valid & ATTR_SIZE) && ia->ia_size < i_size_read(a->h_inode))
+		sz = ia->ia_size;
+
+	hi_wh = NULL;
+	if (au_ftest_icpup(a->flags, DID_CPUP) && d_unhashed(dentry)) {
+		hi_wh = au_hi_wh(inode, a->btgt);
+		if (!hi_wh) {
+			err = au_sio_cpup_wh(dentry, a->btgt, sz, /*file*/NULL);
+			if (unlikely(err))
+				goto out_unlock;
+			hi_wh = au_hi_wh(inode, a->btgt);
+			/* todo: revalidate hi_wh? */
+		}
+	}
+
+	if (parent) {
+		au_pin_set_parent_lflag(&a->pin, /*lflag*/0);
+		di_downgrade_lock(parent, AuLock_IR);
+		dput(parent);
+	}
+	if (!au_ftest_icpup(a->flags, DID_CPUP))
+		goto out; /* success */
+
+	if (!d_unhashed(dentry)) {
+		err = au_sio_cpup_simple(dentry, a->btgt, sz, AuCpup_DTIME);
+		if (!err)
+			a->h_path.dentry = au_h_dptr(dentry, a->btgt);
+	} else if (!hi_wh)
+		a->h_path.dentry = au_h_dptr(dentry, a->btgt);
+	else
+		a->h_path.dentry = hi_wh; /* do not dget here */
+
+ out_unlock:
+	mutex_unlock(&a->h_inode->i_mutex);
+	a->h_inode = a->h_path.dentry->d_inode;
+	if (!err) {
+		mutex_lock_nested(&a->h_inode->i_mutex, AuLsc_I_CHILD);
+		goto out; /* success */
+	}
+
+	au_unpin(&a->pin);
+
+ out_dentry:
+	di_write_unlock(dentry);
+ out:
+	return err;
+}
+
+static int aufs_setattr(struct dentry *dentry, struct iattr *ia)
+{
+	int err;
+	struct inode *inode;
+	struct super_block *sb;
+	struct file *file;
+	struct au_icpup_args *a;
+
+	err = -ENOMEM;
+	a = kzalloc(sizeof(*a), GFP_NOFS);
+	if (unlikely(!a))
+		goto out;
+
+	inode = dentry->d_inode;
+	IMustLock(inode);
+	sb = dentry->d_sb;
+	si_read_lock(sb, AuLock_FLUSH);
+
+	file = NULL;
+	if (ia->ia_valid & ATTR_FILE) {
+		/* currently ftruncate(2) only */
+		file = ia->ia_file;
+		fi_write_lock(file);
+		ia->ia_file = au_h_fptr(file, au_fbstart(file));
+	}
+
+	if (ia->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))
+		ia->ia_valid &= ~ATTR_MODE;
+
+	err = au_lock_and_icpup(dentry, ia, a);
+	if (unlikely(err < 0))
+		goto out_si;
+	if (au_ftest_icpup(a->flags, DID_CPUP)) {
+		ia->ia_file = NULL;
+		ia->ia_valid &= ~ATTR_FILE;
+	}
+
+	a->h_path.mnt = au_sbr_mnt(sb, a->btgt);
+	if (ia->ia_valid & ATTR_SIZE) {
+		struct file *f;
+
+		if (ia->ia_size < i_size_read(inode)) {
+			/* unmap only */
+			err = vmtruncate(inode, ia->ia_size);
+			if (unlikely(err))
+				goto out_unlock;
+		}
+
+		f = NULL;
+		if (ia->ia_valid & ATTR_FILE)
+			f = ia->ia_file;
+		mutex_unlock(&a->h_inode->i_mutex);
+		err = vfsub_trunc(&a->h_path, ia->ia_size, ia->ia_valid, f);
+		mutex_lock_nested(&a->h_inode->i_mutex, AuLsc_I_CHILD);
+	} else
+		err = vfsub_notify_change(&a->h_path, ia);
+	if (!err)
+		au_cpup_attr_changeable(inode);
+
+ out_unlock:
+	mutex_unlock(&a->h_inode->i_mutex);
+	au_unpin(&a->pin);
+	di_write_unlock(dentry);
+ out_si:
+	if (file) {
+		fi_write_unlock(file);
+		ia->ia_file = file;
+		ia->ia_valid |= ATTR_FILE;
+	}
+	si_read_unlock(sb);
+	kfree(a);
+ out:
+	return err;
+}
+
+static int au_getattr_lock_reval(struct dentry *dentry, unsigned int sigen)
+{
+	int err;
+	struct inode *inode;
+	struct dentry *parent;
+
+	err = 0;
+	inode = dentry->d_inode;
+	di_write_lock_child(dentry);
+	if (au_digen(dentry) != sigen || au_iigen(inode) != sigen) {
+		parent = dget_parent(dentry);
+		di_read_lock_parent(parent, AuLock_IR);
+		/* returns a number of positive dentries */
+		err = au_refresh_hdentry(dentry, inode->i_mode & S_IFMT);
+		if (err > 0)
+			err = au_refresh_hinode(inode, dentry);
+		di_read_unlock(parent, AuLock_IR);
+		dput(parent);
+		if (unlikely(!err))
+			err = -EIO;
+	}
+	di_downgrade_lock(dentry, AuLock_IR);
+
+	return err;
+}
+
+static void au_refresh_iattr(struct inode *inode, struct kstat *st,
+			     unsigned int nlink)
+{
+	inode->i_mode = st->mode;
+	inode->i_uid = st->uid;
+	inode->i_gid = st->gid;
+	inode->i_atime = st->atime;
+	inode->i_mtime = st->mtime;
+	inode->i_ctime = st->ctime;
+
+	au_cpup_attr_nlink(inode, /*force*/0);
+	if (S_ISDIR(inode->i_mode)) {
+		inode->i_nlink -= nlink;
+		inode->i_nlink += st->nlink;
+	}
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks = st->blocks;
+	i_size_write(inode, st->size);
+	spin_unlock(&inode->i_lock);
+}
+
+static int aufs_getattr(struct vfsmount *mnt __maybe_unused,
+			struct dentry *dentry, struct kstat *st)
+{
+	int err;
+	unsigned int mnt_flags;
+	aufs_bindex_t bindex;
+	unsigned char udba_none, positive, did_lock;
+	struct super_block *sb, *h_sb;
+	struct inode *inode;
+	struct vfsmount *h_mnt;
+	struct dentry *h_dentry;
+
+	err = 0;
+	did_lock = 0;
+	sb = dentry->d_sb;
+	inode = dentry->d_inode;
+	si_read_lock(sb, AuLock_FLUSH);
+	if (IS_ROOT(dentry)) {
+		/* lock free root dinfo */
+		h_dentry = dget(au_di(dentry)->di_hdentry->hd_dentry);
+		h_mnt = au_sbr_mnt(sb, 0);
+		goto getattr;
+	}
+
+	did_lock = 1;
+	mnt_flags = au_mntflags(sb);
+	udba_none = !!au_opt_test(mnt_flags, UDBA_NONE);
+
+	/* support fstat(2) */
+	if (!d_unhashed(dentry) && !udba_none) {
+		unsigned int sigen = au_sigen(sb);
+		if (au_digen(dentry) == sigen && au_iigen(inode) == sigen)
+			di_read_lock_child(dentry, AuLock_IR);
+		else {
+			err = au_getattr_lock_reval(dentry, sigen);
+			if (unlikely(err))
+				goto out;
+		}
+	} else
+		di_read_lock_child(dentry, AuLock_IR);
+
+	bindex = au_ibstart(inode);
+	h_mnt = au_sbr_mnt(sb, bindex);
+	h_sb = h_mnt->mnt_sb;
+	if (!au_test_fs_bad_iattr(h_sb) && udba_none)
+		goto out_fill; /* success */
+
+	if (au_dbstart(dentry) == bindex)
+		h_dentry = dget(au_h_dptr(dentry, bindex));
+	else if (au_opt_test(mnt_flags, PLINK) && au_plink_test(inode)) {
+		h_dentry = au_plink_lkup(inode, bindex);
+		if (IS_ERR(h_dentry))
+			goto out_fill; /* pretending success */
+	} else
+		/* illegally overlapped or something */
+		goto out_fill; /* pretending success */
+
+ getattr:
+	positive = !!h_dentry->d_inode;
+	if (positive)
+		err = vfs_getattr(h_mnt, h_dentry, st);
+	dput(h_dentry);
+	if (!err) {
+		if (positive)
+			au_refresh_iattr(inode, st, h_dentry->d_inode->i_nlink);
+		goto out_fill; /* success */
+	}
+	goto out;
+
+ out_fill:
+	generic_fillattr(inode, st);
+ out:
+	if (did_lock)
+		di_read_unlock(dentry, AuLock_IR);
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int h_readlink(struct dentry *dentry, int bindex, char __user *buf,
+		      int bufsiz)
+{
+	int err;
+	struct super_block *sb;
+	struct dentry *h_dentry;
+
+	err = -EINVAL;
+	h_dentry = au_h_dptr(dentry, bindex);
+	if (unlikely(/* !h_dentry
+		     || !h_dentry->d_inode
+		     || !h_dentry->d_inode->i_op
+		     || */ !h_dentry->d_inode->i_op->readlink))
+		goto out;
+
+	err = security_inode_readlink(h_dentry);
+	if (unlikely(err))
+		goto out;
+
+	sb = dentry->d_sb;
+	if (!au_test_ro(sb, bindex, dentry->d_inode)) {
+		vfsub_touch_atime(au_sbr_mnt(sb, bindex), h_dentry);
+		fsstack_copy_attr_atime(dentry->d_inode, h_dentry->d_inode);
+	}
+	err = h_dentry->d_inode->i_op->readlink(h_dentry, buf, bufsiz);
+
+ out:
+	return err;
+}
+
+static int aufs_readlink(struct dentry *dentry, char __user *buf, int bufsiz)
+{
+	int err;
+
+	aufs_read_lock(dentry, AuLock_IR);
+	err = h_readlink(dentry, au_dbstart(dentry), buf, bufsiz);
+	aufs_read_unlock(dentry, AuLock_IR);
+
+	return err;
+}
+
+static void *aufs_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	int err;
+	char *buf;
+	mm_segment_t old_fs;
+
+	err = -ENOMEM;
+	buf = __getname();
+	if (unlikely(!buf))
+		goto out;
+
+	aufs_read_lock(dentry, AuLock_IR);
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	err = h_readlink(dentry, au_dbstart(dentry), (char __user *)buf,
+			 PATH_MAX);
+	set_fs(old_fs);
+	aufs_read_unlock(dentry, AuLock_IR);
+
+	if (err >= 0) {
+		buf[err] = 0;
+		/* will be freed by put_link */
+		nd_set_link(nd, buf);
+		return NULL; /* success */
+	}
+	__putname(buf);
+
+ out:
+	path_put(&nd->path);
+	AuTraceErr(err);
+	return ERR_PTR(err);
+}
+
+static void aufs_put_link(struct dentry *dentry __maybe_unused,
+			  struct nameidata *nd, void *cookie __maybe_unused)
+{
+	__putname(nd_get_link(nd));
+}
+
+/* ---------------------------------------------------------------------- */
+
+static void aufs_truncate_range(struct inode *inode __maybe_unused,
+				loff_t start __maybe_unused,
+				loff_t end __maybe_unused)
+{
+	AuUnsupport();
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct inode_operations aufs_symlink_iop = {
+	.permission	= aufs_permission,
+	.setattr	= aufs_setattr,
+	.getattr	= aufs_getattr,
+	.readlink	= aufs_readlink,
+	.follow_link	= aufs_follow_link,
+	.put_link	= aufs_put_link
+};
+
+struct inode_operations aufs_dir_iop = {
+	.create		= aufs_create,
+	.lookup		= aufs_lookup,
+	.link		= aufs_link,
+	.unlink		= aufs_unlink,
+	.symlink	= aufs_symlink,
+	.mkdir		= aufs_mkdir,
+	.rmdir		= aufs_rmdir,
+	.mknod		= aufs_mknod,
+	.rename		= aufs_rename,
+
+	.permission	= aufs_permission,
+	.setattr	= aufs_setattr,
+	.getattr	= aufs_getattr
+};
+
+struct inode_operations aufs_iop = {
+	.permission	= aufs_permission,
+	.setattr	= aufs_setattr,
+	.getattr	= aufs_getattr,
+	.truncate_range	= aufs_truncate_range
+};
diff --git a/fs/aufs/i_op_add.c b/fs/aufs/i_op_add.c
new file mode 100644
index 0000000..516120d
--- /dev/null
+++ b/fs/aufs/i_op_add.c
@@ -0,0 +1,625 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode operations (add entry)
+ */
+
+#include "aufs.h"
+
+/*
+ * final procedure of adding a new entry, except link(2).
+ * remove whiteout, instantiate, copyup the parent dir's times and size
+ * and update version.
+ * if it failed, re-create the removed whiteout.
+ */
+static int epilog(struct inode *dir, aufs_bindex_t bindex,
+		  struct dentry *wh_dentry, struct dentry *dentry)
+{
+	int err, rerr;
+	aufs_bindex_t bwh;
+	struct path h_path;
+	struct inode *inode, *h_dir;
+	struct dentry *wh;
+
+	bwh = -1;
+	if (wh_dentry) {
+		h_dir = wh_dentry->d_parent->d_inode; /* dir inode is locked */
+		IMustLock(h_dir);
+		AuDebugOn(au_h_iptr(dir, bindex) != h_dir);
+		bwh = au_dbwh(dentry);
+		h_path.dentry = wh_dentry;
+		h_path.mnt = au_sbr_mnt(dir->i_sb, bindex);
+		err = au_wh_unlink_dentry(au_h_iptr(dir, bindex), &h_path,
+					  dentry);
+		if (unlikely(err))
+			goto out;
+	}
+
+	inode = au_new_inode(dentry, /*must_new*/1);
+	if (!IS_ERR(inode)) {
+		d_instantiate(dentry, inode);
+		dir = dentry->d_parent->d_inode; /* dir inode is locked */
+		IMustLock(dir);
+		if (au_ibstart(dir) == au_dbstart(dentry))
+			au_cpup_attr_timesizes(dir);
+		dir->i_version++;
+		return 0; /* success */
+	}
+
+	err = PTR_ERR(inode);
+	if (!wh_dentry)
+		goto out;
+
+	/* revert */
+	/* dir inode is locked */
+	wh = au_wh_create(dentry, bwh, wh_dentry->d_parent);
+	rerr = PTR_ERR(wh);
+	if (IS_ERR(wh)) {
+		AuIOErr("%.*s reverting whiteout failed(%d, %d)\n",
+			AuDLNPair(dentry), err, rerr);
+		err = -EIO;
+	} else
+		dput(wh);
+
+ out:
+	return err;
+}
+
+/*
+ * simple tests for the adding inode operations.
+ * following the checks in vfs, plus the parent-child relationship.
+ */
+int au_may_add(struct dentry *dentry, aufs_bindex_t bindex,
+	       struct dentry *h_parent, int isdir)
+{
+	int err;
+	umode_t h_mode;
+	struct dentry *h_dentry;
+	struct inode *h_inode;
+
+	h_dentry = au_h_dptr(dentry, bindex);
+	h_inode = h_dentry->d_inode;
+	if (!dentry->d_inode) {
+		err = -EEXIST;
+		if (unlikely(h_inode))
+			goto out;
+	} else {
+		/* rename(2) case */
+		err = -EIO;
+		if (unlikely(!h_inode || !h_inode->i_nlink))
+			goto out;
+
+		h_mode = h_inode->i_mode;
+		if (!isdir) {
+			err = -EISDIR;
+			if (unlikely(S_ISDIR(h_mode)))
+				goto out;
+		} else if (unlikely(!S_ISDIR(h_mode))) {
+			err = -ENOTDIR;
+			goto out;
+		}
+	}
+
+	err = -EIO;
+	/* expected parent dir is locked */
+	if (unlikely(h_parent != h_dentry->d_parent))
+		goto out;
+	err = 0;
+
+ out:
+	return err;
+}
+
+/*
+ * initial procedure of adding a new entry.
+ * prepare writable branch and the parent dir, lock it,
+ * and lookup whiteout for the new entry.
+ */
+static struct dentry*
+lock_hdir_lkup_wh(struct dentry *dentry, struct au_dtime *dt,
+		  struct dentry *src_dentry, struct au_pin *pin,
+		  struct au_wr_dir_args *wr_dir_args)
+{
+	struct dentry *wh_dentry, *h_parent;
+	struct super_block *sb;
+	struct au_branch *br;
+	int err;
+	unsigned int udba;
+	aufs_bindex_t bcpup;
+
+	err = au_wr_dir(dentry, src_dentry, wr_dir_args);
+	bcpup = err;
+	wh_dentry = ERR_PTR(err);
+	if (unlikely(err < 0))
+		goto out;
+
+	sb = dentry->d_sb;
+	udba = au_opt_udba(sb);
+	err = au_pin(pin, dentry, bcpup, udba,
+		     AuPin_DI_LOCKED | AuPin_MNT_WRITE);
+	wh_dentry = ERR_PTR(err);
+	if (unlikely(err))
+		goto out;
+
+	h_parent = au_pinned_h_parent(pin);
+	if (udba != AuOpt_UDBA_NONE
+	    && au_dbstart(dentry) == bcpup) {
+		err = au_may_add(dentry, bcpup, h_parent,
+				 au_ftest_wrdir(wr_dir_args->flags, ISDIR));
+		wh_dentry = ERR_PTR(err);
+		if (unlikely(err))
+			goto out_unpin;
+	}
+
+	br = au_sbr(sb, bcpup);
+	if (dt) {
+		struct path tmp = {
+			.dentry	= h_parent,
+			.mnt	= br->br_mnt
+		};
+		au_dtime_store(dt, au_pinned_parent(pin), &tmp);
+	}
+
+	wh_dentry = NULL;
+	if (bcpup != au_dbwh(dentry))
+		goto out; /* success */
+
+	wh_dentry = au_wh_lkup(h_parent, &dentry->d_name, br);
+
+ out_unpin:
+	if (IS_ERR(wh_dentry))
+		au_unpin(pin);
+ out:
+	return wh_dentry;
+}
+
+/* ---------------------------------------------------------------------- */
+
+enum { Mknod, Symlink, Creat };
+struct simple_arg {
+	int type;
+	union {
+		struct {
+			int mode;
+			struct nameidata *nd;
+		} c;
+		struct {
+			const char *symname;
+		} s;
+		struct {
+			int mode;
+			dev_t dev;
+		} m;
+	} u;
+};
+
+static int add_simple(struct inode *dir, struct dentry *dentry,
+		      struct simple_arg *arg)
+{
+	int err;
+	aufs_bindex_t bstart;
+	unsigned char created;
+	struct au_dtime dt;
+	struct au_pin pin;
+	struct path h_path;
+	struct dentry *wh_dentry, *parent;
+	struct inode *h_dir;
+	struct au_wr_dir_args wr_dir_args = {
+		.force_btgt	= -1,
+		.flags		= AuWrDir_ADD_ENTRY
+	};
+
+	IMustLock(dir);
+
+	parent = dentry->d_parent; /* dir inode is locked */
+	aufs_read_lock(dentry, AuLock_DW);
+	di_write_lock_parent(parent);
+	wh_dentry = lock_hdir_lkup_wh(dentry, &dt, /*src_dentry*/NULL, &pin,
+				      &wr_dir_args);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out;
+
+	bstart = au_dbstart(dentry);
+	h_path.dentry = au_h_dptr(dentry, bstart);
+	h_path.mnt = au_sbr_mnt(dentry->d_sb, bstart);
+	h_dir = au_pinned_h_dir(&pin);
+	switch (arg->type) {
+	case Creat:
+		err = vfsub_create(h_dir, &h_path, arg->u.c.mode);
+		break;
+	case Symlink:
+		err = vfsub_symlink(h_dir, &h_path, arg->u.s.symname);
+		break;
+	case Mknod:
+		err = vfsub_mknod(h_dir, &h_path, arg->u.m.mode, arg->u.m.dev);
+		break;
+	default:
+		BUG();
+	}
+	created = !err;
+	if (!err)
+		err = epilog(dir, bstart, wh_dentry, dentry);
+
+	/* revert */
+	if (unlikely(created && err && h_path.dentry->d_inode)) {
+		int rerr;
+		rerr = vfsub_unlink(h_dir, &h_path, /*force*/0);
+		if (rerr) {
+			AuIOErr("%.*s revert failure(%d, %d)\n",
+				AuDLNPair(dentry), err, rerr);
+			err = -EIO;
+		}
+		au_dtime_revert(&dt);
+		d_drop(dentry);
+	}
+
+	au_unpin(&pin);
+	dput(wh_dentry);
+
+ out:
+	if (unlikely(err)) {
+		au_update_dbstart(dentry);
+		d_drop(dentry);
+	}
+	di_write_unlock(parent);
+	aufs_read_unlock(dentry, AuLock_DW);
+	return err;
+}
+
+int aufs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
+{
+	struct simple_arg arg = {
+		.type = Mknod,
+		.u.m = {
+			.mode	= mode,
+			.dev	= dev
+		}
+	};
+	return add_simple(dir, dentry, &arg);
+}
+
+int aufs_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
+{
+	struct simple_arg arg = {
+		.type = Symlink,
+		.u.s.symname = symname
+	};
+	return add_simple(dir, dentry, &arg);
+}
+
+int aufs_create(struct inode *dir, struct dentry *dentry, int mode,
+		struct nameidata *nd)
+{
+	struct simple_arg arg = {
+		.type = Creat,
+		.u.c = {
+			.mode	= mode,
+			.nd	= nd
+		}
+	};
+	return add_simple(dir, dentry, &arg);
+}
+
+/* ---------------------------------------------------------------------- */
+
+struct au_link_args {
+	aufs_bindex_t bdst, bsrc;
+	struct au_pin pin;
+	struct path h_path;
+	struct dentry *src_parent, *parent;
+};
+
+static int au_cpup_before_link(struct dentry *src_dentry,
+			       struct au_link_args *a)
+{
+	int err;
+	struct dentry *h_src_dentry;
+	struct mutex *h_mtx;
+
+	di_read_lock_parent(a->src_parent, AuLock_IR);
+	err = au_test_and_cpup_dirs(src_dentry, a->bdst);
+	if (unlikely(err))
+		goto out;
+
+	h_src_dentry = au_h_dptr(src_dentry, a->bsrc);
+	h_mtx = &h_src_dentry->d_inode->i_mutex;
+	err = au_pin(&a->pin, src_dentry, a->bdst,
+		     au_opt_udba(src_dentry->d_sb),
+		     AuPin_DI_LOCKED | AuPin_MNT_WRITE);
+	if (unlikely(err))
+		goto out;
+	mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+	err = au_sio_cpup_simple(src_dentry, a->bdst, -1,
+				 AuCpup_DTIME /* | AuCpup_KEEPLINO */);
+	mutex_unlock(h_mtx);
+	au_unpin(&a->pin);
+
+ out:
+	di_read_unlock(a->src_parent, AuLock_IR);
+	return err;
+}
+
+static int au_cpup_or_link(struct dentry *src_dentry, struct au_link_args *a)
+{
+	int err;
+	unsigned char plink;
+	struct inode *h_inode, *inode;
+	struct dentry *h_src_dentry;
+	struct super_block *sb;
+
+	plink = 0;
+	h_inode = NULL;
+	sb = src_dentry->d_sb;
+	inode = src_dentry->d_inode;
+	if (au_ibstart(inode) <= a->bdst)
+		h_inode = au_h_iptr(inode, a->bdst);
+	if (!h_inode || !h_inode->i_nlink) {
+		/* copyup src_dentry as the name of dentry. */
+		au_set_dbstart(src_dentry, a->bdst);
+		au_set_h_dptr(src_dentry, a->bdst, dget(a->h_path.dentry));
+		h_inode = au_h_dptr(src_dentry, a->bsrc)->d_inode;
+		mutex_lock_nested(&h_inode->i_mutex, AuLsc_I_CHILD);
+		err = au_sio_cpup_single(src_dentry, a->bdst, a->bsrc, -1,
+					 AuCpup_KEEPLINO, a->parent);
+		mutex_unlock(&h_inode->i_mutex);
+		au_set_h_dptr(src_dentry, a->bdst, NULL);
+		au_set_dbstart(src_dentry, a->bsrc);
+	} else {
+		/* the inode of src_dentry already exists on a.bdst branch */
+		h_src_dentry = d_find_alias(h_inode);
+		if (!h_src_dentry && au_plink_test(inode)) {
+			plink = 1;
+			h_src_dentry = au_plink_lkup(inode, a->bdst);
+			err = PTR_ERR(h_src_dentry);
+			if (IS_ERR(h_src_dentry))
+				goto out;
+
+			if (unlikely(!h_src_dentry->d_inode)) {
+				dput(h_src_dentry);
+				h_src_dentry = NULL;
+			}
+
+		}
+		if (h_src_dentry) {
+			err = vfsub_link(h_src_dentry, au_pinned_h_dir(&a->pin),
+					 &a->h_path);
+			dput(h_src_dentry);
+		} else {
+			AuIOErr("no dentry found for hi%lu on b%d\n",
+				h_inode->i_ino, a->bdst);
+			err = -EIO;
+		}
+	}
+
+	if (!err && !plink)
+		au_plink_append(inode, a->bdst, a->h_path.dentry);
+
+out:
+	return err;
+}
+
+int aufs_link(struct dentry *src_dentry, struct inode *dir,
+	      struct dentry *dentry)
+{
+	int err, rerr;
+	struct au_dtime dt;
+	struct au_link_args *a;
+	struct dentry *wh_dentry, *h_src_dentry;
+	struct inode *inode;
+	struct super_block *sb;
+	struct au_wr_dir_args wr_dir_args = {
+		/* .force_btgt	= -1, */
+		.flags		= AuWrDir_ADD_ENTRY
+	};
+
+	IMustLock(dir);
+	inode = src_dentry->d_inode;
+	IMustLock(inode);
+
+	err = -ENOMEM;
+	a = kzalloc(sizeof(*a), GFP_NOFS);
+	if (unlikely(!a))
+		goto out;
+
+	a->parent = dentry->d_parent; /* dir inode is locked */
+	aufs_read_and_write_lock2(dentry, src_dentry, /*AuLock_FLUSH*/0);
+	a->src_parent = dget_parent(src_dentry);
+	wr_dir_args.force_btgt = au_dbstart(src_dentry);
+
+	di_write_lock_parent(a->parent);
+	wr_dir_args.force_btgt = au_wbr(dentry, wr_dir_args.force_btgt);
+	wh_dentry = lock_hdir_lkup_wh(dentry, &dt, src_dentry, &a->pin,
+				      &wr_dir_args);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out_unlock;
+
+	err = 0;
+	sb = dentry->d_sb;
+	a->bdst = au_dbstart(dentry);
+	a->h_path.dentry = au_h_dptr(dentry, a->bdst);
+	a->h_path.mnt = au_sbr_mnt(sb, a->bdst);
+	a->bsrc = au_dbstart(src_dentry);
+	if (au_opt_test(au_mntflags(sb), PLINK)) {
+		if (a->bdst < a->bsrc
+		    /* && h_src_dentry->d_sb != a->h_path.dentry->d_sb */)
+			err = au_cpup_or_link(src_dentry, a);
+		else {
+			h_src_dentry = au_h_dptr(src_dentry, a->bdst);
+			err = vfsub_link(h_src_dentry, au_pinned_h_dir(&a->pin),
+					 &a->h_path);
+		}
+	} else {
+		/*
+		 * copyup src_dentry to the branch we process,
+		 * and then link(2) to it.
+		 */
+		if (a->bdst < a->bsrc
+		    /* && h_src_dentry->d_sb != a->h_path.dentry->d_sb */) {
+			au_unpin(&a->pin);
+			di_write_unlock(a->parent);
+			err = au_cpup_before_link(src_dentry, a);
+			if (!err) {
+				di_write_lock_parent(a->parent);
+				err = au_pin(&a->pin, dentry, a->bdst,
+					     au_opt_udba(sb),
+					     AuPin_DI_LOCKED | AuPin_MNT_WRITE);
+				if (unlikely(err))
+					goto out_wh;
+			}
+		}
+		if (!err) {
+			h_src_dentry = au_h_dptr(src_dentry, a->bdst);
+			err = vfsub_link(h_src_dentry, au_pinned_h_dir(&a->pin),
+					 &a->h_path);
+		}
+	}
+	if (unlikely(err))
+		goto out_unpin;
+
+	if (wh_dentry) {
+		a->h_path.dentry = wh_dentry;
+		err = au_wh_unlink_dentry(au_pinned_h_dir(&a->pin), &a->h_path,
+					  dentry);
+		if (unlikely(err))
+			goto out_revert;
+	}
+
+	dir->i_version++;
+	if (au_ibstart(dir) == au_dbstart(dentry))
+		au_cpup_attr_timesizes(dir);
+	inc_nlink(inode);
+	inode->i_ctime = dir->i_ctime;
+	if (!d_unhashed(a->h_path.dentry))
+		d_instantiate(dentry, au_igrab(inode));
+	else
+		/* some filesystem calls d_drop() */
+		d_drop(dentry);
+	goto out_unpin; /* success */
+
+ out_revert:
+	rerr = vfsub_unlink(au_pinned_h_dir(&a->pin), &a->h_path, /*force*/0);
+	if (!rerr)
+		goto out_dt;
+	AuIOErr("%.*s reverting failed(%d, %d)\n",
+		AuDLNPair(dentry), err, rerr);
+	err = -EIO;
+ out_dt:
+	d_drop(dentry);
+	au_dtime_revert(&dt);
+ out_unpin:
+	au_unpin(&a->pin);
+ out_wh:
+	dput(wh_dentry);
+ out_unlock:
+	if (unlikely(err)) {
+		au_update_dbstart(dentry);
+		d_drop(dentry);
+	}
+	di_write_unlock(a->parent);
+	dput(a->src_parent);
+	aufs_read_and_write_unlock2(dentry, src_dentry);
+	kfree(a);
+ out:
+	return err;
+}
+
+int aufs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int err, rerr;
+	aufs_bindex_t bindex;
+	unsigned char diropq;
+	struct au_pin pin;
+	struct path h_path;
+	struct dentry *wh_dentry, *parent, *opq_dentry;
+	struct mutex *h_mtx;
+	struct super_block *sb;
+	struct au_dtime dt;
+	struct au_wr_dir_args wr_dir_args = {
+		.force_btgt	= -1,
+		.flags		= AuWrDir_ADD_ENTRY | AuWrDir_ISDIR
+	};
+
+	IMustLock(dir);
+
+	aufs_read_lock(dentry, AuLock_DW);
+	parent = dentry->d_parent; /* dir inode is locked */
+	di_write_lock_parent(parent);
+	wh_dentry = lock_hdir_lkup_wh(dentry, &dt, /*src_dentry*/NULL, &pin,
+				      &wr_dir_args);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out;
+
+	sb = dentry->d_sb;
+	bindex = au_dbstart(dentry);
+	h_path.dentry = au_h_dptr(dentry, bindex);
+	h_path.mnt = au_sbr_mnt(sb, bindex);
+	err = vfsub_mkdir(au_pinned_h_dir(&pin), &h_path, mode);
+	if (unlikely(err))
+		goto out_unlock;
+
+	/* make the dir opaque */
+	diropq = 0;
+	h_mtx = &h_path.dentry->d_inode->i_mutex;
+	if (wh_dentry
+	    || au_opt_test(au_mntflags(sb), ALWAYS_DIROPQ)) {
+		mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+		opq_dentry = au_diropq_create(dentry, bindex);
+		mutex_unlock(h_mtx);
+		err = PTR_ERR(opq_dentry);
+		if (IS_ERR(opq_dentry))
+			goto out_dir;
+		dput(opq_dentry);
+		diropq = 1;
+	}
+
+	err = epilog(dir, bindex, wh_dentry, dentry);
+	if (!err) {
+		inc_nlink(dir);
+		goto out_unlock; /* success */
+	}
+
+	/* revert */
+	if (diropq) {
+		AuLabel(revert opq);
+		mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+		rerr = au_diropq_remove(dentry, bindex);
+		mutex_unlock(h_mtx);
+		if (rerr) {
+			AuIOErr("%.*s reverting diropq failed(%d, %d)\n",
+				AuDLNPair(dentry), err, rerr);
+			err = -EIO;
+		}
+	}
+
+ out_dir:
+	AuLabel(revert dir);
+	rerr = vfsub_rmdir(au_pinned_h_dir(&pin), &h_path);
+	if (rerr) {
+		AuIOErr("%.*s reverting dir failed(%d, %d)\n",
+			AuDLNPair(dentry), err, rerr);
+		err = -EIO;
+	}
+	d_drop(dentry);
+	au_dtime_revert(&dt);
+ out_unlock:
+	au_unpin(&pin);
+	dput(wh_dentry);
+ out:
+	if (unlikely(err)) {
+		au_update_dbstart(dentry);
+		d_drop(dentry);
+	}
+	di_write_unlock(parent);
+	aufs_read_unlock(dentry, AuLock_DW);
+	return err;
+}
diff --git a/fs/aufs/i_op_del.c b/fs/aufs/i_op_del.c
new file mode 100644
index 0000000..681edb4
--- /dev/null
+++ b/fs/aufs/i_op_del.c
@@ -0,0 +1,471 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode operations (del entry)
+ */
+
+#include "aufs.h"
+
+/*
+ * decide if a new whiteout for @dentry is necessary or not.
+ * when it is necessary, prepare the parent dir for the upper branch whose
+ * branch index is @bcpup for creation. the actual creation of the whiteout will
+ * be done by caller.
+ * return value:
+ * 0: wh is unnecessary
+ * plus: wh is necessary
+ * minus: error
+ */
+int au_wr_dir_need_wh(struct dentry *dentry, int isdir, aufs_bindex_t *bcpup)
+{
+	int need_wh, err;
+	aufs_bindex_t bstart;
+	struct super_block *sb;
+
+	sb = dentry->d_sb;
+	bstart = au_dbstart(dentry);
+	if (*bcpup < 0) {
+		*bcpup = bstart;
+		if (au_test_ro(sb, bstart, dentry->d_inode)) {
+			err = AuWbrCopyup(au_sbi(sb), dentry);
+			*bcpup = err;
+			if (unlikely(err < 0))
+				goto out;
+		}
+	} else
+		AuDebugOn(bstart < *bcpup
+			  || au_test_ro(sb, *bcpup, dentry->d_inode));
+	AuDbg("bcpup %d, bstart %d\n", *bcpup, bstart);
+
+	if (*bcpup != bstart) {
+		err = au_cpup_dirs(dentry, *bcpup);
+		if (unlikely(err))
+			goto out;
+		need_wh = 1;
+	} else {
+		aufs_bindex_t old_bend, new_bend, bdiropq = -1;
+
+		old_bend = au_dbend(dentry);
+		if (isdir) {
+			bdiropq = au_dbdiropq(dentry);
+			au_set_dbdiropq(dentry, -1);
+		}
+		need_wh = au_lkup_dentry(dentry, bstart + 1, /*type*/0,
+					 /*nd*/NULL);
+		err = need_wh;
+		if (isdir)
+			au_set_dbdiropq(dentry, bdiropq);
+		if (unlikely(err < 0))
+			goto out;
+		new_bend = au_dbend(dentry);
+		if (!need_wh && old_bend != new_bend) {
+			au_set_h_dptr(dentry, new_bend, NULL);
+			au_set_dbend(dentry, old_bend);
+		}
+	}
+	AuDbg("need_wh %d\n", need_wh);
+	err = need_wh;
+
+ out:
+	return err;
+}
+
+/*
+ * simple tests for the del-entry operations.
+ * following the checks in vfs, plus the parent-child relationship.
+ */
+int au_may_del(struct dentry *dentry, aufs_bindex_t bindex,
+	       struct dentry *h_parent, int isdir)
+{
+	int err;
+	umode_t h_mode;
+	struct dentry *h_dentry, *h_latest;
+	struct inode *h_inode;
+
+	h_dentry = au_h_dptr(dentry, bindex);
+	h_inode = h_dentry->d_inode;
+	if (dentry->d_inode) {
+		err = -ENOENT;
+		if (unlikely(!h_inode || !h_inode->i_nlink))
+			goto out;
+
+		h_mode = h_inode->i_mode;
+		if (!isdir) {
+			err = -EISDIR;
+			if (unlikely(S_ISDIR(h_mode)))
+				goto out;
+		} else if (unlikely(!S_ISDIR(h_mode))) {
+			err = -ENOTDIR;
+			goto out;
+		}
+	} else {
+		/* rename(2) case */
+		err = -EIO;
+		if (unlikely(h_inode))
+			goto out;
+	}
+
+	err = -ENOENT;
+	/* expected parent dir is locked */
+	if (unlikely(h_parent != h_dentry->d_parent))
+		goto out;
+	err = 0;
+
+	/*
+	 * rmdir a dir may break the consistency on some filesystem.
+	 * let's try heavy test.
+	 */
+	err = -EACCES;
+	if (unlikely(au_test_h_perm(h_parent->d_inode, MAY_EXEC | MAY_WRITE)))
+		goto out;
+
+	h_latest = au_sio_lkup_one(&dentry->d_name, h_parent,
+				   au_sbr(dentry->d_sb, bindex));
+	err = -EIO;
+	if (IS_ERR(h_latest))
+		goto out;
+	if (h_latest == h_dentry)
+		err = 0;
+	dput(h_latest);
+
+ out:
+	return err;
+}
+
+/*
+ * decide the branch where we operate for @dentry. the branch index will be set
+ * @rbcpup. after diciding it, 'pin' it and store the timestamps of the parent
+ * dir for reverting.
+ * when a new whiteout is necessary, create it.
+ */
+static struct dentry*
+lock_hdir_create_wh(struct dentry *dentry, int isdir, aufs_bindex_t *rbcpup,
+		    struct au_dtime *dt, struct au_pin *pin)
+{
+	struct dentry *wh_dentry;
+	struct super_block *sb;
+	struct path h_path;
+	int err, need_wh;
+	unsigned int udba;
+	aufs_bindex_t bcpup;
+
+	need_wh = au_wr_dir_need_wh(dentry, isdir, rbcpup);
+	wh_dentry = ERR_PTR(need_wh);
+	if (unlikely(need_wh < 0))
+		goto out;
+
+	sb = dentry->d_sb;
+	udba = au_opt_udba(sb);
+	bcpup = *rbcpup;
+	err = au_pin(pin, dentry, bcpup, udba,
+		     AuPin_DI_LOCKED | AuPin_MNT_WRITE);
+	wh_dentry = ERR_PTR(err);
+	if (unlikely(err))
+		goto out;
+
+	h_path.dentry = au_pinned_h_parent(pin);
+	if (udba != AuOpt_UDBA_NONE
+	    && au_dbstart(dentry) == bcpup) {
+		err = au_may_del(dentry, bcpup, h_path.dentry, isdir);
+		wh_dentry = ERR_PTR(err);
+		if (unlikely(err))
+			goto out_unpin;
+	}
+
+	h_path.mnt = au_sbr_mnt(sb, bcpup);
+	au_dtime_store(dt, au_pinned_parent(pin), &h_path);
+	wh_dentry = NULL;
+	if (!need_wh)
+		goto out; /* success, no need to create whiteout */
+
+	wh_dentry = au_wh_create(dentry, bcpup, h_path.dentry);
+	if (!IS_ERR(wh_dentry))
+		goto out; /* success */
+	/* returns with the parent is locked and wh_dentry is dget-ed */
+
+ out_unpin:
+	au_unpin(pin);
+ out:
+	return wh_dentry;
+}
+
+/*
+ * when removing a dir, rename it to a unique temporary whiteout-ed name first
+ * in order to be revertible and save time for removing many child whiteouts
+ * under the dir.
+ * returns 1 when there are too many child whiteout and caller should remove
+ * them asynchronously. returns 0 when the number of children is enough small to
+ * remove now or the branch fs is a remote fs.
+ * otherwise return an error.
+ */
+static int renwh_and_rmdir(struct dentry *dentry, aufs_bindex_t bindex,
+			   struct au_nhash *whlist, struct inode *dir)
+{
+	int rmdir_later, err, dirwh;
+	struct dentry *h_dentry;
+	struct super_block *sb;
+
+	sb = dentry->d_sb;
+	h_dentry = au_h_dptr(dentry, bindex);
+	err = au_whtmp_ren(h_dentry, au_sbr(sb, bindex));
+	if (unlikely(err))
+		goto out;
+
+	/* stop monitoring */
+	au_hin_free(au_hi(dentry->d_inode, bindex));
+
+	if (!au_test_fs_remote(h_dentry->d_sb)) {
+		dirwh = au_sbi(sb)->si_dirwh;
+		rmdir_later = (dirwh <= 1);
+		if (!rmdir_later)
+			rmdir_later = au_nhash_test_longer_wh(whlist, bindex,
+							      dirwh);
+		if (rmdir_later)
+			return rmdir_later;
+	}
+
+	err = au_whtmp_rmdir(dir, bindex, h_dentry, whlist);
+	if (unlikely(err)) {
+		AuIOErr("rmdir %.*s, b%d failed, %d. ignored\n",
+			AuDLNPair(h_dentry), bindex, err);
+		err = 0;
+	}
+
+ out:
+	return err;
+}
+
+/*
+ * final procedure for deleting a entry.
+ * maintain dentry and iattr.
+ */
+static void epilog(struct inode *dir, struct dentry *dentry,
+		   aufs_bindex_t bindex)
+{
+	struct inode *inode;
+
+	/*
+	 * even if this is not a dir,
+	 * set S_DEAD here since we need to detect the dead inode.
+	 */
+	inode = dentry->d_inode;
+	if (!inode->i_nlink)
+		inode->i_flags |= S_DEAD;
+	d_drop(dentry);
+	inode->i_ctime = dir->i_ctime;
+
+	if (atomic_read(&dentry->d_count) == 1) {
+		au_set_h_dptr(dentry, au_dbstart(dentry), NULL);
+		au_update_dbstart(dentry);
+	}
+	if (au_ibstart(dir) == bindex)
+		au_cpup_attr_timesizes(dir);
+	dir->i_version++;
+}
+
+/*
+ * when an error happened, remove the created whiteout and revert everything.
+ */
+static int do_revert(int err, struct inode *dir, aufs_bindex_t bwh,
+		     struct dentry *wh_dentry, struct dentry *dentry,
+		     struct au_dtime *dt)
+{
+	int rerr;
+	struct path h_path = {
+		.dentry	= wh_dentry,
+		.mnt	= au_sbr_mnt(dir->i_sb, bwh)
+	};
+
+	rerr = au_wh_unlink_dentry(au_h_iptr(dir, bwh), &h_path, dentry);
+	if (!rerr) {
+		au_set_dbwh(dentry, bwh);
+		au_dtime_revert(dt);
+		return 0;
+	}
+
+	AuIOErr("%.*s reverting whiteout failed(%d, %d)\n",
+		AuDLNPair(dentry), err, rerr);
+	return -EIO;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int aufs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int err;
+	aufs_bindex_t bwh, bindex, bstart;
+	struct au_dtime dt;
+	struct au_pin pin;
+	struct path h_path;
+	struct inode *inode, *h_dir;
+	struct dentry *parent, *wh_dentry;
+
+	IMustLock(dir);
+	inode = dentry->d_inode;
+	if (unlikely(!inode))
+		return -ENOENT; /* possible? */
+	IMustLock(inode);
+
+	aufs_read_lock(dentry, AuLock_DW);
+	parent = dentry->d_parent; /* dir inode is locked */
+	di_write_lock_parent(parent);
+
+	bstart = au_dbstart(dentry);
+	bwh = au_dbwh(dentry);
+	bindex = -1;
+	wh_dentry = lock_hdir_create_wh(dentry, /*isdir*/0, &bindex, &dt, &pin);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out;
+
+	h_path.mnt = au_sbr_mnt(dentry->d_sb, bstart);
+	h_path.dentry = au_h_dptr(dentry, bstart);
+	dget(h_path.dentry);
+	if (bindex == bstart) {
+		h_dir = au_pinned_h_dir(&pin);
+		err = vfsub_unlink(h_dir, &h_path, /*force*/0);
+	} else {
+		/* dir inode is locked */
+		h_dir = wh_dentry->d_parent->d_inode;
+		IMustLock(h_dir);
+		err = 0;
+	}
+
+	if (!err) {
+		drop_nlink(inode);
+		epilog(dir, dentry, bindex);
+
+		/* update target timestamps */
+		if (bindex == bstart) {
+			vfsub_update_h_iattr(&h_path, /*did*/NULL); /*ignore*/
+			inode->i_ctime = h_path.dentry->d_inode->i_ctime;
+		} else
+			/* todo: this timestamp may be reverted later */
+			inode->i_ctime = h_dir->i_ctime;
+		goto out_unlock; /* success */
+	}
+
+	/* revert */
+	if (wh_dentry) {
+		int rerr;
+
+		rerr = do_revert(err, dir, bwh, wh_dentry, dentry, &dt);
+		if (rerr)
+			err = rerr;
+	}
+
+ out_unlock:
+	au_unpin(&pin);
+	dput(wh_dentry);
+	dput(h_path.dentry);
+ out:
+	di_write_unlock(parent);
+	aufs_read_unlock(dentry, AuLock_DW);
+	return err;
+}
+
+int aufs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	int err, rmdir_later;
+	aufs_bindex_t bwh, bindex, bstart;
+	struct au_dtime dt;
+	struct au_pin pin;
+	struct inode *inode;
+	struct dentry *parent, *wh_dentry, *h_dentry;
+	struct au_whtmp_rmdir_args *args;
+	struct au_nhash *whlist;
+
+	IMustLock(dir);
+	inode = dentry->d_inode;
+	err = -ENOENT; /* possible? */
+	if (unlikely(!inode))
+		goto out;
+	IMustLock(inode);
+
+	whlist = au_nhash_new(GFP_NOFS);
+	err = PTR_ERR(whlist);
+	if (IS_ERR(whlist))
+		goto out;
+
+	err = -ENOMEM;
+	args = kmalloc(sizeof(*args), GFP_NOFS);
+	if (unlikely(!args))
+		goto out_whlist;
+
+	aufs_read_lock(dentry, AuLock_DW | AuLock_FLUSH);
+	parent = dentry->d_parent; /* dir inode is locked */
+	di_write_lock_parent(parent);
+	err = au_test_empty(dentry, whlist);
+	if (unlikely(err))
+		goto out_args;
+
+	bstart = au_dbstart(dentry);
+	bwh = au_dbwh(dentry);
+	bindex = -1;
+	wh_dentry = lock_hdir_create_wh(dentry, /*isdir*/1, &bindex, &dt, &pin);
+	err = PTR_ERR(wh_dentry);
+	if (IS_ERR(wh_dentry))
+		goto out_args;
+
+	h_dentry = au_h_dptr(dentry, bstart);
+	dget(h_dentry);
+	rmdir_later = 0;
+	if (bindex == bstart) {
+		err = renwh_and_rmdir(dentry, bstart, whlist, dir);
+		if (err > 0) {
+			rmdir_later = err;
+			err = 0;
+		}
+	} else {
+		/* stop monitoring */
+		au_hin_free(au_hi(inode, bstart));
+
+		/* dir inode is locked */
+		IMustLock(wh_dentry->d_parent->d_inode);
+		err = 0;
+	}
+
+	if (!err) {
+		clear_nlink(inode);
+		au_set_dbdiropq(dentry, -1);
+		epilog(dir, dentry, bindex);
+
+		if (rmdir_later) {
+			au_whtmp_kick_rmdir(dir, bstart, h_dentry, whlist,
+					    args);
+			args = NULL;
+		}
+
+		goto out_unlock; /* success */
+	}
+
+	/* revert */
+	AuLabel(revert);
+	if (wh_dentry) {
+		int rerr;
+
+		rerr = do_revert(err, dir, bwh, wh_dentry, dentry, &dt);
+		if (rerr)
+			err = rerr;
+	}
+
+ out_unlock:
+	au_unpin(&pin);
+	dput(wh_dentry);
+	dput(h_dentry);
+ out_args:
+	di_write_unlock(parent);
+	aufs_read_unlock(dentry, AuLock_DW);
+	kfree(args);
+ out_whlist:
+	au_nhash_del(whlist);
+ out:
+	return err;
+}
diff --git a/fs/aufs/i_op_ren.c b/fs/aufs/i_op_ren.c
new file mode 100644
index 0000000..e0dd048
--- /dev/null
+++ b/fs/aufs/i_op_ren.c
@@ -0,0 +1,929 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode operation (rename entry)
+ * todo: this is crazy monster
+ */
+
+#include "aufs.h"
+
+enum { AuSRC, AuDST, AuSrcDst };
+enum { AuPARENT, AuCHILD, AuParentChild };
+
+#define AuRen_ISDIR	1
+#define AuRen_ISSAMEDIR	(1 << 1)
+#define AuRen_WHSRC	(1 << 2)
+#define AuRen_WHDST	(1 << 3)
+#define AuRen_MNT_WRITE	(1 << 4)
+#define AuRen_DT_DSTDIR	(1 << 5)
+#define AuRen_DIROPQ	(1 << 6)
+#define AuRen_CPUP	(1 << 7)
+#define au_ftest_ren(flags, name)	((flags) & AuRen_##name)
+#define au_fset_ren(flags, name)	{ (flags) |= AuRen_##name; }
+#define au_fclr_ren(flags, name)	{ (flags) &= ~AuRen_##name; }
+
+struct au_ren_args {
+	struct {
+		struct dentry *dentry, *h_dentry, *parent, *h_parent,
+			*wh_dentry;
+		struct inode *dir, *inode;
+		struct au_hinode *hdir;
+		struct au_dtime dt[AuParentChild];
+		aufs_bindex_t bstart;
+	} sd[AuSrcDst];
+
+#define src_dentry	sd[AuSRC].dentry
+#define src_dir		sd[AuSRC].dir
+#define src_inode	sd[AuSRC].inode
+#define src_h_dentry	sd[AuSRC].h_dentry
+#define src_parent	sd[AuSRC].parent
+#define src_h_parent	sd[AuSRC].h_parent
+#define src_wh_dentry	sd[AuSRC].wh_dentry
+#define src_hdir	sd[AuSRC].hdir
+#define src_h_dir	sd[AuSRC].hdir->hi_inode
+#define src_dt		sd[AuSRC].dt
+#define src_bstart	sd[AuSRC].bstart
+
+#define dst_dentry	sd[AuDST].dentry
+#define dst_dir		sd[AuDST].dir
+#define dst_inode	sd[AuDST].inode
+#define dst_h_dentry	sd[AuDST].h_dentry
+#define dst_parent	sd[AuDST].parent
+#define dst_h_parent	sd[AuDST].h_parent
+#define dst_wh_dentry	sd[AuDST].wh_dentry
+#define dst_hdir	sd[AuDST].hdir
+#define dst_h_dir	sd[AuDST].hdir->hi_inode
+#define dst_dt		sd[AuDST].dt
+#define dst_bstart	sd[AuDST].bstart
+
+	struct dentry *h_trap;
+	struct au_branch *br;
+	struct au_hinode *src_hinode;
+	struct path h_path;
+	struct au_nhash whlist;
+	aufs_bindex_t btgt;
+
+	unsigned int flags;
+
+	struct au_whtmp_rmdir_args *thargs;
+	struct dentry *h_dst;
+};
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * functions for reverting.
+ * when an error happened in a single rename systemcall, we should revert
+ * everything as if nothing happend.
+ * we don't need to revert the copied-up/down the parent dir since they are
+ * harmless.
+ */
+
+#define RevertFailure(fmt, args...) do { \
+	AuIOErr("revert failure: " fmt " (%d, %d)\n", \
+		##args, err, rerr); \
+	err = -EIO; \
+} while (0)
+
+static void au_ren_rev_diropq(int err, struct au_ren_args *a)
+{
+	int rerr;
+
+	au_hin_imtx_lock_nested(a->src_hinode, AuLsc_I_CHILD);
+	rerr = au_diropq_remove(a->src_dentry, a->btgt);
+	au_hin_imtx_unlock(a->src_hinode);
+	if (rerr)
+		RevertFailure("remove diropq %.*s", AuDLNPair(a->src_dentry));
+}
+
+
+static void au_ren_rev_rename(int err, struct au_ren_args *a)
+{
+	int rerr;
+
+	a->h_path.dentry = au_lkup_one(&a->src_dentry->d_name, a->src_h_parent,
+				       a->br, /*nd*/NULL);
+	rerr = PTR_ERR(a->h_path.dentry);
+	if (IS_ERR(a->h_path.dentry)) {
+		RevertFailure("au_lkup_one %.*s", AuDLNPair(a->src_dentry));
+		return;
+	}
+
+	rerr = vfsub_rename(a->dst_h_dir,
+			    au_h_dptr(a->src_dentry, a->btgt),
+			    a->src_h_dir, &a->h_path);
+	d_drop(a->h_path.dentry);
+	dput(a->h_path.dentry);
+	/* au_set_h_dptr(a->src_dentry, a->btgt, NULL); */
+	if (rerr)
+		RevertFailure("rename %.*s", AuDLNPair(a->src_dentry));
+}
+
+static void au_ren_rev_cpup(int err, struct au_ren_args *a)
+{
+	int rerr;
+
+	a->h_path.dentry = a->dst_h_dentry;
+	rerr = vfsub_unlink(a->dst_h_dir, &a->h_path, /*force*/0);
+	au_set_h_dptr(a->src_dentry, a->btgt, NULL);
+	au_set_dbstart(a->src_dentry, a->src_bstart);
+	if (rerr)
+		RevertFailure("unlink %.*s", AuDLNPair(a->dst_h_dentry));
+}
+
+
+static void au_ren_rev_whtmp(int err, struct au_ren_args *a)
+{
+	int rerr;
+
+	a->h_path.dentry = au_lkup_one(&a->dst_dentry->d_name, a->dst_h_parent,
+				       a->br, /*nd*/NULL);
+	rerr = PTR_ERR(a->h_path.dentry);
+	if (IS_ERR(a->h_path.dentry)) {
+		RevertFailure("lookup %.*s", AuDLNPair(a->dst_dentry));
+		return;
+	}
+	if (a->h_path.dentry->d_inode) {
+		d_drop(a->h_path.dentry);
+		dput(a->h_path.dentry);
+		return;
+	}
+
+	rerr = vfsub_rename(a->dst_h_dir, a->h_dst, a->dst_h_dir, &a->h_path);
+	d_drop(a->h_path.dentry);
+	dput(a->h_path.dentry);
+	if (!rerr) {
+		au_set_h_dptr(a->dst_dentry, a->btgt, NULL);
+		au_set_h_dptr(a->dst_dentry, a->btgt, dget(a->h_dst));
+	} else
+		RevertFailure("rename %.*s", AuDLNPair(a->h_dst));
+}
+
+static void au_ren_rev_whsrc(int err, struct au_ren_args *a)
+{
+	int rerr;
+
+	a->h_path.dentry = a->src_wh_dentry;
+	rerr = au_wh_unlink_dentry(a->src_h_dir, &a->h_path, a->src_dentry);
+	if (rerr)
+		RevertFailure("unlink %.*s", AuDLNPair(a->src_wh_dentry));
+}
+
+static void au_ren_rev_drop(struct au_ren_args *a)
+{
+	struct dentry *d, *h_d;
+	int i;
+	aufs_bindex_t bend, bindex;
+
+	for (i = 0; i < AuSrcDst; i++) {
+		d = a->sd[i].dentry;
+		d_drop(d);
+		bend = au_dbend(d);
+		for (bindex = au_dbstart(d); bindex <= bend; bindex++) {
+			h_d = au_h_dptr(d, bindex);
+			if (h_d)
+				d_drop(h_d);
+		}
+	}
+
+	au_update_dbstart(a->dst_dentry);
+	if (a->thargs)
+		d_drop(a->h_dst);
+}
+#undef RevertFailure
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * when we have to copyup the renaming entry, do it with the rename-target name
+ * in order to minimize the cost (the later actual rename is unnecessary).
+ * otherwise rename it on the target branch.
+ */
+static int au_ren_or_cpup(struct au_ren_args *a)
+{
+	int err;
+	struct dentry *d;
+
+	d = a->src_dentry;
+	if (au_dbstart(d) == a->btgt) {
+		a->h_path.dentry = a->dst_h_dentry;
+		if (au_ftest_ren(a->flags, DIROPQ)
+		    && au_dbdiropq(d) == a->btgt)
+			au_fclr_ren(a->flags, DIROPQ);
+		AuDebugOn(au_dbstart(d) != a->btgt);
+		err = vfsub_rename(a->src_h_dir, au_h_dptr(d, a->btgt),
+				   a->dst_h_dir, &a->h_path);
+	} else {
+		struct mutex *h_mtx = &a->src_h_dentry->d_inode->i_mutex;
+
+		au_fset_ren(a->flags, CPUP);
+		mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+		au_set_dbstart(d, a->btgt);
+		au_set_h_dptr(d, a->btgt, dget(a->dst_h_dentry));
+		err = au_sio_cpup_single(d, a->btgt, a->src_bstart, -1,
+					 !AuCpup_DTIME, a->dst_parent);
+		if (unlikely(err)) {
+			au_set_h_dptr(d, a->btgt, NULL);
+			au_set_dbstart(d, a->src_bstart);
+		}
+		mutex_unlock(h_mtx);
+	}
+
+	return err;
+}
+
+/* cf. aufs_rmdir() */
+static int au_ren_del_whtmp(struct au_ren_args *a)
+{
+	int err;
+	struct inode *dir;
+
+	dir = a->dst_dir;
+	if (!au_nhash_test_longer_wh(&a->whlist, a->btgt,
+				     au_sbi(dir->i_sb)->si_dirwh)
+	    || au_test_fs_remote(a->h_dst->d_sb)) {
+		err = au_whtmp_rmdir(dir, a->btgt, a->h_dst, &a->whlist);
+		if (unlikely(err))
+			AuWarn("failed removing whtmp dir %.*s (%d), "
+			       "ignored.\n", AuDLNPair(a->h_dst), err);
+	} else {
+		au_whtmp_kick_rmdir(dir, a->btgt, a->h_dst, &a->whlist,
+				    a->thargs);
+		dput(a->h_dst);
+		a->thargs = NULL;
+	}
+
+	return 0;
+}
+
+/* make it 'opaque' dir. */
+static int au_ren_diropq(struct au_ren_args *a)
+{
+	int err;
+	struct dentry *diropq;
+
+	err = 0;
+	a->src_hinode = au_hi(a->src_inode, a->btgt);
+	au_hin_imtx_lock_nested(a->src_hinode, AuLsc_I_CHILD);
+	diropq = au_diropq_create(a->src_dentry, a->btgt);
+	au_hin_imtx_unlock(a->src_hinode);
+	if (IS_ERR(diropq))
+		err = PTR_ERR(diropq);
+	dput(diropq);
+
+	return err;
+}
+
+static int do_rename(struct au_ren_args *a)
+{
+	int err;
+	struct dentry *d, *h_d;
+
+	/* prepare workqueue args for asynchronous rmdir */
+	h_d = a->dst_h_dentry;
+	if (au_ftest_ren(a->flags, ISDIR) && h_d->d_inode) {
+		err = -ENOMEM;
+		a->thargs = kmalloc(sizeof(*a->thargs), GFP_NOFS);
+		if (unlikely(!a->thargs))
+			goto out;
+		a->h_dst = dget(h_d);
+	}
+
+	/* create whiteout for src_dentry */
+	if (au_ftest_ren(a->flags, WHSRC)) {
+		a->src_wh_dentry
+			= au_wh_create(a->src_dentry, a->btgt, a->src_h_parent);
+		err = PTR_ERR(a->src_wh_dentry);
+		if (IS_ERR(a->src_wh_dentry))
+			goto out_thargs;
+	}
+
+	/* lookup whiteout for dentry */
+	if (au_ftest_ren(a->flags, WHDST)) {
+		h_d = au_wh_lkup(a->dst_h_parent, &a->dst_dentry->d_name,
+				 a->br);
+		err = PTR_ERR(h_d);
+		if (IS_ERR(h_d))
+			goto out_whsrc;
+		if (!h_d->d_inode)
+			dput(h_d);
+		else
+			a->dst_wh_dentry = h_d;
+	}
+
+	/* rename dentry to tmpwh */
+	if (a->thargs) {
+		err = au_whtmp_ren(a->dst_h_dentry, a->br);
+		if (unlikely(err))
+			goto out_whdst;
+
+		d = a->dst_dentry;
+		au_set_h_dptr(d, a->btgt, NULL);
+		err = au_lkup_neg(d, a->btgt);
+		if (unlikely(err))
+			goto out_whtmp;
+		a->dst_h_dentry = au_h_dptr(d, a->btgt);
+	}
+
+	/* cpup src */
+	if (a->dst_h_dentry->d_inode && a->src_bstart != a->btgt) {
+		struct mutex *h_mtx = &a->src_h_dentry->d_inode->i_mutex;
+
+		mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+		err = au_sio_cpup_simple(a->src_dentry, a->btgt, -1,
+					 !AuCpup_DTIME);
+		mutex_unlock(h_mtx);
+		if (unlikely(err))
+			goto out_whtmp;
+	}
+
+	/* rename by vfs_rename or cpup */
+	d = a->dst_dentry;
+	if (au_ftest_ren(a->flags, ISDIR)
+	    && (a->dst_wh_dentry
+		|| au_dbdiropq(d) == a->btgt
+		/* hide the lower to keep xino */
+		|| a->btgt < au_dbend(d)
+		|| au_opt_test(au_mntflags(d->d_sb), ALWAYS_DIROPQ)))
+		au_fset_ren(a->flags, DIROPQ);
+	err = au_ren_or_cpup(a);
+	if (unlikely(err))
+		/* leave the copied-up one */
+		goto out_whtmp;
+
+	/* make dir opaque */
+	if (au_ftest_ren(a->flags, DIROPQ)) {
+		err = au_ren_diropq(a);
+		if (unlikely(err))
+			goto out_rename;
+	}
+
+	/* update target timestamps */
+	AuDebugOn(au_dbstart(a->src_dentry) != a->btgt);
+	a->h_path.dentry = au_h_dptr(a->src_dentry, a->btgt);
+	vfsub_update_h_iattr(&a->h_path, /*did*/NULL); /*ignore*/
+	a->src_inode->i_ctime = a->h_path.dentry->d_inode->i_ctime;
+
+	/* remove whiteout for dentry */
+	if (a->dst_wh_dentry) {
+		a->h_path.dentry = a->dst_wh_dentry;
+		err = au_wh_unlink_dentry(a->dst_h_dir, &a->h_path,
+					  a->dst_dentry);
+		if (unlikely(err))
+			goto out_diropq;
+	}
+
+	/* remove whtmp */
+	if (a->thargs)
+		au_ren_del_whtmp(a); /* ignore this error */
+
+	err = 0;
+	goto out_success;
+
+ out_diropq:
+	if (au_ftest_ren(a->flags, DIROPQ))
+		au_ren_rev_diropq(err, a);
+ out_rename:
+	if (!au_ftest_ren(a->flags, CPUP))
+		au_ren_rev_rename(err, a);
+	else
+		au_ren_rev_cpup(err, a);
+ out_whtmp:
+	if (a->thargs)
+		au_ren_rev_whtmp(err, a);
+ out_whdst:
+	dput(a->dst_wh_dentry);
+	a->dst_wh_dentry = NULL;
+ out_whsrc:
+	if (a->src_wh_dentry)
+		au_ren_rev_whsrc(err, a);
+	au_ren_rev_drop(a);
+ out_success:
+	dput(a->src_wh_dentry);
+	dput(a->dst_wh_dentry);
+ out_thargs:
+	if (a->thargs) {
+		dput(a->h_dst);
+		kfree(a->thargs);
+	}
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * test if @dentry dir can be rename destination or not.
+ * success means, it is a logically empty dir.
+ */
+static int may_rename_dstdir(struct dentry *dentry, struct au_nhash *whlist)
+{
+	return au_test_empty(dentry, whlist);
+}
+
+/*
+ * test if @dentry dir can be rename source or not.
+ * if it can, return 0 and @children is filled.
+ * success means,
+ * - it is a logically empty dir.
+ * - or, it exists on writable branch and has no children including whiteouts
+ *       on the lower branch.
+ */
+static int may_rename_srcdir(struct dentry *dentry, aufs_bindex_t btgt)
+{
+	int err;
+	aufs_bindex_t bstart;
+
+	bstart = au_dbstart(dentry);
+	if (bstart != btgt) {
+		struct au_nhash *whlist;
+
+		whlist = au_nhash_new(GFP_NOFS);
+		err = PTR_ERR(whlist);
+		if (IS_ERR(whlist))
+			goto out;
+		err = au_test_empty(dentry, whlist);
+		au_nhash_del(whlist);
+		goto out;
+	}
+
+	if (bstart == au_dbtaildir(dentry))
+		return 0; /* success */
+
+	err = au_test_empty_lower(dentry);
+
+ out:
+	if (err == -ENOTEMPTY) {
+		AuWarn1("renaming dir who has child(ren) on multiple branches,"
+			" is not supported\n");
+		err = -EXDEV;
+	}
+	return err;
+}
+
+/* side effect: sets whlist and h_dentry */
+static int au_ren_may_dir(struct au_ren_args *a)
+{
+	int err;
+	struct dentry *d;
+
+	err = 0;
+	au_nhash_init(&a->whlist);
+	d = a->dst_dentry;
+	if (au_ftest_ren(a->flags, ISDIR) && a->dst_inode) {
+		au_set_dbstart(d, a->dst_bstart);
+		err = may_rename_dstdir(d, &a->whlist);
+		au_set_dbstart(d, a->btgt);
+	}
+	a->dst_h_dentry = au_h_dptr(d, au_dbstart(d));
+	if (unlikely(err))
+		goto out;
+
+	d = a->src_dentry;
+	a->src_h_dentry = au_h_dptr(d, au_dbstart(d));
+	if (au_ftest_ren(a->flags, ISDIR)) {
+		err = may_rename_srcdir(d, a->btgt);
+		if (unlikely(err))
+			au_nhash_fin(&a->whlist);
+	}
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * simple tests for rename.
+ * following the checks in vfs, plus the parent-child relationship.
+ */
+static int au_may_ren(struct au_ren_args *a)
+{
+	int err, isdir;
+	struct inode *h_inode;
+
+	if (a->src_bstart == a->btgt) {
+		err = au_may_del(a->src_dentry, a->btgt, a->src_h_parent,
+				 au_ftest_ren(a->flags, ISDIR));
+		if (unlikely(err))
+			goto out;
+		err = -EINVAL;
+		if (unlikely(a->src_h_dentry == a->h_trap))
+			goto out;
+	}
+
+	err = 0;
+	if (a->dst_bstart != a->btgt)
+		goto out;
+
+	err = -EIO;
+	h_inode = a->dst_h_dentry->d_inode;
+	isdir = !!au_ftest_ren(a->flags, ISDIR);
+	if (!a->dst_dentry->d_inode) {
+		if (unlikely(h_inode))
+			goto out;
+		err = au_may_add(a->dst_dentry, a->btgt, a->dst_h_parent,
+				 isdir);
+	} else {
+		if (unlikely(!h_inode || !h_inode->i_nlink))
+			goto out;
+		err = au_may_del(a->dst_dentry, a->btgt, a->dst_h_parent,
+				 isdir);
+		if (unlikely(err))
+			goto out;
+		err = -ENOTEMPTY;
+		if (unlikely(a->dst_h_dentry == a->h_trap))
+			goto out;
+		err = 0;
+	}
+
+ out:
+	if (unlikely(err == -ENOENT || err == -EEXIST))
+		err = -EIO;
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * locking order
+ * (VFS)
+ * - src_dir and dir by lock_rename()
+ * - inode if exitsts
+ * (aufs)
+ * - lock all
+ *   + src_dentry and dentry by aufs_read_and_write_lock2() which calls,
+ *     + si_read_lock
+ *     + di_write_lock2_child()
+ *       + di_write_lock_child()
+ *	   + ii_write_lock_child()
+ *       + di_write_lock_child2()
+ *	   + ii_write_lock_child2()
+ *     + src_parent and parent
+ *       + di_write_lock_parent()
+ *	   + ii_write_lock_parent()
+ *       + di_write_lock_parent2()
+ *	   + ii_write_lock_parent2()
+ *   + lower src_dir and dir by vfsub_lock_rename()
+ *   + verify the every relationships between child and parent. if any
+ *     of them failed, unlock all and return -EBUSY.
+ */
+static void au_ren_unlock(struct au_ren_args *a)
+{
+	struct super_block *sb;
+
+	sb = a->dst_dentry->d_sb;
+	if (au_ftest_ren(a->flags, MNT_WRITE))
+		mnt_drop_write(a->br->br_mnt);
+	vfsub_unlock_rename(a->src_h_parent, a->src_hdir,
+			    a->dst_h_parent, a->dst_hdir);
+}
+
+static int au_ren_lock(struct au_ren_args *a)
+{
+	int err;
+	unsigned int udba;
+
+	err = 0;
+	a->src_h_parent = au_h_dptr(a->src_parent, a->btgt);
+	a->src_hdir = au_hi(a->src_dir, a->btgt);
+	a->dst_h_parent = au_h_dptr(a->dst_parent, a->btgt);
+	a->dst_hdir = au_hi(a->dst_dir, a->btgt);
+	a->h_trap = vfsub_lock_rename(a->src_h_parent, a->src_hdir,
+				      a->dst_h_parent, a->dst_hdir);
+	udba = au_opt_udba(a->src_dentry->d_sb);
+	if (au_dbstart(a->src_dentry) == a->btgt)
+		err = au_h_verify(a->src_h_dentry, udba,
+				  a->src_h_parent->d_inode, a->src_h_parent,
+				  a->br);
+	if (!err && au_dbstart(a->dst_dentry) == a->btgt)
+		err = au_h_verify(a->dst_h_dentry, udba,
+				  a->dst_h_parent->d_inode, a->dst_h_parent,
+				  a->br);
+	if (!err) {
+		err = mnt_want_write(a->br->br_mnt);
+		if (unlikely(err))
+			goto out_unlock;
+		au_fset_ren(a->flags, MNT_WRITE);
+		goto out; /* success */
+	}
+
+	err = -EBUSY;
+
+ out_unlock:
+	au_ren_unlock(a);
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static void au_ren_refresh_dir(struct au_ren_args *a)
+{
+	struct inode *dir;
+
+	dir = a->dst_dir;
+	dir->i_version++;
+	if (au_ftest_ren(a->flags, ISDIR)) {
+		/* is this updating defined in POSIX? */
+		au_cpup_attr_timesizes(a->src_inode);
+		au_cpup_attr_nlink(dir, /*force*/1);
+		if (a->dst_inode) {
+			clear_nlink(a->dst_inode);
+			au_cpup_attr_timesizes(a->dst_inode);
+		}
+	}
+	if (au_ibstart(dir) == a->btgt)
+		au_cpup_attr_timesizes(dir);
+
+	if (au_ftest_ren(a->flags, ISSAMEDIR))
+		return;
+
+	dir = a->src_dir;
+	dir->i_version++;
+	if (au_ftest_ren(a->flags, ISDIR))
+		au_cpup_attr_nlink(dir, /*force*/1);
+	if (au_ibstart(dir) == a->btgt)
+		au_cpup_attr_timesizes(dir);
+}
+
+static void au_ren_refresh(struct au_ren_args *a)
+{
+	aufs_bindex_t bend, bindex;
+	struct dentry *d, *h_d;
+	struct inode *i, *h_i;
+	struct super_block *sb;
+
+	d = a->src_dentry;
+	au_set_dbwh(d, -1);
+	bend = au_dbend(d);
+	for (bindex = a->btgt + 1; bindex <= bend; bindex++) {
+		h_d = au_h_dptr(d, bindex);
+		if (h_d)
+			au_set_h_dptr(d, bindex, NULL);
+	}
+	au_set_dbend(d, a->btgt);
+
+	sb = d->d_sb;
+	i = a->src_inode;
+	if (au_opt_test(au_mntflags(sb), PLINK) && au_plink_test(i))
+		return; /* success */
+
+	bend = au_ibend(i);
+	for (bindex = a->btgt + 1; bindex <= bend; bindex++) {
+		h_i = au_h_iptr(i, bindex);
+		if (h_i) {
+			au_xino_write0(sb, bindex, h_i->i_ino, 0);
+			/* ignore this error */
+			au_set_h_iptr(i, bindex, NULL, 0);
+		}
+	}
+	au_set_ibend(i, a->btgt);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* mainly for link(2) and rename(2) */
+int au_wbr(struct dentry *dentry, aufs_bindex_t btgt)
+{
+	aufs_bindex_t bdiropq, bwh;
+	struct dentry *parent;
+	struct au_branch *br;
+
+	parent = dentry->d_parent;
+	IMustLock(parent->d_inode); /* dir is locked */
+
+	bdiropq = au_dbdiropq(parent);
+	bwh = au_dbwh(dentry);
+	br = au_sbr(dentry->d_sb, btgt);
+	if (au_br_rdonly(br)
+	    || (0 <= bdiropq && bdiropq < btgt)
+	    || (0 <= bwh && bwh < btgt))
+		btgt = -1;
+
+	AuDbg("btgt %d\n", btgt);
+	return btgt;
+}
+
+/* sets src_bstart, dst_bstart and btgt */
+static int au_ren_wbr(struct au_ren_args *a)
+{
+	int err;
+	struct au_wr_dir_args wr_dir_args = {
+		/* .force_btgt	= -1, */
+		.flags		= AuWrDir_ADD_ENTRY
+	};
+
+	a->src_bstart = au_dbstart(a->src_dentry);
+	a->dst_bstart = au_dbstart(a->dst_dentry);
+	if (au_ftest_ren(a->flags, ISDIR))
+		au_fset_wrdir(wr_dir_args.flags, ISDIR);
+	wr_dir_args.force_btgt = a->src_bstart;
+	if (a->dst_inode && a->dst_bstart < a->src_bstart)
+		wr_dir_args.force_btgt = a->dst_bstart;
+	wr_dir_args.force_btgt = au_wbr(a->dst_dentry, wr_dir_args.force_btgt);
+	err = au_wr_dir(a->dst_dentry, a->src_dentry, &wr_dir_args);
+	a->btgt = err;
+
+	return err;
+}
+
+static void au_ren_dt(struct au_ren_args *a)
+{
+	a->h_path.dentry = a->src_h_parent;
+	au_dtime_store(a->src_dt + AuPARENT, a->src_parent, &a->h_path);
+	if (!au_ftest_ren(a->flags, ISSAMEDIR)) {
+		a->h_path.dentry = a->dst_h_parent;
+		au_dtime_store(a->dst_dt + AuPARENT, a->dst_parent, &a->h_path);
+	}
+
+	au_fclr_ren(a->flags, DT_DSTDIR);
+	if (!au_ftest_ren(a->flags, ISDIR))
+		return;
+
+	a->h_path.dentry = a->src_h_dentry;
+	au_dtime_store(a->src_dt + AuCHILD, a->src_dentry, &a->h_path);
+	if (a->dst_h_dentry->d_inode) {
+		au_fset_ren(a->flags, DT_DSTDIR);
+		a->h_path.dentry = a->dst_h_dentry;
+		au_dtime_store(a->dst_dt + AuCHILD, a->dst_dentry, &a->h_path);
+	}
+}
+
+static void au_ren_rev_dt(int err, struct au_ren_args *a)
+{
+	struct dentry *h_d;
+	struct mutex *h_mtx;
+
+	au_dtime_revert(a->src_dt + AuPARENT);
+	if (!au_ftest_ren(a->flags, ISSAMEDIR))
+		au_dtime_revert(a->dst_dt + AuPARENT);
+
+	if (au_ftest_ren(a->flags, ISDIR) && err != -EIO) {
+		h_d = a->src_dt[AuCHILD].dt_h_path.dentry;
+		h_mtx = &h_d->d_inode->i_mutex;
+		mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+		au_dtime_revert(a->src_dt + AuCHILD);
+		mutex_unlock(h_mtx);
+
+		if (au_ftest_ren(a->flags, DT_DSTDIR)) {
+			h_d = a->dst_dt[AuCHILD].dt_h_path.dentry;
+			h_mtx = &h_d->d_inode->i_mutex;
+			mutex_lock_nested(h_mtx, AuLsc_I_CHILD);
+			au_dtime_revert(a->dst_dt + AuCHILD);
+			mutex_unlock(h_mtx);
+		}
+	}
+}
+
+/* ---------------------------------------------------------------------- */
+
+int aufs_rename(struct inode *_src_dir, struct dentry *_src_dentry,
+		struct inode *_dst_dir, struct dentry *_dst_dentry)
+{
+	int err;
+	/* reduce stack space */
+	struct au_ren_args *a;
+
+	IMustLock(_src_dir);
+	IMustLock(_dst_dir);
+
+	err = -ENOMEM;
+	BUILD_BUG_ON(sizeof(*a) > PAGE_SIZE);
+	a = kzalloc(sizeof(*a), GFP_NOFS);
+	if (unlikely(!a))
+		goto out;
+
+	a->src_dir = _src_dir;
+	a->src_dentry = _src_dentry;
+	a->src_inode = a->src_dentry->d_inode;
+	a->src_parent = a->src_dentry->d_parent; /* dir inode is locked */
+	a->dst_dir = _dst_dir;
+	a->dst_dentry = _dst_dentry;
+	a->dst_inode = a->dst_dentry->d_inode;
+	a->dst_parent = a->dst_dentry->d_parent; /* dir inode is locked */
+	if (a->dst_inode) {
+		IMustLock(a->dst_inode);
+		au_igrab(a->dst_inode);
+	}
+
+	err = -ENOTDIR;
+	if (S_ISDIR(a->src_inode->i_mode)) {
+		au_fset_ren(a->flags, ISDIR);
+		if (unlikely(a->dst_inode && !S_ISDIR(a->dst_inode->i_mode)))
+			goto out_free;
+		aufs_read_and_write_lock2(a->dst_dentry, a->src_dentry,
+					  AuLock_DIR | AuLock_FLUSH);
+	} else
+		aufs_read_and_write_lock2(a->dst_dentry, a->src_dentry,
+					  AuLock_FLUSH);
+
+	au_fset_ren(a->flags, ISSAMEDIR); /* temporary */
+	di_write_lock_parent(a->dst_parent);
+
+	/* which branch we process */
+	err = au_ren_wbr(a);
+	if (unlikely(err < 0))
+		goto out_unlock;
+	a->br = au_sbr(a->dst_dentry->d_sb, a->btgt);
+	a->h_path.mnt = a->br->br_mnt;
+
+	/* are they available to be renamed */
+	err = au_ren_may_dir(a);
+	if (unlikely(err))
+		goto out_unlock;
+
+	/* prepare the writable parent dir on the same branch */
+	if (a->dst_bstart == a->btgt) {
+		au_fset_ren(a->flags, WHDST);
+	} else {
+		err = au_cpup_dirs(a->dst_dentry, a->btgt);
+		if (unlikely(err))
+			goto out_children;
+	}
+
+	if (a->src_dir != a->dst_dir) {
+		/*
+		 * this temporary unlock is safe,
+		 * because both dir->i_mutex are locked.
+		 */
+		di_write_unlock(a->dst_parent);
+		di_write_lock_parent(a->src_parent);
+		err = au_wr_dir_need_wh(a->src_dentry,
+					au_ftest_ren(a->flags, ISDIR),
+					&a->btgt);
+		di_write_unlock(a->src_parent);
+		di_write_lock2_parent(a->src_parent, a->dst_parent, /*isdir*/1);
+		au_fclr_ren(a->flags, ISSAMEDIR);
+	} else
+		err = au_wr_dir_need_wh(a->src_dentry,
+					au_ftest_ren(a->flags, ISDIR),
+					&a->btgt);
+	if (unlikely(err < 0))
+		goto out_children;
+	if (err)
+		au_fset_ren(a->flags, WHSRC);
+
+	/* lock them all */
+	err = au_ren_lock(a);
+	if (unlikely(err))
+		goto out_children;
+
+	if (!au_opt_test(au_mntflags(a->dst_dir->i_sb), UDBA_NONE)) {
+		err = au_may_ren(a);
+		if (unlikely(err))
+			goto out_hdir;
+	}
+
+	/* store timestamps to be revertible */
+	au_ren_dt(a);
+
+	/* here we go */
+	err = do_rename(a);
+	if (unlikely(err))
+		goto out_dt;
+
+	/* update dir attributes */
+	au_ren_refresh_dir(a);
+
+	/* dput/iput all lower dentries */
+	au_ren_refresh(a);
+
+	goto out_hdir; /* success */
+
+ out_dt:
+	au_ren_rev_dt(err, a);
+ out_hdir:
+	au_ren_unlock(a);
+ out_children:
+	au_nhash_fin(&a->whlist);
+ out_unlock:
+	if (unlikely(err && au_ftest_ren(a->flags, ISDIR))) {
+		au_update_dbstart(a->dst_dentry);
+		d_drop(a->dst_dentry);
+	}
+	if (!err) {
+		d_move(a->src_dentry, a->dst_dentry);
+		if (a->dst_inode
+		    && (a->dst_inode->i_nlink <= 1
+			|| au_ftest_ren(a->flags, ISDIR)))
+			a->dst_inode->i_flags |= S_DEAD;
+	}
+	if (au_ftest_ren(a->flags, ISSAMEDIR))
+		di_write_unlock(a->dst_parent);
+	else
+		di_write_unlock2(a->src_parent, a->dst_parent);
+	aufs_read_and_write_unlock2(a->dst_dentry, a->src_dentry);
+ out_free:
+	iput(a->dst_inode);
+	kfree(a);
+ out:
+	return err;
+}
diff --git a/fs/aufs/iinfo.c b/fs/aufs/iinfo.c
new file mode 100644
index 0000000..ed9c55c
--- /dev/null
+++ b/fs/aufs/iinfo.c
@@ -0,0 +1,257 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode private data
+ */
+
+#include "aufs.h"
+
+struct inode *au_h_iptr(struct inode *inode, aufs_bindex_t bindex)
+{
+	struct inode *h_inode;
+
+	h_inode = au_ii(inode)->ii_hinode[0 + bindex].hi_inode;
+	AuDebugOn(h_inode && atomic_read(&h_inode->i_count) <= 0);
+	return h_inode;
+}
+
+/* todo: hard/soft set? */
+void au_set_ibstart(struct inode *inode, aufs_bindex_t bindex)
+{
+	struct au_iinfo *iinfo = au_ii(inode);
+	struct inode *h_inode;
+
+	iinfo->ii_bstart = bindex;
+	h_inode = iinfo->ii_hinode[bindex + 0].hi_inode;
+	if (h_inode)
+		au_cpup_igen(inode, h_inode);
+}
+
+void au_hiput(struct au_hinode *hinode)
+{
+	au_hin_free(hinode);
+	dput(hinode->hi_whdentry);
+	iput(hinode->hi_inode);
+}
+
+unsigned int au_hi_flags(struct inode *inode, int isdir)
+{
+	unsigned int flags;
+	const unsigned int mnt_flags = au_mntflags(inode->i_sb);
+
+	flags = 0;
+	if (au_opt_test(mnt_flags, XINO))
+		au_fset_hi(flags, XINO);
+	if (isdir && au_opt_test(mnt_flags, UDBA_HINOTIFY))
+		au_fset_hi(flags, HINOTIFY);
+	return flags;
+}
+
+void au_set_h_iptr(struct inode *inode, aufs_bindex_t bindex,
+		   struct inode *h_inode, unsigned int flags)
+{
+	struct au_hinode *hinode;
+	struct inode *hi;
+	struct au_iinfo *iinfo = au_ii(inode);
+
+	hinode = iinfo->ii_hinode + bindex;
+	hi = hinode->hi_inode;
+	AuDebugOn(h_inode && atomic_read(&h_inode->i_count) <= 0);
+	AuDebugOn(h_inode && hi);
+
+	if (hi)
+		au_hiput(hinode);
+	hinode->hi_inode = h_inode;
+	if (h_inode) {
+		int err;
+		struct super_block *sb = inode->i_sb;
+		struct au_branch *br;
+
+		if (bindex == iinfo->ii_bstart)
+			au_cpup_igen(inode, h_inode);
+		br = au_sbr(sb, bindex);
+		hinode->hi_id = br->br_id;
+		if (au_ftest_hi(flags, XINO)) {
+			err = au_xino_write(sb, bindex, h_inode->i_ino,
+					    inode->i_ino);
+			if (unlikely(err))
+				AuIOErr1("failed au_xino_write() %d\n", err);
+		}
+
+		if (au_ftest_hi(flags, HINOTIFY)
+		    && au_br_hinotifyable(br->br_perm)) {
+			err = au_hin_alloc(hinode, inode, h_inode);
+			if (unlikely(err))
+				AuIOErr1("au_hin_alloc() %d\n", err);
+		}
+	}
+}
+
+void au_set_hi_wh(struct inode *inode, aufs_bindex_t bindex,
+		  struct dentry *h_wh)
+{
+	struct au_hinode *hinode;
+
+	hinode = au_ii(inode)->ii_hinode + bindex;
+	AuDebugOn(hinode->hi_whdentry);
+	hinode->hi_whdentry = h_wh;
+}
+
+void au_update_iigen(struct inode *inode)
+{
+	atomic_set(&au_ii(inode)->ii_generation, au_sigen(inode->i_sb));
+	/* smp_mb(); */ /* atomic_set */
+}
+
+/* it may be called at remount time, too */
+void au_update_brange(struct inode *inode, int do_put_zero)
+{
+	struct au_iinfo *iinfo;
+
+	iinfo = au_ii(inode);
+	if (!iinfo || iinfo->ii_bstart < 0)
+		return;
+
+	if (do_put_zero) {
+		aufs_bindex_t bindex;
+
+		for (bindex = iinfo->ii_bstart; bindex <= iinfo->ii_bend;
+		     bindex++) {
+			struct inode *h_i;
+
+			h_i = iinfo->ii_hinode[0 + bindex].hi_inode;
+			if (h_i && !h_i->i_nlink)
+				au_set_h_iptr(inode, bindex, NULL, 0);
+		}
+	}
+
+	iinfo->ii_bstart = -1;
+	while (++iinfo->ii_bstart <= iinfo->ii_bend)
+		if (iinfo->ii_hinode[0 + iinfo->ii_bstart].hi_inode)
+			break;
+	if (iinfo->ii_bstart > iinfo->ii_bend) {
+		iinfo->ii_bstart = -1;
+		iinfo->ii_bend = -1;
+		return;
+	}
+
+	iinfo->ii_bend++;
+	while (0 <= --iinfo->ii_bend)
+		if (iinfo->ii_hinode[0 + iinfo->ii_bend].hi_inode)
+			break;
+	AuDebugOn(iinfo->ii_bstart > iinfo->ii_bend || iinfo->ii_bend < 0);
+}
+
+/* ---------------------------------------------------------------------- */
+
+int au_iinfo_init(struct inode *inode)
+{
+	struct au_iinfo *iinfo;
+	struct super_block *sb;
+	int nbr, i;
+
+	sb = inode->i_sb;
+	iinfo = &(container_of(inode, struct au_icntnr, vfs_inode)->iinfo);
+	nbr = au_sbend(sb) + 1;
+	if (unlikely(nbr <= 0))
+		nbr = 1;
+	iinfo->ii_hinode = kcalloc(nbr, sizeof(*iinfo->ii_hinode), GFP_NOFS);
+	if (iinfo->ii_hinode) {
+		for (i = 0; i < nbr; i++)
+			iinfo->ii_hinode[i].hi_id = -1;
+
+		atomic_set(&iinfo->ii_generation, au_sigen(sb));
+		/* smp_mb(); */ /* atomic_set */
+		init_rwsem(&iinfo->ii_rwsem);
+		iinfo->ii_bstart = -1;
+		iinfo->ii_bend = -1;
+		iinfo->ii_vdir = NULL;
+		return 0;
+	}
+	return -ENOMEM;
+}
+
+int au_ii_realloc(struct au_iinfo *iinfo, int nbr)
+{
+	int err, sz;
+	struct au_hinode *hip;
+
+	err = -ENOMEM;
+	sz = sizeof(*hip) * (iinfo->ii_bend + 1);
+	if (!sz)
+		sz = sizeof(*hip);
+	hip = au_kzrealloc(iinfo->ii_hinode, sz, sizeof(*hip) * nbr, GFP_NOFS);
+	if (hip) {
+		iinfo->ii_hinode = hip;
+		err = 0;
+	}
+
+	return err;
+}
+
+static int au_iinfo_write0(struct super_block *sb, struct au_hinode *hinode,
+			   ino_t ino)
+{
+	int err;
+	aufs_bindex_t bindex;
+	unsigned char locked;
+
+	err = 0;
+	locked = !!si_noflush_read_trylock(sb);
+	bindex = au_br_index(sb, hinode->hi_id);
+	if (bindex >= 0)
+		err = au_xino_write0(sb, bindex, hinode->hi_inode->i_ino, ino);
+	/* error action? */
+	if (locked)
+		si_read_unlock(sb);
+	return err;
+}
+
+void au_iinfo_fin(struct inode *inode)
+{
+	ino_t ino;
+	aufs_bindex_t bend;
+	unsigned char unlinked;
+	struct au_iinfo *iinfo;
+	struct au_hinode *hi;
+	struct super_block *sb;
+
+	iinfo = au_ii(inode);
+	/* bad_inode case */
+	if (!iinfo)
+		return;
+
+	if (iinfo->ii_vdir)
+		au_vdir_free(iinfo->ii_vdir);
+
+	if (iinfo->ii_bstart >= 0) {
+		sb = inode->i_sb;
+		unlinked = !inode->i_nlink;
+		ino = 0;
+		if (unlinked)
+			ino = inode->i_ino;
+		hi = iinfo->ii_hinode + iinfo->ii_bstart;
+		bend = iinfo->ii_bend;
+		while (iinfo->ii_bstart++ <= bend) {
+			if (hi->hi_inode) {
+				if (unlinked || !hi->hi_inode->i_nlink) {
+					au_iinfo_write0(sb, hi, ino);
+					/* ignore this error */
+					ino = 0;
+				}
+				au_hiput(hi);
+			}
+			hi++;
+		}
+	}
+
+	kfree(iinfo->ii_hinode);
+	au_rwsem_destroy(&iinfo->ii_rwsem);
+}
diff --git a/fs/aufs/inode.c b/fs/aufs/inode.c
new file mode 100644
index 0000000..c4a962b
--- /dev/null
+++ b/fs/aufs/inode.c
@@ -0,0 +1,356 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode functions
+ */
+
+#include "aufs.h"
+
+static void au_refresh_hinode_attr(struct inode *inode, int do_version)
+{
+	au_cpup_attr_all(inode, /*force*/0);
+	au_update_iigen(inode);
+	if (do_version)
+		inode->i_version++;
+}
+
+int au_refresh_hinode_self(struct inode *inode, int do_attr)
+{
+	int err;
+	aufs_bindex_t bindex, new_bindex;
+	unsigned char update;
+	struct inode *first;
+	struct au_hinode *p, *q, tmp;
+	struct super_block *sb;
+	struct au_iinfo *iinfo;
+
+	update = 0;
+	sb = inode->i_sb;
+	iinfo = au_ii(inode);
+	err = au_ii_realloc(iinfo, au_sbend(sb) + 1);
+	if (unlikely(err))
+		goto out;
+
+	p = iinfo->ii_hinode + iinfo->ii_bstart;
+	first = p->hi_inode;
+	err = 0;
+	for (bindex = iinfo->ii_bstart; bindex <= iinfo->ii_bend;
+	     bindex++, p++) {
+		if (!p->hi_inode)
+			continue;
+
+		new_bindex = au_br_index(sb, p->hi_id);
+		if (new_bindex == bindex)
+			continue;
+
+		if (new_bindex < 0) {
+			update++;
+			au_hiput(p);
+			p->hi_inode = NULL;
+			continue;
+		}
+
+		if (new_bindex < iinfo->ii_bstart)
+			iinfo->ii_bstart = new_bindex;
+		if (iinfo->ii_bend < new_bindex)
+			iinfo->ii_bend = new_bindex;
+		/* swap two lower inode, and loop again */
+		q = iinfo->ii_hinode + new_bindex;
+		tmp = *q;
+		*q = *p;
+		*p = tmp;
+		if (tmp.hi_inode) {
+			bindex--;
+			p--;
+		}
+	}
+	au_update_brange(inode, /*do_put_zero*/0);
+	if (do_attr)
+		au_refresh_hinode_attr(inode, update && S_ISDIR(inode->i_mode));
+
+ out:
+	return err;
+}
+
+int au_refresh_hinode(struct inode *inode, struct dentry *dentry)
+{
+	int err, update;
+	unsigned int flags;
+	aufs_bindex_t bindex, bend;
+	unsigned char isdir;
+	struct inode *first;
+	struct au_hinode *p;
+	struct au_iinfo *iinfo;
+
+	err = au_refresh_hinode_self(inode, /*do_attr*/0);
+	if (unlikely(err))
+		goto out;
+
+	update = 0;
+	iinfo = au_ii(inode);
+	p = iinfo->ii_hinode + iinfo->ii_bstart;
+	first = p->hi_inode;
+	isdir = S_ISDIR(inode->i_mode);
+	flags = au_hi_flags(inode, isdir);
+	bend = au_dbend(dentry);
+	for (bindex = au_dbstart(dentry); bindex <= bend; bindex++) {
+		struct inode *h_i;
+		struct dentry *h_d;
+
+		h_d = au_h_dptr(dentry, bindex);
+		if (!h_d || !h_d->d_inode)
+			continue;
+
+		if (iinfo->ii_bstart <= bindex && bindex <= iinfo->ii_bend) {
+			h_i = au_h_iptr(inode, bindex);
+			if (h_i) {
+				if (h_i == h_d->d_inode)
+					continue;
+				err = -EIO;
+				break;
+			}
+		}
+		if (bindex < iinfo->ii_bstart)
+			iinfo->ii_bstart = bindex;
+		if (iinfo->ii_bend < bindex)
+			iinfo->ii_bend = bindex;
+		au_set_h_iptr(inode, bindex, au_igrab(h_d->d_inode), flags);
+		update = 1;
+	}
+	au_update_brange(inode, /*do_put_zero*/0);
+
+	if (unlikely(err))
+		goto out;
+
+	au_refresh_hinode_attr(inode, update && isdir);
+
+ out:
+	return err;
+}
+
+static int set_inode(struct inode *inode, struct dentry *dentry)
+{
+	int err;
+	unsigned int flags;
+	umode_t mode;
+	aufs_bindex_t bindex, bstart, btail;
+	unsigned char isdir;
+	struct dentry *h_dentry;
+	struct inode *h_inode;
+	struct au_iinfo *iinfo;
+
+	err = 0;
+	isdir = 0;
+	bstart = au_dbstart(dentry);
+	h_inode = au_h_dptr(dentry, bstart)->d_inode;
+	mode = h_inode->i_mode;
+	switch (mode & S_IFMT) {
+	case S_IFREG:
+		btail = au_dbtail(dentry);
+		inode->i_op = &aufs_iop;
+		inode->i_fop = &aufs_file_fop;
+		inode->i_mapping->a_ops = &aufs_aop;
+		break;
+	case S_IFDIR:
+		isdir = 1;
+		btail = au_dbtaildir(dentry);
+		inode->i_op = &aufs_dir_iop;
+		inode->i_fop = &aufs_dir_fop;
+		break;
+	case S_IFLNK:
+		btail = au_dbtail(dentry);
+		inode->i_op = &aufs_symlink_iop;
+		break;
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		btail = au_dbtail(dentry);
+		inode->i_op = &aufs_iop;
+		init_special_inode(inode, mode, h_inode->i_rdev);
+		break;
+	default:
+		AuIOErr("Unknown file type 0%o\n", mode);
+		err = -EIO;
+		goto out;
+	}
+
+	flags = au_hi_flags(inode, isdir);
+	iinfo = au_ii(inode);
+	iinfo->ii_bstart = bstart;
+	iinfo->ii_bend = btail;
+	for (bindex = bstart; bindex <= btail; bindex++) {
+		h_dentry = au_h_dptr(dentry, bindex);
+		if (h_dentry)
+			au_set_h_iptr(inode, bindex,
+				      au_igrab(h_dentry->d_inode), flags);
+	}
+	au_cpup_attr_all(inode, /*force*/1);
+
+ out:
+	return err;
+}
+
+/* successful returns with iinfo write_locked */
+static int reval_inode(struct inode *inode, struct dentry *dentry, int *matched)
+{
+	int err;
+	aufs_bindex_t bindex, bend;
+	struct inode *h_inode, *h_dinode;
+
+	*matched = 0;
+
+	/*
+	 * before this function, if aufs got any iinfo lock, it must be only
+	 * one, the parent dir.
+	 * it can happen by UDBA and the obsoleted inode number.
+	 */
+	err = -EIO;
+	if (unlikely(inode->i_ino == parent_ino(dentry)))
+		goto out;
+
+	ii_write_lock_new_child(inode);
+	if (unlikely(IS_DEADDIR(inode)))
+		goto out_unlock;
+
+	err = 0;
+	h_dinode = au_h_dptr(dentry, au_dbstart(dentry))->d_inode;
+	bend = au_ibend(inode);
+	for (bindex = au_ibstart(inode); bindex <= bend; bindex++) {
+		h_inode = au_h_iptr(inode, bindex);
+		if (h_inode && h_inode == h_dinode) {
+			*matched = 1;
+			err = 0;
+			if (au_iigen(inode) != au_digen(dentry))
+				err = au_refresh_hinode(inode, dentry);
+			break;
+		}
+	}
+
+ out_unlock:
+	if (unlikely(err))
+		ii_write_unlock(inode);
+ out:
+	return err;
+}
+
+/* successful returns with iinfo write_locked */
+/* todo: return with unlocked? */
+struct inode *au_new_inode(struct dentry *dentry, int must_new)
+{
+	struct inode *inode;
+	struct dentry *h_dentry;
+	struct super_block *sb;
+	ino_t h_ino, ino;
+	int err, match;
+	aufs_bindex_t bstart;
+
+	sb = dentry->d_sb;
+	bstart = au_dbstart(dentry);
+	h_dentry = au_h_dptr(dentry, bstart);
+	h_ino = h_dentry->d_inode->i_ino;
+	err = au_xino_read(sb, bstart, h_ino, &ino);
+	inode = ERR_PTR(err);
+	if (unlikely(err))
+		goto out;
+ new_ino:
+	if (!ino) {
+		ino = au_xino_new_ino(sb);
+		if (unlikely(!ino)) {
+			inode = ERR_PTR(-EIO);
+			goto out;
+		}
+	}
+
+	AuDbg("i%lu\n", (unsigned long)ino);
+	inode = au_iget_locked(sb, ino);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out;
+
+	AuDbg("%lx, new %d\n", inode->i_state, !!(inode->i_state & I_NEW));
+	if (inode->i_state & I_NEW) {
+		ii_write_lock_new_child(inode);
+		err = set_inode(inode, dentry);
+		unlock_new_inode(inode);
+		if (!err)
+			goto out; /* success */
+
+		iget_failed(inode);
+		ii_write_unlock(inode);
+		goto out_iput;
+	} else if (!must_new) {
+		err = reval_inode(inode, dentry, &match);
+		if (!err)
+			goto out; /* success */
+		else if (match)
+			goto out_iput;
+	}
+
+	if (unlikely(au_test_fs_unique_ino(h_dentry->d_inode)))
+		AuWarn1("Un-notified UDBA or repeatedly renamed dir,"
+			" b%d, %s, %.*s, hi%lu, i%lu.\n",
+			bstart, au_sbtype(h_dentry->d_sb), AuDLNPair(dentry),
+			(unsigned long)h_ino, (unsigned long)ino);
+	ino = 0;
+	err = au_xino_write0(sb, bstart, h_ino, 0);
+	if (!err) {
+		iput(inode);
+		goto new_ino;
+	}
+
+ out_iput:
+	iput(inode);
+	inode = ERR_PTR(err);
+ out:
+	return inode;
+}
+
+/* ---------------------------------------------------------------------- */
+
+int au_test_ro(struct super_block *sb, aufs_bindex_t bindex,
+	       struct inode *inode)
+{
+	int err;
+
+	err = au_br_rdonly(au_sbr(sb, bindex));
+
+	/* pseudo-link after flushed may happen out of bounds */
+	if (!err
+	    && inode
+	    && au_ibstart(inode) <= bindex
+	    && bindex <= au_ibend(inode)) {
+		/*
+		 * permission check is unnecessary since vfsub routine
+		 * will be called later
+		 */
+		struct inode *hi = au_h_iptr(inode, bindex);
+		if (hi)
+			err = IS_IMMUTABLE(hi) ? -EROFS : 0;
+	}
+
+	return err;
+}
+
+int au_test_h_perm(struct inode *h_inode, int mask)
+{
+	if (!current_fsuid())
+		return 0;
+	return inode_permission(h_inode, mask);
+}
+
+int au_test_h_perm_sio(struct inode *h_inode, int mask)
+{
+	if (au_test_nfs(h_inode->i_sb)
+	    && (mask & MAY_WRITE)
+	    && S_ISDIR(h_inode->i_mode))
+		mask |= MAY_READ; /* force permission check */
+	return au_test_h_perm(h_inode, mask);
+}
diff --git a/fs/aufs/inode.h b/fs/aufs/inode.h
new file mode 100644
index 0000000..ffda951
--- /dev/null
+++ b/fs/aufs/inode.h
@@ -0,0 +1,471 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inode operations
+ */
+
+#ifndef __AUFS_INODE_H__
+#define __AUFS_INODE_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/inotify.h>
+#include <linux/namei.h>
+#include <linux/aufs_type.h>
+#include "rwsem.h"
+
+struct au_hinotify {
+#ifdef CONFIG_AUFS_HINOTIFY
+	struct inotify_watch	hin_watch;
+	struct inode		*hin_aufs_inode;	/* no get/put */
+#endif
+};
+
+struct au_hinode {
+	struct inode		*hi_inode;
+	aufs_bindex_t		hi_id;
+#ifdef CONFIG_AUFS_HINOTIFY
+	struct au_hinotify	*hi_notify;
+#endif
+
+	/* reference to the copied-up whiteout with get/put */
+	struct dentry		*hi_whdentry;
+};
+
+struct au_vdir;
+struct au_iinfo {
+	atomic_t		ii_generation;
+	struct super_block	*ii_hsb1;	/* no get/put */
+
+	struct rw_semaphore	ii_rwsem;
+	aufs_bindex_t		ii_bstart, ii_bend;
+	__u32			ii_higen;
+	struct au_hinode	*ii_hinode;
+	struct au_vdir		*ii_vdir;
+};
+
+struct au_icntnr {
+	struct au_iinfo iinfo;
+	struct inode vfs_inode;
+};
+
+/* au_pin flags */
+#define AuPin_DI_LOCKED		1
+#define AuPin_MNT_WRITE		(1 << 1)
+#define au_ftest_pin(flags, name)	((flags) & AuPin_##name)
+#define au_fset_pin(flags, name)	{ (flags) |= AuPin_##name; }
+#define au_fclr_pin(flags, name)	{ (flags) &= ~AuPin_##name; }
+
+struct au_pin {
+	/* input */
+	struct dentry *dentry;
+	unsigned int udba;
+	unsigned char lsc_di, lsc_hi, flags;
+	aufs_bindex_t bindex;
+
+	/* output */
+	struct dentry *parent;
+	struct au_hinode *hdir;
+	struct vfsmount *h_mnt;
+};
+
+/* ---------------------------------------------------------------------- */
+
+static inline struct au_iinfo *au_ii(struct inode *inode)
+{
+	struct au_iinfo *iinfo;
+
+	iinfo = &(container_of(inode, struct au_icntnr, vfs_inode)->iinfo);
+	if (iinfo->ii_hinode)
+		return iinfo;
+	return NULL; /* debugging bad_inode case */
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* inode.c */
+int au_refresh_hinode_self(struct inode *inode, int do_attr);
+int au_refresh_hinode(struct inode *inode, struct dentry *dentry);
+struct inode *au_new_inode(struct dentry *dentry, int must_new);
+int au_test_ro(struct super_block *sb, aufs_bindex_t bindex,
+	       struct inode *inode);
+int au_test_h_perm(struct inode *h_inode, int mask);
+int au_test_h_perm_sio(struct inode *h_inode, int mask);
+
+/* i_op.c */
+extern struct inode_operations aufs_iop, aufs_symlink_iop, aufs_dir_iop;
+
+/* au_wr_dir flags */
+#define AuWrDir_ADD_ENTRY	1
+#define AuWrDir_ISDIR		(1 << 1)
+#define au_ftest_wrdir(flags, name)	((flags) & AuWrDir_##name)
+#define au_fset_wrdir(flags, name)	{ (flags) |= AuWrDir_##name; }
+#define au_fclr_wrdir(flags, name)	{ (flags) &= ~AuWrDir_##name; }
+
+struct au_wr_dir_args {
+	aufs_bindex_t force_btgt;
+	unsigned char flags;
+};
+int au_wr_dir(struct dentry *dentry, struct dentry *src_dentry,
+	      struct au_wr_dir_args *args);
+
+struct dentry *au_pinned_h_parent(struct au_pin *pin);
+void au_pin_init(struct au_pin *pin, struct dentry *dentry,
+		 aufs_bindex_t bindex, int lsc_di, int lsc_hi,
+		 unsigned int udba, unsigned char flags);
+int au_pin(struct au_pin *pin, struct dentry *dentry, aufs_bindex_t bindex,
+	   unsigned int udba, unsigned char flags) __must_check;
+int au_do_pin(struct au_pin *pin) __must_check;
+void au_unpin(struct au_pin *pin);
+
+/* i_op_add.c */
+int au_may_add(struct dentry *dentry, aufs_bindex_t bindex,
+	       struct dentry *h_parent, int isdir);
+int aufs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
+int aufs_symlink(struct inode *dir, struct dentry *dentry, const char *symname);
+int aufs_create(struct inode *dir, struct dentry *dentry, int mode,
+		struct nameidata *nd);
+int aufs_link(struct dentry *src_dentry, struct inode *dir,
+	      struct dentry *dentry);
+int aufs_mkdir(struct inode *dir, struct dentry *dentry, int mode);
+
+/* i_op_del.c */
+int au_wr_dir_need_wh(struct dentry *dentry, int isdir, aufs_bindex_t *bcpup);
+int au_may_del(struct dentry *dentry, aufs_bindex_t bindex,
+	       struct dentry *h_parent, int isdir);
+int aufs_unlink(struct inode *dir, struct dentry *dentry);
+int aufs_rmdir(struct inode *dir, struct dentry *dentry);
+
+/* i_op_ren.c */
+int au_wbr(struct dentry *dentry, aufs_bindex_t btgt);
+int aufs_rename(struct inode *src_dir, struct dentry *src_dentry,
+		struct inode *dir, struct dentry *dentry);
+
+/* iinfo.c */
+struct inode *au_h_iptr(struct inode *inode, aufs_bindex_t bindex);
+void au_hiput(struct au_hinode *hinode);
+void au_set_ibstart(struct inode *inode, aufs_bindex_t bindex);
+void au_set_hi_wh(struct inode *inode, aufs_bindex_t bindex,
+		  struct dentry *h_wh);
+unsigned int au_hi_flags(struct inode *inode, int isdir);
+
+/* hinode flags */
+#define AuHi_XINO	1
+#define AuHi_HINOTIFY	(1 << 1)
+#define au_ftest_hi(flags, name)	((flags) & AuHi_##name)
+#define au_fset_hi(flags, name)		{ (flags) |= AuHi_##name; }
+#define au_fclr_hi(flags, name)		{ (flags) &= ~AuHi_##name; }
+
+#ifndef CONFIG_AUFS_HINOTIFY
+#undef AuHi_HINOTIFY
+#define AuHi_HINOTIFY	0
+#endif
+
+void au_set_h_iptr(struct inode *inode, aufs_bindex_t bindex,
+		   struct inode *h_inode, unsigned int flags);
+
+void au_update_iigen(struct inode *inode);
+void au_update_brange(struct inode *inode, int do_put_zero);
+
+int au_iinfo_init(struct inode *inode);
+void au_iinfo_fin(struct inode *inode);
+int au_ii_realloc(struct au_iinfo *iinfo, int nbr);
+
+/* plink.c */
+void au_plink_block_maintain(struct super_block *sb);
+#ifdef CONFIG_AUFS_DEBUG
+void au_plink_list(struct super_block *sb);
+#else
+static inline void au_plink_list(struct super_block *sb)
+{
+	/* nothing */
+}
+#endif
+int au_plink_test(struct inode *inode);
+struct dentry *au_plink_lkup(struct inode *inode, aufs_bindex_t bindex);
+void au_plink_append(struct inode *inode, aufs_bindex_t bindex,
+		     struct dentry *h_dentry);
+void au_plink_put(struct super_block *sb);
+void au_plink_half_refresh(struct super_block *sb, aufs_bindex_t br_id);
+
+/* ---------------------------------------------------------------------- */
+
+/* lock subclass for iinfo */
+enum {
+	AuLsc_II_CHILD,		/* child first */
+	AuLsc_II_CHILD2,	/* rename(2), link(2), and cpup at hinotify */
+	AuLsc_II_CHILD3,	/* copyup dirs */
+	AuLsc_II_PARENT,	/* see AuLsc_I_PARENT in vfsub.h */
+	AuLsc_II_PARENT2,
+	AuLsc_II_PARENT3,	/* copyup dirs */
+	AuLsc_II_NEW_CHILD
+};
+
+/*
+ * ii_read_lock_child, ii_write_lock_child,
+ * ii_read_lock_child2, ii_write_lock_child2,
+ * ii_read_lock_child3, ii_write_lock_child3,
+ * ii_read_lock_parent, ii_write_lock_parent,
+ * ii_read_lock_parent2, ii_write_lock_parent2,
+ * ii_read_lock_parent3, ii_write_lock_parent3,
+ * ii_read_lock_new_child, ii_write_lock_new_child,
+ */
+#define AuReadLockFunc(name, lsc) \
+static inline void ii_read_lock_##name(struct inode *i) \
+{ \
+	down_read_nested(&au_ii(i)->ii_rwsem, AuLsc_II_##lsc); \
+}
+
+#define AuWriteLockFunc(name, lsc) \
+static inline void ii_write_lock_##name(struct inode *i) \
+{ \
+	down_write_nested(&au_ii(i)->ii_rwsem, AuLsc_II_##lsc); \
+}
+
+#define AuRWLockFuncs(name, lsc) \
+	AuReadLockFunc(name, lsc) \
+	AuWriteLockFunc(name, lsc)
+
+AuRWLockFuncs(child, CHILD);
+AuRWLockFuncs(child2, CHILD2);
+AuRWLockFuncs(child3, CHILD3);
+AuRWLockFuncs(parent, PARENT);
+AuRWLockFuncs(parent2, PARENT2);
+AuRWLockFuncs(parent3, PARENT3);
+AuRWLockFuncs(new_child, NEW_CHILD);
+
+#undef AuReadLockFunc
+#undef AuWriteLockFunc
+#undef AuRWLockFuncs
+
+/*
+ * ii_read_unlock, ii_write_unlock, ii_downgrade_lock
+ */
+AuSimpleUnlockRwsemFuncs(ii, struct inode *i, &au_ii(i)->ii_rwsem);
+
+#define IiMustNoWaiters(i)	AuRwMustNoWaiters(&au_ii(i)->ii_rwsem)
+
+/* ---------------------------------------------------------------------- */
+
+static inline unsigned int au_iigen(struct inode *inode)
+{
+	return atomic_read(&au_ii(inode)->ii_generation);
+}
+
+/* tiny test for inode number */
+/* tmpfs generation is too rough */
+static inline int au_test_higen(struct inode *inode, struct inode *h_inode)
+{
+	struct au_iinfo *iinfo;
+
+	iinfo = au_ii(inode);
+	return !(iinfo->ii_hsb1 == h_inode->i_sb
+		 && iinfo->ii_higen == h_inode->i_generation);
+}
+
+static inline struct inode *au_igrab(struct inode *inode)
+{
+	if (inode) {
+		AuDebugOn(!atomic_read(&inode->i_count));
+		atomic_inc(&inode->i_count);
+	}
+	return inode;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static inline aufs_bindex_t au_ii_br_id(struct inode *inode,
+					aufs_bindex_t bindex)
+{
+	return au_ii(inode)->ii_hinode[0 + bindex].hi_id;
+}
+
+static inline aufs_bindex_t au_ibstart(struct inode *inode)
+{
+	return au_ii(inode)->ii_bstart;
+}
+
+static inline aufs_bindex_t au_ibend(struct inode *inode)
+{
+	return au_ii(inode)->ii_bend;
+}
+
+static inline struct au_vdir *au_ivdir(struct inode *inode)
+{
+	return au_ii(inode)->ii_vdir;
+}
+
+static inline struct dentry *au_hi_wh(struct inode *inode, aufs_bindex_t bindex)
+{
+	return au_ii(inode)->ii_hinode[0 + bindex].hi_whdentry;
+}
+
+static inline void au_set_ibend(struct inode *inode, aufs_bindex_t bindex)
+{
+	au_ii(inode)->ii_bend = bindex;
+}
+
+static inline void au_set_ivdir(struct inode *inode, struct au_vdir *vdir)
+{
+	au_ii(inode)->ii_vdir = vdir;
+}
+
+static inline struct au_hinode *au_hi(struct inode *inode, aufs_bindex_t bindex)
+{
+	return au_ii(inode)->ii_hinode + bindex;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static inline struct dentry *au_pinned_parent(struct au_pin *pin)
+{
+	if (pin)
+		return pin->parent;
+	return NULL;
+}
+
+static inline struct inode *au_pinned_h_dir(struct au_pin *pin)
+{
+	if (pin && pin->hdir)
+		return pin->hdir->hi_inode;
+	return NULL;
+}
+
+static inline struct au_hinode *au_pinned_hdir(struct au_pin *pin)
+{
+	if (pin)
+		return pin->hdir;
+	return NULL;
+}
+
+static inline void au_pin_set_dentry(struct au_pin *pin, struct dentry *dentry)
+{
+	if (pin)
+		pin->dentry = dentry;
+}
+
+static inline void au_pin_set_parent_lflag(struct au_pin *pin,
+					   unsigned char lflag)
+{
+	if (pin) {
+		/* dirty macros require brackets */
+		if (lflag) {
+			au_fset_pin(pin->flags, DI_LOCKED);
+		} else {
+			au_fclr_pin(pin->flags, DI_LOCKED);
+		}
+	}
+}
+
+static inline void au_pin_set_parent(struct au_pin *pin, struct dentry *parent)
+{
+	if (pin) {
+		dput(pin->parent);
+		pin->parent = dget(parent);
+	}
+}
+
+/* ---------------------------------------------------------------------- */
+
+#ifdef CONFIG_AUFS_HINOTIFY
+/* hinotify.c */
+int au_hin_alloc(struct au_hinode *hinode, struct inode *inode,
+		 struct inode *h_inode);
+void au_hin_free(struct au_hinode *hinode);
+void au_hin_ctl(struct au_hinode *hinode, int do_set);
+void au_reset_hinotify(struct inode *inode, unsigned int flags);
+
+int __init au_hinotify_init(void);
+void au_hinotify_fin(void);
+
+static inline
+void au_hin_init(struct au_hinode *hinode, struct au_hinotify *val)
+{
+	hinode->hi_notify = val;
+}
+
+static inline void au_iigen_dec(struct inode *inode)
+{
+	atomic_dec(&au_ii(inode)->ii_generation);
+}
+
+#else
+static inline
+int au_hin_alloc(struct au_hinode *hinode __maybe_unused,
+		 struct inode *inode __maybe_unused,
+		 struct inode *h_inode __maybe_unused)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void au_hin_free(struct au_hinode *hinode __maybe_unused)
+{
+	/* nothing */
+}
+
+static inline void au_hin_ctl(struct au_hinode *hinode __maybe_unused,
+			      int do_set __maybe_unused)
+{
+	/* nothing */
+}
+
+static inline void au_reset_hinotify(struct inode *inode __maybe_unused,
+				     unsigned int flags __maybe_unused)
+{
+	/* nothing */
+}
+
+static inline int au_hinotify_init(void)
+{
+	return 0;
+}
+
+#define au_hinotify_fin()	do {} while (0)
+
+static inline
+void au_hin_init(struct au_hinode *hinode __maybe_unused,
+		 struct au_hinotify *val __maybe_unused)
+{
+	/* empty */
+}
+#endif /* CONFIG_AUFS_HINOTIFY */
+
+static inline void au_hin_suspend(struct au_hinode *hdir)
+{
+	au_hin_ctl(hdir, /*do_set*/0);
+}
+
+static inline void au_hin_resume(struct au_hinode *hdir)
+{
+	au_hin_ctl(hdir, /*do_set*/1);
+}
+
+static inline void au_hin_imtx_lock(struct au_hinode *hdir)
+{
+	mutex_lock(&hdir->hi_inode->i_mutex);
+	au_hin_suspend(hdir);
+}
+
+static inline void au_hin_imtx_lock_nested(struct au_hinode *hdir,
+					   unsigned int sc __maybe_unused)
+{
+	mutex_lock_nested(&hdir->hi_inode->i_mutex, sc);
+	au_hin_suspend(hdir);
+}
+
+static inline void au_hin_imtx_unlock(struct au_hinode *hdir)
+{
+	au_hin_resume(hdir);
+	mutex_unlock(&hdir->hi_inode->i_mutex);
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_INODE_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 20/28] aufs ioctl
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (18 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 19/28] aufs inode J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 21/28] aufs sysfs entries J. R. Okajima
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
currently for maintenance only

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/ioctl.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/ioctl.c

diff --git a/fs/aufs/ioctl.c b/fs/aufs/ioctl.c
new file mode 100644
index 0000000..ff17d77
--- /dev/null
+++ b/fs/aufs/ioctl.c
@@ -0,0 +1,54 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/uaccess.h>
+#include "aufs.h"
+
+long aufs_ioctl_dir(struct file *file, unsigned int cmd,
+		    unsigned long arg __maybe_unused)
+{
+	long err;
+	struct super_block *sb;
+	struct au_sbinfo *sbinfo;
+
+	err = -EACCES;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	err = 0;
+	sb = file->f_dentry->d_sb;
+	sbinfo = au_sbi(sb);
+	switch (cmd) {
+	case AUFS_CTL_PLINK_MAINT:
+		/*
+		 * pseudo-link maintenance mode,
+		 * cleared by aufs_release_dir()
+		 */
+		si_write_lock(sb);
+		if (!au_ftest_si(sbinfo, MAINTAIN_PLINK)) {
+			au_fset_si(sbinfo, MAINTAIN_PLINK);
+			au_fi(file)->fi_maintain_plink = 1;
+		} else
+			err = -EBUSY;
+		si_write_unlock(sb);
+		break;
+	case AUFS_CTL_PLINK_CLEAN:
+		if (au_opt_test(sbinfo->si_mntflags, PLINK)) {
+			aufs_write_lock(sb->s_root);
+			au_plink_put(sb);
+			aufs_write_unlock(sb->s_root);
+		}
+		break;
+	default:
+		err = -EINVAL;
+	}
+
+ out:
+	return err;
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (19 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 20/28] aufs ioctl J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16 18:05   ` Greg KH
  2009-03-16  7:20 ` [RFC Aufs2 #2 22/28] aufs branch for loopback block device J. R. Okajima
                   ` (6 subsequent siblings)
  27 siblings, 1 reply; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
sysfs entries, compiled only when CONFIG_SYSFS is enabled

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/sysfs.c |  308 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 308 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/sysfs.c

diff --git a/fs/aufs/sysfs.c b/fs/aufs/sysfs.c
new file mode 100644
index 0000000..b03c2f7
--- /dev/null
+++ b/fs/aufs/sysfs.c
@@ -0,0 +1,308 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * sysfs interface
+ */
+
+#include <linux/fs.h>
+#include <linux/seq_file.h>
+#include <linux/sysfs.h>
+#include "aufs.h"
+
+#ifdef CONFIG_AUFS_DEBUG
+static ssize_t debug_show(struct kobject *kobj __maybe_unused,
+			  struct kobj_attribute *attr __maybe_unused,
+			  char *buf)
+{
+	return sprintf(buf, "%d\n", au_debug_test());
+}
+
+static ssize_t debug_store(struct kobject *kobj __maybe_unused,
+			   struct kobj_attribute *attr __maybe_unused,
+			   const char *buf, size_t sz)
+{
+	if (unlikely(!sz || (*buf != '0' && *buf != '1')))
+		return -EOPNOTSUPP;
+
+	if (*buf == '0')
+		au_debug(0);
+	else if (*buf == '1')
+		au_debug(1);
+	return sz;
+}
+
+static struct kobj_attribute au_debug_attr = __ATTR(debug, S_IRUGO | S_IWUSR,
+						    debug_show, debug_store);
+#endif
+
+static struct attribute *au_attr[] = {
+#ifdef CONFIG_AUFS_DEBUG
+	&au_debug_attr.attr,
+#endif
+	NULL,	/* need to NULL terminate the list of attributes */
+};
+
+static struct attribute_group sysaufs_attr_group_body = {
+	.attrs = au_attr
+};
+
+struct attribute_group *sysaufs_attr_group = &sysaufs_attr_group_body;
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * they are copied from linux/lib/kobject.c,
+ * and will be exported in the future.
+ */
+static ssize_t au_attr_show(struct kobject *kobj, struct attribute *attr,
+			    char *buf)
+{
+	struct kobj_attribute *kattr;
+	ssize_t ret = -EIO;
+
+	kattr = container_of(attr, struct kobj_attribute, attr);
+	if (kattr->show)
+		ret = kattr->show(kobj, kattr, buf);
+	return ret;
+}
+
+#ifdef CONFIG_AUFS_DEBUG
+static ssize_t au_attr_store(struct kobject *kobj, struct attribute *attr,
+			     const char *buf, size_t count)
+{
+	struct kobj_attribute *kattr;
+	ssize_t ret = -EIO;
+
+	kattr = container_of(attr, struct kobj_attribute, attr);
+	if (kattr->store)
+		ret = kattr->store(kobj, kattr, buf, count);
+	return ret;
+}
+#endif
+
+static struct sysfs_ops sysaufs_ops = {
+	.show   = au_attr_show,
+#ifdef CONFIG_AUFS_DEBUG
+	.store  = au_attr_store
+#endif
+};
+
+static struct kobj_type sysaufs_ktype_body = {
+	.sysfs_ops = &sysaufs_ops
+};
+struct kobj_type *sysaufs_ktype = &sysaufs_ktype_body;
+
+/* ---------------------------------------------------------------------- */
+
+static int sysaufs_sbi_xi(struct seq_file *seq, struct file *xf,
+			  struct kstat *st)
+{
+	int err;
+
+	err = vfs_getattr(xf->f_vfsmnt, xf->f_dentry, st);
+	if (!err) {
+		seq_printf(seq, "%llux%lu %lld",
+			   st->blocks, st->blksize, (long long)st->size);
+		seq_putc(seq, '\n');
+	} else
+		seq_printf(seq, "err %d\n", err);
+
+	return err;
+}
+
+int sysaufs_si_xino(struct seq_file *seq, struct super_block *sb)
+{
+	int err;
+	aufs_bindex_t bend, bindex;
+	struct kstat st;
+	struct au_sbinfo *sbinfo;
+	struct file *xf;
+
+	err = 0;
+	sbinfo = au_sbi(sb);
+	if (!au_opt_test(au_mntflags(sb), XINO))
+		goto out; /* success */
+
+	xf = sbinfo->si_xib;
+	err = au_xino_path(seq, xf);
+	seq_putc(seq, '\n');
+	if (!err)
+		err = sysaufs_sbi_xi(seq, xf, &st);
+
+	bend = au_sbend(sb);
+	for (bindex = 0; !err && bindex <= bend; bindex++) {
+		xf = au_sbr(sb, bindex)->br_xino.xi_file;
+		if (!xf)
+			continue;
+
+		seq_printf(seq, "%d: ", bindex);
+		err = vfs_getattr(xf->f_vfsmnt, xf->f_dentry, &st);
+		if (!err) {
+			seq_printf(seq, "%ld, %llux%lu %lld",
+				   (long)file_count(xf), st.blocks, st.blksize,
+				   (long long)st.size);
+			seq_putc(seq, '\n');
+		} else
+			seq_printf(seq, "err %d\n", err);
+	}
+
+ out:
+	return err;
+}
+
+/*
+ * the lifetime of branch is independent from the entry under sysfs.
+ * sysfs handles the lifetime of the entry, and never call ->show() after it is
+ * unlinked.
+ */
+#define SysaufsBr_PREFIX "br"
+static int sysaufs_sbi_br(struct seq_file *seq, struct super_block *sb,
+			  aufs_bindex_t bindex)
+{
+	int err;
+	struct path path;
+	struct dentry *root;
+	struct au_branch *br;
+
+	err = -ENOENT;
+	if (unlikely(au_sbend(sb) < bindex))
+		goto out;
+
+	err = 0;
+	root = sb->s_root;
+	di_read_lock_parent(root, !AuLock_IR);
+	br = au_sbr(sb, bindex);
+	path.mnt = br->br_mnt;
+	path.dentry = au_h_dptr(root, bindex);
+	au_seq_path(seq, &path);
+	di_read_unlock(root, !AuLock_IR);
+	seq_printf(seq, "=%s\n", au_optstr_br_perm(br->br_perm));
+
+ out:
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static struct seq_file *au_seq(char *p, ssize_t len)
+{
+	struct seq_file *seq;
+
+	seq = kzalloc(sizeof(*seq), GFP_NOFS);
+	if (seq) {
+		/* mutex_init(&seq.lock); */
+		seq->buf = p;
+		seq->size = len;
+		return seq; /* success */
+	}
+
+	seq = ERR_PTR(-ENOMEM);
+	return seq;
+}
+
+/* todo: file size may exceed PAGE_SIZE */
+ssize_t sysaufs_si_show(struct kobject *kobj, struct attribute *attr,
+			 char *buf)
+{
+	ssize_t err;
+	long l;
+	struct au_sbinfo *sbinfo;
+	struct super_block *sb;
+	struct seq_file *seq;
+	char *name;
+	struct attribute **cattr;
+
+	sbinfo = container_of(kobj, struct au_sbinfo, si_kobj);
+	sb = sbinfo->si_sb;
+	si_noflush_read_lock(sb);
+
+	seq = au_seq(buf, PAGE_SIZE);
+	err = PTR_ERR(seq);
+	if (IS_ERR(seq))
+		goto out;
+
+	name = (void *)attr->name;
+	cattr = sysaufs_si_attrs;
+	while (*cattr) {
+		if (!strcmp(name, (*cattr)->name)) {
+			err = container_of(*cattr, struct sysaufs_si_attr, attr)
+				->show(seq, sb);
+			goto out_seq;
+		}
+		cattr++;
+	}
+
+	if (!strncmp(name, SysaufsBr_PREFIX, sizeof(SysaufsBr_PREFIX) - 1)) {
+		name += sizeof(SysaufsBr_PREFIX) - 1;
+		err = strict_strtol(name, 10, &l);
+		if (!err)
+			err = sysaufs_sbi_br(seq, sb, (aufs_bindex_t)l);
+		goto out_seq;
+	}
+	BUG();
+
+ out_seq:
+	if (!err) {
+		err = seq->count;
+		/* sysfs limit */
+		if (unlikely(err == PAGE_SIZE))
+			err = -EFBIG;
+	}
+	kfree(seq);
+ out:
+	si_read_unlock(sb);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+void sysaufs_br_init(struct au_branch *br)
+{
+	br->br_attr.name = br->br_name;
+	br->br_attr.mode = S_IRUGO;
+	br->br_attr.owner = THIS_MODULE;
+}
+
+void sysaufs_brs_del(struct super_block *sb, aufs_bindex_t bindex)
+{
+	struct au_sbinfo *sbinfo;
+	aufs_bindex_t bend;
+
+	if (!sysaufs_brs)
+		return;
+
+	sbinfo = au_sbi(sb);
+	bend = au_sbend(sb);
+	for (; bindex <= bend; bindex++)
+		sysfs_remove_file(&sbinfo->si_kobj,
+				  &au_sbr(sb, bindex)->br_attr);
+}
+
+void sysaufs_brs_add(struct super_block *sb, aufs_bindex_t bindex)
+{
+	int err;
+	aufs_bindex_t bend;
+	struct kobject *kobj;
+	struct au_branch *br;
+
+	if (!sysaufs_brs)
+		return;
+
+	kobj = &au_sbi(sb)->si_kobj;
+	bend = au_sbend(sb);
+	for (; bindex <= bend; bindex++) {
+		br = au_sbr(sb, bindex);
+		snprintf(br->br_name, sizeof(br->br_name),
+			 SysaufsBr_PREFIX "%d", bindex);
+		err = sysfs_create_file(kobj, &br->br_attr);
+		if (unlikely(err))
+			AuWarn("failed %s under sysfs(%d)\n", br->br_name, err);
+	}
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 22/28] aufs branch for loopback block device
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (20 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 21/28] aufs sysfs entries J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 23/28] aufs internal inotify J. R. Okajima
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
support and test loopback block device as a branch

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/loop.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/loop.h |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/loop.c
 create mode 100644 fs/aufs/loop.h

diff --git a/fs/aufs/loop.c b/fs/aufs/loop.c
new file mode 100644
index 0000000..84a940f
--- /dev/null
+++ b/fs/aufs/loop.c
@@ -0,0 +1,46 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * support for loopback block device as a branch
+ */
+
+#include <linux/loop.h>
+#include "aufs.h"
+
+/*
+ * test if two lower dentries have overlapping branches.
+ */
+int au_test_loopback_overlap(struct super_block *sb, struct dentry *h_d1,
+			     struct dentry *h_d2)
+{
+	struct inode *h_inode;
+	struct loop_device *l;
+
+	h_inode = h_d1->d_inode;
+	if (MAJOR(h_inode->i_sb->s_dev) != LOOP_MAJOR)
+		return 0;
+
+	l = h_inode->i_sb->s_bdev->bd_disk->private_data;
+	h_d1 = l->lo_backing_file->f_dentry;
+	/* h_d1 can be local NFS. in this case aufs cannot detect the loop */
+	if (unlikely(h_d1->d_sb == sb))
+		return 1;
+	return !!au_test_subdir(h_d1, h_d2);
+}
+
+/* true if a kernel thread named 'loop[0-9].*' accesses a file */
+int au_test_loopback_kthread(void)
+{
+	const char c = current->comm[4];
+
+	return current->mm == NULL
+	       && '0' <= c && c <= '9'
+	       && strncmp(current->comm, "loop", 4) == 0;
+}
diff --git a/fs/aufs/loop.h b/fs/aufs/loop.h
new file mode 100644
index 0000000..1c0ee40
--- /dev/null
+++ b/fs/aufs/loop.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * support for loopback mount as a branch
+ */
+
+#ifndef __AUFS_LOOP_H__
+#define __AUFS_LOOP_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+#ifdef CONFIG_AUFS_BDEV_LOOP
+/* loop.c */
+int au_test_loopback_overlap(struct super_block *sb, struct dentry *h_d1,
+			     struct dentry *h_d2);
+int au_test_loopback_kthread(void);
+#else
+static inline
+int au_test_loopback_overlap(struct super_block *sb, struct dentry *h_d1,
+			     struct dentry *h_d2)
+{
+	return 0;
+}
+
+static inline int au_test_loopback_kthread(void)
+{
+	return 0;
+}
+#endif /* BLK_DEV_LOOP */
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_LOOP_H__ */
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 23/28] aufs internal inotify
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (21 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 22/28] aufs branch for loopback block device J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 24/28] aufs test for fstype J. R. Okajima
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
detect bypassing aufs by inotify-watch

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/hinotify.c |  746 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 746 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/hinotify.c

diff --git a/fs/aufs/hinotify.c b/fs/aufs/hinotify.c
new file mode 100644
index 0000000..af8ebdb
--- /dev/null
+++ b/fs/aufs/hinotify.c
@@ -0,0 +1,746 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * inotify for the lower directories
+ */
+
+#include "aufs.h"
+
+static const __u32 AuHinMask = (IN_MOVE | IN_DELETE | IN_CREATE);
+static struct inotify_handle *au_hin_handle;
+
+AuCacheFuncs(hinotify, HINOTIFY);
+
+int au_hin_alloc(struct au_hinode *hinode, struct inode *inode,
+		 struct inode *h_inode)
+{
+	int err;
+	struct au_hinotify *hin;
+	s32 wd;
+
+	err = -ENOMEM;
+	hin = au_cache_alloc_hinotify();
+	if (hin) {
+		AuDebugOn(hinode->hi_notify);
+		hinode->hi_notify = hin;
+		hin->hin_aufs_inode = inode;
+
+		inotify_init_watch(&hin->hin_watch);
+		wd = inotify_add_watch(au_hin_handle, &hin->hin_watch, h_inode,
+				       AuHinMask);
+		if (wd >= 0)
+			return 0; /* success */
+
+		err = wd;
+		put_inotify_watch(&hin->hin_watch);
+		au_cache_free_hinotify(hin);
+		hinode->hi_notify = NULL;
+	}
+
+	return err;
+}
+
+void au_hin_free(struct au_hinode *hinode)
+{
+	int err;
+	struct au_hinotify *hin;
+
+	hin = hinode->hi_notify;
+	if (hin) {
+		err = 0;
+		if (atomic_read(&hin->hin_watch.count))
+			err = inotify_rm_watch(au_hin_handle, &hin->hin_watch);
+		if (unlikely(err))
+			/* it means the watch is already removed */
+			AuWarn("failed inotify_rm_watch() %d\n", err);
+		au_cache_free_hinotify(hin);
+		hinode->hi_notify = NULL;
+	}
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_hin_ctl(struct au_hinode *hinode, int do_set)
+{
+	struct inode *h_inode;
+	struct inotify_watch *watch;
+
+	if (!hinode->hi_notify)
+		return;
+
+	h_inode = hinode->hi_inode;
+	IMustLock(h_inode);
+
+	/* todo: try inotify_find_update_watch()? */
+	watch = &hinode->hi_notify->hin_watch;
+	mutex_lock(&h_inode->inotify_mutex);
+	/* mutex_lock(&watch->ih->mutex); */
+	if (do_set) {
+		AuDebugOn(watch->mask & AuHinMask);
+		watch->mask |= AuHinMask;
+	} else {
+		AuDebugOn(!(watch->mask & AuHinMask));
+		watch->mask &= ~AuHinMask;
+	}
+	/* mutex_unlock(&watch->ih->mutex); */
+	mutex_unlock(&h_inode->inotify_mutex);
+}
+
+void au_reset_hinotify(struct inode *inode, unsigned int flags)
+{
+	aufs_bindex_t bindex, bend;
+	struct inode *hi;
+	struct dentry *iwhdentry;
+
+	bend = au_ibend(inode);
+	for (bindex = au_ibstart(inode); bindex <= bend; bindex++) {
+		hi = au_h_iptr(inode, bindex);
+		if (!hi)
+			continue;
+
+		/* mutex_lock_nested(&hi->i_mutex, AuLsc_I_CHILD); */
+		iwhdentry = au_hi_wh(inode, bindex);
+		if (iwhdentry)
+			dget(iwhdentry);
+		au_igrab(hi);
+		au_set_h_iptr(inode, bindex, NULL, 0);
+		au_set_h_iptr(inode, bindex, au_igrab(hi),
+			      flags & ~AuHi_XINO);
+		iput(hi);
+		dput(iwhdentry);
+		/* mutex_unlock(&hi->i_mutex); */
+	}
+}
+
+/* ---------------------------------------------------------------------- */
+
+static int hin_xino(struct inode *inode, struct inode *h_inode)
+{
+	int err;
+	aufs_bindex_t bindex, bend, bfound, bstart;
+	struct inode *h_i;
+
+	err = 0;
+	if (unlikely(inode->i_ino == AUFS_ROOT_INO)) {
+		AuWarn("branch root dir was changed\n");
+		goto out;
+	}
+
+	bfound = -1;
+	bend = au_ibend(inode);
+	bstart = au_ibstart(inode);
+#if 0 /* reserved for future use */
+	if (bindex == bend) {
+		/* keep this ino in rename case */
+		goto out;
+	}
+#endif
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		if (au_h_iptr(inode, bindex) == h_inode) {
+			bfound = bindex;
+			break;
+		}
+	}
+	if (bfound < 0)
+		goto out;
+
+	for (bindex = bstart; bindex <= bend; bindex++) {
+		h_i = au_h_iptr(inode, bindex);
+		if (!h_i)
+			continue;
+
+		err = au_xino_write0(inode->i_sb, bindex, h_i->i_ino, 0);
+		/* ignore this error */
+		/* bad action? */
+	}
+
+	/* children inode number will be broken */
+
+ out:
+	AuTraceErr(err);
+	return err;
+}
+
+static int hin_gen_tree(struct dentry *dentry)
+{
+	int err, i, j, ndentry;
+	struct au_dcsub_pages dpages;
+	struct au_dpage *dpage;
+	struct dentry **dentries;
+
+	err = au_dpages_init(&dpages, GFP_NOFS);
+	if (unlikely(err))
+		goto out;
+	err = au_dcsub_pages(&dpages, dentry, NULL, NULL);
+	if (unlikely(err))
+		goto out_dpages;
+
+	for (i = 0; i < dpages.ndpage; i++) {
+		dpage = dpages.dpages + i;
+		dentries = dpage->dentries;
+		ndentry = dpage->ndentry;
+		for (j = 0; j < ndentry; j++) {
+			struct dentry *d;
+
+			d = dentries[j];
+			if (IS_ROOT(d))
+				continue;
+
+			d_drop(d);
+			au_digen_dec(d);
+			if (d->d_inode)
+				/* todo: reset children xino?
+				   cached children only? */
+				au_iigen_dec(d->d_inode);
+		}
+	}
+
+ out_dpages:
+	au_dpages_free(&dpages);
+
+	/* discard children */
+	dentry_unhash(dentry);
+	dput(dentry);
+ out:
+	return err;
+}
+
+/*
+ * return 0 if processed.
+ */
+static int hin_gen_by_inode(char *name, unsigned int nlen, struct inode *inode,
+			    const unsigned int isdir)
+{
+	int err;
+	struct dentry *d;
+	struct qstr *dname;
+
+	err = 1;
+	if (unlikely(inode->i_ino == AUFS_ROOT_INO)) {
+		AuWarn("branch root dir was changed\n");
+		err = 0;
+		goto out;
+	}
+
+	if (!isdir) {
+		AuDebugOn(!name);
+		au_iigen_dec(inode);
+		spin_lock(&dcache_lock);
+		list_for_each_entry(d, &inode->i_dentry, d_alias) {
+			dname = &d->d_name;
+			if (dname->len != nlen
+			    && memcmp(dname->name, name, nlen))
+				continue;
+			err = 0;
+			spin_lock(&d->d_lock);
+			__d_drop(d);
+			au_digen_dec(d);
+			spin_unlock(&d->d_lock);
+			break;
+		}
+		spin_unlock(&dcache_lock);
+	} else {
+		au_fset_si(au_sbi(inode->i_sb), FAILED_REFRESH_DIRS);
+		d = d_find_alias(inode);
+		if (!d) {
+			au_iigen_dec(inode);
+			goto out;
+		}
+
+		dname = &d->d_name;
+		if (dname->len == nlen && !memcmp(dname->name, name, nlen))
+			err = hin_gen_tree(d);
+		dput(d);
+	}
+
+ out:
+	AuTraceErr(err);
+	return err;
+}
+
+static int hin_gen_by_name(struct dentry *dentry, const unsigned int isdir)
+{
+	int err;
+	struct inode *inode;
+
+	inode = dentry->d_inode;
+	if (IS_ROOT(dentry)
+	    /* || (inode && inode->i_ino == AUFS_ROOT_INO) */
+		) {
+		AuWarn("branch root dir was changed\n");
+		return 0;
+	}
+
+	err = 0;
+	if (!isdir) {
+		d_drop(dentry);
+		au_digen_dec(dentry);
+		if (inode)
+			au_iigen_dec(inode);
+	} else {
+		au_fset_si(au_sbi(dentry->d_sb), FAILED_REFRESH_DIRS);
+		if (inode)
+			err = hin_gen_tree(dentry);
+	}
+
+	AuTraceErr(err);
+	return err;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* hinotify job flags */
+#define AuHinJob_XINO0		1
+#define AuHinJob_GEN		(1 << 1)
+#define AuHinJob_DIRENT		(1 << 2)
+#define AuHinJob_ISDIR		(1 << 3)
+#define AuHinJob_TRYXINO0	(1 << 4)
+#define AuHinJob_MNTPNT		(1 << 5)
+#define au_ftest_hinjob(flags, name)	((flags) & AuHinJob_##name)
+#define au_fset_hinjob(flags, name)	{ (flags) |= AuHinJob_##name; }
+#define au_fclr_hinjob(flags, name)	{ (flags) &= ~AuHinJob_##name; }
+
+struct hin_job_args {
+	unsigned int flags;
+	struct inode *inode, *h_inode, *dir, *h_dir;
+	struct dentry *dentry;
+	char *h_name;
+	int h_nlen;
+};
+
+static int hin_job(struct hin_job_args *a)
+{
+	const unsigned int isdir = au_ftest_hinjob(a->flags, ISDIR);
+
+	/* reset xino */
+	if (au_ftest_hinjob(a->flags, XINO0) && a->inode)
+		hin_xino(a->inode, a->h_inode); /* ignore this error */
+
+	if (au_ftest_hinjob(a->flags, TRYXINO0)
+	    && a->inode
+	    && a->h_inode) {
+		mutex_lock_nested(&a->h_inode->i_mutex, AuLsc_I_CHILD);
+		if (!a->h_inode->i_nlink)
+			hin_xino(a->inode, a->h_inode); /* ignore this error */
+		mutex_unlock(&a->h_inode->i_mutex);
+	}
+
+	/* make the generation obsolete */
+	if (au_ftest_hinjob(a->flags, GEN)) {
+		int err = -1;
+		if (a->inode)
+			err = hin_gen_by_inode(a->h_name, a->h_nlen, a->inode,
+					       isdir);
+		if (err && a->dentry)
+			hin_gen_by_name(a->dentry, isdir);
+		/* ignore this error */
+	}
+
+	/* make dir entries obsolete */
+	if (au_ftest_hinjob(a->flags, DIRENT) && a->inode) {
+		struct au_vdir *vdir;
+
+		vdir = au_ivdir(a->inode);
+		if (vdir)
+			vdir->vd_jiffy = 0;
+		/* IMustLock(a->inode); */
+		/* a->inode->i_version++; */
+	}
+
+	/* can do nothing but warn */
+	if (au_ftest_hinjob(a->flags, MNTPNT)
+	    && a->dentry
+	    && d_mountpoint(a->dentry))
+		AuWarn("mount-point %.*s is removed or renamed\n",
+		       AuDLNPair(a->dentry));
+
+	return 0;
+}
+
+/* ---------------------------------------------------------------------- */
+
+static char *in_name(u32 mask)
+{
+#ifdef CONFIG_AUFS_DEBUG
+#define test_ret(flag)	if (mask & flag) \
+				return #flag;
+	test_ret(IN_ACCESS);
+	test_ret(IN_MODIFY);
+	test_ret(IN_ATTRIB);
+	test_ret(IN_CLOSE_WRITE);
+	test_ret(IN_CLOSE_NOWRITE);
+	test_ret(IN_OPEN);
+	test_ret(IN_MOVED_FROM);
+	test_ret(IN_MOVED_TO);
+	test_ret(IN_CREATE);
+	test_ret(IN_DELETE);
+	test_ret(IN_DELETE_SELF);
+	test_ret(IN_MOVE_SELF);
+	test_ret(IN_UNMOUNT);
+	test_ret(IN_Q_OVERFLOW);
+	test_ret(IN_IGNORED);
+	return "";
+#undef test_ret
+#else
+	return "??";
+#endif
+}
+
+static struct dentry *lookup_wlock_by_name(char *name, unsigned int nlen,
+					   struct inode *dir)
+{
+	struct dentry *dentry, *d, *parent;
+	struct qstr *dname;
+
+	parent = d_find_alias(dir);
+	if (!parent)
+		return NULL;
+
+	dentry = NULL;
+	spin_lock(&dcache_lock);
+	list_for_each_entry(d, &parent->d_subdirs, d_u.d_child) {
+		/* AuDbg("%.*s\n", AuDLNPair(d)); */
+		dname = &d->d_name;
+		if (dname->len != nlen || memcmp(dname->name, name, nlen))
+			continue;
+		if (!atomic_read(&d->d_count) || !d->d_fsdata) {
+			spin_lock(&d->d_lock);
+			__d_drop(d);
+			spin_unlock(&d->d_lock);
+			continue;
+		}
+
+		dentry = dget(d);
+		break;
+	}
+	spin_unlock(&dcache_lock);
+	dput(parent);
+
+	if (dentry)
+		di_write_lock_child(dentry);
+
+	return dentry;
+}
+
+static struct inode *lookup_wlock_by_ino(struct super_block *sb,
+					 aufs_bindex_t bindex, ino_t h_ino)
+{
+	struct inode *inode;
+	ino_t ino;
+	int err;
+
+	inode = NULL;
+	err = au_xino_read(sb, bindex, h_ino, &ino);
+	if (!err && ino)
+		inode = ilookup(sb, ino);
+	if (!inode)
+		goto out;
+
+	if (unlikely(inode->i_ino == AUFS_ROOT_INO)) {
+		AuWarn("wrong root branch\n");
+		iput(inode);
+		inode = NULL;
+		goto out;
+	}
+
+	ii_write_lock_child(inode);
+
+ out:
+	return inode;
+}
+
+enum { CHILD, PARENT };
+struct postproc_args {
+	struct inode *h_dir, *dir, *h_child_inode;
+	u32 mask;
+	unsigned int flags[2];
+	unsigned int h_child_nlen;
+	char h_child_name[];
+};
+
+static void postproc(void *_args)
+{
+	struct postproc_args *a = _args;
+	struct super_block *sb;
+	aufs_bindex_t bindex, bend, bfound;
+	unsigned char xino, try_iput;
+	int err;
+	struct inode *inode;
+	ino_t h_ino;
+	struct hin_job_args args;
+	struct dentry *dentry;
+	struct au_sbinfo *sbinfo;
+
+	AuDebugOn(!_args);
+	AuDebugOn(!a->h_dir);
+	AuDebugOn(!a->dir);
+	AuDebugOn(!a->mask);
+	AuDbg("mask 0x%x %s, i%lu, hi%lu, hci%lu\n",
+	      a->mask, in_name(a->mask), a->dir->i_ino, a->h_dir->i_ino,
+	      a->h_child_inode ? a->h_child_inode->i_ino : 0);
+
+	inode = NULL;
+	dentry = NULL;
+	/*
+	 * do not lock a->dir->i_mutex here
+	 * because of d_revalidate() may cause a deadlock.
+	 */
+	sb = a->dir->i_sb;
+	AuDebugOn(!sb);
+	sbinfo = au_sbi(sb);
+	AuDebugOn(!sbinfo);
+	/* big aufs lock */
+	si_noflush_write_lock(sb);
+
+	ii_read_lock_parent(a->dir);
+	bfound = -1;
+	bend = au_ibend(a->dir);
+	for (bindex = au_ibstart(a->dir); bindex <= bend; bindex++)
+		if (au_h_iptr(a->dir, bindex) == a->h_dir) {
+			bfound = bindex;
+			break;
+		}
+	ii_read_unlock(a->dir);
+	if (unlikely(bfound < 0))
+		goto out;
+
+	xino = !!au_opt_test(au_mntflags(sb), XINO);
+	h_ino = 0;
+	if (a->h_child_inode)
+		h_ino = a->h_child_inode->i_ino;
+
+	if (a->h_child_nlen
+	    && (au_ftest_hinjob(a->flags[CHILD], GEN)
+		|| au_ftest_hinjob(a->flags[CHILD], MNTPNT)))
+		dentry = lookup_wlock_by_name(a->h_child_name, a->h_child_nlen,
+					      a->dir);
+	try_iput = 0;
+	if (dentry)
+		inode = dentry->d_inode;
+	if (xino && !inode && h_ino
+	    && (au_ftest_hinjob(a->flags[CHILD], XINO0)
+		|| au_ftest_hinjob(a->flags[CHILD], TRYXINO0)
+		|| au_ftest_hinjob(a->flags[CHILD], GEN))) {
+		inode = lookup_wlock_by_ino(sb, bfound, h_ino);
+		try_iput = 1;
+	    }
+
+	args.flags = a->flags[CHILD];
+	args.dentry = dentry;
+	args.inode = inode;
+	args.h_inode = a->h_child_inode;
+	args.dir = a->dir;
+	args.h_dir = a->h_dir;
+	args.h_name = a->h_child_name;
+	args.h_nlen = a->h_child_nlen;
+	err = hin_job(&args);
+	if (dentry) {
+		if (dentry->d_fsdata)
+			di_write_unlock(dentry);
+		dput(dentry);
+	}
+	if (inode && try_iput) {
+		ii_write_unlock(inode);
+		iput(inode);
+	}
+
+	ii_write_lock_parent(a->dir);
+	args.flags = a->flags[PARENT];
+	args.dentry = NULL;
+	args.inode = a->dir;
+	args.h_inode = a->h_dir;
+	args.dir = NULL;
+	args.h_dir = NULL;
+	args.h_name = NULL;
+	args.h_nlen = 0;
+	err = hin_job(&args);
+	ii_write_unlock(a->dir);
+
+ out:
+	au_nwt_done(&sbinfo->si_nowait);
+	si_write_unlock(sb);
+
+	iput(a->h_child_inode);
+	iput(a->h_dir);
+	iput(a->dir);
+	kfree(a);
+}
+
+/* ---------------------------------------------------------------------- */
+
+static void aufs_inotify(struct inotify_watch *watch, u32 wd __maybe_unused,
+			 u32 mask, u32 cookie __maybe_unused,
+			 const char *h_child_name, struct inode *h_child_inode)
+{
+	struct au_hinotify *hinotify;
+	struct postproc_args *args;
+	int len, wkq_err;
+	unsigned char isdir, isroot, wh;
+	char *p;
+	struct inode *dir;
+	unsigned int flags[2];
+
+	/* if IN_UNMOUNT happens, there must be another bug */
+	AuDebugOn(mask & IN_UNMOUNT);
+	if (mask & (IN_IGNORED | IN_UNMOUNT)) {
+		put_inotify_watch(watch);
+		return;
+	}
+#ifdef AuDbgHinotify
+	au_debug(1);
+	if (1 || !h_child_name || strcmp(h_child_name, AUFS_XINO_FNAME)) {
+		AuDbg("i%lu, wd %d, mask 0x%x %s, cookie 0x%x, hcname %s,"
+		      " hi%lu\n",
+		      watch->inode->i_ino, wd, mask, in_name(mask), cookie,
+		      h_child_name ? h_child_name : "",
+		      h_child_inode ? h_child_inode->i_ino : 0);
+		WARN_ON(1);
+	}
+	au_debug(0);
+#endif
+
+	hinotify = container_of(watch, struct au_hinotify, hin_watch);
+	AuDebugOn(!hinotify || !hinotify->hin_aufs_inode);
+	dir = igrab(hinotify->hin_aufs_inode);
+	if (!dir)
+		return;
+
+	isroot = (dir->i_ino == AUFS_ROOT_INO);
+	len = 0;
+	wh = 0;
+	if (h_child_name) {
+		len = strlen(h_child_name);
+		if (!memcmp(h_child_name, AUFS_WH_PFX, AUFS_WH_PFX_LEN)) {
+			h_child_name += AUFS_WH_PFX_LEN;
+			len -= AUFS_WH_PFX_LEN;
+			wh = 1;
+		}
+	}
+
+	isdir = 0;
+	if (h_child_inode)
+		isdir = !!S_ISDIR(h_child_inode->i_mode);
+	flags[PARENT] = AuHinJob_ISDIR;
+	flags[CHILD] = 0;
+	if (isdir)
+		flags[CHILD] = AuHinJob_ISDIR;
+	switch (mask & IN_ALL_EVENTS) {
+	case IN_MOVED_FROM:
+	case IN_MOVED_TO:
+		AuDebugOn(!h_child_name || !h_child_inode);
+		au_fset_hinjob(flags[CHILD], GEN);
+		au_fset_hinjob(flags[CHILD], XINO0);
+		au_fset_hinjob(flags[CHILD], MNTPNT);
+		au_fset_hinjob(flags[PARENT], DIRENT);
+		break;
+
+	case IN_CREATE:
+		AuDebugOn(!h_child_name || !h_child_inode);
+		au_fset_hinjob(flags[PARENT], DIRENT);
+		au_fset_hinjob(flags[CHILD], GEN);
+		break;
+
+	case IN_DELETE:
+		/*
+		 * aufs never be able to get this child inode.
+		 * revalidation should be in d_revalidate()
+		 * by checking i_nlink, i_generation or d_unhashed().
+		 */
+		AuDebugOn(!h_child_name);
+		au_fset_hinjob(flags[PARENT], DIRENT);
+		au_fset_hinjob(flags[CHILD], GEN);
+		au_fset_hinjob(flags[CHILD], TRYXINO0);
+		au_fset_hinjob(flags[CHILD], MNTPNT);
+		break;
+
+	default:
+		AuDebugOn(1);
+	}
+
+	if (wh)
+		h_child_inode = NULL;
+
+	/* iput() and kfree() will be called in postproc() */
+	/*
+	 * inotify_mutex is already acquired and kmalloc/prune_icache may lock
+	 * iprune_mutex. strange.
+	 */
+	lockdep_off();
+	args = kmalloc(sizeof(*args) + len + 1, GFP_NOFS);
+	lockdep_on();
+	if (unlikely(!args)) {
+		AuErr1("no memory\n");
+		iput(dir);
+		return;
+	}
+	args->flags[PARENT] = flags[PARENT];
+	args->flags[CHILD] = flags[CHILD];
+	args->mask = mask;
+	args->dir = dir;
+	args->h_dir = igrab(watch->inode);
+	if (h_child_inode)
+		h_child_inode = igrab(h_child_inode); /* can be NULL */
+	args->h_child_inode = h_child_inode;
+	args->h_child_nlen = len;
+	if (len) {
+		p = (void *)args;
+		p += sizeof(*args);
+		memcpy(p, h_child_name, len + 1);
+	}
+
+	lockdep_off();
+	wkq_err = au_wkq_nowait(postproc, args, dir->i_sb);
+	lockdep_on();
+	if (unlikely(wkq_err))
+		AuErr("wkq %d\n", wkq_err);
+}
+
+static void aufs_inotify_destroy(struct inotify_watch *watch __maybe_unused)
+{
+	return;
+}
+
+static struct inotify_operations aufs_inotify_ops = {
+	.handle_event	= aufs_inotify,
+	.destroy_watch	= aufs_inotify_destroy
+};
+
+/* ---------------------------------------------------------------------- */
+
+static void au_hin_destroy_cache(void)
+{
+	kmem_cache_destroy(au_cachep[AuCache_HINOTIFY]);
+	au_cachep[AuCache_HINOTIFY] = NULL;
+}
+
+int __init au_hinotify_init(void)
+{
+	int err;
+
+	err = -ENOMEM;
+	au_cachep[AuCache_HINOTIFY] = AuCache(au_hinotify);
+	if (au_cachep[AuCache_HINOTIFY]) {
+		err = 0;
+		au_hin_handle = inotify_init(&aufs_inotify_ops);
+		if (IS_ERR(au_hin_handle)) {
+			err = PTR_ERR(au_hin_handle);
+			au_hin_destroy_cache();
+		}
+	}
+	AuTraceErr(err);
+	return err;
+}
+
+void au_hinotify_fin(void)
+{
+	inotify_destroy(au_hin_handle);
+	if (au_cachep[AuCache_HINOTIFY])
+		au_hin_destroy_cache();
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 24/28] aufs test for fstype
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (22 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 23/28] aufs internal inotify J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 25/28] aufs debug J. R. Okajima
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
judge the type of filesystem of branch

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/fstype.h |  434 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/magic.mk |   52 +++++++
 2 files changed, 486 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/fstype.h
 create mode 100644 fs/aufs/magic.mk

diff --git a/fs/aufs/fstype.h b/fs/aufs/fstype.h
new file mode 100644
index 0000000..eac06aa
--- /dev/null
+++ b/fs/aufs/fstype.h
@@ -0,0 +1,434 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * judging filesystem type
+ */
+
+#ifndef __AUFS_FSTYPE_H__
+#define __AUFS_FSTYPE_H__
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/cramfs_fs.h>
+#include <linux/magic.h>
+#include <linux/romfs_fs.h>
+#include <linux/aufs_type.h>
+
+static inline int au_test_aufs(struct super_block *sb)
+{
+	return sb->s_magic == AUFS_SUPER_MAGIC;
+}
+
+static inline const char *au_sbtype(struct super_block *sb)
+{
+	return sb->s_type->name;
+}
+
+static inline int au_test_iso9660(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_ROMFS_FS) || defined(CONFIG_ROMFS_FS_MODULE)
+	return sb->s_magic == ROMFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_romfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_ISO9660_FS) || defined(CONFIG_ISO9660_FS_MODULE)
+	return sb->s_magic == ISOFS_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_cramfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_CRAMFS) || defined(CONFIG_CRAMFS_MODULE)
+	return sb->s_magic == CRAMFS_MAGIC;
+#endif
+	return 0;
+}
+
+static inline int au_test_nfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_NFS_FS) || defined(CONFIG_NFS_FS_MODULE)
+	return sb->s_magic == NFS_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_fuse(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_FUSE_FS) || defined(CONFIG_FUSE_FS_MODULE)
+	return sb->s_magic == FUSE_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_xfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_XFS_FS) || defined(CONFIG_XFS_FS_MODULE)
+	return sb->s_magic == XFS_SB_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_tmpfs(struct super_block *sb __maybe_unused)
+{
+#ifdef CONFIG_TMPFS
+	return sb->s_magic == TMPFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_ecryptfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_ECRYPT_FS) || defined(CONFIG_ECRYPT_FS_MODULE)
+	return !strcmp(au_sbtype(sb), "ecryptfs");
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_smbfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_SMB_FS) || defined(CONFIG_SMB_FS_MODULE)
+	return sb->s_magic == SMB_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_ocfs2(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_OCFS2_FS) || defined(CONFIG_OCFS2_FS_MODULE)
+	return sb->s_magic == OCFS2_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_ocfs2_dlmfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_OCFS2_FS_O2CB) || defined(CONFIG_OCFS2_FS_O2CB_MODULE)
+	return sb->s_magic == DLMFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_coda(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_CODA_FS) || defined(CONFIG_CODA_FS_MODULE)
+	return sb->s_magic == CODA_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_v9fs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_9P_FS) || defined(CONFIG_9P_FS_MODULE)
+	return sb->s_magic == V9FS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_ext4(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_EXT4DEV_FS) || defined(CONFIG_EXT4DEV_FS_MODULE)
+	return sb->s_magic == EXT4_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_sysv(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_SYSV_FS) || defined(CONFIG_SYSV_FS_MODULE)
+	return !strcmp(au_sbtype(sb), "sysv");
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_ramfs(struct super_block *sb)
+{
+	return sb->s_magic == RAMFS_MAGIC;
+}
+
+static inline int au_test_ubifs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_UBIFS_FS) || defined(CONFIG_UBIFS_FS_MODULE)
+	return sb->s_magic == UBIFS_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_procfs(struct super_block *sb __maybe_unused)
+{
+#ifdef CONFIG_PROC_FS
+	return sb->s_magic == PROC_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_sysfs(struct super_block *sb __maybe_unused)
+{
+#ifdef CONFIG_SYSFS
+	return sb->s_magic == SYSFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_configfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_CONFIGFS_FS) || defined(CONFIG_CONFIGFS_FS_MODULE)
+	return sb->s_magic == CONFIGFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_minix(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_MINIX_FS) || defined(CONFIG_MINIX_FS_MODULE)
+	return sb->s_magic == MINIX3_SUPER_MAGIC
+		|| sb->s_magic == MINIX2_SUPER_MAGIC
+		|| sb->s_magic == MINIX2_SUPER_MAGIC2
+		|| sb->s_magic == MINIX_SUPER_MAGIC
+		|| sb->s_magic == MINIX_SUPER_MAGIC2;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_cifs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_CIFS_FS) || defined(CONFIGCIFS_FS_MODULE)
+	return sb->s_magic == CIFS_MAGIC_NUMBER;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_fat(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_FAT_FS) || defined(CONFIG_FAT_FS_MODULE)
+	return sb->s_magic == MSDOS_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_msdos(struct super_block *sb)
+{
+	return au_test_fat(sb);
+}
+
+static inline int au_test_vfat(struct super_block *sb)
+{
+	return au_test_fat(sb);
+}
+
+static inline int au_test_securityfs(struct super_block *sb __maybe_unused)
+{
+#ifdef CONFIG_SECURITYFS
+	return sb->s_magic == SECURITYFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_squashfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_SQUASHFS) || defined(CONFIG_SQUASHFS_MODULE)
+	return sb->s_magic == SQUASHFS_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_btrfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_BTRFS_FS) || defined(CONFIG_BTRFS_FS_MODULE)
+	return sb->s_magic == BTRFS_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+static inline int au_test_xenfs(struct super_block *sb __maybe_unused)
+{
+#if defined(CONFIG_XENFS) || defined(CONFIG_XENFS_MODULE)
+	return sb->s_magic == XENFS_SUPER_MAGIC;
+#else
+	return 0;
+#endif
+}
+
+/* ---------------------------------------------------------------------- */
+/*
+ * they can't be an aufs branch.
+ */
+static inline int au_test_fs_unsuppoted(struct super_block *sb)
+{
+	return au_test_procfs(sb)
+		|| au_test_sysfs(sb)
+		|| au_test_configfs(sb)
+		|| au_test_securityfs(sb)
+		|| au_test_xenfs(sb)
+		|| au_test_ramfs(sb)
+		/* || !strcmp(au_sbtype(sb), "unionfs") */
+		|| au_test_aufs(sb); /* will be supported in next version */
+}
+
+/*
+ * If the filesystem supports NFS-export, then it has to support NULL as
+ * a nameidata parameter for ->create(), ->lookup() and ->d_revalidate().
+ * We can apply this principle when we handle a lower filesystem.
+ */
+static inline int au_test_fs_null_nd(struct super_block *sb)
+{
+	return !!sb->s_export_op;
+}
+
+static inline int au_test_fs_remote(struct super_block *sb)
+{
+	return !au_test_tmpfs(sb)
+		&& !(sb->s_type->fs_flags & FS_REQUIRES_DEV);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * Note: these functions (below) are created after reading ->getattr() in all
+ * filesystems under linux/fs. it means we have to do so in every update...
+ */
+
+/*
+ * some filesystems require getattr to refresh the inode attributes before
+ * referencing.
+ * in most cases, we can rely on the inode attribute in NFS (or every remote fs)
+ * and leave the work for d_revalidate()
+ */
+static inline int au_test_fs_refresh_iattr(struct super_block *sb)
+{
+	return au_test_nfs(sb)
+		|| au_test_fuse(sb)
+		/* || au_test_smbfs(sb) */	/* untested */
+		/* || au_test_ocfs2(sb) */	/* untested */
+		/* || au_test_btrfs(sb) */	/* untested */
+		/* || au_test_coda(sb) */	/* untested */
+		/* || au_test_v9fs(sb) */	/* untested */
+		;
+}
+
+/*
+ * filesystems which don't maintain i_size or i_blocks.
+ */
+static inline int au_test_fs_bad_iattr_size(struct super_block *sb)
+{
+	return au_test_xfs(sb)
+		/* || au_test_ext4(sb) */	/* untested */
+		/* || au_test_ocfs2(sb) */	/* untested */
+		/* || au_test_ocfs2_dlmfs(sb) */ /* untested */
+		/* || au_test_sysv(sb) */	/* untested */
+		/* || au_test_ramfs(sb)*/	/* unsupported */
+		/* || au_test_ubifs(sb) */	/* untested */
+		/* || au_test_minix(sb) */	/* untested */
+		;
+}
+
+/*
+ * filesystems which don't store the correct value in some of their inode
+ * attributes.
+ */
+static inline int au_test_fs_bad_iattr(struct super_block *sb)
+{
+	return au_test_fs_bad_iattr_size(sb)
+		/* || au_test_cifs(sb) */	/* untested */
+		|| au_test_fat(sb)
+		|| au_test_msdos(sb)
+		|| au_test_vfat(sb);
+}
+
+/*
+ * filesystems which sets S_NOATIME and S_NOCMTIME.
+ */
+static inline int au_test_fs_notime(struct super_block *sb)
+{
+	return au_test_nfs(sb)
+		|| au_test_fuse(sb)
+		/* || au_test_cifs(sb) */	/* untested */
+		/* || au_test_ubifs(sb) */	/* untested */
+		;
+}
+
+/*
+ * filesystems which requires replacing i_mapping.
+ */
+static inline int au_test_fs_bad_mapping(struct super_block *sb)
+{
+	return au_test_fuse(sb);
+}
+
+/* temporary support for i#1 in cramfs */
+static inline int au_test_fs_unique_ino(struct inode *inode)
+{
+	if (au_test_cramfs(inode->i_sb))
+		return inode->i_ino != 1;
+	return 1;
+}
+
+/* ---------------------------------------------------------------------- */
+
+/*
+ * the filesystem where the xino files placed must support i/o after unlink and
+ * maintain i_size and i_blocks.
+ */
+static inline int au_test_fs_bad_xino(struct super_block *sb)
+{
+	return au_test_fs_remote(sb)
+		|| au_test_fs_bad_iattr_size(sb)
+		|| !au_test_fs_null_nd(sb) /* to keep xino code simple */
+		/* don't want unnecessary work for xino */
+		|| au_test_aufs(sb)
+		|| au_test_ecryptfs(sb);
+}
+
+static inline int au_test_fs_trunc_xino(struct super_block *sb)
+{
+	return au_test_tmpfs(sb);
+}
+
+/*
+ * test if the @sb is real-readonly.
+ */
+static inline int au_test_fs_rr(struct super_block *sb)
+{
+	return au_test_squashfs(sb)
+		|| au_test_iso9660(sb)
+		|| au_test_cramfs(sb)
+		|| au_test_romfs(sb);
+}
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_FSTYPE_H__ */
diff --git a/fs/aufs/magic.mk b/fs/aufs/magic.mk
new file mode 100644
index 0000000..83d613c
--- /dev/null
+++ b/fs/aufs/magic.mk
@@ -0,0 +1,52 @@
+
+# defined in ${srctree}/fs/fuse/inode.c
+# tristate
+ifdef CONFIG_FUSE_FS
+ccflags-y += -DFUSE_SUPER_MAGIC=0x65735546
+endif
+
+# defined in ${srctree}/fs/ocfs2/ocfs2_fs.h
+# tristate
+ifdef CONFIG_OCFS2_FS
+ccflags-y += -DOCFS2_SUPER_MAGIC=0x7461636f
+endif
+
+# defined in ${srctree}/fs/ocfs2/dlm/userdlm.h
+# tristate
+ifdef CONFIG_OCFS2_FS_O2CB
+ccflags-y += -DDLMFS_MAGIC=0x76a9f425
+endif
+
+# defined in ${srctree}/fs/ramfs/inode.c
+# always true
+ccflags-y += -DRAMFS_MAGIC=0x858458f6
+
+# defined in ${srctree}/fs/cifs/cifsfs.c
+# tristate
+ifdef CONFIG_CIFS_FS
+ccflags-y += -DCIFS_MAGIC_NUMBER=0xFF534D42
+endif
+
+# defined in ${srctree}/fs/xfs/xfs_sb.h
+# tristate
+ifdef CONFIG_XFS_FS
+ccflags-y += -DXFS_SB_MAGIC=0x58465342
+endif
+
+# defined in ${srctree}/fs/configfs/mount.c
+# tristate
+ifdef CONFIG_CONFIGFS_FS
+ccflags-y += -DCONFIGFS_MAGIC=0x62656570
+endif
+
+# defined in ${srctree}/fs/9p/v9fs.h
+# tristate
+ifdef CONFIG_9P_FS
+ccflags-y += -DV9FS_MAGIC=0x01021997
+endif
+
+# defined in ${srctree}/fs/ubifs/ubifs.h
+# tristate
+ifdef CONFIG_UBIFS_FS
+ccflags-y += -DUBIFS_SUPER_MAGIC=0x24051905
+endif
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 25/28] aufs debug
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (23 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 24/28] aufs test for fstype J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 26/28] export splice functions J. R. Okajima
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

initial commit
debug print

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/aufs/debug.c |  406 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/aufs/debug.h |  245 +++++++++++++++++++++++++++++++++
 fs/aufs/sysrq.c |  106 +++++++++++++++
 3 files changed, 757 insertions(+), 0 deletions(-)
 create mode 100644 fs/aufs/debug.c
 create mode 100644 fs/aufs/debug.h
 create mode 100644 fs/aufs/sysrq.c

diff --git a/fs/aufs/debug.c b/fs/aufs/debug.c
new file mode 100644
index 0000000..e8bc59c
--- /dev/null
+++ b/fs/aufs/debug.c
@@ -0,0 +1,406 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * debug print functions
+ */
+
+#include "aufs.h"
+
+atomic_t au_cond = ATOMIC_INIT(0);
+
+char *au_plevel = KERN_DEBUG;
+#define dpri(fmt, arg...) do { \
+	if (au_debug_test()) \
+		printk("%s" fmt, au_plevel, ##arg); \
+} while (0)
+
+/* ---------------------------------------------------------------------- */
+
+void au_dpri_whlist(struct au_nhash *whlist)
+{
+	int i;
+	struct hlist_head *head;
+	struct au_vdir_wh *tpos;
+	struct hlist_node *pos;
+
+	for (i = 0; i < AuSize_NHASH; i++) {
+		head = whlist->heads + i;
+		hlist_for_each_entry(tpos, pos, head, wh_hash)
+			dpri("b%d, %.*s, %d\n",
+			     tpos->wh_bindex,
+			     tpos->wh_str.len, tpos->wh_str.name,
+			     tpos->wh_str.len);
+	}
+}
+
+void au_dpri_vdir(struct au_vdir *vdir)
+{
+	int i;
+	union au_vdir_deblk_p p;
+	unsigned char *o;
+
+	if (!vdir || IS_ERR(vdir)) {
+		dpri("err %ld\n", PTR_ERR(vdir));
+		return;
+	}
+
+	dpri("nblk %d, deblk %p, last{%d, %p}, ver %lu\n",
+	     vdir->vd_nblk, vdir->vd_deblk,
+	     vdir->vd_last.i, vdir->vd_last.p.p, vdir->vd_version);
+	for (i = 0; i < vdir->vd_nblk; i++) {
+		p.deblk = vdir->vd_deblk[i];
+		o = p.p;
+		dpri("[%d]: %p\n", i, o);
+	}
+}
+
+static int do_pri_inode(aufs_bindex_t bindex, struct inode *inode,
+			struct dentry *wh)
+{
+	char *n = NULL;
+	int l = 0;
+
+	if (!inode || IS_ERR(inode)) {
+		dpri("i%d: err %ld\n", bindex, PTR_ERR(inode));
+		return -1;
+	}
+
+	/* the type of i_blocks depends upon CONFIG_LSF */
+	BUILD_BUG_ON(sizeof(inode->i_blocks) != sizeof(unsigned long)
+		     && sizeof(inode->i_blocks) != sizeof(u64));
+	if (wh) {
+		n = (void *)wh->d_name.name;
+		l = wh->d_name.len;
+	}
+
+	dpri("i%d: i%lu, %s, cnt %d, nl %u, 0%o, sz %llu, blk %llu,"
+	     " ct %lld, np %lu, st 0x%lx, f 0x%x, g %x%s%.*s\n",
+	     bindex,
+	     inode->i_ino, inode->i_sb ? au_sbtype(inode->i_sb) : "??",
+	     atomic_read(&inode->i_count), inode->i_nlink, inode->i_mode,
+	     i_size_read(inode), (unsigned long long)inode->i_blocks,
+	     (long long)timespec_to_ns(&inode->i_ctime) & 0x0ffff,
+	     inode->i_mapping ? inode->i_mapping->nrpages : 0,
+	     inode->i_state, inode->i_flags, inode->i_generation,
+	     l ? ", wh " : "", l, n);
+	return 0;
+}
+
+void au_dpri_inode(struct inode *inode)
+{
+	struct au_iinfo *iinfo;
+	aufs_bindex_t bindex;
+	int err;
+
+	err = do_pri_inode(-1, inode, NULL);
+	if (err || !au_test_aufs(inode->i_sb))
+		return;
+
+	iinfo = au_ii(inode);
+	if (!iinfo)
+		return;
+	dpri("i-1: bstart %d, bend %d, gen %d\n",
+	     iinfo->ii_bstart, iinfo->ii_bend, au_iigen(inode));
+	if (iinfo->ii_bstart < 0)
+		return;
+	for (bindex = iinfo->ii_bstart; bindex <= iinfo->ii_bend; bindex++)
+		do_pri_inode(bindex, iinfo->ii_hinode[0 + bindex].hi_inode,
+			     iinfo->ii_hinode[0 + bindex].hi_whdentry);
+}
+
+static int do_pri_dentry(aufs_bindex_t bindex, struct dentry *dentry)
+{
+	struct dentry *wh = NULL;
+
+	if (!dentry || IS_ERR(dentry)) {
+		dpri("d%d: err %ld\n", bindex, PTR_ERR(dentry));
+		return -1;
+	}
+	/* do not call dget_parent() here */
+	dpri("d%d: %.*s?/%.*s, %s, cnt %d, flags 0x%x\n",
+	     bindex,
+	     AuDLNPair(dentry->d_parent), AuDLNPair(dentry),
+	     dentry->d_sb ? au_sbtype(dentry->d_sb) : "??",
+	     atomic_read(&dentry->d_count), dentry->d_flags);
+	if (bindex >= 0 && dentry->d_inode && au_test_aufs(dentry->d_sb)) {
+		struct au_iinfo *iinfo = au_ii(dentry->d_inode);
+		if (iinfo)
+			wh = iinfo->ii_hinode[0 + bindex].hi_whdentry;
+	}
+	do_pri_inode(bindex, dentry->d_inode, wh);
+	return 0;
+}
+
+void au_dpri_dentry(struct dentry *dentry)
+{
+	struct au_dinfo *dinfo;
+	aufs_bindex_t bindex;
+	int err;
+
+	err = do_pri_dentry(-1, dentry);
+	if (err || !au_test_aufs(dentry->d_sb))
+		return;
+
+	dinfo = au_di(dentry);
+	if (!dinfo)
+		return;
+	dpri("d-1: bstart %d, bend %d, bwh %d, bdiropq %d, gen %d\n",
+	     dinfo->di_bstart, dinfo->di_bend,
+	     dinfo->di_bwh, dinfo->di_bdiropq, au_digen(dentry));
+	if (dinfo->di_bstart < 0)
+		return;
+	for (bindex = dinfo->di_bstart; bindex <= dinfo->di_bend; bindex++)
+		do_pri_dentry(bindex, dinfo->di_hdentry[0 + bindex].hd_dentry);
+}
+
+static int do_pri_file(aufs_bindex_t bindex, struct file *file)
+{
+	char a[32];
+
+	if (!file || IS_ERR(file)) {
+		dpri("f%d: err %ld\n", bindex, PTR_ERR(file));
+		return -1;
+	}
+	a[0] = 0;
+	if (bindex < 0
+	    && file->f_dentry
+	    && au_test_aufs(file->f_dentry->d_sb)
+	    && au_fi(file))
+		snprintf(a, sizeof(a), ", mmapped %d", au_test_mmapped(file));
+	dpri("f%d: mode 0x%x, flags 0%o, cnt %ld, pos %llu%s\n",
+	     bindex, file->f_mode, file->f_flags, (long)file_count(file),
+	     file->f_pos, a);
+	if (file->f_dentry)
+		do_pri_dentry(bindex, file->f_dentry);
+	return 0;
+}
+
+void au_dpri_file(struct file *file)
+{
+	struct au_finfo *finfo;
+	aufs_bindex_t bindex;
+	int err;
+
+	err = do_pri_file(-1, file);
+	if (err || !file->f_dentry || !au_test_aufs(file->f_dentry->d_sb))
+		return;
+
+	finfo = au_fi(file);
+	if (!finfo)
+		return;
+	if (finfo->fi_bstart < 0)
+		return;
+	for (bindex = finfo->fi_bstart; bindex <= finfo->fi_bend; bindex++) {
+		struct au_hfile *hf;
+
+		hf = finfo->fi_hfile + bindex;
+		do_pri_file(bindex, hf ? hf->hf_file : NULL);
+	}
+}
+
+static int do_pri_br(aufs_bindex_t bindex, struct au_branch *br)
+{
+	struct vfsmount *mnt;
+	struct super_block *sb;
+
+	if (!br || IS_ERR(br)
+	    || !(mnt = br->br_mnt) || IS_ERR(mnt)
+	    || !(sb = mnt->mnt_sb) || IS_ERR(sb)) {
+		dpri("s%d: err %ld\n", bindex, PTR_ERR(br));
+		return -1;
+	}
+
+	dpri("s%d: {perm 0x%x, cnt %d, wbr %p}, "
+	     "%s, dev 0x%02x%02x, flags 0x%lx, cnt(BIAS) %d, active %d, "
+	     "xino %d\n",
+	     bindex, br->br_perm, atomic_read(&br->br_count), br->br_wbr,
+	     au_sbtype(sb), MAJOR(sb->s_dev), MINOR(sb->s_dev),
+	     sb->s_flags, sb->s_count - S_BIAS,
+	     atomic_read(&sb->s_active), !!br->br_xino.xi_file);
+	return 0;
+}
+
+void au_dpri_sb(struct super_block *sb)
+{
+	struct au_sbinfo *sbinfo;
+	aufs_bindex_t bindex;
+	int err;
+	/* to reuduce stack size */
+	struct {
+		struct vfsmount mnt;
+		struct au_branch fake;
+	} *a;
+
+	/* this function can be called from magic sysrq */
+	a = kzalloc(sizeof(*a), GFP_ATOMIC);
+	if (unlikely(!a)) {
+		dpri("no memory\n");
+		return;
+	}
+
+	a->mnt.mnt_sb = sb;
+	a->fake.br_perm = 0;
+	a->fake.br_mnt = &a->mnt;
+	a->fake.br_xino.xi_file = NULL;
+	atomic_set(&a->fake.br_count, 0);
+	smp_mb(); /* atomic_set */
+	err = do_pri_br(-1, &a->fake);
+	kfree(a);
+	dpri("dev 0x%x\n", sb->s_dev);
+	if (err || !au_test_aufs(sb))
+		return;
+
+	sbinfo = au_sbi(sb);
+	if (!sbinfo)
+		return;
+	dpri("nw %d, gen %u, kobj %d\n",
+	     atomic_read(&sbinfo->si_nowait.nw_len), sbinfo->si_generation,
+	     atomic_read(&sbinfo->si_kobj.kref.refcount));
+	for (bindex = 0; bindex <= sbinfo->si_bend; bindex++)
+		do_pri_br(bindex, sbinfo->si_branch[0 + bindex]);
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_dbg_sleep_jiffy(int jiffy)
+{
+	while (jiffy)
+		jiffy = schedule_timeout_uninterruptible(jiffy);
+}
+
+void au_dbg_iattr(struct iattr *ia)
+{
+#define AuBit(name)	if (ia->ia_valid & ATTR_ ## name) \
+				dpri(#name "\n")
+	AuBit(MODE);
+	AuBit(UID);
+	AuBit(GID);
+	AuBit(SIZE);
+	AuBit(ATIME);
+	AuBit(MTIME);
+	AuBit(CTIME);
+	AuBit(ATIME_SET);
+	AuBit(MTIME_SET);
+	AuBit(FORCE);
+	AuBit(ATTR_FLAG);
+	AuBit(KILL_SUID);
+	AuBit(KILL_SGID);
+	AuBit(FILE);
+	AuBit(KILL_PRIV);
+	AuBit(OPEN);
+	AuBit(TIMES_SET);
+#undef	AuBit
+	dpri("ia_file %p\n", ia->ia_file);
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_dbg_verify_dir_parent(struct dentry *dentry, unsigned int sigen)
+{
+	struct dentry *parent;
+
+	parent = dget_parent(dentry);
+	AuDebugOn(!S_ISDIR(dentry->d_inode->i_mode)
+		  || IS_ROOT(dentry)
+		  || au_digen(parent) != sigen);
+	dput(parent);
+}
+
+void au_dbg_verify_nondir_parent(struct dentry *dentry, unsigned int sigen)
+{
+	struct dentry *parent;
+
+	parent = dget_parent(dentry);
+	AuDebugOn(S_ISDIR(dentry->d_inode->i_mode)
+		  || au_digen(parent) != sigen);
+	dput(parent);
+}
+
+void au_dbg_verify_gen(struct dentry *parent, unsigned int sigen)
+{
+	int err, i, j;
+	struct au_dcsub_pages dpages;
+	struct au_dpage *dpage;
+	struct dentry **dentries;
+
+	err = au_dpages_init(&dpages, GFP_NOFS);
+	AuDebugOn(err);
+	err = au_dcsub_pages_rev(&dpages, parent, /*do_include*/1, NULL, NULL);
+	AuDebugOn(err);
+	for (i = dpages.ndpage - 1; !err && i >= 0; i--) {
+		dpage = dpages.dpages + i;
+		dentries = dpage->dentries;
+		for (j = dpage->ndentry - 1; !err && j >= 0; j--)
+			AuDebugOn(au_digen(dentries[j]) != sigen);
+	}
+	au_dpages_free(&dpages);
+}
+
+void au_dbg_verify_hf(struct au_finfo *finfo)
+{
+	struct au_hfile *hf;
+	aufs_bindex_t bend, bindex;
+
+	if (finfo->fi_bstart >= 0) {
+		bend = finfo->fi_bend;
+		for (bindex = finfo->fi_bstart; bindex <= bend; bindex++) {
+			hf = finfo->fi_hfile + bindex;
+			AuDebugOn(hf->hf_file || hf->hf_br);
+		}
+	}
+}
+
+void au_dbg_verify_kthread(void)
+{
+	if (au_test_wkq(current)) {
+		au_dbg_blocked();
+		BUG();
+	}
+}
+
+/* ---------------------------------------------------------------------- */
+
+void au_debug_sbinfo_init(struct au_sbinfo *sbinfo __maybe_unused)
+{
+#ifdef AuForceNoPlink
+	au_opt_clr(sbinfo->si_mntflags, PLINK);
+#endif
+#ifdef AuForceNoXino
+	au_opt_clr(sbinfo->si_mntflags, XINO);
+#endif
+#ifdef AuForceNoRefrof
+	au_opt_clr(sbinfo->si_mntflags, REFROF);
+#endif
+#ifdef AuForceHinotify
+	au_opt_set_udba(sbinfo->si_mntflags, UDBA_HINOTIFY);
+#endif
+}
+
+int __init au_debug_init(void)
+{
+	aufs_bindex_t bindex;
+	struct au_vdir_destr destr;
+
+	bindex = -1;
+	AuDebugOn(bindex >= 0);
+
+	destr.len = -1;
+	AuDebugOn(destr.len < NAME_MAX);
+
+#ifdef CONFIG_4KSTACKS
+	AuWarn("CONFIG_4KSTACKS is defined.\n");
+#endif
+
+#ifdef AuForceNoBrs
+	sysaufs_brs = 0;
+#endif
+
+	return 0;
+}
diff --git a/fs/aufs/debug.h b/fs/aufs/debug.h
new file mode 100644
index 0000000..0b377a8
--- /dev/null
+++ b/fs/aufs/debug.h
@@ -0,0 +1,245 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * debug print functions
+ */
+
+#ifndef __AUFS_DEBUG_H__
+#define __AUFS_DEBUG_H__
+
+#ifdef __KERNEL__
+
+#include <linux/delay.h>
+#include <linux/fs.h>
+#include <linux/kd.h>
+#include <linux/vt_kern.h>
+#include <linux/sysrq.h>
+#include <linux/aufs_type.h>
+
+#ifdef CONFIG_AUFS_DEBUG
+#define AuDebugOn(a)		BUG_ON(a)
+extern atomic_t au_cond;
+static inline void au_debug(int n)
+{
+	atomic_set(&au_cond, n);
+	smp_mb();
+}
+
+static inline int au_debug_test(void)
+{
+	int ret;
+
+	ret = atomic_read(&au_cond);
+	smp_mb();
+	return ret;
+}
+#else
+#define AuDebugOn(a)		do {} while (0)
+#define au_debug()		do {} while (0)
+static inline int au_debug_test(void)
+{
+	return 0;
+}
+#endif /* CONFIG_AUFS_DEBUG */
+
+/* ---------------------------------------------------------------------- */
+
+/* debug print */
+
+#define AuDpri(lvl, fmt, arg...) \
+	printk(lvl AUFS_NAME " %s:%d:%s[%d]: " fmt, \
+	       __func__, __LINE__, current->comm, current->pid, ##arg)
+#define AuDbg(fmt, arg...) do { \
+	if (au_debug_test()) \
+		AuDpri(KERN_DEBUG, fmt, ##arg); \
+} while (0)
+#define AuLabel(l) 		AuDbg(#l "\n")
+#define AuInfo(fmt, arg...)	AuDpri(KERN_INFO, fmt, ##arg)
+#define AuWarn(fmt, arg...)	AuDpri(KERN_WARNING, fmt, ##arg)
+#define AuErr(fmt, arg...)	AuDpri(KERN_ERR, fmt, ##arg)
+#define AuIOErr(fmt, arg...)	AuErr("I/O Error, " fmt, ##arg)
+#define AuWarn1(fmt, arg...) do { \
+	static unsigned char _c; \
+	if (!_c++) \
+		AuWarn(fmt, ##arg); \
+} while (0)
+
+#define AuErr1(fmt, arg...) do { \
+	static unsigned char _c; \
+	if (!_c++) \
+		AuErr(fmt, ##arg); \
+} while (0)
+
+#define AuIOErr1(fmt, arg...) do { \
+	static unsigned char _c; \
+	if (!_c++) \
+		AuIOErr(fmt, ##arg); \
+} while (0)
+
+#define AuUnsupportMsg	"This operation is not supported." \
+			" Please report this application to aufs-users ML."
+#define AuUnsupport(fmt, args...) do { \
+	AuErr(AuUnsupportMsg "\n" fmt, ##args); \
+	dump_stack(); \
+} while (0)
+
+#define AuTraceErr(e) do { \
+	if (unlikely((e) < 0)) \
+		AuDbg("err %d\n", (int)(e)); \
+} while (0)
+
+#define AuTraceErrPtr(p) do { \
+	if (IS_ERR(p)) \
+		AuDbg("err %ld\n", PTR_ERR(p)); \
+} while (0)
+
+/* dirty macros for debug print, use with "%.*s" and caution */
+#define AuLNPair(qstr)		(qstr)->len, (qstr)->name
+#define AuDLNPair(d)		AuLNPair(&(d)->d_name)
+
+/* ---------------------------------------------------------------------- */
+
+struct au_sbinfo;
+struct au_finfo;
+#ifdef CONFIG_AUFS_DEBUG
+extern char *au_plevel;
+struct au_nhash;
+void au_dpri_whlist(struct au_nhash *whlist);
+struct au_vdir;
+void au_dpri_vdir(struct au_vdir *vdir);
+void au_dpri_inode(struct inode *inode);
+void au_dpri_dentry(struct dentry *dentry);
+void au_dpri_file(struct file *filp);
+void au_dpri_sb(struct super_block *sb);
+
+void au_dbg_sleep_jiffy(int jiffy);
+void au_dbg_iattr(struct iattr *ia);
+
+void au_dbg_verify_dir_parent(struct dentry *dentry, unsigned int sigen);
+void au_dbg_verify_nondir_parent(struct dentry *dentry, unsigned int sigen);
+void au_dbg_verify_gen(struct dentry *parent, unsigned int sigen);
+void au_dbg_verify_hf(struct au_finfo *finfo);
+void au_dbg_verify_kthread(void);
+
+int __init au_debug_init(void);
+void au_debug_sbinfo_init(struct au_sbinfo *sbinfo);
+#define AuDbgWhlist(w) do { \
+	AuDbg(#w "\n"); \
+	au_dpri_whlist(w); \
+} while (0)
+
+#define AuDbgVdir(v) do { \
+	AuDbg(#v "\n"); \
+	au_dpri_vdir(v); \
+} while (0)
+
+#define AuDbgInode(i) do { \
+	AuDbg(#i "\n"); \
+	au_dpri_inode(i); \
+} while (0)
+
+#define AuDbgDentry(d) do { \
+	AuDbg(#d "\n"); \
+	au_dpri_dentry(d); \
+} while (0)
+
+#define AuDbgFile(f) do { \
+	AuDbg(#f "\n"); \
+	au_dpri_file(f); \
+} while (0)
+
+#define AuDbgSb(sb) do { \
+	AuDbg(#sb "\n"); \
+	au_dpri_sb(sb); \
+} while (0)
+
+#define AuDbgSleep(sec) do { \
+	AuDbg("sleep %d sec\n", sec); \
+	ssleep(sec); \
+} while (0)
+
+#define AuDbgSleepJiffy(jiffy) do { \
+	AuDbg("sleep %d jiffies\n", jiffy); \
+	au_dbg_sleep_jiffy(jiffy); \
+} while (0)
+
+#define AuDbgIAttr(ia) do { \
+	AuDbg("ia_valid 0x%x\n", (ia)->ia_valid); \
+	au_dbg_iattr(ia); \
+} while (0)
+#else
+static inline void au_dbg_verify_dir_parent(struct dentry *dentry,
+					    unsigned int sigen)
+{
+	/* empty */
+}
+static inline void au_dbg_verify_nondir_parent(struct dentry *dentry,
+					   unsigned int sigen)
+{
+	/* empty */
+}
+static inline void au_dbg_verify_gen(struct dentry *parent, unsigned int sigen)
+{
+	/* empty */
+}
+static inline void au_dbg_verify_hf(struct au_finfo *finfo)
+{
+	/* empty */
+}
+static inline void au_dbg_verify_kthread(void)
+{
+	/* empty */
+}
+
+static inline int au_debug_init(void)
+{
+	return 0;
+}
+static inline void au_debug_sbinfo_init(struct au_sbinfo *sbinfo)
+{
+	/* empty */
+}
+#define AuDbgWhlist(w)		do {} while (0)
+#define AuDbgVdir(v)		do {} while (0)
+#define AuDbgInode(i)		do {} while (0)
+#define AuDbgDentry(d)		do {} while (0)
+#define AuDbgFile(f)		do {} while (0)
+#define AuDbgSb(sb)		do {} while (0)
+#define AuDbgSleep(sec)		do {} while (0)
+#define AuDbgSleepJiffy(jiffy)	do {} while (0)
+#define AuDbgIAttr(ia)		do {} while (0)
+#endif /* CONFIG_AUFS_DEBUG */
+
+/* ---------------------------------------------------------------------- */
+
+#ifdef CONFIG_AUFS_MAGIC_SYSRQ
+int __init au_sysrq_init(void);
+void au_sysrq_fin(void);
+
+#ifdef CONFIG_HW_CONSOLE
+#define au_dbg_blocked() do { \
+	WARN_ON(1); \
+	handle_sysrq('w', vc_cons[fg_console].d->vc_tty); \
+} while (0)
+#else
+#define au_dbg_blocked()	do {} while (0)
+#endif
+
+#else
+static inline int au_sysrq_init(void)
+{
+	return 0;
+}
+#define au_sysrq_fin()		do {} while (0)
+#define au_dbg_blocked()	do {} while (0)
+#endif /* CONFIG_AUFS_MAGIC_SYSRQ */
+
+#endif /* __KERNEL__ */
+#endif /* __AUFS_DEBUG_H__ */
diff --git a/fs/aufs/sysrq.c b/fs/aufs/sysrq.c
new file mode 100644
index 0000000..419ff2c
--- /dev/null
+++ b/fs/aufs/sysrq.c
@@ -0,0 +1,106 @@
+/*
+ * Copyright (C) 2005-2009 Junjiro R. Okajima
+ *
+ * This program, aufs is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+/*
+ * magic sysrq hanlder
+ */
+
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+/* #include <linux/sysrq.h> */
+#include "aufs.h"
+
+/* ---------------------------------------------------------------------- */
+
+static void sysrq_sb(struct super_block *sb)
+{
+	char *plevel;
+	struct au_sbinfo *sbinfo;
+	struct file *file;
+
+	plevel = au_plevel;
+	au_plevel = KERN_WARNING;
+	au_debug(1);
+
+	sbinfo = au_sbi(sb);
+	pr_warning("si=%lx\n", sysaufs_si_id(sbinfo));
+	pr_warning(AUFS_NAME ": superblock\n");
+	au_dpri_sb(sb);
+	pr_warning(AUFS_NAME ": root dentry\n");
+	au_dpri_dentry(sb->s_root);
+	pr_warning(AUFS_NAME ": root inode\n");
+	au_dpri_inode(sb->s_root->d_inode);
+#if 0
+	struct inode *i;
+	pr_warning(AUFS_NAME ": isolated inode\n");
+	list_for_each_entry(i, &sb->s_inodes, i_sb_list)
+		if (list_empty(&i->i_dentry))
+			au_dpri_inode(i);
+#endif
+	pr_warning(AUFS_NAME ": files\n");
+	list_for_each_entry(file, &sb->s_files, f_u.fu_list)
+		if (!special_file(file->f_dentry->d_inode->i_mode))
+			au_dpri_file(file);
+
+	au_plevel = plevel;
+	au_debug(0);
+}
+
+/* ---------------------------------------------------------------------- */
+
+/* module parameter */
+static char *aufs_sysrq_key = "a";
+module_param_named(sysrq, aufs_sysrq_key, charp, S_IRUGO);
+MODULE_PARM_DESC(sysrq, "MagicSysRq key for " AUFS_NAME);
+
+static void au_sysrq(int key __maybe_unused,
+		     struct tty_struct *tty __maybe_unused)
+{
+	struct kobject *kobj;
+	struct au_sbinfo *sbinfo;
+
+	/* spin_lock(&sysaufs_ket->list_lock); */
+	list_for_each_entry(kobj, &sysaufs_ket->list, entry) {
+		sbinfo = container_of(kobj, struct au_sbinfo, si_kobj);
+		sysrq_sb(sbinfo->si_sb);
+	}
+	/* spin_unlock(&sysaufs_ket->list_lock); */
+}
+
+static struct sysrq_key_op au_sysrq_op = {
+	.handler	= au_sysrq,
+	.help_msg	= "Aufs",
+	.action_msg	= "Aufs",
+	.enable_mask	= SYSRQ_ENABLE_DUMP
+};
+
+/* ---------------------------------------------------------------------- */
+
+int __init au_sysrq_init(void)
+{
+	int err;
+	char key;
+
+	err = -1;
+	key = *aufs_sysrq_key;
+	if ('a' <= key && key <= 'z')
+		err = register_sysrq_key(key, &au_sysrq_op);
+	if (unlikely(err))
+		AuErr("err %d, sysrq=%c\n", err, key);
+	return err;
+}
+
+void au_sysrq_fin(void)
+{
+	int err;
+	err = unregister_sysrq_key(*aufs_sysrq_key, &au_sysrq_op);
+	if (unlikely(err))
+		AuErr("err %d (ignored)\n", err);
+}
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 26/28] export splice functions
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (24 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 25/28] aufs debug J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 27/28] export lookup functions J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 28/28] kbuild aufs J. R. Okajima
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

export splice functions to aufs module and others

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/splice.c            |   10 +++++-----
 include/linux/splice.h |    6 ++++++
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 4ed0ba4..2fb3d17 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -888,8 +888,8 @@ EXPORT_SYMBOL(generic_splice_sendpage);
 /*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		    loff_t *ppos, size_t len, unsigned int flags)
 {
 	int ret;
 
@@ -912,9 +912,9 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 /*
  * Attempt to initiate a splice from a file to a pipe.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+		  struct pipe_inode_info *pipe, size_t len,
+		  unsigned int flags)
 {
 	int ret;
 
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 528dcb9..5123bc6 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -71,4 +71,10 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			   loff_t *ppos, size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags);
+
 #endif
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 27/28] export lookup functions
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (25 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 26/28] export splice functions J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  2009-03-16  7:20 ` [RFC Aufs2 #2 28/28] kbuild aufs J. R. Okajima
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

export lookup functions to aufs module and others

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/namei.c            |    4 ++--
 include/linux/namei.h |    3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bbc15c2..db581b4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1196,7 +1196,7 @@ out:
  * needs parent already locked. Doesn't follow mounts.
  * SMP-safe.
  */
-static struct dentry *lookup_hash(struct nameidata *nd)
+struct dentry *lookup_hash(struct nameidata *nd)
 {
 	int err;
 
@@ -1206,7 +1206,7 @@ static struct dentry *lookup_hash(struct nameidata *nd)
 	return __lookup_hash(&nd->last, nd->path.dentry, nd);
 }
 
-static int __lookup_one_len(const char *name, struct qstr *this,
+int __lookup_one_len(const char *name, struct qstr *this,
 		struct dentry *base, int len)
 {
 	unsigned long hash;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index fc2e035..182d43b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -75,6 +75,9 @@ extern struct file *lookup_instantiate_filp(struct nameidata *nd, struct dentry
 extern struct file *nameidata_to_filp(struct nameidata *nd, int flags);
 extern void release_open_intent(struct nameidata *);
 
+extern struct dentry *lookup_hash(struct nameidata *nd);
+extern int __lookup_one_len(const char *name, struct qstr *this,
+			    struct dentry *base, int len);
 extern struct dentry *lookup_one_len(const char *, struct dentry *, int);
 extern struct dentry *lookup_one_noperm(const char *, struct dentry *);
 
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC Aufs2 #2 28/28] kbuild aufs
  2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
                   ` (26 preceding siblings ...)
  2009-03-16  7:20 ` [RFC Aufs2 #2 27/28] export lookup functions J. R. Okajima
@ 2009-03-16  7:20 ` J. R. Okajima
  27 siblings, 0 replies; 38+ messages in thread
From: J. R. Okajima @ 2009-03-16  7:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, J. R. Okajima

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/Kconfig  |    1 +
 fs/Makefile |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 93945dd..75156dd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -222,6 +222,7 @@ source "fs/qnx4/Kconfig"
 source "fs/romfs/Kconfig"
 source "fs/sysv/Kconfig"
 source "fs/ufs/Kconfig"
+source "fs/aufs/Kconfig"
 
 endif # MISC_FILESYSTEMS
 
diff --git a/fs/Makefile b/fs/Makefile
index dc20db3..a4e9a65 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -124,3 +124,4 @@ obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
 obj-$(CONFIG_BTRFS_FS)		+= btrfs/
 obj-$(CONFIG_GFS2_FS)           += gfs2/
+obj-$(CONFIG_AUFS_FS)           += aufs/
-- 
1.6.1.284.g5dc13


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-16  7:20 ` [RFC Aufs2 #2 21/28] aufs sysfs entries J. R. Okajima
@ 2009-03-16 18:05   ` Greg KH
  2009-03-17  3:26     ` hooanon05
  0 siblings, 1 reply; 38+ messages in thread
From: Greg KH @ 2009-03-16 18:05 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: linux-kernel, linux-fsdevel

On Mon, Mar 16, 2009 at 04:20:33PM +0900, J. R. Okajima wrote:
> initial commit
> sysfs entries, compiled only when CONFIG_SYSFS is enabled

debug stuff should be in debugfs, not sysfs.

Also, any sysfs files that are created, need to be documented in
Documentation/ABI/.

> +#ifdef CONFIG_AUFS_DEBUG
> +static ssize_t debug_show(struct kobject *kobj __maybe_unused,
> +			  struct kobj_attribute *attr __maybe_unused,
> +			  char *buf)
> +{
> +	return sprintf(buf, "%d\n", au_debug_test());
> +}
> +
> +static ssize_t debug_store(struct kobject *kobj __maybe_unused,
> +			   struct kobj_attribute *attr __maybe_unused,
> +			   const char *buf, size_t sz)
> +{
> +	if (unlikely(!sz || (*buf != '0' && *buf != '1')))
> +		return -EOPNOTSUPP;
> +
> +	if (*buf == '0')
> +		au_debug(0);
> +	else if (*buf == '1')
> +		au_debug(1);
> +	return sz;
> +}

This should be a simple module parameter, like almost all drivers have
it.  That way it shows up in /sys/modules/aufs/ properly, as a boolean
value.  No need to reinvent the wheel here.

> +/* todo: file size may exceed PAGE_SIZE */
> +ssize_t sysaufs_si_show(struct kobject *kobj, struct attribute *attr,
> +			 char *buf)

NO IT CAN NOT!!!!

You are using sysfs wrong if you even think you are getting close to
PAGE_SIZE.

Please, use debugfs for this.  sysfs is a "one value per file" type
filesystem.  You should never be using the seqfile interface for a sysfs
file, that's a sure sign something is wrong in your design.

I think I'll stop reading now :(

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-16 18:05   ` Greg KH
@ 2009-03-17  3:26     ` hooanon05
  2009-03-17  3:37       ` Greg KH
  0 siblings, 1 reply; 38+ messages in thread
From: hooanon05 @ 2009-03-17  3:26 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, linux-fsdevel


Thank you very much for reviewing.

Greg KH:
> > +/* todo: file size may exceed PAGE_SIZE */
> > +ssize_t sysaufs_si_show(struct kobject *kobj, struct attribute *attr,
> > +			 char *buf)
> 
> NO IT CAN NOT!!!!
> 
> You are using sysfs wrong if you even think you are getting close to
> PAGE_SIZE.

The main reason I left a comment about PAGE_SIZE is that some of aufs
entry under sysfs contain the file path. And I don't think it is a good
idea to create a symlink since the path may be changed as a result of
users operation.
I left it as "todo" because such long path is rare.

If you have an idea to show users any file path which may be changed
dynamically, please let me know. At the beginning I have tried procfs,
but I was told that newer module should use sysfs instead of procfs. If
sysfs doesn't suit a path, what should I use? Debugfs?


> Please, use debugfs for this.  sysfs is a "one value per file" type
> filesystem.  You should never be using the seqfile interface for a sysfs
> file, that's a sure sign something is wrong in your design.
> 
> I think I'll stop reading now :(

I will change some entries which doesn't contain a file path (or make
them not to have have) and post again. So please don't stop reading.


J. R. Okajima

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17  3:26     ` hooanon05
@ 2009-03-17  3:37       ` Greg KH
  2009-03-17  4:05         ` hooanon05
  2009-03-19  8:04         ` hooanon05
  0 siblings, 2 replies; 38+ messages in thread
From: Greg KH @ 2009-03-17  3:37 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-kernel, linux-fsdevel

On Tue, Mar 17, 2009 at 12:26:42PM +0900, hooanon05@yahoo.co.jp wrote:
> 
> Thank you very much for reviewing.
> 
> Greg KH:
> > > +/* todo: file size may exceed PAGE_SIZE */
> > > +ssize_t sysaufs_si_show(struct kobject *kobj, struct attribute *attr,
> > > +			 char *buf)
> > 
> > NO IT CAN NOT!!!!
> > 
> > You are using sysfs wrong if you even think you are getting close to
> > PAGE_SIZE.
> 
> The main reason I left a comment about PAGE_SIZE is that some of aufs
> entry under sysfs contain the file path.

Which should be MAX_PATH, right?  Is that ever bigger than PAGE_SIZE?

> And I don't think it is a good idea to create a symlink since the path
> may be changed as a result of users operation.
> I left it as "todo" because such long path is rare.
> 
> If you have an idea to show users any file path which may be changed
> dynamically, please let me know. At the beginning I have tried procfs,
> but I was told that newer module should use sysfs instead of procfs. If
> sysfs doesn't suit a path, what should I use? Debugfs?

What specifically are you trying to show here?

> > Please, use debugfs for this.  sysfs is a "one value per file" type
> > filesystem.  You should never be using the seqfile interface for a sysfs
> > file, that's a sure sign something is wrong in your design.
> > 
> > I think I'll stop reading now :(
> 
> I will change some entries which doesn't contain a file path (or make
> them not to have have) and post again. So please don't stop reading.

Please document your sysfs entries, and I'll be glad to review it again.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17  3:37       ` Greg KH
@ 2009-03-17  4:05         ` hooanon05
  2009-03-17  4:22           ` Greg KH
  2009-03-17 10:39           ` Evgeniy Polyakov
  2009-03-19  8:04         ` hooanon05
  1 sibling, 2 replies; 38+ messages in thread
From: hooanon05 @ 2009-03-17  4:05 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, linux-fsdevel


Greg KH:
> Which should be MAX_PATH, right?  Is that ever bigger than PAGE_SIZE?

Is the limit of PATH_MAX applied to the absolute path too?


> What specifically are you trying to show here?

- the abs path of a directory which is a member of a union.
- the abs path of a xino file which is a regular file maintained by
  aufs.


> Please document your sysfs entries, and I'll be glad to review it again.

Thank you.
I will create Documentation/ABI/testing/sysfs-aufs and describe them in
it. I didn't know about it.


J. R. Okajima

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17  4:05         ` hooanon05
@ 2009-03-17  4:22           ` Greg KH
  2009-03-17  4:50             ` hooanon05
  2009-03-17 10:39           ` Evgeniy Polyakov
  1 sibling, 1 reply; 38+ messages in thread
From: Greg KH @ 2009-03-17  4:22 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-kernel, linux-fsdevel

On Tue, Mar 17, 2009 at 01:05:47PM +0900, hooanon05@yahoo.co.jp wrote:
> 
> Greg KH:
> > Which should be MAX_PATH, right?  Is that ever bigger than PAGE_SIZE?
> 
> Is the limit of PATH_MAX applied to the absolute path too?
> 
> 
> > What specifically are you trying to show here?
> 
> - the abs path of a directory which is a member of a union.
> - the abs path of a xino file which is a regular file maintained by
>   aufs.

And what do people do with this information?  Is it just debugging
stuff?  If so, please put it in debugfs, not sysfs.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17  4:22           ` Greg KH
@ 2009-03-17  4:50             ` hooanon05
  0 siblings, 0 replies; 38+ messages in thread
From: hooanon05 @ 2009-03-17  4:50 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, linux-fsdevel


Greg KH:
> > - the abs path of a directory which is a member of a union.
> > - the abs path of a xino file which is a regular file maintained by
> >   aufs.
> 
> And what do people do with this information?  Is it just debugging
> stuff?  If so, please put it in debugfs, not sysfs.

No, they are not debugging stuff.
Users will need to know or confirm the member of union and the xino file
path. To show their abs path is necessary. Initially aufs showed them in
/proc/mounts (and /etc/mtab). But some users met the size limitation of
them since they build a union with many members. And I moved them to
sysfs (one path per a sysfs file).


J. R. Okajima



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17  4:05         ` hooanon05
  2009-03-17  4:22           ` Greg KH
@ 2009-03-17 10:39           ` Evgeniy Polyakov
  2009-03-17 12:56             ` hooanon05
  1 sibling, 1 reply; 38+ messages in thread
From: Evgeniy Polyakov @ 2009-03-17 10:39 UTC (permalink / raw)
  To: hooanon05; +Cc: Greg KH, linux-kernel, linux-fsdevel

On Tue, Mar 17, 2009 at 01:05:47PM +0900, hooanon05@yahoo.co.jp (hooanon05@yahoo.co.jp) wrote:
> Is the limit of PATH_MAX applied to the absolute path too?

No, only path used in the given operation (related or absolute) is
limited. If related path is converted into the absolute just for the
purpose of dump and not to be used in some syscall, it can have any
length, so you may need to rethink which pathes are presented in sysfs
and what should be moved outside.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17 10:39           ` Evgeniy Polyakov
@ 2009-03-17 12:56             ` hooanon05
  0 siblings, 0 replies; 38+ messages in thread
From: hooanon05 @ 2009-03-17 12:56 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Greg KH, linux-kernel, linux-fsdevel


Evgeniy Polyakov:
> On Tue, Mar 17, 2009 at 01:05:47PM +0900, hooanon05@yahoo.co.jp (hooanon05@yahoo.co.jp) wrote:
> > Is the limit of PATH_MAX applied to the absolute path too?
> 
> No, only path used in the given operation (related or absolute) is
> limited. If related path is converted into the absolute just for the
> purpose of dump and not to be used in some syscall, it can have any
> length, so you may need to rethink which pathes are presented in sysfs
> and what should be moved outside.

As a result, current code is implemented.
- aufs entries under sysfs print the absolute path.
- sets limit (PAGE_SIZE) to its length.
- and I think it is enough.

Theoretically, every entries under procfs (or sysfs?) which contains the
absolute path have this problem.


J. R. Okajima

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC Aufs2 #2 21/28] aufs sysfs entries
  2009-03-17  3:37       ` Greg KH
  2009-03-17  4:05         ` hooanon05
@ 2009-03-19  8:04         ` hooanon05
  1 sibling, 0 replies; 38+ messages in thread
From: hooanon05 @ 2009-03-19  8:04 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, linux-fsdevel


Greg KH:
> Please document your sysfs entries, and I'll be glad to review it again.

Greg,
I posted the updated files based upon your comments,
titled "RFC Aufs2 #3 ...".
Please review them again.

Of course, any comments from anyone are appriciated too.


Thanx in advance
J. R. Okajima


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2009-03-19  8:04 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-16  7:20 [RFC Aufs2 #2 00/28] source files J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 01/28] aufs documents J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 02/28] aufs public header file J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 03/28] aufs module global J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 04/28] aufs super_block J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 05/28] aufs branch directory/filesystem J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 06/28] aufs xino J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 07/28] aufs object lifetime management via sysfs J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 08/28] aufs mount options/flags J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 09/28] aufs workqueue J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 10/28] aufs sub-VFS J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 11/28] aufs sub-dcache J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 12/28] aufs copy-up J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 13/28] aufs whiteout J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 14/28] aufs pseudo-link J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 15/28] aufs policies to select one among multiple writable branches J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 16/28] aufs dentry and lookup J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 17/28] aufs file J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 18/28] aufs direcotry J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 19/28] aufs inode J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 20/28] aufs ioctl J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 21/28] aufs sysfs entries J. R. Okajima
2009-03-16 18:05   ` Greg KH
2009-03-17  3:26     ` hooanon05
2009-03-17  3:37       ` Greg KH
2009-03-17  4:05         ` hooanon05
2009-03-17  4:22           ` Greg KH
2009-03-17  4:50             ` hooanon05
2009-03-17 10:39           ` Evgeniy Polyakov
2009-03-17 12:56             ` hooanon05
2009-03-19  8:04         ` hooanon05
2009-03-16  7:20 ` [RFC Aufs2 #2 22/28] aufs branch for loopback block device J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 23/28] aufs internal inotify J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 24/28] aufs test for fstype J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 25/28] aufs debug J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 26/28] export splice functions J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 27/28] export lookup functions J. R. Okajima
2009-03-16  7:20 ` [RFC Aufs2 #2 28/28] kbuild aufs J. R. Okajima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.