linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/8] Aufs2 documents
@ 2009-02-23  7:31 hooanon05
  2009-02-23  7:33 ` [RFC 1/8] Aufs2: introduction hooanon05
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:31 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


Hello,

This is my second trial to ask incorporating aufs into mainline.
Aufs2 is a refined version of old aufs1:
- to be reviewed easily and widely.
- to make the source files simpler and smaller by dropping several
  original features.

Before posting source files, here are some documents from aufs2 which
describes some of ideas, design or approaches which are implemented.
Kindly review and let me know your comments.

J. R. Okajima


----------------------------------------------------------------------

Aufs2 -- advanced multi layered unification filesystem version 2
http://aufs.sf.net
Junjiro R. Okajima


0. Introduction
----------------------------------------
In the early days, aufs was entirely re-designed and re-implemented
Unionfs Version 1.x series. After many original ideas, approaches,
improvements and implementations, it becomes totally different from
Unionfs while keeping the basic features.
Recently, Unionfs Version 2.x series begin taking some of the same
approaches to aufs1's.
Unionfs is being developed by Professor Erez Zadok at Stony Brook
University and his team.

This version of AUFS, aufs2 has several purposes.
- to be reviewed easily and widely.
- to make the source files simpler and smaller by dropping several
  original features.

Through this work, I found some bad things in aufs1 source code and
fixed them. Some of the dropped features will be reverted in the future,
but not all I'm afraid.
Aufs2 supports linux-2.6.27 and later. If you want older kernel version
support, try aufs1 from CVS on SourceForge.


1. Features
----------------------------------------
- unite several directories into a single virtual filesystem. The member
  directory is called as a branch.
- you can specify the permission flags to the branch, which are 'readonly',
  'readwrite' and 'whiteout-able.'
- by upper writable branch, internal copyup and whiteout, files/dirs on
  readonly branch are modifiable logically.
- dynamic branch manipulation, add, del.
- etc...

Also there are many enhancements in aufs1, such as:
- keep inode number by external inode number table
- keep the timestamps of file/dir in internal copyup operation
- seekable directory, supporting NFS readdir.
- support mmap(2) including /proc/PID/exe symlink, without page-copy
- whiteout is hardlinked in order to reduce the consumption of inodes
  on branch
- do not copyup, nor create a whiteout when it is unnecessary
- revert a single systemcall when an error occurs in aufs
- remount interface instead of ioctl
- maintain /etc/mtab by an external shell script, /sbin/mount.aufs.
- loopback mounted filesystem as a branch
- kernel thread for removing the dir who has a plenty of whiteouts
- support copyup sparse file (a file which has a 'hole' in it)
- default permission flags for branches
- selectable permission flags for ro branch, whether whiteout can
  exist or not
- export via NFS.
- support <sysfs>/fs/aufs.
- support multiple writable branches, some policies to select one
  among multiple writable branches.
- a new semantics for link(2) and rename(2) to support multiple
  writable branches.
- a delegation of the internal branch access to support task I/O
  accounting, which also supports Linux Security Modules (LSM) mainly
  for Suse AppArmor.
- nested mount, i.e. aufs as readonly no-whiteout branch of another aufs.
- copyup-on-open or copyup-on-write
- show-whiteout mode
- show configuration even out of kernel tree
- no glibc changes are required.
- pseudo hardlink (hardlink over branches)
- allow a direct access manually to a file on branch, e.g. bypassing aufs.
  including NFS or remote filesystem branch.
- and more...

Currently these features are dropped temporary from this version, aufs2.
See design/08plan.txt in detail.
- exporting via NFS
- test only the highest one for the directory permission (dirperm1)
- show whiteout mode (shwh)
- copyup on open (coo=)
- being another aufs's readonly branch (robr)
- statistics of aufs thread (/sys/fs/aufs/stat)
- delegation mode (dlgt)
- intent.open/create (file open in a single lookup)


2. Download
----------------------------------------
Kindly one of aufs user, the Center for Scientific Computing and Free
Software (C3SL), Federal University of Parana offered me a public GIT
tree space.

There are three GIT trees, aufs2-2.6, aufs2-standalone and aufs2-util.
While the aufs2-util is always necessary, you need either of aufs2-2.6
or aufs2-standalone.

The aufs2-2.6 tree includes the whole linux-2.6 GIT tree,
git://git.kernel.org/.../torvalds/linux-2.6.git.
And you cannot select CONFIG_AUFS_FS=m for this version, eg. you cannot
build aufs2 as an externel kernel module.
If you already have linux-2.6 GIT tree, you may want to pull and merge
the "aufs2" branch from this tree.

On the other hand, the aufs2-standalone tree has only aufs2 source files
and a necessary patch, and you can select CONFIG_AUFS_FS=m. In other
words, the aufs2-standalone tree is generated from aufs2-2.6 tree by,
- extract new files and modifications.
- generate a single patch file from modifications.
- generate a ChangeLog file from git-log.
- commit the files newly and no log messages. this is not git-pull.

Both of aufs2-2.6 and aufs2-standalone trees have a branch whose name is
in form of "aufs2-xx" where "xx" represents the linux kernel version,
"linux-2.6.xx".

o aufs2-2.6 tree
$ git clone --reference /your/linux-2.6/git/tree \
	http://git.c3sl.ufpr.br/pub/scm/aufs/aufs2-2.6.git \
	aufs2-2.6.git
- if you don't have linux-2.6 GIT tree, then remove "--reference ..."
$ cd aufs2-2.6.git
$ git checkout aufs2-xx	# for instance, aufs2-27 for linux-2.6.27

o aufs2-standalone tree
$ git clone http://git.c3sl.ufpr.br/pub/scm/aufs/aufs2-standalone.git \
	aufs2-standalone.git
$ cd aufs2-standalone.git
$ git checkout aufs2-xx	# for instance, aufs2-27 for linux-2.6.27
- apply "aufs2-standalone.patch" to your kernel source files.

o aufs2-util tree
$ git clone http://git.c3sl.ufpr.br/pub/scm/aufs/aufs2-util.git \
	aufs2-util.git
$ cd aufs2-util.git
- no particular tag/branch currently.


3. Configuration and Compilation
----------------------------------------
For aufs2-2.6 tree,
- enable CONFIG_EXPERIMENTAL and CONFIG_AUFS_FS.
- set other aufs configurations if necessary.

For aufs2-standalone tree,
- enable CONFIG_EXPERIMENTAL and CONFIG_AUFS_FS, you can select =m.
- edit fs/aufs/config.mk and set other aufs configurations if necessary.

And then,
- build your kernel (or a module) by "make"
- install it and reboot your system
- read README in aufs2-util, build and install it


4. Usage
----------------------------------------
At first, make sure aufs2-util are installed, and please read the aufs
manual, ./Documentation/filesystems/aufs/aufs.5.
$ man -l aufs.5

And then,
$ mkdir /tmp/rw /tmp/aufs
# mount -t aufs -o br=/tmp/rw:${HOME}=ro none /tmp/aufs

Here is another example. The result is equivalent.
# mount -t aufs -o br=/tmp/rw:${HOME} none /tmp/aufs
  Or
# mount -t aufs -o br:/tmp/rw none /tmp/aufs
# mount -o remount,append:${HOME} /tmp/aufs

Then, you can see whole tree of your home dir through /tmp/aufs. If
you modify a file under /tmp/aufs, the one on your home directory is
not affected, instead the same named file will be newly created under
/tmp/rw. And all of your modification to a file will be applied to
the one under /tmp/rw. This is called the file based Copy on Write
(COW) method.
Aufs mount options are described in aufs.5.

Additionally, there are some sample usages of aufs which are a
diskless system with network booting, and LiveCD over NFS.
See sample dir in CVS tree on SourceForge.


5. Contact
----------------------------------------
When you have any problems or strange behaviour in aufs, please let me
know with:
- /proc/mounts (instead of the output of mount(8))
- /sys/fs/aufs/* (if you have them)
- /sys/module/aufs/*
- linux kernel version
  if your kernel is not plain, for example modified by distributor,
  the url where i can download its source is necessary too.
- aufs version which was printed at loading the module or booting the
  system, instead of the date you downloaded.
- configuration (define/undefine CONFIG_AUFS_xxx)
- kernel configuration or /proc/config.gz (if you have it)
- behaviour which you think to be incorrect
- actual operation, reproducible one is better
- mailto: aufs-users at lists.sourceforge.net

Usually, I don't watch the Public Areas(Bugs, Support Requests, Patches,
and Feature Requests) on SourceForge. Please join and write to
aufs-users ML.


6. Acknowledgements
----------------------------------------
Thanks to everyone who have tried and are using aufs, whoever
have reported a bug or any feedback.

Especially donors:
Tomas Matejicek(slax.org) made a donation (much more than once).
Dai Itasaka made a donation (2007/8).
Chuck Smith made a donation (2008/4, 10 and 12).
Henk Schoneveld made a donation (2008/9).
Chih-Wei Huang, ASUS, CTC donated Eee PC 4G (2008/10).
Francois Dupoux made a donation (2008/11).
Bruno Cesar Ribas and Luis Carlos Erpen de Bona, C3SL serves public GIT
tree (2009/2).

Thank you very much.
Donations are always, including future donations, very important and
helpful for me to keep on developing aufs.


7.
----------------------------------------
If you are an experienced user, no explanation is needed. Aufs is
just a linux filesystem.


Enjoy!

# Local variables: ;
# mode: text;
# End: ;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 1/8] Aufs2: introduction
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
@ 2009-02-23  7:33 ` hooanon05
  2009-02-23  7:34 ` [RFC 2/8] Aufs2: structure hooanon05
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:33 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


Introduction
----------------------------------------

aufs [ei ju: ef es] | [a u f s]
1. abbrev. for "advanced multi-layered unification filesystem".
2. abbrev. for "another unionfs".
3. abbrev. for "auf das" in German which means "on the" in English.
   Ex. "Butter aufs Brot"(G) means "butter onto bread"(E).
   But "Filesystem aufs Filesystem" is hard to understand.

AUFS is a filesystem with features:
- multi layered stackable unification filesystem, the member directory
  is called as a branch.
- branch permission and attribute, 'readonly', 'real-readonly',
  'readwrite', 'whiteout-able', 'link-able whiteout' and their
  combination.
- internal "file copy-on-write".
- logical deletion, whiteout.
- dynamic branch manipulation, adding, deleting and changing permission.
- allow bypassing aufs, user's direct branch access.
- external inode number translation table and bitmap which maintains the
  persistent aufs inode number.
- seekable directory, including NFS readdir.
- file mapping, mmap and sharing pages.
- pseudo-link, hardlink over branches.
- loopback mounted filesystem as a branch.
- several policies to select one among multiple writable branches.
- revert a single systemcall when an error occurs in aufs.
- and more...


Multi Layered Stackable Unification Filesystem
----------------------------------------------------------------------
Most people already knows what it is.
It is a filesystem which unifies several directories and provides a
merged single directory. When users access a file, the access will be
passed/re-directed/converted (sorry, I am not sure which English word is
correct) to the real file on the member filesystem. The member
filesystem is called 'lower filesystem' or 'branch' and has a mode
'readonly' and 'readwrite.' And the deletion for a file on the lower
readonly branch is handled by creating 'whiteout' on the upper writable
branch.

On LKML, there have been discussions about UnionMount (Jan Blunck and
Bharata B Rao) and Unionfs (Erez Zadok). They took different approaches
to implement the merged-view.
The former tries putting it into VFS, and the latter implements as a
separate filesystem.
(If I misunderstand about these implementations, please let me know and
I shall correct it. Because it is a long time ago when I read their
source files last time).
UnionMount's approach will be able to small, but may be hard to share
branches between several UnionMount since the whiteout in it is
implemented in the inode on branch filesystem and always
shared. According to Bharata's post, readdir does not seems to be
finished yet.
Unionfs has a longer history. When I started implementing a stacking filesystem
(Aug 2005), it already existed. It has virtual super_block, inode,
dentry and file objects and they have an array pointing lower same kind
objects. After contributing many patches for Unionfs, I re-started my
project AUFS (Jun 2006).

In AUFS, the structure of filesystem resembles to Unionfs, but I
implemented my own ideas, approaches and enhancements and it became
totally different one.


Several characters/aspects of aufs
----------------------------------------------------------------------

Aufs has several characters or aspects.
1. a filesystem, callee of VFS helper
2. sub-VFS, caller of VFS helper for branches
3. a virtual filesystem which maintains persistent inode number
4. reader/writer of files on branches such like an application

1. Caller of VFS Helper
As an ordinary linux filesystem, aufs is a callee of VFS. For instance,
unlink(2) from an application reaches sys_unlink() kernel function and
then vfs_unlink() is called. vfs_unlink() is one of VFS helper and it
calls filesystem specific unlink operation. Actually aufs implements the
unlink operation but it behaves like a redirector.

2. Caller of VFS Helper for Branches
aufs_unlink() passes the unlink request to the branch filesystem as if
it were called from VFS. So the called unlink operation of the branch
filesystem acts as usual. As a caller of VFS helper, aufs should handle
every necessary pre/post operation for the branch filesystem.
- acquire the lock for the parent dir on a branch
- lookup in a branch
- revalidate dentry on a branch
- mnt_want_write() for a branch
- vfs_unlink() for a branch
- mnt_drop_write() for a branch
- release the lock on a branch

3. Persistent Inode Number
One of the most important issue for a filesystem is to maintain inode
numbers. This is particularly important to support exporting a
filesystem via NFS. Aufs is a virtual filesystem which doesn't have a
backend block device for its own. But some storage is necessary to
maintain inode number. It may be a large space and may not suit to keep
in memory. Aufs rents some space from its first writable branch
filesystem (by default) and creates file(s) on it. These files are
created by aufs internally and removed soon (currently) keeping opened.
Note: Because these files are removed, they are totally gone after
      unmounting aufs. It means the inode numbers are not persistent
      across unmount or reboot. I have a plan to make them really
      persistent which will be important for aufs on NFS server.

4. Read/Write Files Internally (copy-on-write)
Because a branch can be readonly, when you write a file on it, aufs will
"copy-up" it to the upper writable branch internally. And then write the
originally requested thing to the file. Generally kernel doesn't
open/read/write file actively. In aufs, even a single write may cause a
internal "file copy". This behaviour is very similar to cp(1) command.

Some people may think it is better to pass such work to user space
helper, instead of doing in kernel space. Actually I am still thinking
about it. But currently I have implemented it in kernel space.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 2/8] Aufs2: structure
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
  2009-02-23  7:33 ` [RFC 1/8] Aufs2: introduction hooanon05
@ 2009-02-23  7:34 ` hooanon05
  2009-02-23  9:13   ` Tomas M
  2009-02-23  7:35 ` [RFC 3/8] Aufs2: lookup hooanon05
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


Basic Aufs Internal Structure

Superblock/Inode/Dentry/File Objects
----------------------------------------------------------------------
As like an ordinary filesystem, aufs has its own
superblock/inode/dentry/file objects. All these objects have a
dynamically allocated array and store the same kind of pointers to the
lower filesystem, branch.
For example, when you build a union with one readwrite branch and one
readonly, mounted /au, /rw and /ro respectively.
- /au = /rw + /ro
- /ro/fileA exists but /rw/fileA

Aufs lookup operation finds /ro/fileA and gets dentry for that. These
pointers are stored in a aufs dentry. The array in aufs dentry will be,
- [0] = NULL
- [1] = /ro/fileA

This style of an array is essentially same to the aufs
superblock/inode/dentry/file objects.

Because aufs supports manipulating branches, ie. add/delete/change
dynamically, these objects has its own generation. When branches are
changed, the generation in aufs superblock is incremented. And a
generation in other object are compared when it is accessed.
When a generation in other objects are obsoleted, aufs refreshes the
internal array.


Superblock
----------------------------------------------------------------------
Additionally aufs superblock has some data for policies to select one
among multiple writable branches, XIB files, pseudo-links and kobject.
See below in detail.
About the policies which supports copy-down a directory, see policy.txt
too.


Branch and XINO(External Inode Number Translation Table)
----------------------------------------------------------------------
Every branch has its own xino (external inode number translation table)
file. The xino file is created and unlinked by aufs internally. When two
members of a union exist on the same filesystem, they share the single
xino file.
The struct of a xino file is simple, just a sequence of aufs inode
numbers which is indexed by the lower inode number.
In the above sample, assume the inode number of /ro/fileA is i111 and
aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
4(8) bytes at 111 * 4(8) bytes offset in the xino file.

Also a writable branch has three kinds of "whiteout bases". All these
are existed when the branch is joined to aufs and the names are
whiteout-ed doubly, so that users will never see their names in aufs
hierarchy.
1. a regular file which will be linked to all whiteouts.
2. a directory to store a pseudo-link.
3. a directory to store an "orphan-ed" file temporary.

1. Whiteout Base
   When you remove a file on a readonly branch, aufs handles it as a
   logical deletion and creates a whiteout on the upper writable branch
   as a hardlink of this file in order not to consume inode on the
   writable branch.
2. Pseudo-link Dir
   See below, Pseudo-link.
3. Step-Parent Dir
   When "fileC" exists on the lower readonly branch only and it is
   opened and removed with its parent dir, and then user writes
   something into it, then aufs copies-up fileC to this
   directory. Because there is no other dir to store fileC. After
   creating a file under this dir, the file is unlinked.

Because aufs supports manipulating branches, ie. add/delete/change
dynamically, a branch has its own id. When the branch order changes, aufs
finds the new index by searching the branch id.


Pseudo-link
----------------------------------------------------------------------
Assume "fileA" exists on the lower readonly branch only and it is
hardlinked to "fileB" on the branch. When you write something to fileA,
aufs copies-up it to the upper writable branch. Additionally aufs
creates a hardlink under the Pseudo-link Directory of the writable
branch. The inode of a pseudo-link is kept in aufs super_block as a
simple list. If fileB is read after unlinking fileA, aufs returns
filedata from the pseudo-link instead of the lower readonly
branch. Because the pseudo-link is based upon the inode, to keep the
inode number by xino (see above) is important.

All the hardlinks under the Pseudo-link Directory of the writable branch
should be restored in a proper location later. Aufs provides a utility
to do this. The userspace helpers executed at remounting and unmounting
aufs by default.


XIB(external inode number bitmap)
----------------------------------------------------------------------
Addition to the xino file per a branch, aufs has an external inode number
bitmap in a superblock object. It is also a file such like a xino file.
It is a simple bitmap to mark whether the aufs inode number is in-use or
not.
To reduce the file I/O, aufs prepares a single memory page to cache xib.

Aufs implements a feature to truncate/refresh both of xino and xib to
reduce the number of consumed disk blocks for these files.


Virtual or Vertical Dir
----------------------------------------------------------------------
In order to support multiple layers (branches), aufs readdir operation
constructs a virtual dir block on memory. For readdir, aufs calls
vfs_readdir() internally for each dir on branches, merges their entries
with eliminating the whiteout-ed ones, and sets it to file (dir)
object. So the file object has its entry list until it is closed. The
entry list will be updated when the file position is zero and becomes
old. This decision is made in aufs automatically.

The dynamically allocated memory block for the name of entries has a
unit of 512 bytes (by default) and stores the names contiguously (no
padding). Another block for each entry is handled by kmem_cache too.
During building dir blocks, aufs creates hash list and judging whether
the entry is whiteouted by its upper branch or already listed.

Some people may call it can be a security hole or invite DoS attack
since the opened and once readdir-ed dir (file object) holds its entry
list and becomes a pressure for system memory. But I'd say it is similar
to files under /proc or /sys. The virtual files in them also holds a
memory page (generally) while they are opened. When an idea to reduce
memory for them is introduced, it will be applied to aufs too.


Workqueue
----------------------------------------------------------------------
Aufs sometimes requires privilege access to a branch. For instance,
in copy-up/down operation. When a user process is going to make changes
to a file which exists in the lower readonly branch only, and the mode
of one of ancestor directories may not be writable by a user
process. Here aufs copy-up the file with its ancestors and they may
require privilege to set its owner/group/mode/etc.
This is a typical case of a application character of aufs (see
Introduction).

Aufs uses workqueue synchronously for this case. It creates its own
workqueue. The workqueue is a kernel thread and has privilege. Aufs
passes the request to call mkdir or write (for example), and wait for
its completion. This approach solves a problem of a signal handler
simply.
If aufs didn't adopt the workqueue and changed the privilege of the
process, and if the mkdir/write call arises SIGXFSZ or other signal,
then the user process might gain a privilege or the generated core file
was owned by a superuser. But I have a plan to switch to a new
credential approach which will be introduced in linux-2.6.29.

Also aufs uses the system global workqueue ("events" kernel thread) too
for asynchronous tasks, such like handling inotify, re-creating a
whiteout base and etc. This is unrelated to a privilege.
Most of aufs operation tries acquiring a rw_semaphore for aufs
superblock at the beginning, at the same time waits for the completion
of all queued asynchronous tasks.


Whiteout
----------------------------------------------------------------------
The whiteout in aufs is very similar to Unionfs's. That is represented
by its filename. UnionMount takes an approach of a file mode, but I am
afraid several utilities (find(1) or something) will have to support it.

Basically the whiteout represents "logical deletion" which stops aufs to
lookup further, but also it represents "dir is opaque" which also stop
lookup.

In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
In order to make several functions in a single systemcall to be
revertible, aufs adopts an approach to rename a directory to a temporary
unique whiteouted name.
For example, in rename(2) dir where the target dir already existed, aufs
renames the target dir to a temporary unique whiteouted name before the
actual rename on a branch and then handles other actions (make it opaque,
update the attributes, etc). If an error happens in these actions, aufs
simply renames the whiteouted name back and returns an error. If all are
succeeded, aufs registers a function to remove the whiteouted unique
temporary name completely and asynchronously to the system global
workqueue.


Copy-up
----------------------------------------------------------------------
It is a well-known feature or concept.
When user modifies a file on a readonly branch, aufs operate "copy-up"
internally and makes change to the new file on the upper writable branch.
When the trigger systemcall does not update the timestamps of the parent
dir, aufs reverts it after copy-up.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 3/8] Aufs2: lookup
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
  2009-02-23  7:33 ` [RFC 1/8] Aufs2: introduction hooanon05
  2009-02-23  7:34 ` [RFC 2/8] Aufs2: structure hooanon05
@ 2009-02-23  7:35 ` hooanon05
  2009-02-23  7:36 ` [RFC 4/8] Aufs2: branch hooanon05
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


Lookup in a Branch
----------------------------------------------------------------------
Since aufs has a character of sub-VFS (see Introduction), it operates
lookup for branches as VFS does. It may be a heavy work. Generally
speaking struct nameidata is a bigger structure and includes many
information. But almost all lookup operation in aufs is the simplest
case, ie. lookup only an entry directly connected to its parent. Digging
down the directory hierarchy is unnecessary.

VFS has a function lookup_one_len() for that use, but it is not usable
for a branch filesystem which requires struct nameidata. So aufs
implements a simple lookup wrapper function. When a branch filesystem
allows NULL as nameidata, it calls lookup_one_len(). Otherwise it builds
a simplest nameidata and calls lookup_hash().
Here aufs applies "a principle in NFSD", ie. if the filesystem supports
NFS-export, then it has to support NULL as a nameidata parameter for
->create(), ->lookup() and ->d_revalidate(). So the lookup wrapper in
aufs tests if ->s_export_op in the branch is NULL or not.

When a branch is a remote filesystem, aufs trusts its ->d_revalidate().
For d_revalidate, aufs implements three levels of revalidate tests. See
"Revalidate Dentry and UDBA" in detail.


Loopback Mount
----------------------------------------------------------------------
Basically aufs supports any type of filesystem and block device for a
branch (actually there are some exceptions). But it is prohibited to add
a loopback mounted one whose backend file exists in a filesystem which is
already added to aufs. The reason is to protect aufs from a recursive
lookup. If it was allowed, the aufs lookup operation might re-enter a
lookup for the loopback mounted branch in the same context, and will
cause a deadlock.


Revalidate Dentry and UDBA (User's Direct Branch Access)
----------------------------------------------------------------------
Generally VFS helpers re-validate a dentry as a part of lookup.
0. digging down the directory hierarchy.
1. lock the parent dir by its i_mutex.
2. lookup the final (child) entry.
3. revalidate it.
4. call the actual operation (create, unlink, etc.)
5. unlock the parent dir

If the filesystem implements its ->d_revalidate() (step 3), then it is
called. Actually aufs implements it and checks the dentry on a branch is
still valid.
But it is not enough. Because aufs has to release the lock for the
parent dir on a branch at the end of ->lookup() (step 2) and
->d_revalidate() (step 3) while the i_mutex of the aufs dir is still
held by VFS.
If the file on a branch is changed directly, eg. bypassing aufs, after
aufs released the lock, then the subsequent operation may cause
something unpleasant result.

This situation is a result of VFS architecture, ->lookup() and
->d_revalidate() is separated. But I never say it is wrong. It is a good
design from VFS's point of view. It is just not suitable for sub-VFS
character in aufs.

Aufs supports such case by three level of revalidation which is
selectable by user.
1. Simple Revalidate
   Addition to the native flow in VFS's, confirm the child-parent
   relationship on the branch just after locking the parent dir on the
   branch in the "actual operation" (step 4). When this validation
   fails, aufs returns EBUSY. ->d_revalidate() (step 3) in aufs still
   checks the validation of the dentry on branches.
2. Monitor Changes Internally by Inotify
   Addition to above, in the "actual operation" (step 4) aufs re-lookup
   the dentry on the branch, and returns EBUSY if it finds different
   dentry.
   Additionally, aufs sets the inotify watch for every dir on branches
   during it is in cache. When the event is notified, aufs registers a
   function to kernel 'events' thread by schedule_work(). And the
   function sets some special status to the cached aufs dentry and inode
   private data. If they are not cached, then aufs has nothing to
   do. When the same file is accessed through aufs (step 0-3) later,
   aufs will detect the status and refresh all necessary data.
   In this mode, aufs has to ignore the event which is fired by aufs
   itself.
3. No Extra Validation
   This is the simplest test and doesn't add any additional revalidation
   test, and skip therevalidatin in step 4. It is useful and improves
   aufs performance when system surely hide the aufs branches from user,
   by over-mounting something (or another method).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 4/8] Aufs2: branch
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
                   ` (2 preceding siblings ...)
  2009-02-23  7:35 ` [RFC 3/8] Aufs2: lookup hooanon05
@ 2009-02-23  7:36 ` hooanon05
  2009-02-23  7:36 ` [RFC 5/8] Aufs2: wbr_policy hooanon05
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:36 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


Branch Manipulation

Since aufs supports dynamic branch manipulation, ie. add/remove a branch
and changing its permission/attribute, there are a lot of works to do.


Add a Branch
----------------------------------------------------------------------
o Confirm the adding dir exists outside of aufs, including loopback
  mount.
- and other various attributes...
o Initialize the xino file and whiteout bases if necessary.
  See struct.txt.

o Check the owner/group/mode of the directory
  When the owner/group/mode of the adding directory differs from the
  existing branch, aufs issues a warning because it may impose a
  security risk.
  For example, when a upper writable branch has a world writable empty
  top directory, a malicious user can create any files on the writable
  branch directly, like copy-up and modify manually. If something like
  /etc/{passwd,shadow} exists on the lower readonly branch but the upper
  writable branch, and the writable branch is world-writable, then a
  malicious guy may create /etc/passwd on the writable branch directly
  and the infected file will be valid in aufs.
  I am afraid it can be a security issue, but nothing to do except
  producing a warning.


Delete a Branch
----------------------------------------------------------------------
o Confirm the deleting branch is not busy
  To be general, there is one merit to adopt "remount" interface to
  manipulate branches. It is to discard caches. At deleting a branch,
  aufs checks the still cached (and connected) dentries and inodes. If
  there are any, then they are all in-use. An inode without its
  corresponding dentry can be alive alone (for example, inotify case).

  For the cached one, aufs checks whether the same named entry exists on
  other branches.
  If the cached one is a directory, because aufs provides a merged view
  to users, as long as one dir is left on any branch aufs can show the
  dir to users. In this case, the branch can be removed from aufs.
  Otherwise aufs rejects deleting the branch.

  If any file on the deleting branch is opened by aufs, then aufs
  rejects deleting.


Modify the Permission of a Branch
----------------------------------------------------------------------
o Re-initialize or remove the xino file and whiteout bases if necessary.
  See struct.txt.

o rw --> ro: Confirm the modifying branch is not busy
  Aufs rejects the request if any of these conditions are true.
  - a file on the branch is mmap-ed.
  - a regular file on the branch is opened for write and there is no
    same named entry on the upper branch.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 5/8] Aufs2: wbr_policy
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
                   ` (3 preceding siblings ...)
  2009-02-23  7:36 ` [RFC 4/8] Aufs2: branch hooanon05
@ 2009-02-23  7:36 ` hooanon05
  2009-02-23  7:37 ` [RFC 6/8] Aufs2: fmode_exec hooanon05
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:36 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


Policies to Select One among Multiple Writable Branches
----------------------------------------------------------------------
When the number of writable branch is more than one, aufs has to decide
the target branch for file creation or copy-up. By default, the highest
writable branch which has the parent (or ancestor) dir of the target
file is chosen (top-down-parent policy).
By user's request, aufs implements some other policies to select the
writable branch, for file creation two policies, round-robin and
most-free-space policies. For copy-up three policies, top-down-parent,
bottom-up-parent and bottom-up policies.

As expected, the round-robin policy selects the branch in circular. When
you have two writable branches and creates 10 new files, 5 files will be
created for each branch. mkdir(2) systemcall is an exception. When you
create 10 new directories, all will be created on the same branch.
And the most-free-space policy selects the one which has most free
space among the writable branches. The amount of free space will be
checked by aufs internally, and users can specify its time interval.

The policies for copy-up is more simple,
top-down-parent is equivalent to the same named on in create policy,
bottom-up-parent selects the writable branch where the parent dir
exists and the nearest upper one from the copyup-source,
bottom-up selects the nearest upper writable branch from the
copyup-source, regardless the existence of the parent dir.

There are some rules or exceptions to apply these policies.
- If there is a readonly branch above the policy-selected branch and
  the parent dir is marked as opaque (a variation of whiteout), or the
  target (creating) file is whiteout-ed on the upper readonly branch,
  then the result of the policy is ignored and the target file will be
  created on the nearest upper writable branch than the readonly branch.
- If there is a writable branch above the policy-selected branch and
  the parent dir is marked as opaque or the target file is whiteouted
  on the branch, then the result of the policy is ignored and the target
  file will be created on the highest one among the upper writable
  branches who has diropq or whiteout. In case of whiteout, aufs removes
  it as usual.
- link(2) and rename(2) systemcalls are exceptions in every policy.
  They try selecting the branch where the source exists as possible
  since copyup a large file will take long time. If it can't be,
  ie. the branch where the source exists is readonly, then they will
  follow the copyup policy.
- There is an exception for rename(2) when the target exists.
  If the rename target exists, aufs compares the index of the branches
  where the source and the target exists and selects the higher
  one. If the selected branch is readonly, then aufs follows the
  copyup policy.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 6/8] Aufs2: fmode_exec
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
                   ` (4 preceding siblings ...)
  2009-02-23  7:36 ` [RFC 5/8] Aufs2: wbr_policy hooanon05
@ 2009-02-23  7:37 ` hooanon05
  2009-02-23  7:37 ` [RFC 7/8] Aufs2: mmap hooanon05
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:37 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


FMODE_EXEC and deny_write()
----------------------------------------------------------------------
Generally Unix prevents an executing file from writing its filedata.
In linux it is implemented by deny_write() and allow_write().
When a file is executed by exec() family, open_exec() (and sys_uselib())
they opens the file and calls deny_write(). If the file is aufs's virtual
one, it has no meaning. The file which deny_write() is really necessary
is the file on a branch. But the FMODE_EXEC flag is not passed to
->open() operation. So aufs adopt a dirty trick.

- in order to get FMODE_EXEC, aufs ->lookup() and ->d_revalidate() set
  nd->intent.open.file->private_data to nd->intent.open.flags temporary.
- in aufs ->open(), when FMODE_EXEC is set in file->private_data, it
  calls deny_write() for the file on a branch.
- when the aufs file is released, allow_write() for the file on a branch
  is called.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 7/8] Aufs2: mmap
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
                   ` (5 preceding siblings ...)
  2009-02-23  7:37 ` [RFC 6/8] Aufs2: fmode_exec hooanon05
@ 2009-02-23  7:37 ` hooanon05
  2009-02-23  9:18   ` Tomas M
  2009-02-23  7:38 ` [RFC 8/8] Aufs2: plan hooanon05
  2009-02-25 17:50 ` [RFC 0/8] Aufs2 documents David P. Quigley
  8 siblings, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:37 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


mmap(2) -- File Memory Mapping
----------------------------------------------------------------------
In aufs, the file-mapped pages are shared between the file on a branch
and the virtual one in aufs by overriding vm_operation, particularly
->fault().

In aufs_mmap(),
- get and store vm_ops of the real file on a branch.
- map the file of aufs by generic_file_mmap() and set aufs's vm
  operations.

In aufs_fault(),
- get the file of aufs from the passed vma, sleep if needed.
- get the real file on a branch from the aufs file.
- a race may happen. for instance a multithreaded library. so some lock
  is implemented.
- call ->fault() in the previously stored vm_ops with setting the
  real file on a branch to vm_file.
- restore vm_file and wake_up if someone else got sleep.

When a branch is added to or deleted from aufs, the same-named file may
unveil and its contents will be replaced by the new one when a process
read(2) through previously opened file.
(Some users may not want to refresh the filedata. For such users, I
have a plan to implement a mount option 'refrof' which decides to
refresh the opened files or not. See plan.txt too.)
In this case, an already mapped file will not be updated since the
contents are a part of a process already and it should not be changed by
aufs branch manipulation. Of course, in case of the deleting branch has a
busy file, it cannot be deleted from the union.

In Unionfs, it took an approach which the memory pages mapped to
filedata are copied from the lower (real) file into the Unionfs's
virtual one and handles it by address_space operations. Recently Unionfs
changed it to this approach which aufs adopted since Jul 2006.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 8/8] Aufs2: plan
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
                   ` (6 preceding siblings ...)
  2009-02-23  7:37 ` [RFC 7/8] Aufs2: mmap hooanon05
@ 2009-02-23  7:38 ` hooanon05
  2009-02-25 17:50 ` [RFC 0/8] Aufs2 documents David P. Quigley
  8 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23  7:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel


> This is my second trial to ask incorporating aufs into mainline.


Plan

Restoring some features which was implemented in aufs1.
They were dropped in aufs2 in order to make source files simpler and
easier to be reviewed.


Export Aufs via NFS
----------------------------------------------------------------------
Here is an approach I adopt in aufs1.
- like xino/xib, add a new file 'xigen' which stores aufs inode
  generation.
- iget_locked(): initialize aufs inode generation for a new inode, and
  store it in xigen file.
- destroy_inode(): increment aufs inode generation and store it in xigen
  file. it is necessary even if it is not unlinked, because any data of
  inode may be changed by UDBA.
- encode_fh(): for a root dir, simply return FILEID_ROOT. otherwise
  build file handle by
  + branch id (4 bytes)
  + superblock generation (4 bytes)
  + inode number (4 or 8 bytes)
  + parent dir inode number (4 or 8 bytes)
  + inode generation (4 bytes))
  + return value of exportfs_encode_fh() for the parent on a branch (4
    bytes)
  + file handle for a branch (by exportfs_encode_fh())
- fh_to_dentry():
  + find the index of a branch from its id in handle, and check it is
    still exist in aufs.
  + 1st level: get the inode number from handle and search it in cache.
  + 2nd level: if not found, get the parent inode number from handle and
    search it in cache. and then open the parent dir, find the matching
    inode number by vfs_readdir() and get its name, and call
    lookup_one_len() for the target dentry.
  + 3rd level: if the parent dir is not cached, call
    exportfs_decode_fh() for a branch and get the parent on a branch,
    build a pathname of it, convert it a pathname in aufs, call
    path_lookup(). now aufs gets a parent dir dentry, then handle it as
    the 2nd level.
  + to open the dir, aufs needs struct vfsmount. aufs keeps vfsmount
    for every branch, but not itself. to get this, (currently) aufs
    searches in current->nsproxy->mnt_ns list. it may not be a good
    idea, but I didn't get other approach.
  + test the generation of the gotten inode.
- every inode operation: they may get EBUSY due to UDBA. in this case,
  convert it into ESTALE for NFSD.
- readdir(): call lockdep_on/off() because filldir in NFSD calls
  lookup_one_len(), vfs_getattr(), encode_fh() and others.


Test Only the Highest One for the Directory Permission (dirperm1 option)
----------------------------------------------------------------------
Let's try case study.
- aufs has two branches, upper readwrite and lower readonly.
  /au = /rw + /ro
- "dirA" exists under /ro, but /rw. and its mode is 0700.
- user invoked "chmod a+rx /au/dirA"
- then "dirA" becomes world readable?

In this case, /ro/dirA is still 0700 since it exists in readonly branch,
or it may be a natively readonly filesystem. If aufs respects the lower
branch, it should not respond readdir request from other users. But user
allowed it by chmod. Should really aufs rejects showing the entries
under /ro/dirA?

To be honest, I don't have a best solution for this case. So I
implemented 'dirperm1' and 'nodirperm1' option in aufs1, and leave it to
users.
When dirperm1 is specified, aufs checks only the highest one for the
directory permission, and shows the entries. Otherwise, as usual, checks
every dir existing on all branches and rejects the request.

As a side effect, dirperm1 option improves the performance of aufs
because the number of permission check is reduced.


Show Whiteout Mode (shwh)
----------------------------------------------------------------------
Generally aufs hides the name of whiteouts. But in some cases, to show
them is very useful for users. For instance, creating a new middle layer
(branch) by merging existing layers.

(borrowing aufs1 HOW-TO from a user, Michael Towers)
When you have three branches,
- Bottom: 'system', squashfs (underlying base system), read-only
- Middle: 'mods', squashfs, read-only
- Top: 'overlay', ram (tmpfs), read-write

The top layer is loaded at boot time and saved at shutdown, to preserve
the changes made to the system during the session.
When larger changes have been made, or smaller changes have accumulated,
the size of the saved top layer data grows. At this point, it would be
nice to be able to merge the two overlay branches ('mods' and 'overlay')
and rewrite the 'mods' squashfs, clearing the top layer and thus
restoring save and load speed.

This merging is simplified by the use of another aufs mount, of just the
two overlay branches using the 'shwh' option.
# mount -t aufs -o ro,shwh,br:/livesys/overlay=ro+wh:/livesys/mods=rr+wh \
	aufs /livesys/merge_union

A merged view of these two branches is then available at
/livesys/merge_union, and the new feature is that the whiteouts are
visible!
Note that in 'shwh' mode the aufs mount must be 'ro', which will disable
writing to all branches. Also the default mode for all branches is 'ro'.
It is now possible to save the combined contents of the two overlay
branches to a new squashfs, e.g.:
# mksquashfs /livesys/merge_union /path/to/newmods.squash

This new squashfs archive can be stored on the boot device and the
initramfs will use it to replace the old one at the next boot.


Being Another Aufs's Readonly Branch (robr)
----------------------------------------------------------------------
Aufs1 allows aufs to be another aufs's readonly branch.
This feature was developed by a user's request. But it may not be used
currecnly.


Copy-up on Open (coo=)
----------------------------------------------------------------------
By default the internal copy-up is executed when it is really necessary.
It is not done when a file is opened for writing, but when write(2) is
done. Users who have many (over 100) branches want to know and analyse
when and what file is copied-up. To insert a new upper branch which
contains such files only may improve the performance of aufs.

Aufs1 implemented "coo=none | leaf | all" option.


Refresh the Opened File (refrof)
----------------------------------------------------------------------
This option is implemented in aufs1 but incomplete.

When user reads from a file, he expects to get its latest filedata
generally. If the file is removed and a new same named file is created,
the content he gets is unchanged, ie. the unlinked filedata.

Let's try case study again.
- aufs has two branches.
  /au = /rw + /ro
- "fileA" exists under /ro, but /rw.
- user opened "/au/fileA".
- he or someone else inserts a branch (/new) between /rw and /ro.
  /au = /rw + /new + /ro
- the new branch has "fileA".
- user reads from the opened "fileA"
- which filedata should aufs return, from /ro or /new?

Some people says it has to be "from /ro" and it is a semantics of Unix.
The others say it should be "from /new" because the file is not removed
and it is equivalent to the case of someone else modifies the file.

Here again I don't have a best and final answer. I got an idea to
implement 'refrof' and 'norefrof' option. When 'refrof' (REFResh the
Opened File) is specified (by default), aufs returns the filedata from
/new.
Otherwise from /new.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 2/8] Aufs2: structure
  2009-02-23  7:34 ` [RFC 2/8] Aufs2: structure hooanon05
@ 2009-02-23  9:13   ` Tomas M
  2009-02-23  9:22     ` Tomas M
  2009-02-23 14:23     ` [RFC 2/8] Aufs2: structure hooanon05
  0 siblings, 2 replies; 42+ messages in thread
From: Tomas M @ 2009-02-23  9:13 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-fsdevel, linux-kernel

> The struct of a xino file is simple, just a sequence of aufs inode
> numbers which is indexed by the lower inode number.
> In the above sample, assume the inode number of /ro/fileA is i111 and
> aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
> 4(8) bytes at 111 * 4(8) bytes offset in the xino file.

I think it is worth mentioning that the xino file, if I understand it correctly, is a 'sparse file', that means it is full of 'holes' and doesn't consume as much disk space as it might appear.

In my opinion, the current xino-file approach is not much usable on filesystems which do not support sparse files (for example, if you wish to union two vfats), since some 'seeks' would probably write a lot of nulls. But I am not any kernel developer so I don't even know if there exists any filesystem which would be unable to support sparse files (except the mentioned VFAT, of course).


Tomas M
slax.org


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 7/8] Aufs2: mmap
  2009-02-23  7:37 ` [RFC 7/8] Aufs2: mmap hooanon05
@ 2009-02-23  9:18   ` Tomas M
  2009-02-23 14:39     ` hooanon05
  0 siblings, 1 reply; 42+ messages in thread
From: Tomas M @ 2009-02-23  9:18 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-fsdevel, linux-kernel

> In Unionfs, it took an approach which the memory pages mapped to
> filedata are copied from the lower (real) file into the Unionfs's
> virtual one and handles it by address_space operations. Recently Unionfs
> changed it to this approach which aufs adopted since Jul 2006.

I think there are much more areas where people around unionfs actually get the good ideas from AUFS. But it appears most projects already switched from StonyBrook's unionfs to AUFS, since AUFS proves to be better for many years.


Tomas M
slax.org

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 2/8] Aufs2: structure
  2009-02-23  9:13   ` Tomas M
@ 2009-02-23  9:22     ` Tomas M
  2009-02-24  8:13       ` New filesystem for Linux kernel Tomas M
  2009-02-23 14:23     ` [RFC 2/8] Aufs2: structure hooanon05
  1 sibling, 1 reply; 42+ messages in thread
From: Tomas M @ 2009-02-23  9:22 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

> The struct of a xino file is simple, just a sequence of aufs inode
> numbers which is indexed by the lower inode number.
> In the above sample, assume the inode number of /ro/fileA is i111 and
> aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
> 4(8) bytes at 111 * 4(8) bytes offset in the xino file.

I think it is worth mentioning that the xino file, if I understand it correctly, is a 'sparse file', that means it is full of 'holes' and doesn't consume as much disk space as it might appear.

In my opinion, the current xino-file approach is not much usable on filesystems which do not support sparse files (for example, if you wish to union two vfats), since some 'seeks' would probably write a lot of nulls. But I am not any kernel developer so I don't even know if there exists any filesystem which would be unable to support sparse files (except the mentioned VFAT, of course).


Tomas M
slax.org

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 2/8] Aufs2: structure
  2009-02-23  9:13   ` Tomas M
  2009-02-23  9:22     ` Tomas M
@ 2009-02-23 14:23     ` hooanon05
  1 sibling, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23 14:23 UTC (permalink / raw)
  To: Tomas M; +Cc: linux-fsdevel, linux-kernel


Tomas M:
> > The struct of a xino file is simple, just a sequence of aufs inode
> > numbers which is indexed by the lower inode number.
> > In the above sample, assume the inode number of /ro/fileA is i111 and
> > aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
> > 4(8) bytes at 111 * 4(8) bytes offset in the xino file.
> 
> I think it is worth mentioning that the xino file, if I understand it correctly, is a 'sparse file', that means it is full of 'holes' and doesn't consume as much disk space as it might appear.

That is right.
Thank you for pointing out.


> In my opinion, the current xino-file approach is not much usable on filesystems which do not support sparse files (for example, if you wish to union two vfats), since some 'seeks' would probably write a lot of nulls. But I am not any kernel developer so I don't even know if there exists any filesystem which would be unable to support sparse files (except the mentioned VFAT, of course).

Aufs creats the xino files on the first writable branch. If it consumes
disk space for holes, then an aufs mount option 'xino=<path>' may be
useful.

I will update my local documents.
Thank you.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 7/8] Aufs2: mmap
  2009-02-23  9:18   ` Tomas M
@ 2009-02-23 14:39     ` hooanon05
  0 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-23 14:39 UTC (permalink / raw)
  To: Tomas M; +Cc: linux-fsdevel, linux-kernel


Tomas M:
> > In Unionfs, it took an approach which the memory pages mapped to
> > filedata are copied from the lower (real) file into the Unionfs's
> > virtual one and handles it by address_space operations. Recently Unionfs
> > changed it to this approach which aufs adopted since Jul 2006.
> 
> I think there are much more areas where people around unionfs actually get the good ideas from AUFS. But it appears most projects already switched from StonyBrook's unionfs to AUFS, since AUFS proves to be better for many years.

Yes, AUFS is used by many people for many years.
And several people asked me to put it into mainline. Unfortunately I am
not the one who decide it. I can only post and ask to do it.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* New filesystem for Linux kernel
  2009-02-23  9:22     ` Tomas M
@ 2009-02-24  8:13       ` Tomas M
  2009-02-24 11:52         ` Miklos Szeredi
  2009-02-24 14:15         ` Theodore Tso
  0 siblings, 2 replies; 42+ messages in thread
From: Tomas M @ 2009-02-24  8:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

An overview of aufs2 has been submitted to this list.
I noticed zero response at all. Nobody cares?

I suggest to remove unionfs from Andrew's -mm tree and replace it by aufs2!
Tell me why this should not happen...

I write this in the hope that a debate will start...
Thank you

Tomas M
slax.org

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24  8:13       ` New filesystem for Linux kernel Tomas M
@ 2009-02-24 11:52         ` Miklos Szeredi
  2009-02-24 13:18           ` hooanon05
  2009-02-24 14:15         ` Theodore Tso
  1 sibling, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2009-02-24 11:52 UTC (permalink / raw)
  To: tomas; +Cc: akpm, linux-fsdevel, linux-kernel

On Tue, 24 Feb 2009, Tomas M wrote:
> An overview of aufs2 has been submitted to this list.
> I noticed zero response at all. Nobody cares?
> 
> I suggest to remove unionfs from Andrew's -mm tree and replace it by aufs2!

Perhaps you should ask Andrew?

> Tell me why this should not happen...

I think the biggest problem is too many features.

   > git diff master...aufs2 | diffstat
    ...
    73 files changed, 23527 insertions(+), 7 deletions(-)
                      ^^^^^^
This is an unreviewable amount of code, it would make AUFS one of the
biggest filesystems on linux.

The first step would be to separate out the very core functionality,
which should be a couple thousands of lines max.  And when that has
been accepted and stabilised, then you can start adding fancy
features.

Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 11:52         ` Miklos Szeredi
@ 2009-02-24 13:18           ` hooanon05
  2009-02-24 13:45             ` Tarkan Erimer
  2009-02-24 14:50             ` Miklos Szeredi
  0 siblings, 2 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-24 13:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: tomas, akpm, linux-fsdevel, linux-kernel


Miklos Szeredi:
> I think the biggest problem is too many features.
> 
>    > git diff master...aufs2 | diffstat
>     ...
>     73 files changed, 23527 insertions(+), 7 deletions(-)
>                       ^^^^^^
> This is an unreviewable amount of code, it would make AUFS one of the
> biggest filesystems on linux.
> 
> The first step would be to separate out the very core functionality,
> which should be a couple thousands of lines max.  And when that has
> been accepted and stabilised, then you can start adding fancy
> features.

I have to admit aufs is big, but actually, as I wrote in the documents,
aufs2 has already dropped several features. And I believe it is the core
feature set. If aufs2 drops some more features, then both of users and
reviewers will say it doesn't work in this case, in that case. I don't
think you would like to review such unusable code in real world.
For those who wants to begin with aufs2 principle (or basic
architecture), I described and posted these documents.

Actually ocfs2 and xfs are much bigger.
If they have been reviewed, I'd ask you to review aufs2 too.
$ cd linux/fs
$ for i in *; do test -d $i && echo -n $i && find $i -type f | xargs wc -l | tail -1; done | tr -s '[[:blank:]]' | sort -n -k 2 | tail
nfs 30363 total
ntfs 31346 total
jfs 33056 total
cifs 34059 total
ubifs 34721 total
btrfs 43417 total
aufs 46325 total
nls 54855 total
ocfs2 71294 total
xfs 102144 total


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 13:18           ` hooanon05
@ 2009-02-24 13:45             ` Tarkan Erimer
  2009-02-24 13:57               ` hooanon05
  2009-02-24 14:50             ` Miklos Szeredi
  1 sibling, 1 reply; 42+ messages in thread
From: Tarkan Erimer @ 2009-02-24 13:45 UTC (permalink / raw)
  To: hooanon05; +Cc: Miklos Szeredi, tomas, akpm, linux-fsdevel, linux-kernel

hooanon05@yahoo.co.jp wrote:
> Miklos Szeredi:
>   
>> I think the biggest problem is too many features.
>>
>>    > git diff master...aufs2 | diffstat
>>     ...
>>     73 files changed, 23527 insertions(+), 7 deletions(-)
>>                       ^^^^^^
>> This is an unreviewable amount of code, it would make AUFS one of the
>> biggest filesystems on linux.
>>
>> The first step would be to separate out the very core functionality,
>> which should be a couple thousands of lines max.  And when that has
>> been accepted and stabilised, then you can start adding fancy
>> features.
>>     
>
> I have to admit aufs is big, but actually, as I wrote in the documents,
> aufs2 has already dropped several features. And I believe it is the core
> feature set. If aufs2 drops some more features, then both of users and
> reviewers will say it doesn't work in this case, in that case. I don't
> think you would like to review such unusable code in real world.
> For those who wants to begin with aufs2 principle (or basic
> architecture), I described and posted these documents.
>
>   
I think there is a misunderstanding or confusion about merging the code 
into the mainline. You needn't to drop some functionality/features to 
make your code small and to make it reviewed by mergers. You have to 
separate (Core FS functions, features etc.) your code into small pieces. 
So that, mergers will look, for first, your core FS functionality code 
to see that it *_breaks_* or *_touches_* any _*other areas of the 
kernel*_. If everything goes well, your main Core FS functionality code 
will be merged into the mainline. After that, you can send your 
feature/functionality codes one by one.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 13:45             ` Tarkan Erimer
@ 2009-02-24 13:57               ` hooanon05
  2009-02-24 14:16                 ` Tarkan Erimer
  0 siblings, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-24 13:57 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: Miklos Szeredi, tomas, akpm, linux-fsdevel, linux-kernel


Tarkan Erimer:
> I think there is a misunderstanding or confusion about merging the code 
> into the mainline. You needn't to drop some functionality/features to 
> make your code small and to make it reviewed by mergers. You have to 
> separate (Core FS functions, features etc.) your code into small pieces. 
> So that, mergers will look, for first, your core FS functionality code 
> to see that it *_breaks_* or *_touches_* any _*other areas of the 
> kernel*_. If everything goes well, your main Core FS functionality code 
> will be merged into the mainline. After that, you can send your 
> feature/functionality codes one by one.

Tarkan, thank you for your advise.
Actually I broke functions and files into small pieces. Additionally I
applied CONFIG_AUFS_* conditions in fs/aufs/Makefile. When you review
aufs source files, I'd suggest you to read in this order.

aufs-y := module.o sbinfo.o super.o branch.o xino.o sysaufs.o opts.o \
	wkq.o vfsub.o dcsub.o \
	cpup.o whout.o plink.o wbr_policy.o \
	dinfo.o dentry.o \
	finfo.o file.o f_op.o \
	dir.o vdir.o \
	iinfo.o inode.o i_op.o i_op_add.o i_op_del.o i_op_ren.o \
	ioctl.o
aufs-$(CONFIG_SYSFS) += sysfs.o
aufs-$(CONFIG_AUFS_BDEV_LOOP) += loop.o
aufs-$(CONFIG_AUFS_HINOTIFY) += hinotify.o
aufs-$(CONFIG_AUFS_DEBUG) += debug.o
aufs-$(CONFIG_AUFS_MAGIC_SYSRQ) += sysrq.o

If I am still misunderstanding, please let me know.

Thank you
J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24  8:13       ` New filesystem for Linux kernel Tomas M
  2009-02-24 11:52         ` Miklos Szeredi
@ 2009-02-24 14:15         ` Theodore Tso
  2009-02-24 15:18           ` David P. Quigley
                             ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Theodore Tso @ 2009-02-24 14:15 UTC (permalink / raw)
  To: Tomas M; +Cc: linux-fsdevel, linux-kernel

On Tue, Feb 24, 2009 at 09:13:08AM +0100, Tomas M wrote:
> An overview of aufs2 has been submitted to this list.
> I noticed zero response at all. Nobody cares?
> 
> I suggest to remove unionfs from Andrew's -mm tree and replace it by aufs2!
> Tell me why this should not happen...

Um, you need to tell us why aufs2 is better than Unionfs.  The burden
of proof rests on your shoulders.  The code which is displacing
existing code needs to give a justification about why it is better
than the code which is displacing, not the other way around.

> I write this in the hope that a debate will start...

As a debate judge might say, you haven't even made your prima facie
case yet.

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 13:57               ` hooanon05
@ 2009-02-24 14:16                 ` Tarkan Erimer
  0 siblings, 0 replies; 42+ messages in thread
From: Tarkan Erimer @ 2009-02-24 14:16 UTC (permalink / raw)
  To: hooanon05; +Cc: Miklos Szeredi, tomas, akpm, linux-fsdevel, linux-kernel

hooanon05@yahoo.co.jp wrote:
> Tarkan, thank you for your advise.
> Actually I broke functions and files into small pieces. Additionally I
> applied CONFIG_AUFS_* conditions in fs/aufs/Makefile. When you review
> aufs source files, I'd suggest you to read in this order.
>
> aufs-y := module.o sbinfo.o super.o branch.o xino.o sysaufs.o opts.o \
> 	wkq.o vfsub.o dcsub.o \
> 	cpup.o whout.o plink.o wbr_policy.o \
> 	dinfo.o dentry.o \
> 	finfo.o file.o f_op.o \
> 	dir.o vdir.o \
> 	iinfo.o inode.o i_op.o i_op_add.o i_op_del.o i_op_ren.o \
> 	ioctl.o
> aufs-$(CONFIG_SYSFS) += sysfs.o
> aufs-$(CONFIG_AUFS_BDEV_LOOP) += loop.o
> aufs-$(CONFIG_AUFS_HINOTIFY) += hinotify.o
> aufs-$(CONFIG_AUFS_DEBUG) += debug.o
> aufs-$(CONFIG_AUFS_MAGIC_SYSRQ) += sysrq.o
>
> If I am still misunderstanding, please let me know.
>   
Ok then. Good luck at your merging into mainline.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 13:18           ` hooanon05
  2009-02-24 13:45             ` Tarkan Erimer
@ 2009-02-24 14:50             ` Miklos Szeredi
  2009-02-24 16:26               ` hooanon05
  2009-02-26  5:51               ` hooanon05
  1 sibling, 2 replies; 42+ messages in thread
From: Miklos Szeredi @ 2009-02-24 14:50 UTC (permalink / raw)
  To: hooanon05; +Cc: miklos, tomas, akpm, linux-fsdevel, linux-kernel

On Tue, 24 Feb 2009, hooanon05@yahoo.co.j wrote:
> I have to admit aufs is big, but actually, as I wrote in the documents,
> aufs2 has already dropped several features. And I believe it is the core
> feature set. If aufs2 drops some more features, then both of users and
> reviewers will say it doesn't work in this case, in that case. I don't
> think you would like to review such unusable code in real world.

It's always easier to review something with less features, even if
that feature set is too little for real world use.

Maybe it has all got to get into mainline to be really useful, but
still, splitting it up by functionality (not just files) helps
reviewers a lot.

Let's see this feature list:

> 1. Features
> ----------------------------------------
> - unite several directories into a single virtual filesystem. The member
>   directory is called as a branch.

Sounds like a core functionality :)

> - you can specify the permission flags to the branch, which are 'readonly',
>   'readwrite' and 'whiteout-able.'

The simplest version is with all branches read-only.  That gets rid of
a _huge_ amount of complexity, yet it's still useful in some
situations.  It also deals with a lot of the basic infrastucture
needed for stacking.

> - by upper writable branch, internal copyup and whiteout, files/dirs on
>   readonly branch are modifiable logically.

Right.  The second most simple version is all branches read-only except the
top one.

And that's when one starts thinking about whether unioning is really
the right solution.  Instead this could be implemented with a special
filesystem format that only contains deltas to the data, metatata and
directory tree.  It would be much more space efficient, could easily
handle renames, hard links etc, without all the hacks that
unionfs/aufs does.

Has this been discussed somewhere?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 14:15         ` Theodore Tso
@ 2009-02-24 15:18           ` David P. Quigley
  2009-02-24 15:41             ` hooanon05
  2009-02-25  7:31             ` Tomas M
  2009-02-25  8:12           ` Tomas M
  2009-02-26 14:31           ` Amit Kucheria
  2 siblings, 2 replies; 42+ messages in thread
From: David P. Quigley @ 2009-02-24 15:18 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Tomas M, linux-fsdevel, linux-kernel

On Tue, 2009-02-24 at 09:15 -0500, Theodore Tso wrote:
> On Tue, Feb 24, 2009 at 09:13:08AM +0100, Tomas M wrote:
> > An overview of aufs2 has been submitted to this list.
> > I noticed zero response at all. Nobody cares?
> > 
> > I suggest to remove unionfs from Andrew's -mm tree and replace it by aufs2!
> > Tell me why this should not happen...
> 
> Um, you need to tell us why aufs2 is better than Unionfs.  The burden
> of proof rests on your shoulders.  The code which is displacing
> existing code needs to give a justification about why it is better
> than the code which is displacing, not the other way around.
> 
> > I write this in the hope that a debate will start...
> 
> As a debate judge might say, you haven't even made your prima facie
> case yet.
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The arguments that the AUFS2 people have been using for a while is that
aufs2 is more stable than unionfs. I use to work on the Stony Brook
version of Unionfs and I must admit that the 1.0 version which I did the
initial 2.6 kernel port of was buggy. However after I left they rewrote
large chunks of the code for a 2.0 release. I don't have much knowledge
of this code base but I am still subscribed to the Unionfs mailing list
and I don't see too many bug reports heading that way. 

In reality though the debate isn't Unionfs vs AUFS2 because many of the
kernel people including Christoph have voiced their opinion that they
don't want to see a file system solution to union mounts. There have
been patch sets trying to introduce union mounts and they seem to have
gone quiet for a while as one of the main points of debate was how and
where to do duplicate elimination.

That being said there is no reason that both unionfs and aufs2 can't
live in mm. However, just because either of them are in mm doesn't mean
that they will get mainlined. Reiser4 has been in there for ages and I
don't think anyone sees that getting in any time soon.

As far as review of AUFS2 goes I use to idle in a few freenode channels
a while back and there was a discussion about AUFS2 the first time it
hit LKML. The AUFS people tout that the unionfs dev team has
incorporated some of their ideas into unionfs and that is true. However,
one of the issues raised about AUFS back then was that they had all
sorts of sorted hacks that were unacceptable for use in mainline. I
don't have my irc logs since I am at work and they are at home but I can
try to dig up the specific problems when I get home later today.

I think AUFS2 really needs a solid review but as other people have said
it isn't broken up and organized in a way that makes it a task someone
wants to do. Earlier in the thread someone said that the AUFS team needs
to slim it down to just core features and get those mainlined. The
Unionfs team did this when we were moving for mainline inclusion and I
think that is one of the reasons why people jumped ship and moved to
AUFS. Since we were focusing on the mainline inclusion effort and
releasing a smaller feature set Unionfs it ended up dropping some
functionality that people may have seen as core while the kernel
community and ourselves didn't.

Just my $0.02 and in no way shape or form the opinion of my employer or
the US govt and definitely not a request to have work done in this
space.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 15:18           ` David P. Quigley
@ 2009-02-24 15:41             ` hooanon05
  2009-02-25 15:53               ` David P. Quigley
  2009-02-25  7:31             ` Tomas M
  1 sibling, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-24 15:41 UTC (permalink / raw)
  To: David P. Quigley; +Cc: Theodore Tso, Tomas M, linux-fsdevel, linux-kernel


Hi David,
I'm glad if you remember me still.

"David P. Quigley":
> In reality though the debate isn't Unionfs vs AUFS2 because many of the
> kernel people including Christoph have voiced their opinion that they
> don't want to see a file system solution to union mounts. There have
> been patch sets trying to introduce union mounts and they seem to have
> gone quiet for a while as one of the main points of debate was how and
> where to do duplicate elimination.

As I wrote in aufs2 documents, UnionMount may eliminate some
duplication, but it won't be able to share branches.
----------------------------------------------------------------------
UnionMount's approach will be able to small, but may be hard to share
branches between several UnionMount since the whiteout in it is
implemented in the inode on branch filesystem and always
shared. According to Bharata's post, readdir does not seems to be
finished yet.
----------------------------------------------------------------------

Anyway, "kernel people including Christoph have voiced their opinion
that they don't want to see a file system solution to union mounts" is
very impotant. If it is true, I should ask its reason fist. Could you
tell me the source of this information? Url of mail archive or somthing?


> As far as review of AUFS2 goes I use to idle in a few freenode channels
> a while back and there was a discussion about AUFS2 the first time it
> hit LKML. The AUFS people tout that the unionfs dev team has
> incorporated some of their ideas into unionfs and that is true. However,
> one of the issues raised about AUFS back then was that they had all
> sorts of sorted hacks that were unacceptable for use in mainline. I

I am afraid you are confusing aufs1 and aufs2.
And I hope that such "unacceptable" things are dropped from aufs2.


> don't have my irc logs since I am at work and they are at home but I can
> try to dig up the specific problems when I get home later today.

Ok, I'll wait.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 14:50             ` Miklos Szeredi
@ 2009-02-24 16:26               ` hooanon05
  2009-02-25 10:28                 ` Miklos Szeredi
  2009-02-26  5:51               ` hooanon05
  1 sibling, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-24 16:26 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: tomas, akpm, linux-fsdevel, linux-kernel


Miklos Szeredi:
> It's always easier to review something with less features, even if
> that feature set is too little for real world use.

Generally I agree with you.


> The simplest version is with all branches read-only.  That gets rid of
> a _huge_ amount of complexity, yet it's still useful in some
> situations.  It also deals with a lot of the basic infrastucture
> needed for stacking.

If you really think it is a better way to get merged into mainline, then
I'll try implement such version.


> And that's when one starts thinking about whether unioning is really
> the right solution.  Instead this could be implemented with a special
> filesystem format that only contains deltas to the data, metatata and
> directory tree.  It would be much more space efficient, could easily
> handle renames, hard links etc, without all the hacks that
> unionfs/aufs does.

It sounds like an ODF (on disk format) version of unionfs (while it
seems to be inactive).
At implementing, I don't think it easier to maintain delta of filedata
and metadata. Since aufs has a writable branch in it, it is better and
easier to maintain data in a branch fs.
If you think there should not be any writable branch in aufs, and all
"write" goes to a new filesystem format, then it is equivalent to a
writable branch, isn't it?
If you say "just a part of write" goes to a new fs, then I don't think
we can support several essential features, for instance mmap.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 15:18           ` David P. Quigley
  2009-02-24 15:41             ` hooanon05
@ 2009-02-25  7:31             ` Tomas M
  2009-02-25  9:33               ` David Newall
  1 sibling, 1 reply; 42+ messages in thread
From: Tomas M @ 2009-02-25  7:31 UTC (permalink / raw)
  To: David P. Quigley; +Cc: Theodore Tso, linux-fsdevel, linux-kernel

David P. Quigley wrote:
> Earlier in the thread someone said that the AUFS team needs
> to slim it down to just core features and get those mainlined. The
> Unionfs team did this when we were moving for mainline inclusion and I
> think that is one of the reasons why people jumped ship and moved to
> AUFS.

I can speak just for myself, and in my opinion it wasn't it.

At least Knoppix and Slax switched to AUFS even before the release of unionfs 2.0.
The switch was not due to missing features in unionfs, it was due to huge instability, for example simple filesystem operations freezed the whole computer, mmap support was completely missing or broken (until unionfs people incorporated the idea from aufs), and so on. Any single similar problem did not happen with aufs, which is used in Slax by hundreds of thousands of users.

In general, I need an union filesystem and I do not care if that is unionfs or aufs. But since I have very bad experience with unionfs (and I am not alone), and I have very good experience with aufs, along with all hundreds of thousands users of Slax, I wish aufs to be mainlined; because the code simply works, since its initial release, for many years.

Tomas M
slax.org


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 14:15         ` Theodore Tso
  2009-02-24 15:18           ` David P. Quigley
@ 2009-02-25  8:12           ` Tomas M
  2009-02-26 14:31           ` Amit Kucheria
  2 siblings, 0 replies; 42+ messages in thread
From: Tomas M @ 2009-02-25  8:12 UTC (permalink / raw)
  To: Theodore Tso, Tomas M, linux-fsdevel, linux-kernel

>> I suggest to remove unionfs from Andrew's -mm tree and replace it by aufs2!
>> Tell me why this should not happen...
> 
> Um, you need to tell us why aufs2 is better than Unionfs.  The burden
> of proof rests on your shoulders.

I understand and agree.

At the time Slax switched from unionfs to aufs, there were the following issues:
(unionfs people are welcome to comment on what's improved in newest unionfs 2.x)

1) Better branch manipulation in aufs
- unionfs used ioctls to manipulate branches, aufs used remount. I do not quite understand which one was better, but remount didn't need any special binary to manipulate branches so that was better for me.

I guess unionfs switched to remount too; not because it is better, but because it was needed just to get mainlined. From my point of view, they do many things just because they desperately need to be mainlined, even if they would like to do it differently (the ioctl branch manipulation is just an example)

2) More branches in one aufs union
- unionfs supported (if I remember correctly) only 128 branches. Aufs supported 1024 branches and if there was an option to support 32767 branches (didn't test it though).

3) Fully working mmap support in aufs
- unionfs wasn't even able to mount a loop file from the union. At the time of my switch, there were no mmap support in unionfs at all. I think they already implemented it, re-using aufs's method.

4) Persistent inode support in aufs
- I'm not sure if I remember this correctly. In unionfs, when a branch was added, I believe that all the inode numbers changed. That had fatal impact on many programs, since it is not supposed to happen. Aufs maintain persistent inode numbers just perfectly.

5) Better stability in aufs
- it is always said that this is a subjective issue. In my opinion, it's pretty objective. Unionfs didn't work well in the past, and they keep fixing race conditions all the time, even now in -mm tree. Aufs simply worked, and it still simply works. I do not remember any single system freeze with aufs, but I remember a lot of freezes with unionfs.

6) Branch information through /sys instead of /proc/mounts
- imagine you added 1000 branches, what happens with your /proc/mounts with unionfs? In aufs, it's just OK.


Tomas M
slax.org



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-25  7:31             ` Tomas M
@ 2009-02-25  9:33               ` David Newall
  0 siblings, 0 replies; 42+ messages in thread
From: David Newall @ 2009-02-25  9:33 UTC (permalink / raw)
  To: Tomas M; +Cc: David P. Quigley, Theodore Tso, linux-fsdevel, linux-kernel

Tomas M wrote:
> In general, I need an union filesystem and I do not care if that is unionfs or aufs.

Agreed. Well said.

> I wish aufs to be mainlined; because the code simply works, since its initial release, for many years.

Perhaps; but consider Okajima-san's rationale, "Aufs2 is a refined
version of old aufs1: - to be reviewed easily and widely. - to make the
source files simpler and smaller by dropping several original
features."  Do note assume the code being proposed is the same as, or of
the same quality as, that from AUFS1.


If we really believe in choice, then unless AUFS2 obviously breaks
things, it's hard to argue against it's inclusion.  Thus can
best-of-breed emerge.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 16:26               ` hooanon05
@ 2009-02-25 10:28                 ` Miklos Szeredi
  2009-02-26  4:09                   ` hooanon05
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2009-02-25 10:28 UTC (permalink / raw)
  To: hooanon05; +Cc: miklos, tomas, akpm, linux-fsdevel, linux-kernel

On Wed, 25 Feb 2009, hooanon05@yahoo.co.j wrote:
> > The simplest version is with all branches read-only.  That gets rid of
> > a _huge_ amount of complexity, yet it's still useful in some
> > situations.  It also deals with a lot of the basic infrastucture
> > needed for stacking.
> 
> If you really think it is a better way to get merged into mainline, then
> I'll try implement such version.

I'd personally be more motivated to review <2000 line chunks (where
each step adds new functionality and makes sense in itself), than a
20000 line filesystem all in one.

> > And that's when one starts thinking about whether unioning is really
> > the right solution.  Instead this could be implemented with a special
> > filesystem format that only contains deltas to the data, metatata and
> > directory tree.  It would be much more space efficient, could easily
> > handle renames, hard links etc, without all the hacks that
> > unionfs/aufs does.
> 
> It sounds like an ODF (on disk format) version of unionfs (while it
> seems to be inactive).
> At implementing, I don't think it easier to maintain delta of filedata
> and metadata. Since aufs has a writable branch in it, it is better and
> easier to maintain data in a branch fs.

Perhaps it's easier, but copy-up is a very inefficient operation, both
in disk space and in time.  My personal opinion is that a "delta"
filesystem would be cleaner and more useful than a writable union.
Writable union filesystems need many hacks to make them useful, such
as copy-up, whiteouts, inode number tables, virtual hard links, etc.

But that's just a thought, I haven't gone too deeply into this.

> If you think there should not be any writable branch in aufs, and all
> "write" goes to a new filesystem format, then it is equivalent to a
> writable branch, isn't it?

Yes, it should be equivalent.

> If you say "just a part of write" goes to a new fs, then I don't think
> we can support several essential features, for instance mmap.

It should be possible to support mmap.

Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 15:41             ` hooanon05
@ 2009-02-25 15:53               ` David P. Quigley
  2009-02-26  4:21                 ` hooanon05
  0 siblings, 1 reply; 42+ messages in thread
From: David P. Quigley @ 2009-02-25 15:53 UTC (permalink / raw)
  To: hooanon05; +Cc: Theodore Tso, Tomas M, linux-fsdevel, linux-kernel

On Wed, 2009-02-25 at 00:41 +0900, hooanon05@yahoo.co.jp wrote:
> Hi David,
> I'm glad if you remember me still.
> 
> "David P. Quigley":
> > In reality though the debate isn't Unionfs vs AUFS2 because many of the
> > kernel people including Christoph have voiced their opinion that they
> > don't want to see a file system solution to union mounts. There have
> > been patch sets trying to introduce union mounts and they seem to have
> > gone quiet for a while as one of the main points of debate was how and
> > where to do duplicate elimination.
> 
> As I wrote in aufs2 documents, UnionMount may eliminate some
> duplication, but it won't be able to share branches.
> ----------------------------------------------------------------------
> UnionMount's approach will be able to small, but may be hard to share
> branches between several UnionMount since the whiteout in it is
> implemented in the inode on branch filesystem and always
> shared. According to Bharata's post, readdir does not seems to be
> finished yet.
> ----------------------------------------------------------------------
> 
> Anyway, "kernel people including Christoph have voiced their opinion
> that they don't want to see a file system solution to union mounts" is
> very impotant. If it is true, I should ask its reason fist. Could you
> tell me the source of this information? Url of mail archive or somthing?

I'm having a hard time finding a long descriptive email from Christoph
outlining why but from our original unionfs posting to lkml this was
what I found. There have been subsequent comments from various people to
this effect as well.

http://marc.info/?l=linux-kernel&m=116833670823190&w=2

"And unionfs is the wrong thing do use for this.  Unioning is a complex
namespace operation and needs to be implemented in the VFS or at least
needs a lot of help from the VFS.  Getting namespace cache coherency
and especially locking right is impossible with out that."

I'd suggest getting the VFS maintainers to chime in on your code. If
their opinion on this has changed then you are in much better shape for
getting AUFS2 merged.

> 
> 
> > As far as review of AUFS2 goes I use to idle in a few freenode channels
> > a while back and there was a discussion about AUFS2 the first time it
> > hit LKML. The AUFS people tout that the unionfs dev team has
> > incorporated some of their ideas into unionfs and that is true. However,
> > one of the issues raised about AUFS back then was that they had all
> > sorts of sorted hacks that were unacceptable for use in mainline. I
> 
> I am afraid you are confusing aufs1 and aufs2.
> And I hope that such "unacceptable" things are dropped from aufs2.

I believe I am and since that is the case then aufs2 really needs a
solid review before it is merged. Silence isn't tacit acceptance that
the code is good to go. It usually means the people aren't looking at
it. 

> 
> 
> > don't have my irc logs since I am at work and they are at home but I can
> > try to dig up the specific problems when I get home later today.
> 
> Ok, I'll wait.

This may sound like a copout but unfortunately it seems my logs were on
my hard drive that died a few months back. Regardless though since you
did a major rewrite for AUFS2 those comments could possibly no longer be
valid. Regardless since there was a major rewrite since your last review
several people should review the code base.

> 
> 
> J. R. Okajima

All that said good luck with this effort.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
                   ` (7 preceding siblings ...)
  2009-02-23  7:38 ` [RFC 8/8] Aufs2: plan hooanon05
@ 2009-02-25 17:50 ` David P. Quigley
  2009-02-25 19:07   ` Matthew Wilcox
  2009-02-26  4:54   ` hooanon05
  8 siblings, 2 replies; 42+ messages in thread
From: David P. Quigley @ 2009-02-25 17:50 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-fsdevel, linux-kernel

I think it would be useful to see the source code for AUFS2 posted to
LKML. One of the questions I have not which doesn't seem to be addressed
in these documents is how robust is your xattr support and are you
making the appropriate LSM calls to make this usable with SELinux and
Smack. Also from a labeling perspective you have a very interesting
question of which label do you select when unifying directories. If you
have a/foo and b/foo each with different labels which do you choose.
Based on the history of Union type file systems I would suspect the
answer is whichever branch is listed first. 

Dave


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-25 17:50 ` [RFC 0/8] Aufs2 documents David P. Quigley
@ 2009-02-25 19:07   ` Matthew Wilcox
  2009-02-26  4:54   ` hooanon05
  1 sibling, 0 replies; 42+ messages in thread
From: Matthew Wilcox @ 2009-02-25 19:07 UTC (permalink / raw)
  To: David P. Quigley; +Cc: hooanon05, linux-fsdevel, linux-kernel

On Wed, Feb 25, 2009 at 12:50:54PM -0500, David P. Quigley wrote:
> I think it would be useful to see the source code for AUFS2 posted to
> LKML. One of the questions I have not which doesn't seem to be addressed
> in these documents is how robust is your xattr support and are you
> making the appropriate LSM calls to make this usable with SELinux and
> Smack. Also from a labeling perspective you have a very interesting
> question of which label do you select when unifying directories. If you
> have a/foo and b/foo each with different labels which do you choose.
> Based on the history of Union type file systems I would suspect the
> answer is whichever branch is listed first. 

That would provide an interesting way to bypass security protections on
a directory.  I suspect it should deny access if *any* of the unioned
directories would deny access.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-25 10:28                 ` Miklos Szeredi
@ 2009-02-26  4:09                   ` hooanon05
  0 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-26  4:09 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: tomas, akpm, linux-fsdevel, linux-kernel


Miklos Szeredi:
> I'd personally be more motivated to review <2000 line chunks (where
> each step adds new functionality and makes sense in itself), than a
> 20000 line filesystem all in one.

That is reasonable. :-)


> Perhaps it's easier, but copy-up is a very inefficient operation, both
> in disk space and in time.  My personal opinion is that a "delta"

I have to agree again.


> But that's just a thought, I haven't gone too deeply into this.
> 
> > If you say "just a part of write" goes to a new fs, then I don't think
> > we can support several essential features, for instance mmap.
> 
> It should be possible to support mmap.

In a easy way?
I know you already wrote it is just a thought, but if you have an idea
to support mmapping a file which is distributed multiple filesystems,
please let me know.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-25 15:53               ` David P. Quigley
@ 2009-02-26  4:21                 ` hooanon05
  0 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-26  4:21 UTC (permalink / raw)
  To: David P. Quigley; +Cc: Theodore Tso, Tomas M, linux-fsdevel, linux-kernel


Thank you for searching.

"David P. Quigley":
> "And unionfs is the wrong thing do use for this.  Unioning is a complex
> namespace operation and needs to be implemented in the VFS or at least
> needs a lot of help from the VFS.  Getting namespace cache coherency
> and especially locking right is impossible with out that."
> 
> I'd suggest getting the VFS maintainers to chime in on your code. If
> their opinion on this has changed then you are in much better shape for
> getting AUFS2 merged.

It may not be apropriate to ask you "especially locking right" in
detail. But if it means what I am guessing, this description may be the
answer.

(from [RFC 3/8] Aufs2: lookup)
Revalidate Dentry and UDBA (User's Direct Branch Access)
----------------------------------------------------------------------
Generally VFS helpers re-validate a dentry as a part of lookup.
0. digging down the directory hierarchy.
1. lock the parent dir by its i_mutex.
2. lookup the final (child) entry.
3. revalidate it.
4. call the actual operation (create, unlink, etc.)
5. unlock the parent dir

If the filesystem implements its ->d_revalidate() (step 3), then it is
called. Actually aufs implements it and checks the dentry on a branch is
still valid.
But it is not enough. Because aufs has to release the lock for the
parent dir on a branch at the end of ->lookup() (step 2) and
->d_revalidate() (step 3) while the i_mutex of the aufs dir is still
held by VFS.
If the file on a branch is changed directly, eg. bypassing aufs, after
aufs released the lock, then the subsequent operation may cause
something unpleasant result.

This situation is a result of VFS architecture, ->lookup() and
->d_revalidate() is separated. But I never say it is wrong. It is a good
design from VFS's point of view. It is just not suitable for sub-VFS
character in aufs.

Aufs supports such case by three level of revalidation which is
selectable by user.
1. Simple Revalidate
   Addition to the native flow in VFS's, confirm the child-parent
   relationship on the branch just after locking the parent dir on the
   branch in the "actual operation" (step 4). When this validation
   fails, aufs returns EBUSY. ->d_revalidate() (step 3) in aufs still
   checks the validation of the dentry on branches.
2. Monitor Changes Internally by Inotify
   Addition to above, in the "actual operation" (step 4) aufs re-lookup
   the dentry on the branch, and returns EBUSY if it finds different
   dentry.
   Additionally, aufs sets the inotify watch for every dir on branches
   during it is in cache. When the event is notified, aufs registers a
   function to kernel 'events' thread by schedule_work(). And the
   function sets some special status to the cached aufs dentry and inode
   private data. If they are not cached, then aufs has nothing to
   do. When the same file is accessed through aufs (step 0-3) later,
   aufs will detect the status and refresh all necessary data.
   In this mode, aufs has to ignore the event which is fired by aufs
   itself.
3. No Extra Validation
   This is the simplest test and doesn't add any additional revalidation
   test, and skip therevalidatin in step 4. It is useful and improves
   aufs performance when system surely hide the aufs branches from user,
   by over-mounting something (or another method).
----------------------------------------------------------------------


> This may sound like a copout but unfortunately it seems my logs were on
> my hard drive that died a few months back. Regardless though since you
> did a major rewrite for AUFS2 those comments could possibly no longer be
> valid. Regardless since there was a major rewrite since your last review
> several people should review the code base.

I have no objection about reviewing, entirely agreed.
Because I could guess it is hard work to read 40k lines, I posted
documents which describe design first.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-25 17:50 ` [RFC 0/8] Aufs2 documents David P. Quigley
  2009-02-25 19:07   ` Matthew Wilcox
@ 2009-02-26  4:54   ` hooanon05
  2009-02-26 17:20     ` David P. Quigley
  1 sibling, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-26  4:54 UTC (permalink / raw)
  To: David P. Quigley; +Cc: linux-fsdevel, linux-kernel


"David P. Quigley":
> I think it would be useful to see the source code for AUFS2 posted to
> LKML. One of the questions I have not which doesn't seem to be addressed
> in these documents is how robust is your xattr support and are you
> making the appropriate LSM calls to make this usable with SELinux and
> Smack. Also from a labeling perspective you have a very interesting
> question of which label do you select when unifying directories. If you
> have a/foo and b/foo each with different labels which do you choose.
> Based on the history of Union type file systems I would suspect the
> answer is whichever branch is listed first. 

Aufs doesn't support xattr curretnly because I don't decide how to
support it yet.
As far as I know, the implementation of xattr and its key/name pairs are
filesystem dependent. For instance,
- there are two branches (rw and ro) in aufs and their filesystem type
  differs from each other.
- an application issues getxattr() or listxattr() and makes sure
  "key.brabra" exists (or set).
- and then it issues setxattr() for "key.brabra".
- aufs will copies-up the file and tries setxattr() for the upper one.
- I am afraid there may happen "key.brabra" is not supported by the
  upper filesystem and aufs returns an error.
- from the users' point of view, this behaviour must be very strange.

Finally I am considering to make some levels to support xattr.
- support minimum common set of key only (if such set exists)
  Here "minimum common set" means a group of key which are surely
  supported by all filesystems. Aufs will filter-out other keys.
- create a new internal status flag
  This flag is set when the type of all branches are same. When the flag
  is set, aufs will handle xattr by simply redirecting.
- create a new aufs mount option
  the option will select two behaviours (above).

Unfortunately I could not understand what label means.
Is it a volume label at mounting like UUID?


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 14:50             ` Miklos Szeredi
  2009-02-24 16:26               ` hooanon05
@ 2009-02-26  5:51               ` hooanon05
  2009-02-26  5:55                 ` hooanon05
  1 sibling, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-26  5:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: tomas, akpm, linux-fsdevel, linux-kernel


Miklos Szeredi:
> The simplest version is with all branches read-only.  That gets rid of
> a _huge_ amount of complexity, yet it's still useful in some
> situations.  It also deals with a lot of the basic infrastucture
> needed for stacking.

I made it.
Since you already pulled aufs2-2.6.git tree, please try or review
aufs2-tmp-ro branch. Most of my local test scripts are useless for
readnonly mount. But I picked up some and they all passed.

$ wc -l fs/aufs/*.[ch] | tail -1
 12677 total

About 50% of lines are deleted.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-26  5:51               ` hooanon05
@ 2009-02-26  5:55                 ` hooanon05
  0 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-26  5:55 UTC (permalink / raw)
  To: Miklos Szeredi, tomas, akpm, linux-fsdevel, linux-kernel


I tested it on linux-2.6.28.

J. R. Okajima

> Miklos Szeredi:
> > The simplest version is with all branches read-only.  That gets rid of
> > a _huge_ amount of complexity, yet it's still useful in some
> > situations.  It also deals with a lot of the basic infrastucture
> > needed for stacking.
> 
> I made it.
> Since you already pulled aufs2-2.6.git tree, please try or review
> aufs2-tmp-ro branch. Most of my local test scripts are useless for
> readnonly mount. But I picked up some and they all passed.
> 
> $ wc -l fs/aufs/*.[ch] | tail -1
>  12677 total
> 
> About 50% of lines are deleted.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: New filesystem for Linux kernel
  2009-02-24 14:15         ` Theodore Tso
  2009-02-24 15:18           ` David P. Quigley
  2009-02-25  8:12           ` Tomas M
@ 2009-02-26 14:31           ` Amit Kucheria
  2 siblings, 0 replies; 42+ messages in thread
From: Amit Kucheria @ 2009-02-26 14:31 UTC (permalink / raw)
  To: Theodore Tso, Tomas M, linux-fsdevel, linux-kernel

On Tue, Feb 24, 2009 at 4:15 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Tue, Feb 24, 2009 at 09:13:08AM +0100, Tomas M wrote:
>> An overview of aufs2 has been submitted to this list.
>> I noticed zero response at all. Nobody cares?
>>
>> I suggest to remove unionfs from Andrew's -mm tree and replace it by aufs2!
>> Tell me why this should not happen...
>
> Um, you need to tell us why aufs2 is better than Unionfs.  The burden
> of proof rests on your shoulders.  The code which is displacing
> existing code needs to give a justification about why it is better
> than the code which is displacing, not the other way around.

Does it really have to displace unionfs? Why can't it be merged in
(after proper review) alongside unionfs?

Ubuntu moved to aufs for some of the same reasons that Tomas has
outlined elsewhere in this thread. Unionfs required some hand holding
everytime we upgraded to a new kernel while aufs has not given us
those problems. And yes, unionfs races have given us several sleepless
nights before releases.

Regards,
Amit (with Ubuntu Kernel Developer hat on)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-26  4:54   ` hooanon05
@ 2009-02-26 17:20     ` David P. Quigley
  2009-02-27 14:27       ` hooanon05
  0 siblings, 1 reply; 42+ messages in thread
From: David P. Quigley @ 2009-02-26 17:20 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-fsdevel, linux-kernel

On Thu, 2009-02-26 at 13:54 +0900, hooanon05@yahoo.co.jp wrote:
> "David P. Quigley":
> > I think it would be useful to see the source code for AUFS2 posted to
> > LKML. One of the questions I have not which doesn't seem to be addressed
> > in these documents is how robust is your xattr support and are you
> > making the appropriate LSM calls to make this usable with SELinux and
> > Smack. Also from a labeling perspective you have a very interesting
> > question of which label do you select when unifying directories. If you
> > have a/foo and b/foo each with different labels which do you choose.
> > Based on the history of Union type file systems I would suspect the
> > answer is whichever branch is listed first. 
> 
> Aufs doesn't support xattr curretnly because I don't decide how to
> support it yet.
> As far as I know, the implementation of xattr and its key/name pairs are
> filesystem dependent. For instance,
> - there are two branches (rw and ro) in aufs and their filesystem type
>   differs from each other.
> - an application issues getxattr() or listxattr() and makes sure
>   "key.brabra" exists (or set).
> - and then it issues setxattr() for "key.brabra".
> - aufs will copies-up the file and tries setxattr() for the upper one.
> - I am afraid there may happen "key.brabra" is not supported by the
>   upper filesystem and aufs returns an error.
> - from the users' point of view, this behaviour must be very strange.

This is correct. Seeing the xattr then not being able to set it is not
necessarily wrong behavior but if the error returned is along the lines
of key doesn't exist then that can be confusing. For example with
SELinux it is possible to see security.selinux but then to get an
EOPNOTSUPP on trying to set that key on certain file systems. Doing a
quick test on ext3 with setfattr on an xattr that doesn't exist you get
an EOPNOTSUPP back. Considering things such as ACLs and SELinux labels
are stored in xattrs it seems that failing a copyup on EOPNOTSUPP is a
very reasonable thing to do.


> 
> Finally I am considering to make some levels to support xattr.
> - support minimum common set of key only (if such set exists)
>   Here "minimum common set" means a group of key which are surely
>   supported by all filesystems. Aufs will filter-out other keys.
> - create a new internal status flag
>   This flag is set when the type of all branches are same. When the flag
>   is set, aufs will handle xattr by simply redirecting.
> - create a new aufs mount option
>   the option will select two behaviours (above).

So I don't think this is a good way of going about it. The idea of
having some flag which indicates just relay to the lower filesystems if
they are all the same completely ignores that you may have several file
systems which all support the required namespaces. One example I can
think of is a thin client where you get the main image from an NFSv4
server hopefully with the labeled nfs support built in (shameless plug)
so that the security.selinux and security.smack xattrs work properly.
You can then union that with either a persistent file system
(ext{2,3,4}) or a non-persistent one like tmpfs. Both of these support
the security namespace where ext* will store it on disk tmpfs will only
update the incore inode security state. Regardless you have a situation
where it is acceptable to use setxattr on the nfs4, ext*, and tmpfs
branches.

So I think it might be reasonable to say that if you can't copyup the
xattr fail with EOPNOTSUPP and leave it upto the administrator to
configure his system in a way that allows him to copyup files properly.
If he knows he has a store with a lot of user.* xattrs on a read only
branch he should make sure his read write branch supports user xattrs. 

>From a security perspective though there is the question of what label
do you give the file that you just created in the copyup process. If you
have the union of an iso(iso9660_t) and an ext3 file system (possibly
any type) what type do you give the new file. iso9660_t doesn't really
belong on an ext3 file system. You might need to create a new LSM hook
to ask the security module what to label files that are copied up.  

> 
> Unfortunately I could not understand what label means.
> Is it a volume label at mounting like UUID?

Unfortunately label seems to be an overloaded term. I should have used
the term security label. So as of 2.6.30 there will be 3 LSMs in
mainline. Two of which SELinux and Smack are label based MAC
implementations. This means that objects in the system (files, pipes,
symlinks, directories) are assigned a security label. In SELinux this is
a multi-field string which contains a user, role, type, and potentially
a Multi-Level Security component (think Secret, Top Secret). I'm not as
familiar with Smack labels and where they are stored but it also assigns
a security label to objects. SELinux stores these labels in an xattr
which is keyed with security.selinux. Like I mentioned above it is
possible that security.selinux can be persistent like in the case of
ext* or just goes away on unmount like in the case of tmpfs.

If you have more questions about this feel free to ask. I don't have
time to actually do work in this space but I can answer whatever
questions you have.

Dave


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-26 17:20     ` David P. Quigley
@ 2009-02-27 14:27       ` hooanon05
  2009-02-27 18:17         ` David P. Quigley
  0 siblings, 1 reply; 42+ messages in thread
From: hooanon05 @ 2009-02-27 14:27 UTC (permalink / raw)
  To: David P. Quigley; +Cc: linux-fsdevel, linux-kernel


"David P. Quigley":
> an EOPNOTSUPP back. Considering things such as ACLs and SELinux labels
> are stored in xattrs it seems that failing a copyup on EOPNOTSUPP is a
> very reasonable thing to do.

Do you mean ... ?
- if aufs and its lower branch fs support xattr but its upper branch
  doesn't, then some of copyup will fail.
- that is user's choice.


> > Finally I am considering to make some levels to support xattr.
> > - support minimum common set of key only (if such set exists)
> >   Here "minimum common set" means a group of key which are surely
> >   supported by all filesystems. Aufs will filter-out other keys.
> > - create a new internal status flag
> >   This flag is set when the type of all branches are same. When the flag
> >   is set, aufs will handle xattr by simply redirecting.
> > - create a new aufs mount option
> >   the option will select two behaviours (above).
> 
> So I don't think this is a good way of going about it. The idea of
> having some flag which indicates just relay to the lower filesystems if
> they are all the same completely ignores that you may have several file
> systems which all support the required namespaces. One example I can

When all branch filesystems support the required xattr even if thier
filesystem-type differ, user can specify the mount option (the thrid
level above) and all xattr will be handled. When any of xattr are not
supported by the upper branch fs, then copyup will fail.
Do I make my clear, or do I misunderstand you?


> If you have more questions about this feel free to ask. I don't have
> time to actually do work in this space but I can answer whatever
> questions you have.

I am afraid I don't fully understand what you wrote.
According to linux/Documentation/Smack.txt, "xattr support is not
strictly required". But for selinux (or other security mechanism), xattr
is neccessary as you wrote.
Please tell me the url where I should know about security label or
type. Particulary "iso9660_t" type, I don't know what it is.
And do you believe the lack of supporting xattr is critical for aufs to
be merged?


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-27 14:27       ` hooanon05
@ 2009-02-27 18:17         ` David P. Quigley
  2009-02-28  8:04           ` hooanon05
  0 siblings, 1 reply; 42+ messages in thread
From: David P. Quigley @ 2009-02-27 18:17 UTC (permalink / raw)
  To: hooanon05; +Cc: linux-fsdevel, linux-kernel

On Fri, 2009-02-27 at 23:27 +0900, hooanon05@yahoo.co.jp wrote:
> "David P. Quigley":
> > an EOPNOTSUPP back. Considering things such as ACLs and SELinux labels
> > are stored in xattrs it seems that failing a copyup on EOPNOTSUPP is a
> > very reasonable thing to do.
> 
> Do you mean ... ?
> - if aufs and its lower branch fs support xattr but its upper branch
>   doesn't, then some of copyup will fail.
> - that is user's choice.

I mean that in the event that an xattr can't be copied up to the next
read-write branch the copyup should fail. Otherwise you get the
situation where you can drop very important security information just by
touching a file on a read-only branch. If someone wants to guarantee
that they will always be able to copyup a given xattr it should be their
responsibility to ensure that every file system can handle it.

> 
> 
> > > Finally I am considering to make some levels to support xattr.
> > > - support minimum common set of key only (if such set exists)
> > >   Here "minimum common set" means a group of key which are surely
> > >   supported by all filesystems. Aufs will filter-out other keys.
> > > - create a new internal status flag
> > >   This flag is set when the type of all branches are same. When the flag
> > >   is set, aufs will handle xattr by simply redirecting.
> > > - create a new aufs mount option
> > >   the option will select two behaviours (above).
> > 
> > So I don't think this is a good way of going about it. The idea of
> > having some flag which indicates just relay to the lower filesystems if
> > they are all the same completely ignores that you may have several file
> > systems which all support the required namespaces. One example I can
> 
> When all branch filesystems support the required xattr even if thier
> filesystem-type differ, user can specify the mount option (the thrid
> level above) and all xattr will be handled. When any of xattr are not
> supported by the upper branch fs, then copyup will fail.
> Do I make my clear, or do I misunderstand you?

I can see an "I know better" mount option being useful for this. I
understand what you're trying to do now.

> 
> 
> > If you have more questions about this feel free to ask. I don't have
> > time to actually do work in this space but I can answer whatever
> > questions you have.
> 
> I am afraid I don't fully understand what you wrote.
> According to linux/Documentation/Smack.txt, "xattr support is not
> strictly required". But for selinux (or other security mechanism), xattr
> is neccessary as you wrote.
> Please tell me the url where I should know about security label or
> type. Particulary "iso9660_t" type, I don't know what it is.
> And do you believe the lack of supporting xattr is critical for aufs to
> be merged?

I don't know how many people are interested in using SELinux and Unionfs
so I can't say if it's critical for merging however I think it is
reasonable to expect any new file system to work with the security
mechanisms already in the kernel. Without at least basic xattr support
SELinux will have to fall back on assigning mount wide labels to any
aufs mount even if all the underlying file systems have security labels.

For information on SELinux there are the official papers on nsa.gov but
I also found this reference in a LWN article a while back [1]. It
contains a series of notes that the person took while learning SELinux
and they are well formatted. I haven't ready them all thoroughly but a
quick glance over the initial concepts document seems to be accurate and
what you are looking for.

The point I was trying to make with iso9660_t is that since the xattr
interface is what is used for security labels regardless of whether the
underlying file system actually supports xattrs there are other places
that the security label may come from. Since an iso9660 file system does
not support extended attributes SELinux has a fixed type for all content
on that file system. When someone asks for security.selinux on an
iso9660 file system it goes to the inode, takes the security context in
there, and converts it into the string you see even though there is no
real xattr support.


[1]http://equivocation.org/selinux

Dave


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 0/8] Aufs2 documents
  2009-02-27 18:17         ` David P. Quigley
@ 2009-02-28  8:04           ` hooanon05
  0 siblings, 0 replies; 42+ messages in thread
From: hooanon05 @ 2009-02-28  8:04 UTC (permalink / raw)
  To: David P. Quigley; +Cc: linux-fsdevel, linux-kernel


"David P. Quigley":
> I mean that in the event that an xattr can't be copied up to the next
> read-write branch the copyup should fail. Otherwise you get the
	:::
> I can see an "I know better" mount option being useful for this. I
> understand what you're trying to do now.

Then I don't see difference between what you wrote and my idea.
I think I should assert first that aufs doesn't support xattr currently.


> The point I was trying to make with iso9660_t is that since the xattr
> interface is what is used for security labels regardless of whether the
> underlying file system actually supports xattrs there are other places
> that the security label may come from. Since an iso9660 file system does
> not support extended attributes SELinux has a fixed type for all content
> on that file system. When someone asks for security.selinux on an
> iso9660 file system it goes to the inode, takes the security context in
> there, and converts it into the string you see even though there is no
> real xattr support.

For example, on a system which adopts selinux, if a user copies a file
manually from iso9660 (or other fs which doesn't support xattr) to a
xattr-supported fs, then he will not meet a security risk.
Is it correct?

And in the reverse case, copying a file manually from an xattr-supported
fs to non-supported fs may drop security info. Since this is manual
operation, user should know the risk. Right?
Here what will be security.selinux for the copied file?
Will it be set to the default value which is defined by a global policy
or something?
If it is true, then what will happen when the global policy has the
highest security level? User will not meet a security risk either?
Or such policy is useless in real world because of too low usability?

While these questions may sound silly, I just want to know how the
protection will be damaged in detail. Still I don't know much about
selinux.
I understand the case you want to clarify which aufs copies file
internally instead of manually. But will you tell me the comparision
between internal file copy and manual copy?

Anyway I guess aufs should support xattr in the future. I just want to
fix its priority, particulary it has to be implemented before
review/merge.


J. R. Okajima

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-02-28  8:04 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-23  7:31 [RFC 0/8] Aufs2 documents hooanon05
2009-02-23  7:33 ` [RFC 1/8] Aufs2: introduction hooanon05
2009-02-23  7:34 ` [RFC 2/8] Aufs2: structure hooanon05
2009-02-23  9:13   ` Tomas M
2009-02-23  9:22     ` Tomas M
2009-02-24  8:13       ` New filesystem for Linux kernel Tomas M
2009-02-24 11:52         ` Miklos Szeredi
2009-02-24 13:18           ` hooanon05
2009-02-24 13:45             ` Tarkan Erimer
2009-02-24 13:57               ` hooanon05
2009-02-24 14:16                 ` Tarkan Erimer
2009-02-24 14:50             ` Miklos Szeredi
2009-02-24 16:26               ` hooanon05
2009-02-25 10:28                 ` Miklos Szeredi
2009-02-26  4:09                   ` hooanon05
2009-02-26  5:51               ` hooanon05
2009-02-26  5:55                 ` hooanon05
2009-02-24 14:15         ` Theodore Tso
2009-02-24 15:18           ` David P. Quigley
2009-02-24 15:41             ` hooanon05
2009-02-25 15:53               ` David P. Quigley
2009-02-26  4:21                 ` hooanon05
2009-02-25  7:31             ` Tomas M
2009-02-25  9:33               ` David Newall
2009-02-25  8:12           ` Tomas M
2009-02-26 14:31           ` Amit Kucheria
2009-02-23 14:23     ` [RFC 2/8] Aufs2: structure hooanon05
2009-02-23  7:35 ` [RFC 3/8] Aufs2: lookup hooanon05
2009-02-23  7:36 ` [RFC 4/8] Aufs2: branch hooanon05
2009-02-23  7:36 ` [RFC 5/8] Aufs2: wbr_policy hooanon05
2009-02-23  7:37 ` [RFC 6/8] Aufs2: fmode_exec hooanon05
2009-02-23  7:37 ` [RFC 7/8] Aufs2: mmap hooanon05
2009-02-23  9:18   ` Tomas M
2009-02-23 14:39     ` hooanon05
2009-02-23  7:38 ` [RFC 8/8] Aufs2: plan hooanon05
2009-02-25 17:50 ` [RFC 0/8] Aufs2 documents David P. Quigley
2009-02-25 19:07   ` Matthew Wilcox
2009-02-26  4:54   ` hooanon05
2009-02-26 17:20     ` David P. Quigley
2009-02-27 14:27       ` hooanon05
2009-02-27 18:17         ` David P. Quigley
2009-02-28  8:04           ` hooanon05

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).