* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-11 20:40 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-11 20:40 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Thu, May 07, 2009 at 08:10:18PM -0700, Joel Becker wrote:
> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
> > You certainly did not address:
> >
> > - desire for one single system call to handle both
> > owner preservation and create with current owner.
>
> Nope, and I don't intend to. reflink() is a snapshotting call,
> not a kitchen sink.
I've been thinking about this all weekend. The current state
doesn't make me happy.
Now, what concerns me here is the interface to userspace. The
system call itself. I don't care if we implement it via one vfs_foo()
or 10 nor how many iops we end up with. We can and will modify those as
we find better ideas. But I want reflink(2) to have a semantic that is
easily understood and intuitive.
When I initially designed reflink(), I hadn't thought about the
ownership and permission implications of snapshotting. I was having too
much fun reflinking files around. In that iteration, anyone could
reflink a file. But a true snapshot needs ownership, permissions, acls,
and other security attributes (in all, I'm gonna call that the "security
context") as well. So I defined reflink() as such. This meant
requiring privileges, but lost some of the flexibility of the call. I
call that a loss.
What I'm not going to do is add optional behaviors to the system
call. It should be pretty obvious what it does, or we're doing it
wrong. The 'flags' field of reflinkat(2) is for AT_* flags.
When I decided on requiring privileges, I thought that degrading
without privileges was too confusing. I was wrong. I want reflink() to
fit into the pantheon of file system operations in a way that makes
sense alongside the others, and this isn't it.
Here's v4 of reflink(). If you have the privileges, you get the
full snapshot. If you don't, you must have read access, and then you
get the entire snapshot (data and extended attributes) except that the
security context is reinitialized. That's it. It fits with most of the
other ops, and it's a clean degradation.
I add a flag to ips->reflink() so that the filesystem knows what
to do with the security context. That's the only change visible outside
of vfs_reflink().
Security folks, check my work. Everyone else, let me know if
this satisfies.
Joel
From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and security
contexts in order to create a fully snapshot. Preserving those
attributes requires ownership or CAP_CHOWN. A caller without those
privileges will see the security context of the new file initialized to
their default.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security context on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
XXX: Currently only adds the x86_32 linkage. The rest of the
architectures belong here too.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 165 +++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 113 ++++++++++++++++++++++
include/linux/fs.h | 2 +
include/linux/security.h | 16 +++
include/linux/syscalls.h | 2 +
security/capability.c | 6 +
security/security.c | 7 ++
10 files changed, 317 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..aa7380f
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,165 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks just like link(2):
+
+ int reflink(const char *oldpath, const char *newpath);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security context, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security context of the source file obviously requires
+the privilege to do so. Callers that do not own the source file and do
+not have CAP_CHOWN will get a new reflink with all non-security
+attributes preserved; the security context of the new reflink will be
+as a newly created file by that user.
+
+Partial reflinks are not allowed. The new inode will only appear in the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. Like hard links and symlinks, a reflink cannot be
+created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security context (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller will see their own default security
+context applied to the file.
+
+A caller without the privileges to preserve the security context must
+have read access to reflink a file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+All file attributes and extended attributes of the new file must
+identical to the source file with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+- If the caller lacks the privileges to preserve the security context,
+ the file will have its security context initialized as would any new
+ file.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, int preserve_security);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the security context should be preserved or reinitialized, as specified
+by the preserve_security argument. The filesystem just needs to create
+the new inode identical to the old one with the exceptions noted above,
+link up the shared data extents, and then link the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..01cd810 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..34a6ce5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+ int preserve_security = 1;
+
+ if (!inode)
+ return -ENOENT;
+
+ /*
+ * If the caller has the rights, reflink() will preserve the
+ * security context of the source inode.
+ */
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ preserve_security = 0;
+ if ((current_fsuid() != inode->i_uid) &&
+ !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ preserve_security = 0;
+
+ /*
+ * If the caller doesn't have the right to preserve the security
+ * context, the caller is only getting the data and extended
+ * attributes. They need read permission on the file.
+ */
+ if (!preserve_security) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+ if (S_ISDIR(inode->i_mode))
+ return -EPERM;
+
+ error = security_inode_reflink(old_dentry, dir);
+ if (error)
+ return error;
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry,
+ preserve_security);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory. "Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..0a5c807 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,int);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..ea9cd93 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1423,7 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char *old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..35a8743 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..3dcc4cc 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..70d0ac3 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.1.3
--
"Three o'clock is always too late or too early for anything you
want to do."
- Jean-Paul Sartre
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply related [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-11 22:27 ` James Morris
-1 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-11 22:27 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009, Joel Becker wrote:
> and other security attributes (in all, I'm gonna call that the "security
> context") as well. So I defined reflink() as such. This meant
"security context" is an term associated with SELinux, so you may want to
use something like "security attributes" or "security state" to avoid
confusing people.
> + error = security_inode_reflink(old_dentry, dir);
> + if (error)
> + return error;
We'll need the new_dentry now, to set up new security state before the
dentry is instantiated.
e.g. SELinux will need to perform some checks on the operation, then
calculate a new security context for the new file.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-11 22:27 ` James Morris
0 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-11 22:27 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009, Joel Becker wrote:
> and other security attributes (in all, I'm gonna call that the "security
> context") as well. So I defined reflink() as such. This meant
"security context" is an term associated with SELinux, so you may want to
use something like "security attributes" or "security state" to avoid
confusing people.
> + error = security_inode_reflink(old_dentry, dir);
> + if (error)
> + return error;
We'll need the new_dentry now, to set up new security state before the
dentry is instantiated.
e.g. SELinux will need to perform some checks on the operation, then
calculate a new security context for the new file.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 22:27 ` [Ocfs2-devel] " James Morris
@ 2009-05-11 22:34 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-11 22:34 UTC (permalink / raw)
To: James Morris
Cc: jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, May 12, 2009 at 08:27:17AM +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>
> > and other security attributes (in all, I'm gonna call that the "security
> > context") as well. So I defined reflink() as such. This meant
>
> "security context" is an term associated with SELinux, so you may want to
> use something like "security attributes" or "security state" to avoid
> confusing people.
Ok, I wondered if my brain had picked that out from somewhere.
> > + error = security_inode_reflink(old_dentry, dir);
> > + if (error)
> > + return error;
>
> We'll need the new_dentry now, to set up new security state before the
> dentry is instantiated.
>
> e.g. SELinux will need to perform some checks on the operation, then
> calculate a new security context for the new file.
Do I need to pass in preserve_security as well so SELinux knows
what the ownership check determined?
Joel
--
"Copy from one, it's plagiarism; copy from two, it's research."
- Wilson Mizner
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-11 22:34 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-11 22:34 UTC (permalink / raw)
To: James Morris
Cc: jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, May 12, 2009 at 08:27:17AM +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>
> > and other security attributes (in all, I'm gonna call that the "security
> > context") as well. So I defined reflink() as such. This meant
>
> "security context" is an term associated with SELinux, so you may want to
> use something like "security attributes" or "security state" to avoid
> confusing people.
Ok, I wondered if my brain had picked that out from somewhere.
> > + error = security_inode_reflink(old_dentry, dir);
> > + if (error)
> > + return error;
>
> We'll need the new_dentry now, to set up new security state before the
> dentry is instantiated.
>
> e.g. SELinux will need to perform some checks on the operation, then
> calculate a new security context for the new file.
Do I need to pass in preserve_security as well so SELinux knows
what the ownership check determined?
Joel
--
"Copy from one, it's plagiarism; copy from two, it's research."
- Wilson Mizner
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 22:34 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-12 1:12 ` James Morris
-1 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-12 1:12 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009, Joel Becker wrote:
> > e.g. SELinux will need to perform some checks on the operation, then
> > calculate a new security context for the new file.
>
> Do I need to pass in preserve_security as well so SELinux knows
> what the ownership check determined?
Not for SELinux -- its security attributes are orthogonal to DAC, and it
will perform its own checks on them.
Other LSMs should operate similarly (there is also the CAP_CHOWN check
which the LSM may hook), although if not, the flag can be added later if
required.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 1:12 ` James Morris
0 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-12 1:12 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009, Joel Becker wrote:
> > e.g. SELinux will need to perform some checks on the operation, then
> > calculate a new security context for the new file.
>
> Do I need to pass in preserve_security as well so SELinux knows
> what the ownership check determined?
Not for SELinux -- its security attributes are orthogonal to DAC, and it
will perform its own checks on them.
Other LSMs should operate similarly (there is also the CAP_CHOWN check
which the LSM may hook), although if not, the flag can be added later if
required.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 1:12 ` [Ocfs2-devel] " James Morris
@ 2009-05-12 12:18 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 12:18 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>
> > > e.g. SELinux will need to perform some checks on the operation, then
> > > calculate a new security context for the new file.
> >
> > Do I need to pass in preserve_security as well so SELinux knows
> > what the ownership check determined?
>
> Not for SELinux -- its security attributes are orthogonal to DAC, and it
> will perform its own checks on them.
Is preserve_security supposed to also control the preservation of the
SELinux security attribute (security.selinux extended attribute)? I'd
expect that either we preserve all the security-relevant attributes or
none of them. And if that is the case, then SELinux has to know about
preserve_security in order to know what the security context of the new
inode will be.
Also, if you are going to automatically degrade reflink(2) behavior
based on the owner_or_cap test, then you ought to allow the same to be
true if the security module vetoes the attempt to preserve attributes.
Either DAC or MAC logic may say that security attributes cannot be
preserved. Your current logic will only allow graceful degradation in
the DAC case, but the MAC case will remain a hard failure.
> Other LSMs should operate similarly (there is also the CAP_CHOWN check
> which the LSM may hook), although if not, the flag can be added later if
> required.
>
>
> - James
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 12:18 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 12:18 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>
> > > e.g. SELinux will need to perform some checks on the operation, then
> > > calculate a new security context for the new file.
> >
> > Do I need to pass in preserve_security as well so SELinux knows
> > what the ownership check determined?
>
> Not for SELinux -- its security attributes are orthogonal to DAC, and it
> will perform its own checks on them.
Is preserve_security supposed to also control the preservation of the
SELinux security attribute (security.selinux extended attribute)? I'd
expect that either we preserve all the security-relevant attributes or
none of them. And if that is the case, then SELinux has to know about
preserve_security in order to know what the security context of the new
inode will be.
Also, if you are going to automatically degrade reflink(2) behavior
based on the owner_or_cap test, then you ought to allow the same to be
true if the security module vetoes the attempt to preserve attributes.
Either DAC or MAC logic may say that security attributes cannot be
preserved. Your current logic will only allow graceful degradation in
the DAC case, but the MAC case will remain a hard failure.
> Other LSMs should operate similarly (there is also the CAP_CHOWN check
> which the LSM may hook), although if not, the flag can be added later if
> required.
>
>
> - James
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 12:18 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-12 17:22 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 17:22 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, ocfs2-devel, viro
On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> > On Mon, 11 May 2009, Joel Becker wrote:
> >
> > > > e.g. SELinux will need to perform some checks on the operation, then
> > > > calculate a new security context for the new file.
> > >
> > > Do I need to pass in preserve_security as well so SELinux knows
> > > what the ownership check determined?
> >
> > Not for SELinux -- its security attributes are orthogonal to DAC, and it
> > will perform its own checks on them.
>
> Is preserve_security supposed to also control the preservation of the
> SELinux security attribute (security.selinux extended attribute)? I'd
> expect that either we preserve all the security-relevant attributes or
> none of them. And if that is the case, then SELinux has to know about
> preserve_security in order to know what the security context of the new
> inode will be.
Thank you Stephen, you read my mind. In the ocfs2 case, we're
expecting to just reflink the extended attribute structures verbatim in
the preserve_security case. So we would be ignoring whatever was set on
the new_dentry by security_inode_reflink(). This gets us the best CoW
sharing of the xattr extents, but I want to make sure that's "safe" in
the preserve_security case.
> Also, if you are going to automatically degrade reflink(2) behavior
> based on the owner_or_cap test, then you ought to allow the same to be
> true if the security module vetoes the attempt to preserve attributes.
> Either DAC or MAC logic may say that security attributes cannot be
> preserved. Your current logic will only allow graceful degradation in
> the DAC case, but the MAC case will remain a hard failure.
I did not think of this, and its a very good point as well. I'm
not sure how to have the return value of security_inode_reflink()
distinguish between "disallow the reflink" and "disallow
preserve_security". But since !preserve_security requires read access
only, perhaps we move security_inode_reflink up higher and say:
error = security_inode_reflink(old_dentry, dir);
if (error)
preserve_security = 0;
Here security_inode_reflink() does not need new_dentry, because it isn't
setting a security context. If it's ok with the reflink, we'll be
copying the extended attribute. If it's not OK, it falls through to the
inode_permission(inode, MAY_READ) check, which will check for plain old
read access.
What do we think?
Joel
--
"Under capitalism, man exploits man. Under Communism, it's just
the opposite."
- John Kenneth Galbraith
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 17:22 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 17:22 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, ocfs2-devel, viro
On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> > On Mon, 11 May 2009, Joel Becker wrote:
> >
> > > > e.g. SELinux will need to perform some checks on the operation, then
> > > > calculate a new security context for the new file.
> > >
> > > Do I need to pass in preserve_security as well so SELinux knows
> > > what the ownership check determined?
> >
> > Not for SELinux -- its security attributes are orthogonal to DAC, and it
> > will perform its own checks on them.
>
> Is preserve_security supposed to also control the preservation of the
> SELinux security attribute (security.selinux extended attribute)? I'd
> expect that either we preserve all the security-relevant attributes or
> none of them. And if that is the case, then SELinux has to know about
> preserve_security in order to know what the security context of the new
> inode will be.
Thank you Stephen, you read my mind. In the ocfs2 case, we're
expecting to just reflink the extended attribute structures verbatim in
the preserve_security case. So we would be ignoring whatever was set on
the new_dentry by security_inode_reflink(). This gets us the best CoW
sharing of the xattr extents, but I want to make sure that's "safe" in
the preserve_security case.
> Also, if you are going to automatically degrade reflink(2) behavior
> based on the owner_or_cap test, then you ought to allow the same to be
> true if the security module vetoes the attempt to preserve attributes.
> Either DAC or MAC logic may say that security attributes cannot be
> preserved. Your current logic will only allow graceful degradation in
> the DAC case, but the MAC case will remain a hard failure.
I did not think of this, and its a very good point as well. I'm
not sure how to have the return value of security_inode_reflink()
distinguish between "disallow the reflink" and "disallow
preserve_security". But since !preserve_security requires read access
only, perhaps we move security_inode_reflink up higher and say:
error = security_inode_reflink(old_dentry, dir);
if (error)
preserve_security = 0;
Here security_inode_reflink() does not need new_dentry, because it isn't
setting a security context. If it's ok with the reflink, we'll be
copying the extended attribute. If it's not OK, it falls through to the
inode_permission(inode, MAY_READ) check, which will check for plain old
read access.
What do we think?
Joel
--
"Under capitalism, man exploits man. Under Communism, it's just
the opposite."
- John Kenneth Galbraith
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 17:22 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-12 17:32 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 17:32 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> > > On Mon, 11 May 2009, Joel Becker wrote:
> > >
> > > > > e.g. SELinux will need to perform some checks on the operation, then
> > > > > calculate a new security context for the new file.
> > > >
> > > > Do I need to pass in preserve_security as well so SELinux knows
> > > > what the ownership check determined?
> > >
> > > Not for SELinux -- its security attributes are orthogonal to DAC, and it
> > > will perform its own checks on them.
> >
> > Is preserve_security supposed to also control the preservation of the
> > SELinux security attribute (security.selinux extended attribute)? I'd
> > expect that either we preserve all the security-relevant attributes or
> > none of them. And if that is the case, then SELinux has to know about
> > preserve_security in order to know what the security context of the new
> > inode will be.
>
> Thank you Stephen, you read my mind. In the ocfs2 case, we're
> expecting to just reflink the extended attribute structures verbatim in
> the preserve_security case.
And in the preserve_security==0 case, you'll be calling
security_inode_init_security() in order to get the attribute name/value
pair to assign to the new inode just as in the normal file creation
case?
> So we would be ignoring whatever was set on
> the new_dentry by security_inode_reflink(). This gets us the best CoW
> sharing of the xattr extents, but I want to make sure that's "safe" in
> the preserve_security case.
security_inode_reflink() can't handle the initialization regardless, as
the inode doesn't yet exist at that point.
> > Also, if you are going to automatically degrade reflink(2) behavior
> > based on the owner_or_cap test, then you ought to allow the same to be
> > true if the security module vetoes the attempt to preserve attributes.
> > Either DAC or MAC logic may say that security attributes cannot be
> > preserved. Your current logic will only allow graceful degradation in
> > the DAC case, but the MAC case will remain a hard failure.
>
> I did not think of this, and its a very good point as well. I'm
> not sure how to have the return value of security_inode_reflink()
> distinguish between "disallow the reflink" and "disallow
> preserve_security". But since !preserve_security requires read access
> only, perhaps we move security_inode_reflink up higher and say:
>
> error = security_inode_reflink(old_dentry, dir);
> if (error)
> preserve_security = 0;
>
> Here security_inode_reflink() does not need new_dentry, because it isn't
> setting a security context. If it's ok with the reflink, we'll be
> copying the extended attribute. If it's not OK, it falls through to the
> inode_permission(inode, MAY_READ) check, which will check for plain old
> read access.
> What do we think?
I'd rather have two hooks, one to allow the security module to override
preserve_security and one to allow the security module to deny the
operation altogether. The former hook only needs to be called if
preserve_security is not already cleared by the DAC logic. The latter
hook needs to know the final verdict on preserve_security in order to
determine the right set of checks to apply, which isn't necessarily
limited to only checking read access.
But we don't need the new_dentry regardless.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 17:32 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 17:32 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> > > On Mon, 11 May 2009, Joel Becker wrote:
> > >
> > > > > e.g. SELinux will need to perform some checks on the operation, then
> > > > > calculate a new security context for the new file.
> > > >
> > > > Do I need to pass in preserve_security as well so SELinux knows
> > > > what the ownership check determined?
> > >
> > > Not for SELinux -- its security attributes are orthogonal to DAC, and it
> > > will perform its own checks on them.
> >
> > Is preserve_security supposed to also control the preservation of the
> > SELinux security attribute (security.selinux extended attribute)? I'd
> > expect that either we preserve all the security-relevant attributes or
> > none of them. And if that is the case, then SELinux has to know about
> > preserve_security in order to know what the security context of the new
> > inode will be.
>
> Thank you Stephen, you read my mind. In the ocfs2 case, we're
> expecting to just reflink the extended attribute structures verbatim in
> the preserve_security case.
And in the preserve_security==0 case, you'll be calling
security_inode_init_security() in order to get the attribute name/value
pair to assign to the new inode just as in the normal file creation
case?
> So we would be ignoring whatever was set on
> the new_dentry by security_inode_reflink(). This gets us the best CoW
> sharing of the xattr extents, but I want to make sure that's "safe" in
> the preserve_security case.
security_inode_reflink() can't handle the initialization regardless, as
the inode doesn't yet exist at that point.
> > Also, if you are going to automatically degrade reflink(2) behavior
> > based on the owner_or_cap test, then you ought to allow the same to be
> > true if the security module vetoes the attempt to preserve attributes.
> > Either DAC or MAC logic may say that security attributes cannot be
> > preserved. Your current logic will only allow graceful degradation in
> > the DAC case, but the MAC case will remain a hard failure.
>
> I did not think of this, and its a very good point as well. I'm
> not sure how to have the return value of security_inode_reflink()
> distinguish between "disallow the reflink" and "disallow
> preserve_security". But since !preserve_security requires read access
> only, perhaps we move security_inode_reflink up higher and say:
>
> error = security_inode_reflink(old_dentry, dir);
> if (error)
> preserve_security = 0;
>
> Here security_inode_reflink() does not need new_dentry, because it isn't
> setting a security context. If it's ok with the reflink, we'll be
> copying the extended attribute. If it's not OK, it falls through to the
> inode_permission(inode, MAY_READ) check, which will check for plain old
> read access.
> What do we think?
I'd rather have two hooks, one to allow the security module to override
preserve_security and one to allow the security module to deny the
operation altogether. The former hook only needs to be called if
preserve_security is not already cleared by the DAC logic. The latter
hook needs to know the final verdict on preserve_security in order to
determine the right set of checks to apply, which isn't necessarily
limited to only checking read access.
But we don't need the new_dentry regardless.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 17:32 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-12 18:03 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 18:03 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, ocfs2-devel, viro
On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > Is preserve_security supposed to also control the preservation of the
> > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > expect that either we preserve all the security-relevant attributes or
> > > none of them. And if that is the case, then SELinux has to know about
> > > preserve_security in order to know what the security context of the new
> > > inode will be.
> >
> > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > expecting to just reflink the extended attribute structures verbatim in
> > the preserve_security case.
>
> And in the preserve_security==0 case, you'll be calling
> security_inode_init_security() in order to get the attribute name/value
> pair to assign to the new inode just as in the normal file creation
> case?
Oh, absolutely.
As an aside, do inodes ever have more than one security.*
attribute? It would appear that security_inode_init_security() just
returns one attribute, but what if I had a system running under SMACK
and then changed to SELinux? Would my (existing) inode then have
security.smack and security.selinux attributes?
> > > Also, if you are going to automatically degrade reflink(2) behavior
> > > based on the owner_or_cap test, then you ought to allow the same to be
> > > true if the security module vetoes the attempt to preserve attributes.
> > > Either DAC or MAC logic may say that security attributes cannot be
> > > preserved. Your current logic will only allow graceful degradation in
> > > the DAC case, but the MAC case will remain a hard failure.
> >
> > I did not think of this, and its a very good point as well. I'm
> > not sure how to have the return value of security_inode_reflink()
> > distinguish between "disallow the reflink" and "disallow
> > preserve_security". But since !preserve_security requires read access
> > only, perhaps we move security_inode_reflink up higher and say:
> >
> > error = security_inode_reflink(old_dentry, dir);
> > if (error)
> > preserve_security = 0;
> >
> > Here security_inode_reflink() does not need new_dentry, because it isn't
> > setting a security context. If it's ok with the reflink, we'll be
> > copying the extended attribute. If it's not OK, it falls through to the
> > inode_permission(inode, MAY_READ) check, which will check for plain old
> > read access.
> > What do we think?
>
> I'd rather have two hooks, one to allow the security module to override
> preserve_security and one to allow the security module to deny the
> operation altogether. The former hook only needs to be called if
> preserve_security is not already cleared by the DAC logic. The latter
> hook needs to know the final verdict on preserve_security in order to
> determine the right set of checks to apply, which isn't necessarily
> limited to only checking read access.
Ok, is that two hooks or one hook with specific error returns?
I don't care, it's up to the LSM group. I just can't come up with a
good distinguishing set of names if its two hooks :-)
Joel
--
Life's Little Instruction Book #157
"Take time to smell the roses."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 18:03 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 18:03 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, ocfs2-devel, viro
On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > Is preserve_security supposed to also control the preservation of the
> > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > expect that either we preserve all the security-relevant attributes or
> > > none of them. And if that is the case, then SELinux has to know about
> > > preserve_security in order to know what the security context of the new
> > > inode will be.
> >
> > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > expecting to just reflink the extended attribute structures verbatim in
> > the preserve_security case.
>
> And in the preserve_security==0 case, you'll be calling
> security_inode_init_security() in order to get the attribute name/value
> pair to assign to the new inode just as in the normal file creation
> case?
Oh, absolutely.
As an aside, do inodes ever have more than one security.*
attribute? It would appear that security_inode_init_security() just
returns one attribute, but what if I had a system running under SMACK
and then changed to SELinux? Would my (existing) inode then have
security.smack and security.selinux attributes?
> > > Also, if you are going to automatically degrade reflink(2) behavior
> > > based on the owner_or_cap test, then you ought to allow the same to be
> > > true if the security module vetoes the attempt to preserve attributes.
> > > Either DAC or MAC logic may say that security attributes cannot be
> > > preserved. Your current logic will only allow graceful degradation in
> > > the DAC case, but the MAC case will remain a hard failure.
> >
> > I did not think of this, and its a very good point as well. I'm
> > not sure how to have the return value of security_inode_reflink()
> > distinguish between "disallow the reflink" and "disallow
> > preserve_security". But since !preserve_security requires read access
> > only, perhaps we move security_inode_reflink up higher and say:
> >
> > error = security_inode_reflink(old_dentry, dir);
> > if (error)
> > preserve_security = 0;
> >
> > Here security_inode_reflink() does not need new_dentry, because it isn't
> > setting a security context. If it's ok with the reflink, we'll be
> > copying the extended attribute. If it's not OK, it falls through to the
> > inode_permission(inode, MAY_READ) check, which will check for plain old
> > read access.
> > What do we think?
>
> I'd rather have two hooks, one to allow the security module to override
> preserve_security and one to allow the security module to deny the
> operation altogether. The former hook only needs to be called if
> preserve_security is not already cleared by the DAC logic. The latter
> hook needs to know the final verdict on preserve_security in order to
> determine the right set of checks to apply, which isn't necessarily
> limited to only checking read access.
Ok, is that two hooks or one hook with specific error returns?
I don't care, it's up to the LSM group. I just can't come up with a
good distinguishing set of names if its two hooks :-)
Joel
--
Life's Little Instruction Book #157
"Take time to smell the roses."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 18:03 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-12 18:04 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 18:04 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > Is preserve_security supposed to also control the preservation of the
> > > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > > expect that either we preserve all the security-relevant attributes or
> > > > none of them. And if that is the case, then SELinux has to know about
> > > > preserve_security in order to know what the security context of the new
> > > > inode will be.
> > >
> > > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > > expecting to just reflink the extended attribute structures verbatim in
> > > the preserve_security case.
> >
> > And in the preserve_security==0 case, you'll be calling
> > security_inode_init_security() in order to get the attribute name/value
> > pair to assign to the new inode just as in the normal file creation
> > case?
>
> Oh, absolutely.
> As an aside, do inodes ever have more than one security.*
> attribute? It would appear that security_inode_init_security() just
> returns one attribute, but what if I had a system running under SMACK
> and then changed to SELinux? Would my (existing) inode then have
> security.smack and security.selinux attributes?
No, there would be no security.selinux attribute and the file would be
treated as having a well-defined 'unlabeled' attribute by SELinux. Not
something you have to worry about.
> > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > true if the security module vetoes the attempt to preserve attributes.
> > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > preserved. Your current logic will only allow graceful degradation in
> > > > the DAC case, but the MAC case will remain a hard failure.
> > >
> > > I did not think of this, and its a very good point as well. I'm
> > > not sure how to have the return value of security_inode_reflink()
> > > distinguish between "disallow the reflink" and "disallow
> > > preserve_security". But since !preserve_security requires read access
> > > only, perhaps we move security_inode_reflink up higher and say:
> > >
> > > error = security_inode_reflink(old_dentry, dir);
> > > if (error)
> > > preserve_security = 0;
> > >
> > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > setting a security context. If it's ok with the reflink, we'll be
> > > copying the extended attribute. If it's not OK, it falls through to the
> > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > read access.
> > > What do we think?
> >
> > I'd rather have two hooks, one to allow the security module to override
> > preserve_security and one to allow the security module to deny the
> > operation altogether. The former hook only needs to be called if
> > preserve_security is not already cleared by the DAC logic. The latter
> > hook needs to know the final verdict on preserve_security in order to
> > determine the right set of checks to apply, which isn't necessarily
> > limited to only checking read access.
>
> Ok, is that two hooks or one hook with specific error returns?
> I don't care, it's up to the LSM group. I just can't come up with a
> good distinguishing set of names if its two hooks :-)
I suppose you could coalesce them into a single hook ala:
error = security_inode_reflink(old_dentry, dir, &preserve_security);
if (error)
return (error);
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 18:04 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 18:04 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > Is preserve_security supposed to also control the preservation of the
> > > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > > expect that either we preserve all the security-relevant attributes or
> > > > none of them. And if that is the case, then SELinux has to know about
> > > > preserve_security in order to know what the security context of the new
> > > > inode will be.
> > >
> > > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > > expecting to just reflink the extended attribute structures verbatim in
> > > the preserve_security case.
> >
> > And in the preserve_security==0 case, you'll be calling
> > security_inode_init_security() in order to get the attribute name/value
> > pair to assign to the new inode just as in the normal file creation
> > case?
>
> Oh, absolutely.
> As an aside, do inodes ever have more than one security.*
> attribute? It would appear that security_inode_init_security() just
> returns one attribute, but what if I had a system running under SMACK
> and then changed to SELinux? Would my (existing) inode then have
> security.smack and security.selinux attributes?
No, there would be no security.selinux attribute and the file would be
treated as having a well-defined 'unlabeled' attribute by SELinux. Not
something you have to worry about.
> > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > true if the security module vetoes the attempt to preserve attributes.
> > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > preserved. Your current logic will only allow graceful degradation in
> > > > the DAC case, but the MAC case will remain a hard failure.
> > >
> > > I did not think of this, and its a very good point as well. I'm
> > > not sure how to have the return value of security_inode_reflink()
> > > distinguish between "disallow the reflink" and "disallow
> > > preserve_security". But since !preserve_security requires read access
> > > only, perhaps we move security_inode_reflink up higher and say:
> > >
> > > error = security_inode_reflink(old_dentry, dir);
> > > if (error)
> > > preserve_security = 0;
> > >
> > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > setting a security context. If it's ok with the reflink, we'll be
> > > copying the extended attribute. If it's not OK, it falls through to the
> > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > read access.
> > > What do we think?
> >
> > I'd rather have two hooks, one to allow the security module to override
> > preserve_security and one to allow the security module to deny the
> > operation altogether. The former hook only needs to be called if
> > preserve_security is not already cleared by the DAC logic. The latter
> > hook needs to know the final verdict on preserve_security in order to
> > determine the right set of checks to apply, which isn't necessarily
> > limited to only checking read access.
>
> Ok, is that two hooks or one hook with specific error returns?
> I don't care, it's up to the LSM group. I just can't come up with a
> good distinguishing set of names if its two hooks :-)
I suppose you could coalesce them into a single hook ala:
error = security_inode_reflink(old_dentry, dir, &preserve_security);
if (error)
return (error);
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 18:04 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-12 18:28 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 18:28 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, ocfs2-devel, viro
On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > As an aside, do inodes ever have more than one security.*
> > attribute? It would appear that security_inode_init_security() just
> > returns one attribute, but what if I had a system running under SMACK
> > and then changed to SELinux? Would my (existing) inode then have
> > security.smack and security.selinux attributes?
>
> No, there would be no security.selinux attribute and the file would be
> treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> something you have to worry about.
Even if I've run rstorecon? Basically, I'm trying to understand
if, in the !preserve_security case, ocfs2 can just do "link up the
existing xattrs, then set whatever we got from
security_inode_init_security()", or if we have to go through and delete
all security.* attributes before installing the result of
security_inode_init_security().
> > > I'd rather have two hooks, one to allow the security module to override
> > > preserve_security and one to allow the security module to deny the
> > > operation altogether. The former hook only needs to be called if
> > > preserve_security is not already cleared by the DAC logic. The latter
> > > hook needs to know the final verdict on preserve_security in order to
> > > determine the right set of checks to apply, which isn't necessarily
> > > limited to only checking read access.
> >
> > Ok, is that two hooks or one hook with specific error returns?
> > I don't care, it's up to the LSM group. I just can't come up with a
> > good distinguishing set of names if its two hooks :-)
>
> I suppose you could coalesce them into a single hook ala:
> error = security_inode_reflink(old_dentry, dir, &preserve_security);
> if (error)
> return (error);
What fits in with the LSM convention. That's more important
than one-hook-vs-two.
Joel
--
"Gone to plant a weeping willow
On the bank's green edge it will roll, roll, roll.
Sing a lulaby beside the waters.
Lovers come and go, the river roll, roll, rolls."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 18:28 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 18:28 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, ocfs2-devel, viro
On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > As an aside, do inodes ever have more than one security.*
> > attribute? It would appear that security_inode_init_security() just
> > returns one attribute, but what if I had a system running under SMACK
> > and then changed to SELinux? Would my (existing) inode then have
> > security.smack and security.selinux attributes?
>
> No, there would be no security.selinux attribute and the file would be
> treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> something you have to worry about.
Even if I've run rstorecon? Basically, I'm trying to understand
if, in the !preserve_security case, ocfs2 can just do "link up the
existing xattrs, then set whatever we got from
security_inode_init_security()", or if we have to go through and delete
all security.* attributes before installing the result of
security_inode_init_security().
> > > I'd rather have two hooks, one to allow the security module to override
> > > preserve_security and one to allow the security module to deny the
> > > operation altogether. The former hook only needs to be called if
> > > preserve_security is not already cleared by the DAC logic. The latter
> > > hook needs to know the final verdict on preserve_security in order to
> > > determine the right set of checks to apply, which isn't necessarily
> > > limited to only checking read access.
> >
> > Ok, is that two hooks or one hook with specific error returns?
> > I don't care, it's up to the LSM group. I just can't come up with a
> > good distinguishing set of names if its two hooks :-)
>
> I suppose you could coalesce them into a single hook ala:
> error = security_inode_reflink(old_dentry, dir, &preserve_security);
> if (error)
> return (error);
What fits in with the LSM convention. That's more important
than one-hook-vs-two.
Joel
--
"Gone to plant a weeping willow
On the bank's green edge it will roll, roll, roll.
Sing a lulaby beside the waters.
Lovers come and go, the river roll, roll, rolls."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 18:28 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-12 18:37 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 18:37 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 11:28 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > > As an aside, do inodes ever have more than one security.*
> > > attribute? It would appear that security_inode_init_security() just
> > > returns one attribute, but what if I had a system running under SMACK
> > > and then changed to SELinux? Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> >
> > No, there would be no security.selinux attribute and the file would be
> > treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> > something you have to worry about.
>
> Even if I've run rstorecon? Basically, I'm trying to understand
> if, in the !preserve_security case, ocfs2 can just do "link up the
> existing xattrs, then set whatever we got from
> security_inode_init_security()", or if we have to go through and delete
> all security.* attributes before installing the result of
> security_inode_init_security().
Likely a better example would be file capabilities
(security.capability), as you might be using those simultaneously with
SELinux (security.selinux).
security_inode_init_security() is only going to return security.selinux,
as new files don't get any file capabilities assigned by default. I
guess you would want to delete security.capability from the reflink if
preserve_security==0.
> > > > I'd rather have two hooks, one to allow the security module to override
> > > > preserve_security and one to allow the security module to deny the
> > > > operation altogether. The former hook only needs to be called if
> > > > preserve_security is not already cleared by the DAC logic. The latter
> > > > hook needs to know the final verdict on preserve_security in order to
> > > > determine the right set of checks to apply, which isn't necessarily
> > > > limited to only checking read access.
> > >
> > > Ok, is that two hooks or one hook with specific error returns?
> > > I don't care, it's up to the LSM group. I just can't come up with a
> > > good distinguishing set of names if its two hooks :-)
> >
> > I suppose you could coalesce them into a single hook ala:
> > error = security_inode_reflink(old_dentry, dir, &preserve_security);
> > if (error)
> > return (error);
>
> What fits in with the LSM convention. That's more important
> than one-hook-vs-two.
I think that the above example fits with the LSM convention.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 18:37 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 18:37 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 11:28 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > > As an aside, do inodes ever have more than one security.*
> > > attribute? It would appear that security_inode_init_security() just
> > > returns one attribute, but what if I had a system running under SMACK
> > > and then changed to SELinux? Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> >
> > No, there would be no security.selinux attribute and the file would be
> > treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> > something you have to worry about.
>
> Even if I've run rstorecon? Basically, I'm trying to understand
> if, in the !preserve_security case, ocfs2 can just do "link up the
> existing xattrs, then set whatever we got from
> security_inode_init_security()", or if we have to go through and delete
> all security.* attributes before installing the result of
> security_inode_init_security().
Likely a better example would be file capabilities
(security.capability), as you might be using those simultaneously with
SELinux (security.selinux).
security_inode_init_security() is only going to return security.selinux,
as new files don't get any file capabilities assigned by default. I
guess you would want to delete security.capability from the reflink if
preserve_security==0.
> > > > I'd rather have two hooks, one to allow the security module to override
> > > > preserve_security and one to allow the security module to deny the
> > > > operation altogether. The former hook only needs to be called if
> > > > preserve_security is not already cleared by the DAC logic. The latter
> > > > hook needs to know the final verdict on preserve_security in order to
> > > > determine the right set of checks to apply, which isn't necessarily
> > > > limited to only checking read access.
> > >
> > > Ok, is that two hooks or one hook with specific error returns?
> > > I don't care, it's up to the LSM group. I just can't come up with a
> > > good distinguishing set of names if its two hooks :-)
> >
> > I suppose you could coalesce them into a single hook ala:
> > error = security_inode_reflink(old_dentry, dir, &preserve_security);
> > if (error)
> > return (error);
>
> What fits in with the LSM convention. That's more important
> than one-hook-vs-two.
I think that the above example fits with the LSM convention.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 18:04 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-14 18:06 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:06 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > > Is preserve_security supposed to also control the preservation of the
> > > > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > > > expect that either we preserve all the security-relevant attributes or
> > > > > none of them. And if that is the case, then SELinux has to know about
> > > > > preserve_security in order to know what the security context of the new
> > > > > inode will be.
> > > >
> > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > > > expecting to just reflink the extended attribute structures verbatim in
> > > > the preserve_security case.
> > >
> > > And in the preserve_security==0 case, you'll be calling
> > > security_inode_init_security() in order to get the attribute name/value
> > > pair to assign to the new inode just as in the normal file creation
> > > case?
> >
> > Oh, absolutely.
> > As an aside, do inodes ever have more than one security.*
> > attribute? It would appear that security_inode_init_security() just
> > returns one attribute, but what if I had a system running under SMACK
> > and then changed to SELinux? Would my (existing) inode then have
> > security.smack and security.selinux attributes?
>
> No, there would be no security.selinux attribute and the file would be
> treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> something you have to worry about.
>
> > > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > > true if the security module vetoes the attempt to preserve attributes.
> > > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > > preserved. Your current logic will only allow graceful degradation in
> > > > > the DAC case, but the MAC case will remain a hard failure.
> > > >
> > > > I did not think of this, and its a very good point as well. I'm
> > > > not sure how to have the return value of security_inode_reflink()
> > > > distinguish between "disallow the reflink" and "disallow
> > > > preserve_security". But since !preserve_security requires read access
> > > > only, perhaps we move security_inode_reflink up higher and say:
> > > >
> > > > error = security_inode_reflink(old_dentry, dir);
> > > > if (error)
> > > > preserve_security = 0;
> > > >
> > > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > > setting a security context. If it's ok with the reflink, we'll be
> > > > copying the extended attribute. If it's not OK, it falls through to the
> > > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > > read access.
> > > > What do we think?
> > >
> > > I'd rather have two hooks, one to allow the security module to override
> > > preserve_security and one to allow the security module to deny the
> > > operation altogether. The former hook only needs to be called if
> > > preserve_security is not already cleared by the DAC logic. The latter
> > > hook needs to know the final verdict on preserve_security in order to
> > > determine the right set of checks to apply, which isn't necessarily
> > > limited to only checking read access.
> >
> > Ok, is that two hooks or one hook with specific error returns?
> > I don't care, it's up to the LSM group. I just can't come up with a
> > good distinguishing set of names if its two hooks :-)
>
> I suppose you could coalesce them into a single hook ala:
> error = security_inode_reflink(old_dentry, dir, &preserve_security);
> if (error)
> return (error);
On second thought (agreeing with Andy about making the interface
explicit wrt preserve_security), I don't expect us to ever override
preserve_security from SELinux, so you can just pass it in by value.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 18:06 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:06 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > > Is preserve_security supposed to also control the preservation of the
> > > > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > > > expect that either we preserve all the security-relevant attributes or
> > > > > none of them. And if that is the case, then SELinux has to know about
> > > > > preserve_security in order to know what the security context of the new
> > > > > inode will be.
> > > >
> > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > > > expecting to just reflink the extended attribute structures verbatim in
> > > > the preserve_security case.
> > >
> > > And in the preserve_security==0 case, you'll be calling
> > > security_inode_init_security() in order to get the attribute name/value
> > > pair to assign to the new inode just as in the normal file creation
> > > case?
> >
> > Oh, absolutely.
> > As an aside, do inodes ever have more than one security.*
> > attribute? It would appear that security_inode_init_security() just
> > returns one attribute, but what if I had a system running under SMACK
> > and then changed to SELinux? Would my (existing) inode then have
> > security.smack and security.selinux attributes?
>
> No, there would be no security.selinux attribute and the file would be
> treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> something you have to worry about.
>
> > > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > > true if the security module vetoes the attempt to preserve attributes.
> > > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > > preserved. Your current logic will only allow graceful degradation in
> > > > > the DAC case, but the MAC case will remain a hard failure.
> > > >
> > > > I did not think of this, and its a very good point as well. I'm
> > > > not sure how to have the return value of security_inode_reflink()
> > > > distinguish between "disallow the reflink" and "disallow
> > > > preserve_security". But since !preserve_security requires read access
> > > > only, perhaps we move security_inode_reflink up higher and say:
> > > >
> > > > error = security_inode_reflink(old_dentry, dir);
> > > > if (error)
> > > > preserve_security = 0;
> > > >
> > > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > > setting a security context. If it's ok with the reflink, we'll be
> > > > copying the extended attribute. If it's not OK, it falls through to the
> > > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > > read access.
> > > > What do we think?
> > >
> > > I'd rather have two hooks, one to allow the security module to override
> > > preserve_security and one to allow the security module to deny the
> > > operation altogether. The former hook only needs to be called if
> > > preserve_security is not already cleared by the DAC logic. The latter
> > > hook needs to know the final verdict on preserve_security in order to
> > > determine the right set of checks to apply, which isn't necessarily
> > > limited to only checking read access.
> >
> > Ok, is that two hooks or one hook with specific error returns?
> > I don't care, it's up to the LSM group. I just can't come up with a
> > good distinguishing set of names if its two hooks :-)
>
> I suppose you could coalesce them into a single hook ala:
> error = security_inode_reflink(old_dentry, dir, &preserve_security);
> if (error)
> return (error);
On second thought (agreeing with Andy about making the interface
explicit wrt preserve_security), I don't expect us to ever override
preserve_security from SELinux, so you can just pass it in by value.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 18:06 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-14 18:25 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:25 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Thu, 2009-05-14 at 14:06 -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > > > Is preserve_security supposed to also control the preservation of the
> > > > > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > > > > expect that either we preserve all the security-relevant attributes or
> > > > > > none of them. And if that is the case, then SELinux has to know about
> > > > > > preserve_security in order to know what the security context of the new
> > > > > > inode will be.
> > > > >
> > > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > > > > expecting to just reflink the extended attribute structures verbatim in
> > > > > the preserve_security case.
> > > >
> > > > And in the preserve_security==0 case, you'll be calling
> > > > security_inode_init_security() in order to get the attribute name/value
> > > > pair to assign to the new inode just as in the normal file creation
> > > > case?
> > >
> > > Oh, absolutely.
> > > As an aside, do inodes ever have more than one security.*
> > > attribute? It would appear that security_inode_init_security() just
> > > returns one attribute, but what if I had a system running under SMACK
> > > and then changed to SELinux? Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> >
> > No, there would be no security.selinux attribute and the file would be
> > treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> > something you have to worry about.
> >
> > > > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > > > true if the security module vetoes the attempt to preserve attributes.
> > > > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > > > preserved. Your current logic will only allow graceful degradation in
> > > > > > the DAC case, but the MAC case will remain a hard failure.
> > > > >
> > > > > I did not think of this, and its a very good point as well. I'm
> > > > > not sure how to have the return value of security_inode_reflink()
> > > > > distinguish between "disallow the reflink" and "disallow
> > > > > preserve_security". But since !preserve_security requires read access
> > > > > only, perhaps we move security_inode_reflink up higher and say:
> > > > >
> > > > > error = security_inode_reflink(old_dentry, dir);
> > > > > if (error)
> > > > > preserve_security = 0;
> > > > >
> > > > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > > > setting a security context. If it's ok with the reflink, we'll be
> > > > > copying the extended attribute. If it's not OK, it falls through to the
> > > > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > > > read access.
> > > > > What do we think?
> > > >
> > > > I'd rather have two hooks, one to allow the security module to override
> > > > preserve_security and one to allow the security module to deny the
> > > > operation altogether. The former hook only needs to be called if
> > > > preserve_security is not already cleared by the DAC logic. The latter
> > > > hook needs to know the final verdict on preserve_security in order to
> > > > determine the right set of checks to apply, which isn't necessarily
> > > > limited to only checking read access.
> > >
> > > Ok, is that two hooks or one hook with specific error returns?
> > > I don't care, it's up to the LSM group. I just can't come up with a
> > > good distinguishing set of names if its two hooks :-)
> >
> > I suppose you could coalesce them into a single hook ala:
> > error = security_inode_reflink(old_dentry, dir, &preserve_security);
> > if (error)
> > return (error);
>
> On second thought (agreeing with Andy about making the interface
> explicit wrt preserve_security), I don't expect us to ever override
> preserve_security from SELinux, so you can just pass it in by value.
And you can likely make preserve_security a simple bool (set from some
caller-provided flag) rather than an int. At which point the SELinux
wiring for the new hook would be something like this:
If we are preserving security attributes on the reflink, then treat it
like creating a link to an existing file; else treat it like creating a
new file. Read access will also be checked in the non-preserving case
by virtue of the separate inode_permission call.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 2fcad7c..20ef414 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2667,6 +2667,17 @@ static int selinux_inode_symlink(struct inode *dir, struct dentry *dentry, const
return may_create(dir, dentry, SECCLASS_LNK_FILE);
}
+static int selinux_inode_reflink(struct dentry *dentry, struct inode *dir,
+ bool preserve_security)
+{
+ struct inode_security_struct *isec = dentry->d_inode->i_security;
+
+ if (preserve_security)
+ return may_link(dir, dentry, MAY_LINK);
+ else
+ return may_create(dir, dentry, isec->sclass);
+}
+
static int selinux_inode_mkdir(struct inode *dir, struct dentry *dentry, int mask)
{
return may_create(dir, dentry, SECCLASS_DIR);
@@ -5357,6 +5368,7 @@ static struct security_operations selinux_ops = {
.inode_link = selinux_inode_link,
.inode_unlink = selinux_inode_unlink,
.inode_symlink = selinux_inode_symlink,
+ .inode_reflink = selinux_inode_reflink,
.inode_mkdir = selinux_inode_mkdir,
.inode_rmdir = selinux_inode_rmdir,
.inode_mknod = selinux_inode_mknod,
--
Stephen Smalley
National Security Agency
^ permalink raw reply related [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 18:25 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:25 UTC (permalink / raw)
To: Joel Becker
Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Thu, 2009-05-14 at 14:06 -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > > > Is preserve_security supposed to also control the preservation of the
> > > > > > SELinux security attribute (security.selinux extended attribute)? I'd
> > > > > > expect that either we preserve all the security-relevant attributes or
> > > > > > none of them. And if that is the case, then SELinux has to know about
> > > > > > preserve_security in order to know what the security context of the new
> > > > > > inode will be.
> > > > >
> > > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're
> > > > > expecting to just reflink the extended attribute structures verbatim in
> > > > > the preserve_security case.
> > > >
> > > > And in the preserve_security==0 case, you'll be calling
> > > > security_inode_init_security() in order to get the attribute name/value
> > > > pair to assign to the new inode just as in the normal file creation
> > > > case?
> > >
> > > Oh, absolutely.
> > > As an aside, do inodes ever have more than one security.*
> > > attribute? It would appear that security_inode_init_security() just
> > > returns one attribute, but what if I had a system running under SMACK
> > > and then changed to SELinux? Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> >
> > No, there would be no security.selinux attribute and the file would be
> > treated as having a well-defined 'unlabeled' attribute by SELinux. Not
> > something you have to worry about.
> >
> > > > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > > > true if the security module vetoes the attempt to preserve attributes.
> > > > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > > > preserved. Your current logic will only allow graceful degradation in
> > > > > > the DAC case, but the MAC case will remain a hard failure.
> > > > >
> > > > > I did not think of this, and its a very good point as well. I'm
> > > > > not sure how to have the return value of security_inode_reflink()
> > > > > distinguish between "disallow the reflink" and "disallow
> > > > > preserve_security". But since !preserve_security requires read access
> > > > > only, perhaps we move security_inode_reflink up higher and say:
> > > > >
> > > > > error = security_inode_reflink(old_dentry, dir);
> > > > > if (error)
> > > > > preserve_security = 0;
> > > > >
> > > > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > > > setting a security context. If it's ok with the reflink, we'll be
> > > > > copying the extended attribute. If it's not OK, it falls through to the
> > > > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > > > read access.
> > > > > What do we think?
> > > >
> > > > I'd rather have two hooks, one to allow the security module to override
> > > > preserve_security and one to allow the security module to deny the
> > > > operation altogether. The former hook only needs to be called if
> > > > preserve_security is not already cleared by the DAC logic. The latter
> > > > hook needs to know the final verdict on preserve_security in order to
> > > > determine the right set of checks to apply, which isn't necessarily
> > > > limited to only checking read access.
> > >
> > > Ok, is that two hooks or one hook with specific error returns?
> > > I don't care, it's up to the LSM group. I just can't come up with a
> > > good distinguishing set of names if its two hooks :-)
> >
> > I suppose you could coalesce them into a single hook ala:
> > error = security_inode_reflink(old_dentry, dir, &preserve_security);
> > if (error)
> > return (error);
>
> On second thought (agreeing with Andy about making the interface
> explicit wrt preserve_security), I don't expect us to ever override
> preserve_security from SELinux, so you can just pass it in by value.
And you can likely make preserve_security a simple bool (set from some
caller-provided flag) rather than an int. At which point the SELinux
wiring for the new hook would be something like this:
If we are preserving security attributes on the reflink, then treat it
like creating a link to an existing file; else treat it like creating a
new file. Read access will also be checked in the non-preserving case
by virtue of the separate inode_permission call.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 2fcad7c..20ef414 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2667,6 +2667,17 @@ static int selinux_inode_symlink(struct inode *dir, struct dentry *dentry, const
return may_create(dir, dentry, SECCLASS_LNK_FILE);
}
+static int selinux_inode_reflink(struct dentry *dentry, struct inode *dir,
+ bool preserve_security)
+{
+ struct inode_security_struct *isec = dentry->d_inode->i_security;
+
+ if (preserve_security)
+ return may_link(dir, dentry, MAY_LINK);
+ else
+ return may_create(dir, dentry, isec->sclass);
+}
+
static int selinux_inode_mkdir(struct inode *dir, struct dentry *dentry, int mask)
{
return may_create(dir, dentry, SECCLASS_DIR);
@@ -5357,6 +5368,7 @@ static struct security_operations selinux_ops = {
.inode_link = selinux_inode_link,
.inode_unlink = selinux_inode_unlink,
.inode_symlink = selinux_inode_symlink,
+ .inode_reflink = selinux_inode_reflink,
.inode_mkdir = selinux_inode_mkdir,
.inode_rmdir = selinux_inode_rmdir,
.inode_mknod = selinux_inode_mknod,
--
Stephen Smalley
National Security Agency
^ permalink raw reply related [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 18:25 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-14 23:25 ` James Morris
-1 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-14 23:25 UTC (permalink / raw)
To: Stephen Smalley
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Thu, 14 May 2009, Stephen Smalley wrote:
> And you can likely make preserve_security a simple bool (set from some
> caller-provided flag) rather than an int. At which point the SELinux
> wiring for the new hook would be something like this:
>
> If we are preserving security attributes on the reflink, then treat it
> like creating a link to an existing file;
Do we also need to somewhat consider it like a new file? e.g. in the case
of create_sid being set (if different to the existing security attribute),
I believe we need to fail the operation because security attributes are
not preserved, and also decide which error code to return (the user may be
confused if it's EACCES -- EINVAL might be better). Similar for reflinks
on a context mounted file system, although create_sid needs to be checked
during inode instantiation (unless we, say, add set a preserve_sid flag
which overrides create_sid and is cleared upon use).
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 23:25 ` James Morris
0 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-14 23:25 UTC (permalink / raw)
To: Stephen Smalley
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Thu, 14 May 2009, Stephen Smalley wrote:
> And you can likely make preserve_security a simple bool (set from some
> caller-provided flag) rather than an int. At which point the SELinux
> wiring for the new hook would be something like this:
>
> If we are preserving security attributes on the reflink, then treat it
> like creating a link to an existing file;
Do we also need to somewhat consider it like a new file? e.g. in the case
of create_sid being set (if different to the existing security attribute),
I believe we need to fail the operation because security attributes are
not preserved, and also decide which error code to return (the user may be
confused if it's EACCES -- EINVAL might be better). Similar for reflinks
on a context mounted file system, although create_sid needs to be checked
during inode instantiation (unless we, say, add set a preserve_sid flag
which overrides create_sid and is cleared upon use).
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 23:25 ` [Ocfs2-devel] " James Morris
@ 2009-05-15 11:54 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 11:54 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 09:25 +1000, James Morris wrote:
> On Thu, 14 May 2009, Stephen Smalley wrote:
>
> > And you can likely make preserve_security a simple bool (set from some
> > caller-provided flag) rather than an int. At which point the SELinux
> > wiring for the new hook would be something like this:
> >
> > If we are preserving security attributes on the reflink, then treat it
> > like creating a link to an existing file;
>
> Do we also need to somewhat consider it like a new file? e.g. in the case
> of create_sid being set (if different to the existing security attribute),
> I believe we need to fail the operation because security attributes are
> not preserved, and also decide which error code to return (the user may be
> confused if it's EACCES -- EINVAL might be better). Similar for reflinks
> on a context mounted file system, although create_sid needs to be checked
> during inode instantiation (unless we, say, add set a preserve_sid flag
> which overrides create_sid and is cleared upon use).
The create_sid is not relevant in the preserve_security==1 case; the
filesystem will always preserve the security context from the original
inode on the new inode in that case. The create_sid won't ever be used
in that case, as it only gets applied if the filesystem calls
security_inode_init_security() to obtain the attribute (name, value)
pair for a new inode, and the filesystem will only do that in the
preserve_security==0 case.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 11:54 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 11:54 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 09:25 +1000, James Morris wrote:
> On Thu, 14 May 2009, Stephen Smalley wrote:
>
> > And you can likely make preserve_security a simple bool (set from some
> > caller-provided flag) rather than an int. At which point the SELinux
> > wiring for the new hook would be something like this:
> >
> > If we are preserving security attributes on the reflink, then treat it
> > like creating a link to an existing file;
>
> Do we also need to somewhat consider it like a new file? e.g. in the case
> of create_sid being set (if different to the existing security attribute),
> I believe we need to fail the operation because security attributes are
> not preserved, and also decide which error code to return (the user may be
> confused if it's EACCES -- EINVAL might be better). Similar for reflinks
> on a context mounted file system, although create_sid needs to be checked
> during inode instantiation (unless we, say, add set a preserve_sid flag
> which overrides create_sid and is cleared upon use).
The create_sid is not relevant in the preserve_security==1 case; the
filesystem will always preserve the security context from the original
inode on the new inode in that case. The create_sid won't ever be used
in that case, as it only gets applied if the filesystem calls
security_inode_init_security() to obtain the attribute (name, value)
pair for a new inode, and the filesystem will only do that in the
preserve_security==0 case.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-15 11:54 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-15 13:35 ` James Morris
-1 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-15 13:35 UTC (permalink / raw)
To: Stephen Smalley
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Fri, 15 May 2009, Stephen Smalley wrote:
> The create_sid is not relevant in the preserve_security==1 case; the
> filesystem will always preserve the security context from the original
> inode on the new inode in that case. The create_sid won't ever be used
> in that case, as it only gets applied if the filesystem calls
> security_inode_init_security() to obtain the attribute (name, value)
> pair for a new inode, and the filesystem will only do that in the
> preserve_security==0 case.
Ok. Does this break the idea of create_sid, though? i.e. it will be
ignored when a new file is created via reflink(), potentially allowing DAC
to determine whether MAC labeling policy is enforced, and is also not
consistent with the way fsuid is handled.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 13:35 ` James Morris
0 siblings, 0 replies; 304+ messages in thread
From: James Morris @ 2009-05-15 13:35 UTC (permalink / raw)
To: Stephen Smalley
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Fri, 15 May 2009, Stephen Smalley wrote:
> The create_sid is not relevant in the preserve_security==1 case; the
> filesystem will always preserve the security context from the original
> inode on the new inode in that case. The create_sid won't ever be used
> in that case, as it only gets applied if the filesystem calls
> security_inode_init_security() to obtain the attribute (name, value)
> pair for a new inode, and the filesystem will only do that in the
> preserve_security==0 case.
Ok. Does this break the idea of create_sid, though? i.e. it will be
ignored when a new file is created via reflink(), potentially allowing DAC
to determine whether MAC labeling policy is enforced, and is also not
consistent with the way fsuid is handled.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-15 13:35 ` [Ocfs2-devel] " James Morris
@ 2009-05-15 15:44 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 15:44 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 23:35 +1000, James Morris wrote:
> On Fri, 15 May 2009, Stephen Smalley wrote:
>
> > The create_sid is not relevant in the preserve_security==1 case; the
> > filesystem will always preserve the security context from the original
> > inode on the new inode in that case. The create_sid won't ever be used
> > in that case, as it only gets applied if the filesystem calls
> > security_inode_init_security() to obtain the attribute (name, value)
> > pair for a new inode, and the filesystem will only do that in the
> > preserve_security==0 case.
>
> Ok. Does this break the idea of create_sid, though? i.e. it will be
> ignored when a new file is created via reflink(), potentially allowing DAC
> to determine whether MAC labeling policy is enforced, and is also not
> consistent with the way fsuid is handled.
I think it is consistent with the planned uid handling for reflink (if
preserve_security==1, then the new inode gets the uid of the original
inode; else the new inode gets the fsuid of the creating process).
create_sid is a "discretionary" mechanism - the application supplies the
value via setfscreatecon(3), subject to a policy check (the file create
check). Applications only expect the create_sid to be applied on normal
file creations (and even there, it may not happen due to context mounts
or filesystems that do not support labeling), so we aren't bound to that
behavior for reflink.
The MAC policy is enforced based on the permission checks, not the
create_sid, so the only question is whether it is sufficient to check
link permission for reflink(2) in the attribute-preserving case or
whether we should add a new permission for it. We don't want to reuse
the create permission for reflink(2) in the attribute-preserving case
due to the difference in semantics between a reflink and a normal file
creation. The result of a reflink(2) will look identical to the result
of a link(2) except that it will have its own inode and thus a different
inode number, link count, and ctime.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 15:44 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 15:44 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 23:35 +1000, James Morris wrote:
> On Fri, 15 May 2009, Stephen Smalley wrote:
>
> > The create_sid is not relevant in the preserve_security==1 case; the
> > filesystem will always preserve the security context from the original
> > inode on the new inode in that case. The create_sid won't ever be used
> > in that case, as it only gets applied if the filesystem calls
> > security_inode_init_security() to obtain the attribute (name, value)
> > pair for a new inode, and the filesystem will only do that in the
> > preserve_security==0 case.
>
> Ok. Does this break the idea of create_sid, though? i.e. it will be
> ignored when a new file is created via reflink(), potentially allowing DAC
> to determine whether MAC labeling policy is enforced, and is also not
> consistent with the way fsuid is handled.
I think it is consistent with the planned uid handling for reflink (if
preserve_security==1, then the new inode gets the uid of the original
inode; else the new inode gets the fsuid of the creating process).
create_sid is a "discretionary" mechanism - the application supplies the
value via setfscreatecon(3), subject to a policy check (the file create
check). Applications only expect the create_sid to be applied on normal
file creations (and even there, it may not happen due to context mounts
or filesystems that do not support labeling), so we aren't bound to that
behavior for reflink.
The MAC policy is enforced based on the permission checks, not the
create_sid, so the only question is whether it is sufficient to check
link permission for reflink(2) in the attribute-preserving case or
whether we should add a new permission for it. We don't want to reuse
the create permission for reflink(2) in the attribute-preserving case
due to the difference in semantics between a reflink and a normal file
creation. The result of a reflink(2) will look identical to the result
of a link(2) except that it will have its own inode and thus a different
inode number, link count, and ctime.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 18:03 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-13 1:47 ` Casey Schaufler
-1 siblings, 0 replies; 304+ messages in thread
From: Casey Schaufler @ 2009-05-13 1:47 UTC (permalink / raw)
To: Stephen Smalley, James Morris, jim owens, ocfs2-devel, viro,
mtk.manpages, linux-se
Joel Becker wrote:
> On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
>
>> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
>>
>>> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
>>>
>>>> Is preserve_security supposed to also control the preservation of the
>>>> SELinux security attribute (security.selinux extended attribute)? I'd
>>>> expect that either we preserve all the security-relevant attributes or
>>>> none of them. And if that is the case, then SELinux has to know about
>>>> preserve_security in order to know what the security context of the new
>>>> inode will be.
>>>>
>>> Thank you Stephen, you read my mind. In the ocfs2 case, we're
>>> expecting to just reflink the extended attribute structures verbatim in
>>> the preserve_security case.
>>>
>> And in the preserve_security==0 case, you'll be calling
>> security_inode_init_security() in order to get the attribute name/value
>> pair to assign to the new inode just as in the normal file creation
>> case?
>>
>
> Oh, absolutely.
> As an aside, do inodes ever have more than one security.*
> attribute?
ACLs, capability sets and Smack labels can all exist on a file at
the same time. I know of at least one effort underway to create a
multiple-label LSM.
> It would appear that security_inode_init_security() just
> returns one attribute, but what if I had a system running under SMACK
> and then changed to SELinux?
The Smack attribute would hang around, it would just be unused.
> Would my (existing) inode then have
> security.smack and security.selinux attributes?
>
Yup. It happens all the time. Whenever someone converts a Fedora
system to Smack they end up with a filesystem full of unused selinux
labels. It does no harm.
>
>>>> Also, if you are going to automatically degrade reflink(2) behavior
>>>> based on the owner_or_cap test, then you ought to allow the same to be
>>>> true if the security module vetoes the attempt to preserve attributes.
>>>> Either DAC or MAC logic may say that security attributes cannot be
>>>> preserved. Your current logic will only allow graceful degradation in
>>>> the DAC case, but the MAC case will remain a hard failure.
>>>>
>>> I did not think of this, and its a very good point as well. I'm
>>> not sure how to have the return value of security_inode_reflink()
>>> distinguish between "disallow the reflink" and "disallow
>>> preserve_security". But since !preserve_security requires read access
>>> only, perhaps we move security_inode_reflink up higher and say:
>>>
>>> error = security_inode_reflink(old_dentry, dir);
>>> if (error)
>>> preserve_security = 0;
>>>
>>> Here security_inode_reflink() does not need new_dentry, because it isn't
>>> setting a security context. If it's ok with the reflink, we'll be
>>> copying the extended attribute. If it's not OK, it falls through to the
>>> inode_permission(inode, MAY_READ) check, which will check for plain old
>>> read access.
>>> What do we think?
>>>
>> I'd rather have two hooks, one to allow the security module to override
>> preserve_security and one to allow the security module to deny the
>> operation altogether. The former hook only needs to be called if
>> preserve_security is not already cleared by the DAC logic. The latter
>> hook needs to know the final verdict on preserve_security in order to
>> determine the right set of checks to apply, which isn't necessarily
>> limited to only checking read access.
>>
>
> Ok, is that two hooks or one hook with specific error returns?
> I don't care, it's up to the LSM group. I just can't come up with a
> good distinguishing set of names if its two hooks :-)
>
> Joel
>
>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-13 1:47 ` Casey Schaufler
0 siblings, 0 replies; 304+ messages in thread
From: Casey Schaufler @ 2009-05-13 1:47 UTC (permalink / raw)
To: Stephen Smalley, James Morris, jim owens, ocfs2-devel, viro,
mtk.manpages, linux-se
Joel Becker wrote:
> On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
>
>> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
>>
>>> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
>>>
>>>> Is preserve_security supposed to also control the preservation of the
>>>> SELinux security attribute (security.selinux extended attribute)? I'd
>>>> expect that either we preserve all the security-relevant attributes or
>>>> none of them. And if that is the case, then SELinux has to know about
>>>> preserve_security in order to know what the security context of the new
>>>> inode will be.
>>>>
>>> Thank you Stephen, you read my mind. In the ocfs2 case, we're
>>> expecting to just reflink the extended attribute structures verbatim in
>>> the preserve_security case.
>>>
>> And in the preserve_security==0 case, you'll be calling
>> security_inode_init_security() in order to get the attribute name/value
>> pair to assign to the new inode just as in the normal file creation
>> case?
>>
>
> Oh, absolutely.
> As an aside, do inodes ever have more than one security.*
> attribute?
ACLs, capability sets and Smack labels can all exist on a file at
the same time. I know of at least one effort underway to create a
multiple-label LSM.
> It would appear that security_inode_init_security() just
> returns one attribute, but what if I had a system running under SMACK
> and then changed to SELinux?
The Smack attribute would hang around, it would just be unused.
> Would my (existing) inode then have
> security.smack and security.selinux attributes?
>
Yup. It happens all the time. Whenever someone converts a Fedora
system to Smack they end up with a filesystem full of unused selinux
labels. It does no harm.
>
>>>> Also, if you are going to automatically degrade reflink(2) behavior
>>>> based on the owner_or_cap test, then you ought to allow the same to be
>>>> true if the security module vetoes the attempt to preserve attributes.
>>>> Either DAC or MAC logic may say that security attributes cannot be
>>>> preserved. Your current logic will only allow graceful degradation in
>>>> the DAC case, but the MAC case will remain a hard failure.
>>>>
>>> I did not think of this, and its a very good point as well. I'm
>>> not sure how to have the return value of security_inode_reflink()
>>> distinguish between "disallow the reflink" and "disallow
>>> preserve_security". But since !preserve_security requires read access
>>> only, perhaps we move security_inode_reflink up higher and say:
>>>
>>> error = security_inode_reflink(old_dentry, dir);
>>> if (error)
>>> preserve_security = 0;
>>>
>>> Here security_inode_reflink() does not need new_dentry, because it isn't
>>> setting a security context. If it's ok with the reflink, we'll be
>>> copying the extended attribute. If it's not OK, it falls through to the
>>> inode_permission(inode, MAY_READ) check, which will check for plain old
>>> read access.
>>> What do we think?
>>>
>> I'd rather have two hooks, one to allow the security module to override
>> preserve_security and one to allow the security module to deny the
>> operation altogether. The former hook only needs to be called if
>> preserve_security is not already cleared by the DAC logic. The latter
>> hook needs to know the final verdict on preserve_security in order to
>> determine the right set of checks to apply, which isn't necessarily
>> limited to only checking read access.
>>
>
> Ok, is that two hooks or one hook with specific error returns?
> I don't care, it's up to the LSM group. I just can't come up with a
> good distinguishing set of names if its two hooks :-)
>
> Joel
>
>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-13 1:47 ` [Ocfs2-devel] " Casey Schaufler
@ 2009-05-13 16:43 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-13 16:43 UTC (permalink / raw)
To: Casey Schaufler
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, Stephen Smalley, ocfs2-devel, viro
On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote:
> Joel Becker wrote:
> > Oh, absolutely.
> > As an aside, do inodes ever have more than one security.*
> > attribute?
>
> ACLs, capability sets and Smack labels can all exist on a file at
> the same time. I know of at least one effort underway to create a
> multiple-label LSM.
So ACLs and cap sets live under security.*? That's good.
> > Would my (existing) inode then have
> > security.smack and security.selinux attributes?
> >
>
> Yup. It happens all the time. Whenever someone converts a Fedora
> system to Smack they end up with a filesystem full of unused selinux
> labels. It does no harm.
At that runtime, sure. But with reflink(), we may be reflinking
someone else's inode, and if we have to drop its security state, we
should clean the unused labels just in case they go back to selinux (or
back to smack, etc). But if they are all under security.*, it's easy to
do.
Thanks!
Joel
--
Life's Little Instruction Book #173
"Be kinder than necessary."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-13 16:43 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-13 16:43 UTC (permalink / raw)
To: Casey Schaufler
Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
jim owens, Stephen Smalley, ocfs2-devel, viro
On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote:
> Joel Becker wrote:
> > Oh, absolutely.
> > As an aside, do inodes ever have more than one security.*
> > attribute?
>
> ACLs, capability sets and Smack labels can all exist on a file at
> the same time. I know of at least one effort underway to create a
> multiple-label LSM.
So ACLs and cap sets live under security.*? That's good.
> > Would my (existing) inode then have
> > security.smack and security.selinux attributes?
> >
>
> Yup. It happens all the time. Whenever someone converts a Fedora
> system to Smack they end up with a filesystem full of unused selinux
> labels. It does no harm.
At that runtime, sure. But with reflink(), we may be reflinking
someone else's inode, and if we have to drop its security state, we
should clean the unused labels just in case they go back to selinux (or
back to smack, etc). But if they are all under security.*, it's easy to
do.
Thanks!
Joel
--
Life's Little Instruction Book #173
"Be kinder than necessary."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-13 16:43 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-13 17:23 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-13 17:23 UTC (permalink / raw)
To: Joel Becker
Cc: Casey Schaufler, James Morris, jim owens, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Wed, 2009-05-13 at 09:43 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote:
> > Joel Becker wrote:
> > > Oh, absolutely.
> > > As an aside, do inodes ever have more than one security.*
> > > attribute?
> >
> > ACLs, capability sets and Smack labels can all exist on a file at
> > the same time. I know of at least one effort underway to create a
> > multiple-label LSM.
>
> So ACLs and cap sets live under security.*? That's good.
File capabilities live under security.*, but ACLs predate the security
namespace and live in the system namespace as
"system.posix_acl_access" (and if a directory, there is also a
"system.posix_acl_default" attribute that specifies the default ACL for
new files in that directory).
In the preserve_security==0 case, you'd want to:
- drop all attributes under security.* on the new inode,
- set (security.<name>, value) to the name:value pair provided by
security_inode_init_security(),
- set system.posix_acl_access to the default ACL associated with the
parent directory (the "system.posix_acl_default" attribute on the
parent).
The latter two steps are what is already done in the new inode creation
code path, so you hopefully can just reuse that code.
> > > Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> > >
> >
> > Yup. It happens all the time. Whenever someone converts a Fedora
> > system to Smack they end up with a filesystem full of unused selinux
> > labels. It does no harm.
>
> At that runtime, sure. But with reflink(), we may be reflinking
> someone else's inode, and if we have to drop its security state, we
> should clean the unused labels just in case they go back to selinux (or
> back to smack, etc). But if they are all under security.*, it's easy to
> do.
>
> Thanks!
> Joel
>
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-13 17:23 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-13 17:23 UTC (permalink / raw)
To: Joel Becker
Cc: Casey Schaufler, James Morris, jim owens, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Wed, 2009-05-13 at 09:43 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote:
> > Joel Becker wrote:
> > > Oh, absolutely.
> > > As an aside, do inodes ever have more than one security.*
> > > attribute?
> >
> > ACLs, capability sets and Smack labels can all exist on a file at
> > the same time. I know of at least one effort underway to create a
> > multiple-label LSM.
>
> So ACLs and cap sets live under security.*? That's good.
File capabilities live under security.*, but ACLs predate the security
namespace and live in the system namespace as
"system.posix_acl_access" (and if a directory, there is also a
"system.posix_acl_default" attribute that specifies the default ACL for
new files in that directory).
In the preserve_security==0 case, you'd want to:
- drop all attributes under security.* on the new inode,
- set (security.<name>, value) to the name:value pair provided by
security_inode_init_security(),
- set system.posix_acl_access to the default ACL associated with the
parent directory (the "system.posix_acl_default" attribute on the
parent).
The latter two steps are what is already done in the new inode creation
code path, so you hopefully can just reuse that code.
> > > Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> > >
> >
> > Yup. It happens all the time. Whenever someone converts a Fedora
> > system to Smack they end up with a filesystem full of unused selinux
> > labels. It does no harm.
>
> At that runtime, sure. But with reflink(), we may be reflinking
> someone else's inode, and if we have to drop its security state, we
> should clean the unused labels just in case they go back to selinux (or
> back to smack, etc). But if they are all under security.*, it's easy to
> do.
>
> Thanks!
> Joel
>
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-13 17:23 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-13 18:27 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-13 18:27 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, jim owens, linux-security-module, mtk.manpages,
Casey Schaufler, linux-fsdevel, ocfs2-devel, viro
On Wed, May 13, 2009 at 01:23:58PM -0400, Stephen Smalley wrote:
> File capabilities live under security.*, but ACLs predate the security
> namespace and live in the system namespace as
> "system.posix_acl_access" (and if a directory, there is also a
> "system.posix_acl_default" attribute that specifies the default ACL for
> new files in that directory).
>
> In the preserve_security==0 case, you'd want to:
> - drop all attributes under security.* on the new inode,
> - set (security.<name>, value) to the name:value pair provided by
> security_inode_init_security(),
> - set system.posix_acl_access to the default ACL associated with the
> parent directory (the "system.posix_acl_default" attribute on the
> parent).
>
> The latter two steps are what is already done in the new inode creation
> code path, so you hopefully can just reuse that code.
I am absolutely expecting to reuse that code. I was just
trying to make sure I didn't miss any steps prior to the normal
new-inode stuff. Thanks.
Joel
--
The zen have a saying:
"When you learn how to listen, ANYONE can be your teacher."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-13 18:27 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-13 18:27 UTC (permalink / raw)
To: Stephen Smalley
Cc: James Morris, jim owens, linux-security-module, mtk.manpages,
Casey Schaufler, linux-fsdevel, ocfs2-devel, viro
On Wed, May 13, 2009 at 01:23:58PM -0400, Stephen Smalley wrote:
> File capabilities live under security.*, but ACLs predate the security
> namespace and live in the system namespace as
> "system.posix_acl_access" (and if a directory, there is also a
> "system.posix_acl_default" attribute that specifies the default ACL for
> new files in that directory).
>
> In the preserve_security==0 case, you'd want to:
> - drop all attributes under security.* on the new inode,
> - set (security.<name>, value) to the name:value pair provided by
> security_inode_init_security(),
> - set system.posix_acl_access to the default ACL associated with the
> parent directory (the "system.posix_acl_default" attribute on the
> parent).
>
> The latter two steps are what is already done in the new inode creation
> code path, so you hopefully can just reuse that code.
I am absolutely expecting to reuse that code. I was just
trying to make sure I didn't miss any steps prior to the normal
new-inode stuff. Thanks.
Joel
--
The zen have a saying:
"When you learn how to listen, ANYONE can be your teacher."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 22:27 ` [Ocfs2-devel] " James Morris
@ 2009-05-12 12:01 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 12:01 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 08:27 +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>
> > and other security attributes (in all, I'm gonna call that the "security
> > context") as well. So I defined reflink() as such. This meant
>
> "security context" is an term associated with SELinux, so you may want to
> use something like "security attributes" or "security state" to avoid
> confusing people.
>
> > + error = security_inode_reflink(old_dentry, dir);
> > + if (error)
> > + return error;
>
> We'll need the new_dentry now, to set up new security state before the
> dentry is instantiated.
I don't think the inode exists yet for the new_dentry (not until after
the call to i_op->reflink), and thus we cannot set up the new inode
state at the point of security_inode_reflink(). We will need the
filesystem to call into the security module to get the right security
attribute name/value pair when creating the new inode, just as with
normal inode creation, unless it is preserving the name/value pair from
the original. The security_inode_init_security() hook is for that
purpose - you can see its usage in existing filesystems when creating
new inodes.
> e.g. SELinux will need to perform some checks on the operation, then
> calculate a new security context for the new file.
>
>
> - James
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 12:01 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-12 12:01 UTC (permalink / raw)
To: James Morris
Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 2009-05-12 at 08:27 +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>
> > and other security attributes (in all, I'm gonna call that the "security
> > context") as well. So I defined reflink() as such. This meant
>
> "security context" is an term associated with SELinux, so you may want to
> use something like "security attributes" or "security state" to avoid
> confusing people.
>
> > + error = security_inode_reflink(old_dentry, dir);
> > + if (error)
> > + return error;
>
> We'll need the new_dentry now, to set up new security state before the
> dentry is instantiated.
I don't think the inode exists yet for the new_dentry (not until after
the call to i_op->reflink), and thus we cannot set up the new inode
state at the point of security_inode_reflink(). We will need the
filesystem to call into the security module to get the right security
attribute name/value pair when creating the new inode, just as with
normal inode creation, unless it is preserving the name/value pair from
the original. The security_inode_init_security() hook is for that
purpose - you can see its usage in existing filesystems when creating
new inodes.
> e.g. SELinux will need to perform some checks on the operation, then
> calculate a new security context for the new file.
>
>
> - James
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-11 23:11 ` jim owens
-1 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-11 23:11 UTC (permalink / raw)
To: joel.becker, linux-fsdevel
Cc: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module
Joel Becker wrote:
> Here's v4 of reflink(). If you have the privileges, you get the
> full snapshot. If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized. That's it. It fits with most of the
> other ops, and it's a clean degradation.
I really like this. It has a nice clean user operational definition
and gives them all the snap/cowfile features. And if they had the
privilege to do the reflink(), they can just chattr away :)
jim
> + /*
> + * If the caller has the rights, reflink() will preserve the
> + * security context of the source inode.
> + */
> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
> + if ((current_fsuid() != inode->i_uid) &&
> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
I have not done a code review, but that appears to be an
editing cut-and-past duplication.
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-11 23:11 ` jim owens
0 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-11 23:11 UTC (permalink / raw)
To: joel.becker, linux-fsdevel
Cc: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module
Joel Becker wrote:
> Here's v4 of reflink(). If you have the privileges, you get the
> full snapshot. If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized. That's it. It fits with most of the
> other ops, and it's a clean degradation.
I really like this. It has a nice clean user operational definition
and gives them all the snap/cowfile features. And if they had the
privilege to do the reflink(), they can just chattr away :)
jim
> + /*
> + * If the caller has the rights, reflink() will preserve the
> + * security context of the source inode.
> + */
> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
> + if ((current_fsuid() != inode->i_uid) &&
> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
I have not done a code review, but that appears to be an
editing cut-and-past duplication.
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 23:11 ` [Ocfs2-devel] " jim owens
@ 2009-05-11 23:42 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-11 23:42 UTC (permalink / raw)
To: jim owens
Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel,
ocfs2-devel, viro
On Mon, May 11, 2009 at 07:11:00PM -0400, jim owens wrote:
> Joel Becker wrote:
>> Here's v4 of reflink(). If you have the privileges, you get the
>> full snapshot. If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized. That's it. It fits with most of the
>> other ops, and it's a clean degradation.
>
> I really like this. It has a nice clean user operational definition
> and gives them all the snap/cowfile features. And if they had the
> privilege to do the reflink(), they can just chattr away :)
>
> jim
>
>> + /*
>> + * If the caller has the rights, reflink() will preserve the
>> + * security context of the source inode.
>> + */
>> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
>> + preserve_security = 0;
>> + if ((current_fsuid() != inode->i_uid) &&
>> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
>> + preserve_security = 0;
>
> I have not done a code review, but that appears to be an
> editing cut-and-past duplication.
Oh, good catch.
Joel
--
"In the long run...we'll all be dead."
-Unknown
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-11 23:42 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-11 23:42 UTC (permalink / raw)
To: jim owens
Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel,
ocfs2-devel, viro
On Mon, May 11, 2009 at 07:11:00PM -0400, jim owens wrote:
> Joel Becker wrote:
>> Here's v4 of reflink(). If you have the privileges, you get the
>> full snapshot. If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized. That's it. It fits with most of the
>> other ops, and it's a clean degradation.
>
> I really like this. It has a nice clean user operational definition
> and gives them all the snap/cowfile features. And if they had the
> privilege to do the reflink(), they can just chattr away :)
>
> jim
>
>> + /*
>> + * If the caller has the rights, reflink() will preserve the
>> + * security context of the source inode.
>> + */
>> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
>> + preserve_security = 0;
>> + if ((current_fsuid() != inode->i_uid) &&
>> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
>> + preserve_security = 0;
>
> I have not done a code review, but that appears to be an
> editing cut-and-past duplication.
Oh, good catch.
Joel
--
"In the long run...we'll all be dead."
-Unknown
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-12 11:31 ` Jörn Engel
-1 siblings, 0 replies; 304+ messages in thread
From: Jörn Engel @ 2009-05-12 11:31 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote:
>
> Here's v4 of reflink(). If you have the privileges, you get the
> full snapshot. If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized. That's it. It fits with most of the
> other ops, and it's a clean degradation.
Let me see if I understand this correctly. File "/tmp/foo" belongs to
Joel, file "/tmp/bar" belongs to Joern. Everyone has read access to
those files. Now if you reflink them to your home directory, both files
belong to you. If I reflink them to my home directory, both files
belong to me. And if root reflinks them to /root, one file belongs to
Joel, the other to Joern. Is that correct?
Because if it is, I would call that behaviour rather confusing. A
system call that behaves differently depending on who calls it - or
on whether the binary is installed suid root - is something I would like
to avoid.
Jörn
--
A surrounded army must be given a way out.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 11:31 ` Jörn Engel
0 siblings, 0 replies; 304+ messages in thread
From: Jörn Engel @ 2009-05-12 11:31 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote:
>
> Here's v4 of reflink(). If you have the privileges, you get the
> full snapshot. If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized. That's it. It fits with most of the
> other ops, and it's a clean degradation.
Let me see if I understand this correctly. File "/tmp/foo" belongs to
Joel, file "/tmp/bar" belongs to Joern. Everyone has read access to
those files. Now if you reflink them to your home directory, both files
belong to you. If I reflink them to my home directory, both files
belong to me. And if root reflinks them to /root, one file belongs to
Joel, the other to Joern. Is that correct?
Because if it is, I would call that behaviour rather confusing. A
system call that behaves differently depending on who calls it - or
on whether the binary is installed suid root - is something I would like
to avoid.
J?rn
--
A surrounded army must be given a way out.
-- Sun Tzu
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 11:31 ` [Ocfs2-devel] " Jörn Engel
@ 2009-05-12 13:12 ` jim owens
-1 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-12 13:12 UTC (permalink / raw)
To: Jörn Engel
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Jörn Engel wrote:
> On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote:
>> Here's v4 of reflink(). If you have the privileges, you get the
>> full snapshot. If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized. That's it. It fits with most of the
>> other ops, and it's a clean degradation.
>
> Let me see if I understand this correctly. File "/tmp/foo" belongs to
> Joel, file "/tmp/bar" belongs to Joern. Everyone has read access to
> those files. Now if you reflink them to your home directory, both files
> belong to you. If I reflink them to my home directory, both files
> belong to me. And if root reflinks them to /root, one file belongs to
> Joel, the other to Joern. Is that correct?
yes
> Because if it is, I would call that behaviour rather confusing. A
> system call that behaves differently depending on who calls it - or
> on whether the binary is installed suid root - is something I would like
> to avoid.
Avoiding that just gives us other confusing operations unless
you have a really good alternative.
This design is very elegant, I wish I had thought of it :)
It passes the test that 99% of the time for any user (including
root), "it just works the way I want it to". In my experience,
root and setuid programs really don't want to take ownership,
they want to replicate it.
The behavior matches "cp -p" or "tar -x" and yes those are not
system calls but so what. What matters is the documentation is
clear about what happens and the most useful result occurs.
jim
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 13:12 ` jim owens
0 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-12 13:12 UTC (permalink / raw)
To: Jörn Engel
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
J?rn Engel wrote:
> On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote:
>> Here's v4 of reflink(). If you have the privileges, you get the
>> full snapshot. If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized. That's it. It fits with most of the
>> other ops, and it's a clean degradation.
>
> Let me see if I understand this correctly. File "/tmp/foo" belongs to
> Joel, file "/tmp/bar" belongs to Joern. Everyone has read access to
> those files. Now if you reflink them to your home directory, both files
> belong to you. If I reflink them to my home directory, both files
> belong to me. And if root reflinks them to /root, one file belongs to
> Joel, the other to Joern. Is that correct?
yes
> Because if it is, I would call that behaviour rather confusing. A
> system call that behaves differently depending on who calls it - or
> on whether the binary is installed suid root - is something I would like
> to avoid.
Avoiding that just gives us other confusing operations unless
you have a really good alternative.
This design is very elegant, I wish I had thought of it :)
It passes the test that 99% of the time for any user (including
root), "it just works the way I want it to". In my experience,
root and setuid programs really don't want to take ownership,
they want to replicate it.
The behavior matches "cp -p" or "tar -x" and yes those are not
system calls but so what. What matters is the documentation is
clear about what happens and the most useful result occurs.
jim
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 13:12 ` [Ocfs2-devel] " jim owens
@ 2009-05-12 20:24 ` Jamie Lokier
-1 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:24 UTC (permalink / raw)
To: jim owens
Cc: Jörn Engel, Joel Becker, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
jim owens wrote:
> It passes the test that 99% of the time for any user (including
> root), "it just works the way I want it to". In my experience,
> root and setuid programs really don't want to take ownership,
> they want to replicate it.
Unfortunately in the other 1%, as I've explained in detail in another
mail, it's a lot of work and sometimes impossible for a program to set
the attributes to be those of a new file.
Whereas an explicit choice between snapshot attributes and new-file
attributes never causes problems, because it's trivial to provide the
automatic "-p" switch by trying one then the other.
To human-optimise, make your reflink _program_ do that.
Humans don't call system calls themselves :-)
> The behavior matches "cp -p" or "tar -x"
Actually it doesn't, but even if it did, not having any way to turn
off the "-p" would be just as annoying as if you couldn't do that with "cp".
If you like root to have "cp -p", put it in /root/.bashrc :-)
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 20:24 ` Jamie Lokier
0 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:24 UTC (permalink / raw)
To: jim owens
Cc: Jörn Engel, Joel Becker, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
jim owens wrote:
> It passes the test that 99% of the time for any user (including
> root), "it just works the way I want it to". In my experience,
> root and setuid programs really don't want to take ownership,
> they want to replicate it.
Unfortunately in the other 1%, as I've explained in detail in another
mail, it's a lot of work and sometimes impossible for a program to set
the attributes to be those of a new file.
Whereas an explicit choice between snapshot attributes and new-file
attributes never causes problems, because it's trivial to provide the
automatic "-p" switch by trying one then the other.
To human-optimise, make your reflink _program_ do that.
Humans don't call system calls themselves :-)
> The behavior matches "cp -p" or "tar -x"
Actually it doesn't, but even if it did, not having any way to turn
off the "-p" would be just as annoying as if you couldn't do that with "cp".
If you like root to have "cp -p", put it in /root/.bashrc :-)
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 13:12 ` [Ocfs2-devel] " jim owens
@ 2009-05-14 18:43 ` Jörn Engel
-1 siblings, 0 replies; 304+ messages in thread
From: Jörn Engel @ 2009-05-14 18:43 UTC (permalink / raw)
To: jim owens
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
[ Delayed response - mailserver was dead. ]
On Tue, 12 May 2009 09:12:17 -0400, jim owens wrote:
>
> >Because if it is, I would call that behaviour rather confusing. A
> >system call that behaves differently depending on who calls it - or
> >on whether the binary is installed suid root - is something I would like
> >to avoid.
>
> Avoiding that just gives us other confusing operations unless
> you have a really good alternative.
>
> This design is very elegant, I wish I had thought of it :)
>
> It passes the test that 99% of the time for any user (including
> root), "it just works the way I want it to". In my experience,
> root and setuid programs really don't want to take ownership,
> they want to replicate it.
>
> The behavior matches "cp -p" or "tar -x" and yes those are not
> system calls but so what. What matters is the documentation is
> clear about what happens and the most useful result occurs.
If what you want is copyfile(2), this is a poor design because it
usually does what you want and sometimes doesn't. If what you want is
reflink(2), this may be acceptable. Not sure. I personally would
prefer to get -EPERM or something instead of altered behaviour.
So you can count me in with the people that propose two seperate system
calls.
Jörn
--
They laughed at Galileo. They laughed at Copernicus. They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 18:43 ` Jörn Engel
0 siblings, 0 replies; 304+ messages in thread
From: Jörn Engel @ 2009-05-14 18:43 UTC (permalink / raw)
To: jim owens
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
[ Delayed response - mailserver was dead. ]
On Tue, 12 May 2009 09:12:17 -0400, jim owens wrote:
>
> >Because if it is, I would call that behaviour rather confusing. A
> >system call that behaves differently depending on who calls it - or
> >on whether the binary is installed suid root - is something I would like
> >to avoid.
>
> Avoiding that just gives us other confusing operations unless
> you have a really good alternative.
>
> This design is very elegant, I wish I had thought of it :)
>
> It passes the test that 99% of the time for any user (including
> root), "it just works the way I want it to". In my experience,
> root and setuid programs really don't want to take ownership,
> they want to replicate it.
>
> The behavior matches "cp -p" or "tar -x" and yes those are not
> system calls but so what. What matters is the documentation is
> clear about what happens and the most useful result occurs.
If what you want is copyfile(2), this is a poor design because it
usually does what you want and sometimes doesn't. If what you want is
reflink(2), this may be acceptable. Not sure. I personally would
prefer to get -EPERM or something instead of altered behaviour.
So you can count me in with the people that propose two seperate system
calls.
J?rn
--
They laughed at Galileo. They laughed at Copernicus. They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-12 15:04 ` Sage Weil
-1 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-12 15:04 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009, Joel Becker wrote:
> Here's v4 of reflink(). If you have the privileges, you get the
> full snapshot. If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized. That's it. It fits with most of the
> other ops, and it's a clean degradation.
What would a 'cp' without '-p' be expected to do here when it has the
privileges? Call reflink(2), then explicitly clear out any copied
security attributes ensure that any copied attributes are removed, and
otherwise jump through hoops to make the newly created file look like it
should? Should it check whether it has the privileges and act accordingly
(_can_ it even do that reliably/atomically?), or unconditionally verify
the attributes look like a new file's should?
To me, a simple 'cp' type operation (assuming it gets wired up the way it
could) seems like at least as common a use case than a 'snapshot'
operation. I know that's not what your main goal here, but I don't
understand the resistance to two syscalls. Mixing the two might give you
the right answer in many cases, but certainly not all, and it makes for
confusing application interface semantics that we won't be able to change
down the line.
sage
> I add a flag to ips->reflink() so that the filesystem knows what
> to do with the security context. That's the only change visible outside
> of vfs_reflink().
> Security folks, check my work. Everyone else, let me know if
> this satisfies.
>
> Joel
>
> >From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001
> From: Joel Becker <joel.becker@oracle.com>
> Date: Sat, 2 May 2009 22:48:59 -0700
> Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
>
> The userpace visible idea of the operation is:
>
> int reflink(const char *oldpath, const char *newpath);
> int reflinkat(int olddirfd, const char *oldpath,
> int newdirfd, const char *newpath, int flags);
>
> The kernel only implements reflinkat(2). reflink(3) is a trivial
> wrapper around reflinkat(2).
>
> The reflink() system call creates reference-counted links. It creates
> a new file that shares the data extents of the source file in a
> copy-on-write fashion. Its calling semantics are identical to link(2)
> and linkat(2). Once complete, programs see the new file as a completely
> separate entry.
>
> reflink() attempts to preserve ownership, permissions, and security
> contexts in order to create a fully snapshot. Preserving those
> attributes requires ownership or CAP_CHOWN. A caller without those
> privileges will see the security context of the new file initialized to
> their default.
>
> In the VFS, ->reflink() is an inode_operation with the almost same
> arguments as ->link(); an additional argument tells the filesystem to
> copy over or reinitialize the security context on the new file.
>
> A new LSM hook, security_inode_reflink(), is added. None of the
> existing LSM hooks appeared to fit.
>
> XXX: Currently only adds the x86_32 linkage. The rest of the
> architectures belong here too.
>
> Signed-off-by: Joel Becker <joel.becker@oracle.com>
> ---
> Documentation/filesystems/reflink.txt | 165 +++++++++++++++++++++++++++++++++
> Documentation/filesystems/vfs.txt | 4 +
> arch/x86/include/asm/unistd_32.h | 1 +
> arch/x86/kernel/syscall_table_32.S | 1 +
> fs/namei.c | 113 ++++++++++++++++++++++
> include/linux/fs.h | 2 +
> include/linux/security.h | 16 +++
> include/linux/syscalls.h | 2 +
> security/capability.c | 6 +
> security/security.c | 7 ++
> 10 files changed, 317 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/filesystems/reflink.txt
>
> diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
> new file mode 100644
> index 0000000..aa7380f
> --- /dev/null
> +++ b/Documentation/filesystems/reflink.txt
> @@ -0,0 +1,165 @@
> +reflink(2)
> +==========
> +
> +
> +INTRODUCTION
> +------------
> +
> +A reflink is a reference-counted link. The reflink(2) operation is
> +analogous to the link(2) operation, except that instead of two directory
> +entries pointing to the same inode, there are two identical inodes
> +pointing to the same data. Writes do not modify the shared data; they
> +use copy-on-write (CoW). Thus, after the reflink has been created, the
> +inodes can diverge without impacting each other.
> +
> +
> +SYNOPSIS
> +--------
> +
> +The reflink(2) call looks just like link(2):
> +
> + int reflink(const char *oldpath, const char *newpath);
> +
> +The actual system call is reflinkat(2):
> +
> + int reflinkat(int olddirfd, const char *oldpath,
> + int newdirfd, const char *newpath, int flags);
> +
> +For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
> +The reflink(2) call won't be implemented by the kernel, because it's a
> +trivial wrapper around reflinkat(2).
> +
> +
> +DESCRIPTION
> +-----------
> +
> +One way of viewing reflink is to look at the level of sharing. A
> +symbolic link does its sharing at the directory entry level; many names
> +end up pointing at the same directory entry. Hard links are one step
> +down. Multiple directory entries are sharing one inode. Reflinks are
> +down one more level: multiple inodes share the same data extents.
> +
> +When you symlink a file, you can then access it via the symlink or the
> +real directory entry, and for the most part they look identical. When
> +accessing more than one name for a hard link, the object returned looks
> +identical. Similarly, a newly created reflink is identical to its
> +source in almost every way and can be treated as such. This includes
> +ownership, permissions, security context, and data. The only things
> +that are different are the inode number, the link count, and the ctime.
> +
> +A reflink is a snapshot of the source file at the time it is created.
> +
> +Once created, though, a reflink can be modified like any other normal
> +file without affecting the source file. Changes to trivial fields like
> +permissions, owner, or times are guaranteed not to trigger CoW of file
> +data and will not return any error that wouldn't happen on a truly
> +distinct file. Changes to the file's data will trigger CoW of the data
> +affected - the actual CoW granularity is up to the filesystem, from
> +exact bytes up to the entire file. ocfs2, for example, will copy out an
> +entire extent or 1MB, whichever is smaller.
> +
> +Preserving the security context of the source file obviously requires
> +the privilege to do so. Callers that do not own the source file and do
> +not have CAP_CHOWN will get a new reflink with all non-security
> +attributes preserved; the security context of the new reflink will be
> +as a newly created file by that user.
> +
> +Partial reflinks are not allowed. The new inode will only appear in the
> +directory structure after it is fully formed. This prevents a crash or
> +lack of space from creating a partial reflink.
> +
> +If a filesystem does not support reflinks, the kernel and libc MUST NOT
> +fake it. Callers are expecting to get snapshots, and faking it will
> +violate that trust.
> +
> +The userspace view is as follows. When reflink(2) returns, opening
> +oldpath and newpath returns identical-looking files, just like link(2).
> +After that, oldpath and newpath behave as distinct files, and
> +modifications to one have no impact on the other.
> +
> +
> +RESTRICTIONS
> +------------
> +
> +Just as the sharing gets lower as you move from symlink() -> link() ->
> +reflink(), the restrictions on the call get tighter. A symlink doesn't
> +require any access permissions other than being able to create its
> +inode. It can cross filesystems and mount points, and it can point to
> +any type of file. A hard link requires both source and target to be on
> +the same filesystem under the same mount point, and that the source not
> +be a directory. Like hard links and symlinks, a reflink cannot be
> +created if newpath exists.
> +
> +Reflinks adds one big restriction on top of hard links: only the owner
> +or someone with elevated privileges (CAP_CHOWN) can preserve the
> +security context (permissions, ownership, ACLs, etc) across a reflink.
> +A reflink is a point-in-time snapshot of a file. Without the
> +appropriate privilege, the caller will see their own default security
> +context applied to the file.
> +
> +A caller without the privileges to preserve the security context must
> +have read access to reflink a file.
> +
> +
> +SHARING
> +-------
> +
> +A reflink creates a new inode. It shares all data extents of the source
> +file; this includes file data and extended attribute data. All of the
> +sharing is in a CoW fashion, and any modification of the data will break
> +the sharing.
> +
> +For some filesystems, certain data structures are not in allocated
> +storage extents. Creating a reflink might make a copy of these extents.
> +An example is ext3's ability to store small extended attributes inside
> +the ext3 inode. Since a reflink is creating a new inode, those extended
> +attributes are merely copied to the new inode.
> +
> +
> +EXCEPTIONS
> +----------
> +
> +All file attributes and extended attributes of the new file must
> +identical to the source file with the following exceptions:
> +
> +- The new file must have a new inode number. This allows POSIX
> + programs to treat the source and new files as separate objects. From
> + the view of the POSIX application, the files are distinct. The
> + sharing is invisible outside of the filesystem's internal structures.
> +- The ctime of the source file only changes if the source's metadata
> + must be changed to accommodate the copy-on-write linkage. The ctime
> + of the new file is set to represent its creation.
> +- The link count of the source file is unchanged, and the link count of
> + the new file is one.
> +- If the caller lacks the privileges to preserve the security context,
> + the file will have its security context initialized as would any new
> + file.
> +
> +The mtime of the source file is unmodified, and the mtime of the new
> +file is set identical to the source file. This reflects that the data
> +is unchanged.
> +
> +
> +INODE OPERATION
> +---------------
> +
> +Filesystems implement the ->reflink() inode operation. It has almost
> +the same prototype as ->link():
> +
> + int (*reflink)(struct dentry *old_dentry, struct inode *dir,
> + struct dentry *new_dentry, int preserve_security);
> +
> +When the filesystem is called, the VFS has already checked the
> +permissions and mountpoint of the operation. It has determined whether
> +the security context should be preserved or reinitialized, as specified
> +by the preserve_security argument. The filesystem just needs to create
> +the new inode identical to the old one with the exceptions noted above,
> +link up the shared data extents, and then link the new inode into dir.
> +
> +
> +FOLLOWING SYMBOLIC LINKS
> +------------------------
> +
> +reflink() deferences symbolic links in the same manner that link(2)
> +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
> +
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index f49eecf..01cd810 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -333,6 +333,7 @@ struct inode_operations {
> ssize_t (*listxattr) (struct dentry *, char *, size_t);
> int (*removexattr) (struct dentry *, const char *);
> void (*truncate_range)(struct inode *, loff_t, loff_t);
> + int (*reflink) (struct dentry *,struct inode *,struct dentry *);
> };
>
> Again, all methods are called without any locks being held, unless
> @@ -431,6 +432,9 @@ otherwise noted.
>
> truncate_range: a method provided by the underlying filesystem to truncate a
> range of blocks , i.e. punch a hole somewhere in a file.
> + reflink: called by the reflink(2) system call. Only required if you want
> + to support reflinks. For further information, see
> + Documentation/filesystems/reflink.txt.
>
>
> The Address Space Object
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6e72d74..c368563 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
> #define __NR_inotify_init1 332
> #define __NR_preadv 333
> #define __NR_pwritev 334
> +#define __NR_reflinkat 335
>
> #ifdef __KERNEL__
>
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index ff5c873..d11c200 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
> .long sys_inotify_init1
> .long sys_preadv
> .long sys_pwritev
> + .long sys_reflinkat /* 335 */
> diff --git a/fs/namei.c b/fs/namei.c
> index 78f253c..34a6ce5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
> return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
> }
>
> +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
> +{
> + struct inode *inode = old_dentry->d_inode;
> + int error;
> + int preserve_security = 1;
> +
> + if (!inode)
> + return -ENOENT;
> +
> + /*
> + * If the caller has the rights, reflink() will preserve the
> + * security context of the source inode.
> + */
> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
> + if ((current_fsuid() != inode->i_uid) &&
> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
> +
> + /*
> + * If the caller doesn't have the right to preserve the security
> + * context, the caller is only getting the data and extended
> + * attributes. They need read permission on the file.
> + */
> + if (!preserve_security) {
> + error = inode_permission(inode, MAY_READ);
> + if (error)
> + return error;
> + }
> +
> + error = may_create(dir, new_dentry);
> + if (error)
> + return error;
> +
> + if (dir->i_sb != inode->i_sb)
> + return -EXDEV;
> +
> + /*
> + * A reflink to an append-only or immutable file cannot be created.
> + */
> + if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> + return -EPERM;
> + if (!dir->i_op->reflink)
> + return -EPERM;
> + if (S_ISDIR(inode->i_mode))
> + return -EPERM;
> +
> + error = security_inode_reflink(old_dentry, dir);
> + if (error)
> + return error;
> +
> + mutex_lock(&inode->i_mutex);
> + vfs_dq_init(dir);
> + error = dir->i_op->reflink(old_dentry, dir, new_dentry,
> + preserve_security);
> + mutex_unlock(&inode->i_mutex);
> + if (!error)
> + fsnotify_create(dir, new_dentry);
> + return error;
> +}
> +
> +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
> + int, newdfd, const char __user *, newname, int, flags)
> +{
> + struct dentry *new_dentry;
> + struct nameidata nd;
> + struct path old_path;
> + int error;
> + char *to;
> +
> + if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
> + return -EINVAL;
> +
> + error = user_path_at(olddfd, oldname,
> + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
> + &old_path);
> + if (error)
> + return error;
> +
> + error = user_path_parent(newdfd, newname, &nd, &to);
> + if (error)
> + goto out;
> + error = -EXDEV;
> + if (old_path.mnt != nd.path.mnt)
> + goto out_release;
> + new_dentry = lookup_create(&nd, 0);
> + error = PTR_ERR(new_dentry);
> + if (IS_ERR(new_dentry))
> + goto out_unlock;
> + error = mnt_want_write(nd.path.mnt);
> + if (error)
> + goto out_dput;
> + error = security_path_link(old_path.dentry, &nd.path, new_dentry);
> + if (error)
> + goto out_drop_write;
> + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
> +out_drop_write:
> + mnt_drop_write(nd.path.mnt);
> +out_dput:
> + dput(new_dentry);
> +out_unlock:
> + mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
> +out_release:
> + path_put(&nd.path);
> + putname(to);
> +out:
> + path_put(&old_path);
> +
> + return error;
> +}
> +
> +
> /*
> * The worst of all namespace operations - renaming directory. "Perverted"
> * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
> @@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename);
> EXPORT_SYMBOL(vfs_create);
> EXPORT_SYMBOL(vfs_follow_link);
> EXPORT_SYMBOL(vfs_link);
> +EXPORT_SYMBOL(vfs_reflink);
> EXPORT_SYMBOL(vfs_mkdir);
> EXPORT_SYMBOL(vfs_mknod);
> EXPORT_SYMBOL(generic_permission);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5bed436..0a5c807 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
> extern int vfs_rmdir(struct inode *, struct dentry *);
> extern int vfs_unlink(struct inode *, struct dentry *);
> extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
> +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
>
> /*
> * VFS dentry helper functions.
> @@ -1537,6 +1538,7 @@ struct inode_operations {
> loff_t len);
> int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
> u64 len);
> + int (*reflink) (struct dentry *,struct inode *,struct dentry *,int);
> };
>
> struct seq_file;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index d5fd616..ea9cd93 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
> * @inode contains a pointer to the inode.
> * @secid contains a pointer to the location where result will be saved.
> * In case of failure, @secid will be set to zero.
> + * @inode_reflink:
> + * Check permission before creating a new reference-counted link to
> + * a file.
> + * @old_dentry contains the dentry structure for an existing link to
> + * the file.
> + * @dir contains the inode structure of the parent directory of the
> + * new reflink.
> + * Return 0 if permission is granted.
> *
> * Security hooks for file operations
> *
> @@ -1415,6 +1423,7 @@ struct security_operations {
> int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
> int (*inode_symlink) (struct inode *dir,
> struct dentry *dentry, const char *old_name);
> + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir);
> int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
> int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
> int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
> @@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
> int security_inode_unlink(struct inode *dir, struct dentry *dentry);
> int security_inode_symlink(struct inode *dir, struct dentry *dentry,
> const char *old_name);
> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir);
> int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
> int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
> int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
> @@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir,
> return 0;
> }
>
> +static inline int security_inode_reflink(struct dentry *old_dentry,
> + struct inode *dir)
> +{
> + return 0;
> +}
> +
> static inline int security_inode_mkdir(struct inode *dir,
> struct dentry *dentry,
> int mode)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 40617c1..35a8743 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
> int newdfd, const char __user * newname);
> asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
> int newdfd, const char __user *newname, int flags);
> +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
> + int newdfd, const char __user *newname, int flags);
> asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
> int newdfd, const char __user * newname);
> asmlinkage long sys_futimesat(int dfd, char __user *filename,
> diff --git a/security/capability.c b/security/capability.c
> index 21b6cea..3dcc4cc 100644
> --- a/security/capability.c
> +++ b/security/capability.c
> @@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
> return 0;
> }
>
> +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode)
> +{
> + return 0;
> +}
> +
> static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
> int mask)
> {
> @@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops)
> set_to_cap_if_null(ops, inode_link);
> set_to_cap_if_null(ops, inode_unlink);
> set_to_cap_if_null(ops, inode_symlink);
> + set_to_cap_if_null(ops, inode_reflink);
> set_to_cap_if_null(ops, inode_mkdir);
> set_to_cap_if_null(ops, inode_rmdir);
> set_to_cap_if_null(ops, inode_mknod);
> diff --git a/security/security.c b/security/security.c
> index 5284255..70d0ac3 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
> return security_ops->inode_symlink(dir, dentry, old_name);
> }
>
> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir)
> +{
> + if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
> + return 0;
> + return security_ops->inode_reflink(old_dentry, dir);
> +}
> +
> int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> {
> if (unlikely(IS_PRIVATE(dir)))
> --
> 1.6.1.3
>
>
> --
>
> "Three o'clock is always too late or too early for anything you
> want to do."
> - Jean-Paul Sartre
>
> Joel Becker
> Principal Software Developer
> Oracle
> E-mail: joel.becker@oracle.com
> Phone: (650) 506-8127
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 15:04 ` Sage Weil
0 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-12 15:04 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Mon, 11 May 2009, Joel Becker wrote:
> Here's v4 of reflink(). If you have the privileges, you get the
> full snapshot. If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized. That's it. It fits with most of the
> other ops, and it's a clean degradation.
What would a 'cp' without '-p' be expected to do here when it has the
privileges? Call reflink(2), then explicitly clear out any copied
security attributes ensure that any copied attributes are removed, and
otherwise jump through hoops to make the newly created file look like it
should? Should it check whether it has the privileges and act accordingly
(_can_ it even do that reliably/atomically?), or unconditionally verify
the attributes look like a new file's should?
To me, a simple 'cp' type operation (assuming it gets wired up the way it
could) seems like at least as common a use case than a 'snapshot'
operation. I know that's not what your main goal here, but I don't
understand the resistance to two syscalls. Mixing the two might give you
the right answer in many cases, but certainly not all, and it makes for
confusing application interface semantics that we won't be able to change
down the line.
sage
> I add a flag to ips->reflink() so that the filesystem knows what
> to do with the security context. That's the only change visible outside
> of vfs_reflink().
> Security folks, check my work. Everyone else, let me know if
> this satisfies.
>
> Joel
>
> >From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001
> From: Joel Becker <joel.becker@oracle.com>
> Date: Sat, 2 May 2009 22:48:59 -0700
> Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
>
> The userpace visible idea of the operation is:
>
> int reflink(const char *oldpath, const char *newpath);
> int reflinkat(int olddirfd, const char *oldpath,
> int newdirfd, const char *newpath, int flags);
>
> The kernel only implements reflinkat(2). reflink(3) is a trivial
> wrapper around reflinkat(2).
>
> The reflink() system call creates reference-counted links. It creates
> a new file that shares the data extents of the source file in a
> copy-on-write fashion. Its calling semantics are identical to link(2)
> and linkat(2). Once complete, programs see the new file as a completely
> separate entry.
>
> reflink() attempts to preserve ownership, permissions, and security
> contexts in order to create a fully snapshot. Preserving those
> attributes requires ownership or CAP_CHOWN. A caller without those
> privileges will see the security context of the new file initialized to
> their default.
>
> In the VFS, ->reflink() is an inode_operation with the almost same
> arguments as ->link(); an additional argument tells the filesystem to
> copy over or reinitialize the security context on the new file.
>
> A new LSM hook, security_inode_reflink(), is added. None of the
> existing LSM hooks appeared to fit.
>
> XXX: Currently only adds the x86_32 linkage. The rest of the
> architectures belong here too.
>
> Signed-off-by: Joel Becker <joel.becker@oracle.com>
> ---
> Documentation/filesystems/reflink.txt | 165 +++++++++++++++++++++++++++++++++
> Documentation/filesystems/vfs.txt | 4 +
> arch/x86/include/asm/unistd_32.h | 1 +
> arch/x86/kernel/syscall_table_32.S | 1 +
> fs/namei.c | 113 ++++++++++++++++++++++
> include/linux/fs.h | 2 +
> include/linux/security.h | 16 +++
> include/linux/syscalls.h | 2 +
> security/capability.c | 6 +
> security/security.c | 7 ++
> 10 files changed, 317 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/filesystems/reflink.txt
>
> diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
> new file mode 100644
> index 0000000..aa7380f
> --- /dev/null
> +++ b/Documentation/filesystems/reflink.txt
> @@ -0,0 +1,165 @@
> +reflink(2)
> +==========
> +
> +
> +INTRODUCTION
> +------------
> +
> +A reflink is a reference-counted link. The reflink(2) operation is
> +analogous to the link(2) operation, except that instead of two directory
> +entries pointing to the same inode, there are two identical inodes
> +pointing to the same data. Writes do not modify the shared data; they
> +use copy-on-write (CoW). Thus, after the reflink has been created, the
> +inodes can diverge without impacting each other.
> +
> +
> +SYNOPSIS
> +--------
> +
> +The reflink(2) call looks just like link(2):
> +
> + int reflink(const char *oldpath, const char *newpath);
> +
> +The actual system call is reflinkat(2):
> +
> + int reflinkat(int olddirfd, const char *oldpath,
> + int newdirfd, const char *newpath, int flags);
> +
> +For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
> +The reflink(2) call won't be implemented by the kernel, because it's a
> +trivial wrapper around reflinkat(2).
> +
> +
> +DESCRIPTION
> +-----------
> +
> +One way of viewing reflink is to look at the level of sharing. A
> +symbolic link does its sharing at the directory entry level; many names
> +end up pointing at the same directory entry. Hard links are one step
> +down. Multiple directory entries are sharing one inode. Reflinks are
> +down one more level: multiple inodes share the same data extents.
> +
> +When you symlink a file, you can then access it via the symlink or the
> +real directory entry, and for the most part they look identical. When
> +accessing more than one name for a hard link, the object returned looks
> +identical. Similarly, a newly created reflink is identical to its
> +source in almost every way and can be treated as such. This includes
> +ownership, permissions, security context, and data. The only things
> +that are different are the inode number, the link count, and the ctime.
> +
> +A reflink is a snapshot of the source file at the time it is created.
> +
> +Once created, though, a reflink can be modified like any other normal
> +file without affecting the source file. Changes to trivial fields like
> +permissions, owner, or times are guaranteed not to trigger CoW of file
> +data and will not return any error that wouldn't happen on a truly
> +distinct file. Changes to the file's data will trigger CoW of the data
> +affected - the actual CoW granularity is up to the filesystem, from
> +exact bytes up to the entire file. ocfs2, for example, will copy out an
> +entire extent or 1MB, whichever is smaller.
> +
> +Preserving the security context of the source file obviously requires
> +the privilege to do so. Callers that do not own the source file and do
> +not have CAP_CHOWN will get a new reflink with all non-security
> +attributes preserved; the security context of the new reflink will be
> +as a newly created file by that user.
> +
> +Partial reflinks are not allowed. The new inode will only appear in the
> +directory structure after it is fully formed. This prevents a crash or
> +lack of space from creating a partial reflink.
> +
> +If a filesystem does not support reflinks, the kernel and libc MUST NOT
> +fake it. Callers are expecting to get snapshots, and faking it will
> +violate that trust.
> +
> +The userspace view is as follows. When reflink(2) returns, opening
> +oldpath and newpath returns identical-looking files, just like link(2).
> +After that, oldpath and newpath behave as distinct files, and
> +modifications to one have no impact on the other.
> +
> +
> +RESTRICTIONS
> +------------
> +
> +Just as the sharing gets lower as you move from symlink() -> link() ->
> +reflink(), the restrictions on the call get tighter. A symlink doesn't
> +require any access permissions other than being able to create its
> +inode. It can cross filesystems and mount points, and it can point to
> +any type of file. A hard link requires both source and target to be on
> +the same filesystem under the same mount point, and that the source not
> +be a directory. Like hard links and symlinks, a reflink cannot be
> +created if newpath exists.
> +
> +Reflinks adds one big restriction on top of hard links: only the owner
> +or someone with elevated privileges (CAP_CHOWN) can preserve the
> +security context (permissions, ownership, ACLs, etc) across a reflink.
> +A reflink is a point-in-time snapshot of a file. Without the
> +appropriate privilege, the caller will see their own default security
> +context applied to the file.
> +
> +A caller without the privileges to preserve the security context must
> +have read access to reflink a file.
> +
> +
> +SHARING
> +-------
> +
> +A reflink creates a new inode. It shares all data extents of the source
> +file; this includes file data and extended attribute data. All of the
> +sharing is in a CoW fashion, and any modification of the data will break
> +the sharing.
> +
> +For some filesystems, certain data structures are not in allocated
> +storage extents. Creating a reflink might make a copy of these extents.
> +An example is ext3's ability to store small extended attributes inside
> +the ext3 inode. Since a reflink is creating a new inode, those extended
> +attributes are merely copied to the new inode.
> +
> +
> +EXCEPTIONS
> +----------
> +
> +All file attributes and extended attributes of the new file must
> +identical to the source file with the following exceptions:
> +
> +- The new file must have a new inode number. This allows POSIX
> + programs to treat the source and new files as separate objects. From
> + the view of the POSIX application, the files are distinct. The
> + sharing is invisible outside of the filesystem's internal structures.
> +- The ctime of the source file only changes if the source's metadata
> + must be changed to accommodate the copy-on-write linkage. The ctime
> + of the new file is set to represent its creation.
> +- The link count of the source file is unchanged, and the link count of
> + the new file is one.
> +- If the caller lacks the privileges to preserve the security context,
> + the file will have its security context initialized as would any new
> + file.
> +
> +The mtime of the source file is unmodified, and the mtime of the new
> +file is set identical to the source file. This reflects that the data
> +is unchanged.
> +
> +
> +INODE OPERATION
> +---------------
> +
> +Filesystems implement the ->reflink() inode operation. It has almost
> +the same prototype as ->link():
> +
> + int (*reflink)(struct dentry *old_dentry, struct inode *dir,
> + struct dentry *new_dentry, int preserve_security);
> +
> +When the filesystem is called, the VFS has already checked the
> +permissions and mountpoint of the operation. It has determined whether
> +the security context should be preserved or reinitialized, as specified
> +by the preserve_security argument. The filesystem just needs to create
> +the new inode identical to the old one with the exceptions noted above,
> +link up the shared data extents, and then link the new inode into dir.
> +
> +
> +FOLLOWING SYMBOLIC LINKS
> +------------------------
> +
> +reflink() deferences symbolic links in the same manner that link(2)
> +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
> +
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index f49eecf..01cd810 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -333,6 +333,7 @@ struct inode_operations {
> ssize_t (*listxattr) (struct dentry *, char *, size_t);
> int (*removexattr) (struct dentry *, const char *);
> void (*truncate_range)(struct inode *, loff_t, loff_t);
> + int (*reflink) (struct dentry *,struct inode *,struct dentry *);
> };
>
> Again, all methods are called without any locks being held, unless
> @@ -431,6 +432,9 @@ otherwise noted.
>
> truncate_range: a method provided by the underlying filesystem to truncate a
> range of blocks , i.e. punch a hole somewhere in a file.
> + reflink: called by the reflink(2) system call. Only required if you want
> + to support reflinks. For further information, see
> + Documentation/filesystems/reflink.txt.
>
>
> The Address Space Object
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6e72d74..c368563 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
> #define __NR_inotify_init1 332
> #define __NR_preadv 333
> #define __NR_pwritev 334
> +#define __NR_reflinkat 335
>
> #ifdef __KERNEL__
>
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index ff5c873..d11c200 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
> .long sys_inotify_init1
> .long sys_preadv
> .long sys_pwritev
> + .long sys_reflinkat /* 335 */
> diff --git a/fs/namei.c b/fs/namei.c
> index 78f253c..34a6ce5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
> return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
> }
>
> +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
> +{
> + struct inode *inode = old_dentry->d_inode;
> + int error;
> + int preserve_security = 1;
> +
> + if (!inode)
> + return -ENOENT;
> +
> + /*
> + * If the caller has the rights, reflink() will preserve the
> + * security context of the source inode.
> + */
> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
> + if ((current_fsuid() != inode->i_uid) &&
> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
> + preserve_security = 0;
> +
> + /*
> + * If the caller doesn't have the right to preserve the security
> + * context, the caller is only getting the data and extended
> + * attributes. They need read permission on the file.
> + */
> + if (!preserve_security) {
> + error = inode_permission(inode, MAY_READ);
> + if (error)
> + return error;
> + }
> +
> + error = may_create(dir, new_dentry);
> + if (error)
> + return error;
> +
> + if (dir->i_sb != inode->i_sb)
> + return -EXDEV;
> +
> + /*
> + * A reflink to an append-only or immutable file cannot be created.
> + */
> + if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> + return -EPERM;
> + if (!dir->i_op->reflink)
> + return -EPERM;
> + if (S_ISDIR(inode->i_mode))
> + return -EPERM;
> +
> + error = security_inode_reflink(old_dentry, dir);
> + if (error)
> + return error;
> +
> + mutex_lock(&inode->i_mutex);
> + vfs_dq_init(dir);
> + error = dir->i_op->reflink(old_dentry, dir, new_dentry,
> + preserve_security);
> + mutex_unlock(&inode->i_mutex);
> + if (!error)
> + fsnotify_create(dir, new_dentry);
> + return error;
> +}
> +
> +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
> + int, newdfd, const char __user *, newname, int, flags)
> +{
> + struct dentry *new_dentry;
> + struct nameidata nd;
> + struct path old_path;
> + int error;
> + char *to;
> +
> + if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
> + return -EINVAL;
> +
> + error = user_path_at(olddfd, oldname,
> + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
> + &old_path);
> + if (error)
> + return error;
> +
> + error = user_path_parent(newdfd, newname, &nd, &to);
> + if (error)
> + goto out;
> + error = -EXDEV;
> + if (old_path.mnt != nd.path.mnt)
> + goto out_release;
> + new_dentry = lookup_create(&nd, 0);
> + error = PTR_ERR(new_dentry);
> + if (IS_ERR(new_dentry))
> + goto out_unlock;
> + error = mnt_want_write(nd.path.mnt);
> + if (error)
> + goto out_dput;
> + error = security_path_link(old_path.dentry, &nd.path, new_dentry);
> + if (error)
> + goto out_drop_write;
> + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
> +out_drop_write:
> + mnt_drop_write(nd.path.mnt);
> +out_dput:
> + dput(new_dentry);
> +out_unlock:
> + mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
> +out_release:
> + path_put(&nd.path);
> + putname(to);
> +out:
> + path_put(&old_path);
> +
> + return error;
> +}
> +
> +
> /*
> * The worst of all namespace operations - renaming directory. "Perverted"
> * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
> @@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename);
> EXPORT_SYMBOL(vfs_create);
> EXPORT_SYMBOL(vfs_follow_link);
> EXPORT_SYMBOL(vfs_link);
> +EXPORT_SYMBOL(vfs_reflink);
> EXPORT_SYMBOL(vfs_mkdir);
> EXPORT_SYMBOL(vfs_mknod);
> EXPORT_SYMBOL(generic_permission);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5bed436..0a5c807 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
> extern int vfs_rmdir(struct inode *, struct dentry *);
> extern int vfs_unlink(struct inode *, struct dentry *);
> extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
> +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
>
> /*
> * VFS dentry helper functions.
> @@ -1537,6 +1538,7 @@ struct inode_operations {
> loff_t len);
> int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
> u64 len);
> + int (*reflink) (struct dentry *,struct inode *,struct dentry *,int);
> };
>
> struct seq_file;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index d5fd616..ea9cd93 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
> * @inode contains a pointer to the inode.
> * @secid contains a pointer to the location where result will be saved.
> * In case of failure, @secid will be set to zero.
> + * @inode_reflink:
> + * Check permission before creating a new reference-counted link to
> + * a file.
> + * @old_dentry contains the dentry structure for an existing link to
> + * the file.
> + * @dir contains the inode structure of the parent directory of the
> + * new reflink.
> + * Return 0 if permission is granted.
> *
> * Security hooks for file operations
> *
> @@ -1415,6 +1423,7 @@ struct security_operations {
> int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
> int (*inode_symlink) (struct inode *dir,
> struct dentry *dentry, const char *old_name);
> + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir);
> int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
> int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
> int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
> @@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
> int security_inode_unlink(struct inode *dir, struct dentry *dentry);
> int security_inode_symlink(struct inode *dir, struct dentry *dentry,
> const char *old_name);
> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir);
> int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
> int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
> int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
> @@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir,
> return 0;
> }
>
> +static inline int security_inode_reflink(struct dentry *old_dentry,
> + struct inode *dir)
> +{
> + return 0;
> +}
> +
> static inline int security_inode_mkdir(struct inode *dir,
> struct dentry *dentry,
> int mode)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 40617c1..35a8743 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
> int newdfd, const char __user * newname);
> asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
> int newdfd, const char __user *newname, int flags);
> +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
> + int newdfd, const char __user *newname, int flags);
> asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
> int newdfd, const char __user * newname);
> asmlinkage long sys_futimesat(int dfd, char __user *filename,
> diff --git a/security/capability.c b/security/capability.c
> index 21b6cea..3dcc4cc 100644
> --- a/security/capability.c
> +++ b/security/capability.c
> @@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
> return 0;
> }
>
> +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode)
> +{
> + return 0;
> +}
> +
> static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
> int mask)
> {
> @@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops)
> set_to_cap_if_null(ops, inode_link);
> set_to_cap_if_null(ops, inode_unlink);
> set_to_cap_if_null(ops, inode_symlink);
> + set_to_cap_if_null(ops, inode_reflink);
> set_to_cap_if_null(ops, inode_mkdir);
> set_to_cap_if_null(ops, inode_rmdir);
> set_to_cap_if_null(ops, inode_mknod);
> diff --git a/security/security.c b/security/security.c
> index 5284255..70d0ac3 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
> return security_ops->inode_symlink(dir, dentry, old_name);
> }
>
> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir)
> +{
> + if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
> + return 0;
> + return security_ops->inode_reflink(old_dentry, dir);
> +}
> +
> int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> {
> if (unlikely(IS_PRIVATE(dir)))
> --
> 1.6.1.3
>
>
> --
>
> "Three o'clock is always too late or too early for anything you
> want to do."
> - Jean-Paul Sartre
>
> Joel Becker
> Principal Software Developer
> Oracle
> E-mail: joel.becker at oracle.com
> Phone: (650) 506-8127
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 15:04 ` [Ocfs2-devel] " Sage Weil
@ 2009-05-12 15:23 ` jim owens
-1 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-12 15:23 UTC (permalink / raw)
To: Sage Weil
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Sage Weil wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>> Here's v4 of reflink(). If you have the privileges, you get the
>> full snapshot. If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized. That's it. It fits with most of the
>> other ops, and it's a clean degradation.
>
> What would a 'cp' without '-p' be expected to do here when it has the
> privileges? Call reflink(2), then explicitly clear out any copied
> security attributes ensure that any copied attributes are removed, and
> otherwise jump through hoops to make the newly created file look like it
> should? Should it check whether it has the privileges and act accordingly
> (_can_ it even do that reliably/atomically?), or unconditionally verify
> the attributes look like a new file's should?
I don't understand what you think is hard about cp doing the
"if not preserve then update attributes". It does not have to check
the reflink() attr result, it just assigns the expected new attributes.
Only the -p snapshot needs atomicity.
> To me, a simple 'cp' type operation (assuming it gets wired up the way it
> could) seems like at least as common a use case than a 'snapshot'
I don't think changing "cp" is a good idea since users have a
long history that cp means make a data copy, not cow. Adding
a new flag is IMO not be as good as a new utility. Particularly
since we can not do directories.
jim
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 15:23 ` jim owens
0 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-12 15:23 UTC (permalink / raw)
To: Sage Weil
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Sage Weil wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>> Here's v4 of reflink(). If you have the privileges, you get the
>> full snapshot. If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized. That's it. It fits with most of the
>> other ops, and it's a clean degradation.
>
> What would a 'cp' without '-p' be expected to do here when it has the
> privileges? Call reflink(2), then explicitly clear out any copied
> security attributes ensure that any copied attributes are removed, and
> otherwise jump through hoops to make the newly created file look like it
> should? Should it check whether it has the privileges and act accordingly
> (_can_ it even do that reliably/atomically?), or unconditionally verify
> the attributes look like a new file's should?
I don't understand what you think is hard about cp doing the
"if not preserve then update attributes". It does not have to check
the reflink() attr result, it just assigns the expected new attributes.
Only the -p snapshot needs atomicity.
> To me, a simple 'cp' type operation (assuming it gets wired up the way it
> could) seems like at least as common a use case than a 'snapshot'
I don't think changing "cp" is a good idea since users have a
long history that cp means make a data copy, not cow. Adding
a new flag is IMO not be as good as a new utility. Particularly
since we can not do directories.
jim
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 15:23 ` [Ocfs2-devel] " jim owens
@ 2009-05-12 16:16 ` Sage Weil
-1 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-12 16:16 UTC (permalink / raw)
To: jim owens
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 12 May 2009, jim owens wrote:
> Sage Weil wrote:
> > On Mon, 11 May 2009, Joel Becker wrote:
> > > Here's v4 of reflink(). If you have the privileges, you get the
> > > full snapshot. If you don't, you must have read access, and then you
> > > get the entire snapshot (data and extended attributes) except that the
> > > security context is reinitialized. That's it. It fits with most of the
> > > other ops, and it's a clean degradation.
> >
> > What would a 'cp' without '-p' be expected to do here when it has the
> > privileges? Call reflink(2), then explicitly clear out any copied security
> > attributes ensure that any copied attributes are removed, and otherwise jump
> > through hoops to make the newly created file look like it should? Should it
> > check whether it has the privileges and act accordingly (_can_ it even do
> > that reliably/atomically?), or unconditionally verify the attributes look
> > like a new file's should?
>
> I don't understand what you think is hard about cp doing the
> "if not preserve then update attributes". It does not have to check
> the reflink() attr result, it just assigns the expected new attributes.
I assume it's possible, but not being familiar with how the SELinux etc
attributes look, my guess is that any tool that wants to cow file data
to a new file (even if root) would need to do something like
reflink(src, dst)
chown(dst, getuid(), getgid())
listxattr and rmxattr each. or just delete selinux/whatever attributes.
create generic 'new file' selinux/whatever attributes, if needed.
The chown bit isn't even right, since it doesn't follow the directory
sticky bit rules. And is there some generic way to assign an existing
file 'new file'-like security attributes? It's a mess.
> Only the -p snapshot needs atomicity.
My point is that the process creating the cow file should unconditionally
do the above checks (and needed fixups) because it can't atomically verify
the attribute copy won't happen andke the reflink call.
> > To me, a simple 'cp' type operation (assuming it gets wired up the way it
> > could) seems like at least as common a use case than a 'snapshot'
>
> I don't think changing "cp" is a good idea since users have a
> long history that cp means make a data copy, not cow. Adding
> a new flag is IMO not be as good as a new utility. Particularly
> since we can not do directories.
Maybe not, but that's a separate question from the interface issue. We
shouldn't preclude the possibility creating tools that preserve attributes
(or warn if they can't) and tools that simply want to cow data to a new
file. AFAICS reflink(2) as proposed doesn't quite let you do either one
without extra hackery to compensate for its dual-mode behavior. If this
thread has demonstrated anything, it's that some users want snapshot-like
semantics (cp -p) and some want cowfile()-like semantics (cp). What is
the benefit of combining the two into a single call? If I want
snapshot-like semantics, I would rather get -EPERM if I lack the necessary
permissions than silently get an approximation. Then I can at least issue
a warning to the user. If I really want to gracefully 'degrade', I can
always do something like
err = reflink(src, dst);
if (err == -EPERM) {
err = cowfile(src, dst);
if (!err)
printf("warning: failed to preserve all file attributes\n");
}
sage
>
> jim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 16:16 ` Sage Weil
0 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-12 16:16 UTC (permalink / raw)
To: jim owens
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 12 May 2009, jim owens wrote:
> Sage Weil wrote:
> > On Mon, 11 May 2009, Joel Becker wrote:
> > > Here's v4 of reflink(). If you have the privileges, you get the
> > > full snapshot. If you don't, you must have read access, and then you
> > > get the entire snapshot (data and extended attributes) except that the
> > > security context is reinitialized. That's it. It fits with most of the
> > > other ops, and it's a clean degradation.
> >
> > What would a 'cp' without '-p' be expected to do here when it has the
> > privileges? Call reflink(2), then explicitly clear out any copied security
> > attributes ensure that any copied attributes are removed, and otherwise jump
> > through hoops to make the newly created file look like it should? Should it
> > check whether it has the privileges and act accordingly (_can_ it even do
> > that reliably/atomically?), or unconditionally verify the attributes look
> > like a new file's should?
>
> I don't understand what you think is hard about cp doing the
> "if not preserve then update attributes". It does not have to check
> the reflink() attr result, it just assigns the expected new attributes.
I assume it's possible, but not being familiar with how the SELinux etc
attributes look, my guess is that any tool that wants to cow file data
to a new file (even if root) would need to do something like
reflink(src, dst)
chown(dst, getuid(), getgid())
listxattr and rmxattr each. or just delete selinux/whatever attributes.
create generic 'new file' selinux/whatever attributes, if needed.
The chown bit isn't even right, since it doesn't follow the directory
sticky bit rules. And is there some generic way to assign an existing
file 'new file'-like security attributes? It's a mess.
> Only the -p snapshot needs atomicity.
My point is that the process creating the cow file should unconditionally
do the above checks (and needed fixups) because it can't atomically verify
the attribute copy won't happen andke the reflink call.
> > To me, a simple 'cp' type operation (assuming it gets wired up the way it
> > could) seems like at least as common a use case than a 'snapshot'
>
> I don't think changing "cp" is a good idea since users have a
> long history that cp means make a data copy, not cow. Adding
> a new flag is IMO not be as good as a new utility. Particularly
> since we can not do directories.
Maybe not, but that's a separate question from the interface issue. We
shouldn't preclude the possibility creating tools that preserve attributes
(or warn if they can't) and tools that simply want to cow data to a new
file. AFAICS reflink(2) as proposed doesn't quite let you do either one
without extra hackery to compensate for its dual-mode behavior. If this
thread has demonstrated anything, it's that some users want snapshot-like
semantics (cp -p) and some want cowfile()-like semantics (cp). What is
the benefit of combining the two into a single call? If I want
snapshot-like semantics, I would rather get -EPERM if I lack the necessary
permissions than silently get an approximation. Then I can at least issue
a warning to the user. If I really want to gracefully 'degrade', I can
always do something like
err = reflink(src, dst);
if (err == -EPERM) {
err = cowfile(src, dst);
if (!err)
printf("warning: failed to preserve all file attributes\n");
}
sage
>
> jim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 16:16 ` [Ocfs2-devel] " Sage Weil
@ 2009-05-12 17:45 ` jim owens
-1 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-12 17:45 UTC (permalink / raw)
To: Sage Weil
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Sage Weil wrote:
> Maybe not, but that's a separate question from the interface issue. We
> shouldn't preclude the possibility creating tools that preserve attributes
> (or warn if they can't) and tools that simply want to cow data to a new
> file. AFAICS reflink(2) as proposed doesn't quite let you do either one
> without extra hackery to compensate for its dual-mode behavior. If this
> thread has demonstrated anything, it's that some users want snapshot-like
> semantics (cp -p) and some want cowfile()-like semantics (cp). What is
> the benefit of combining the two into a single call? If I want
> snapshot-like semantics, I would rather get -EPERM if I lack the necessary
> permissions than silently get an approximation.
I'm not fighting against two syscalls but the reason I like
the V4 definition is the opposite of knowing I failed to snapshot.
It is really because in my experience as both root on multi-user
systems and basic untrusted user, when root copies something from
a user there are only two desired outcomes:
1) cp -p
2) cp, chown "someone" , chgrp "somegroup", chmod "new rights"
The common mistake is wanting #1 and forgetting the -p so it
then produces an error and has to be fixed.
Using root's default attributes is almost never desired.
So with this reflink() definition, normal users get their own
attributes and root automatically gets preserve but can change
them later.
IMO this is optimized for humans, and I don't really know
of any privileged daemon things that are setuid and
want to not preserve attributes. Do you have examples?
jim
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 17:45 ` jim owens
0 siblings, 0 replies; 304+ messages in thread
From: jim owens @ 2009-05-12 17:45 UTC (permalink / raw)
To: Sage Weil
Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Sage Weil wrote:
> Maybe not, but that's a separate question from the interface issue. We
> shouldn't preclude the possibility creating tools that preserve attributes
> (or warn if they can't) and tools that simply want to cow data to a new
> file. AFAICS reflink(2) as proposed doesn't quite let you do either one
> without extra hackery to compensate for its dual-mode behavior. If this
> thread has demonstrated anything, it's that some users want snapshot-like
> semantics (cp -p) and some want cowfile()-like semantics (cp). What is
> the benefit of combining the two into a single call? If I want
> snapshot-like semantics, I would rather get -EPERM if I lack the necessary
> permissions than silently get an approximation.
I'm not fighting against two syscalls but the reason I like
the V4 definition is the opposite of knowing I failed to snapshot.
It is really because in my experience as both root on multi-user
systems and basic untrusted user, when root copies something from
a user there are only two desired outcomes:
1) cp -p
2) cp, chown "someone" , chgrp "somegroup", chmod "new rights"
The common mistake is wanting #1 and forgetting the -p so it
then produces an error and has to be fixed.
Using root's default attributes is almost never desired.
So with this reflink() definition, normal users get their own
attributes and root automatically gets preserve but can change
them later.
IMO this is optimized for humans, and I don't really know
of any privileged daemon things that are setuid and
want to not preserve attributes. Do you have examples?
jim
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 17:45 ` [Ocfs2-devel] " jim owens
@ 2009-05-12 20:29 ` Jamie Lokier
-1 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:29 UTC (permalink / raw)
To: jim owens
Cc: Sage Weil, Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
jim owens wrote:
> Using root's default attributes is almost never desired.
^^^^^^
Exactly. When it is desired, it shouldn't be impossible :-)
Setting attributes to those of a new file outside the kernel requires
parsing /proc/mounts and knowing filesystem-type-specific things,
among other things. Ugly stuff - should never be written. Don't make
such ugly stuff be written (and fail when /proc isn't mounted).
There is also the principle of least surprise... Shell scripts which
behave differently for root - that's asking for trouble.
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 20:29 ` Jamie Lokier
0 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:29 UTC (permalink / raw)
To: jim owens
Cc: Sage Weil, Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
jim owens wrote:
> Using root's default attributes is almost never desired.
^^^^^^
Exactly. When it is desired, it shouldn't be impossible :-)
Setting attributes to those of a new file outside the kernel requires
parsing /proc/mounts and knowing filesystem-type-specific things,
among other things. Ugly stuff - should never be written. Don't make
such ugly stuff be written (and fail when /proc isn't mounted).
There is also the principle of least surprise... Shell scripts which
behave differently for root - that's asking for trouble.
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 15:04 ` [Ocfs2-devel] " Sage Weil
@ 2009-05-12 17:28 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 17:28 UTC (permalink / raw)
To: Sage Weil
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, May 12, 2009 at 08:04:21AM -0700, Sage Weil wrote:
> To me, a simple 'cp' type operation (assuming it gets wired up the way it
> could) seems like at least as common a use case than a 'snapshot'
> operation. I know that's not what your main goal here, but I don't
> understand the resistance to two syscalls. Mixing the two might give you
> the right answer in many cases, but certainly not all, and it makes for
> confusing application interface semantics that we won't be able to change
> down the line.
I'm not against two syscalls, but I'm not writing copyfile()
here, just reflink(). Someone clearly could write copyfile() later and
link into some of the same underlying mechanisms.
It's important to distinguish the semantics, though, and that's
why I'm doing one thing. For example, reflink() is a snapshot (a
"reference-counted link") and has behaviors based on that. libc should
never fake it, because the callers expect those behaviors. Whereas
copyfile() would be fakeable in libc with a read/write cycle on
filesystems that don't support it. Things like that.
Heck, I think you could use reflink() to create a copyfile() in
libc that uses no additional syscall. But you couldn't use copyfile()
to create reflink().
Joel
--
"Lately I've been talking in my sleep.
Can't imagine what I'd have to say.
Except my world will be right
When love comes back my way."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-12 17:28 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-12 17:28 UTC (permalink / raw)
To: Sage Weil
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, May 12, 2009 at 08:04:21AM -0700, Sage Weil wrote:
> To me, a simple 'cp' type operation (assuming it gets wired up the way it
> could) seems like at least as common a use case than a 'snapshot'
> operation. I know that's not what your main goal here, but I don't
> understand the resistance to two syscalls. Mixing the two might give you
> the right answer in many cases, but certainly not all, and it makes for
> confusing application interface semantics that we won't be able to change
> down the line.
I'm not against two syscalls, but I'm not writing copyfile()
here, just reflink(). Someone clearly could write copyfile() later and
link into some of the same underlying mechanisms.
It's important to distinguish the semantics, though, and that's
why I'm doing one thing. For example, reflink() is a snapshot (a
"reference-counted link") and has behaviors based on that. libc should
never fake it, because the callers expect those behaviors. Whereas
copyfile() would be fakeable in libc with a read/write cycle on
filesystems that don't support it. Things like that.
Heck, I think you could use reflink() to create a copyfile() in
libc that uses no additional syscall. But you couldn't use copyfile()
to create reflink().
Joel
--
"Lately I've been talking in my sleep.
Can't imagine what I'd have to say.
Except my world will be right
When love comes back my way."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-12 17:28 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-13 4:30 ` Sage Weil
-1 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-13 4:30 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 12 May 2009, Joel Becker wrote:
> I'm not against two syscalls, but I'm not writing copyfile()
> here, just reflink(). Someone clearly could write copyfile() later and
> link into some of the same underlying mechanisms.
Ok, good.
> It's important to distinguish the semantics, though, and that's
> why I'm doing one thing. For example, reflink() is a snapshot (a
> "reference-counted link") and has behaviors based on that. libc should
> never fake it, because the callers expect those behaviors. Whereas
> copyfile() would be fakeable in libc with a read/write cycle on
> filesystems that don't support it. Things like that.
> Heck, I think you could use reflink() to create a copyfile() in
> libc that uses no additional syscall. But you couldn't use copyfile()
> to create reflink().
Right, except that you _could_ implement the degraded (no CAP_CHOWN)
reflink() behavior with a hypothetical copyfile().
I just think you should be sure that reflink() has _exactly_ the snapshot
semantics that make sense, without compromises that try to capture some or
all of copyfile() as well. Assuming that a copyfile() type syscall also
existed, would you really want reflink() to silently degrade to something
that can be implemented via copyfile() when you lack CAP_CHOWN?
With the proposed reflink(), we might end up with a final API that looks
something like:
cowfile(src, dst, flags) - cow data and/or xattrs from src to dst
reflink(src, dst) - snapshot src to dst, or if !CAP_CHOWN, cowfile() instead
A simpler reflink() would make that degradation non-mandatory, and
trivially implemented in userspace by those who want it.
sage
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-13 4:30 ` Sage Weil
0 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-13 4:30 UTC (permalink / raw)
To: Joel Becker
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Tue, 12 May 2009, Joel Becker wrote:
> I'm not against two syscalls, but I'm not writing copyfile()
> here, just reflink(). Someone clearly could write copyfile() later and
> link into some of the same underlying mechanisms.
Ok, good.
> It's important to distinguish the semantics, though, and that's
> why I'm doing one thing. For example, reflink() is a snapshot (a
> "reference-counted link") and has behaviors based on that. libc should
> never fake it, because the callers expect those behaviors. Whereas
> copyfile() would be fakeable in libc with a read/write cycle on
> filesystems that don't support it. Things like that.
> Heck, I think you could use reflink() to create a copyfile() in
> libc that uses no additional syscall. But you couldn't use copyfile()
> to create reflink().
Right, except that you _could_ implement the degraded (no CAP_CHOWN)
reflink() behavior with a hypothetical copyfile().
I just think you should be sure that reflink() has _exactly_ the snapshot
semantics that make sense, without compromises that try to capture some or
all of copyfile() as well. Assuming that a copyfile() type syscall also
existed, would you really want reflink() to silently degrade to something
that can be implemented via copyfile() when you lack CAP_CHOWN?
With the proposed reflink(), we might end up with a final API that looks
something like:
cowfile(src, dst, flags) - cow data and/or xattrs from src to dst
reflink(src, dst) - snapshot src to dst, or if !CAP_CHOWN, cowfile() instead
A simpler reflink() would make that degradation non-mandatory, and
trivially implemented in userspace by those who want it.
sage
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-14 3:57 ` Andy Lutomirski
-1 siblings, 0 replies; 304+ messages in thread
From: Andy Lutomirski @ 2009-05-14 3:57 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Joel Becker wrote:
> +
> +Preserving the security context of the source file obviously requires
> +the privilege to do so. Callers that do not own the source file and do
> +not have CAP_CHOWN will get a new reflink with all non-security
> +attributes preserved; the security context of the new reflink will be
> +as a newly created file by that user.
> +
There are plenty of syscalls that require some privilege and fail if the
caller doesn't have it. But I can think of only one syscall that does
*something different* depending on who called it: setuid.
Please search the web and marvel at the disasters caused by setuid's
magical caller-dependent behavior (the sendmail bug is probably the most
famous [1]). This proposal for reflink is just asking for bugs where an
attacker gets some otherwise privileged program to call reflink but to
somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
copy security attributes, thus exposing a link with the wrong permissions.
Would it really be that hard to have two syscalls, or a flag, or
whatever, where one of them preserves all security attributes and
*fails* if the caller isn't allowed to do that and the other one makes
the caller own the new link?
[1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
--Andy
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 3:57 ` Andy Lutomirski
0 siblings, 0 replies; 304+ messages in thread
From: Andy Lutomirski @ 2009-05-14 3:57 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Joel Becker wrote:
> +
> +Preserving the security context of the source file obviously requires
> +the privilege to do so. Callers that do not own the source file and do
> +not have CAP_CHOWN will get a new reflink with all non-security
> +attributes preserved; the security context of the new reflink will be
> +as a newly created file by that user.
> +
There are plenty of syscalls that require some privilege and fail if the
caller doesn't have it. But I can think of only one syscall that does
*something different* depending on who called it: setuid.
Please search the web and marvel at the disasters caused by setuid's
magical caller-dependent behavior (the sendmail bug is probably the most
famous [1]). This proposal for reflink is just asking for bugs where an
attacker gets some otherwise privileged program to call reflink but to
somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
copy security attributes, thus exposing a link with the wrong permissions.
Would it really be that hard to have two syscalls, or a flag, or
whatever, where one of them preserves all security attributes and
*fails* if the caller isn't allowed to do that and the other one makes
the caller own the new link?
[1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
--Andy
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 3:57 ` [Ocfs2-devel] " Andy Lutomirski
@ 2009-05-14 18:12 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:12 UTC (permalink / raw)
To: Andy Lutomirski
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> Joel Becker wrote:
> > +
> > +Preserving the security context of the source file obviously requires
> > +the privilege to do so. Callers that do not own the source file and do
> > +not have CAP_CHOWN will get a new reflink with all non-security
> > +attributes preserved; the security context of the new reflink will be
> > +as a newly created file by that user.
> > +
>
> There are plenty of syscalls that require some privilege and fail if the
> caller doesn't have it. But I can think of only one syscall that does
> *something different* depending on who called it: setuid.
>
> Please search the web and marvel at the disasters caused by setuid's
> magical caller-dependent behavior (the sendmail bug is probably the most
> famous [1]). This proposal for reflink is just asking for bugs where an
> attacker gets some otherwise privileged program to call reflink but to
> somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
> copy security attributes, thus exposing a link with the wrong permissions.
>
> Would it really be that hard to have two syscalls, or a flag, or
> whatever, where one of them preserves all security attributes and
> *fails* if the caller isn't allowed to do that and the other one makes
> the caller own the new link?
>
>
> [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
Yes, I agree - the selection of whether or not to preserve the security
attributes should be an explicit part of the kernel interface. Then the
application still has the freedom to fall back on the non-preserving
form of the call if that is truly what it wants.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 18:12 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:12 UTC (permalink / raw)
To: Andy Lutomirski
Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> Joel Becker wrote:
> > +
> > +Preserving the security context of the source file obviously requires
> > +the privilege to do so. Callers that do not own the source file and do
> > +not have CAP_CHOWN will get a new reflink with all non-security
> > +attributes preserved; the security context of the new reflink will be
> > +as a newly created file by that user.
> > +
>
> There are plenty of syscalls that require some privilege and fail if the
> caller doesn't have it. But I can think of only one syscall that does
> *something different* depending on who called it: setuid.
>
> Please search the web and marvel at the disasters caused by setuid's
> magical caller-dependent behavior (the sendmail bug is probably the most
> famous [1]). This proposal for reflink is just asking for bugs where an
> attacker gets some otherwise privileged program to call reflink but to
> somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
> copy security attributes, thus exposing a link with the wrong permissions.
>
> Would it really be that hard to have two syscalls, or a flag, or
> whatever, where one of them preserves all security attributes and
> *fails* if the caller isn't allowed to do that and the other one makes
> the caller own the new link?
>
>
> [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
Yes, I agree - the selection of whether or not to preserve the security
attributes should be an explicit part of the kernel interface. Then the
application still has the freedom to fall back on the non-preserving
form of the call if that is truly what it wants.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 18:12 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-14 22:00 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-14 22:00 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote:
> On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> > Joel Becker wrote:
> > > +
> > > +Preserving the security context of the source file obviously requires
> > > +the privilege to do so. Callers that do not own the source file and do
> > > +not have CAP_CHOWN will get a new reflink with all non-security
> > > +attributes preserved; the security context of the new reflink will be
> > > +as a newly created file by that user.
> > > +
> >
> > There are plenty of syscalls that require some privilege and fail if the
> > caller doesn't have it. But I can think of only one syscall that does
> > *something different* depending on who called it: setuid.
> >
> > Please search the web and marvel at the disasters caused by setuid's
> > magical caller-dependent behavior (the sendmail bug is probably the most
> > famous [1]). This proposal for reflink is just asking for bugs where an
> > attacker gets some otherwise privileged program to call reflink but to
> > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
> > copy security attributes, thus exposing a link with the wrong permissions.
> >
> > Would it really be that hard to have two syscalls, or a flag, or
> > whatever, where one of them preserves all security attributes and
> > *fails* if the caller isn't allowed to do that and the other one makes
> > the caller own the new link?
> >
> >
> > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
>
> Yes, I agree - the selection of whether or not to preserve the security
> attributes should be an explicit part of the kernel interface. Then the
> application still has the freedom to fall back on the non-preserving
> form of the call if that is truly what it wants.
Here's my problem. Every single shell script now has to do:
ln -r source target
[ $? != 0 ] && ln -r --no-perms source target
Every single program now has to do:
if (reflink(source, target) && errno == EPERM)
reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
Because the 99% user wants a real snapshot, and doesn't want to have to
think about it. The could, of course, code up their own permission
checks to see which variant of reflink to call, but it's still useless
(to them) boilerplate.
Also, if the 'common' user has to use the reflinkat() call?
We've lost.
Finally, how is this safer? Don't get me wrong, I do respect
the concern - that's why I originally went with your proposal of
is_owner_or_cap(). But the fact is that if you've hijacked a process
with enough privileges, you *can* make the full reflink, and if your
hijacked process doesn't but does have read access, you *can* make the
NOPERMS reflink. So doing it with the userspace code above is identical
to the kernel code, except that every userspace program has to handle it
themselves.
Joel
--
"Vote early and vote often."
- Al Capone
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-14 22:00 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-14 22:00 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote:
> On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> > Joel Becker wrote:
> > > +
> > > +Preserving the security context of the source file obviously requires
> > > +the privilege to do so. Callers that do not own the source file and do
> > > +not have CAP_CHOWN will get a new reflink with all non-security
> > > +attributes preserved; the security context of the new reflink will be
> > > +as a newly created file by that user.
> > > +
> >
> > There are plenty of syscalls that require some privilege and fail if the
> > caller doesn't have it. But I can think of only one syscall that does
> > *something different* depending on who called it: setuid.
> >
> > Please search the web and marvel at the disasters caused by setuid's
> > magical caller-dependent behavior (the sendmail bug is probably the most
> > famous [1]). This proposal for reflink is just asking for bugs where an
> > attacker gets some otherwise privileged program to call reflink but to
> > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
> > copy security attributes, thus exposing a link with the wrong permissions.
> >
> > Would it really be that hard to have two syscalls, or a flag, or
> > whatever, where one of them preserves all security attributes and
> > *fails* if the caller isn't allowed to do that and the other one makes
> > the caller own the new link?
> >
> >
> > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
>
> Yes, I agree - the selection of whether or not to preserve the security
> attributes should be an explicit part of the kernel interface. Then the
> application still has the freedom to fall back on the non-preserving
> form of the call if that is truly what it wants.
Here's my problem. Every single shell script now has to do:
ln -r source target
[ $? != 0 ] && ln -r --no-perms source target
Every single program now has to do:
if (reflink(source, target) && errno == EPERM)
reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
Because the 99% user wants a real snapshot, and doesn't want to have to
think about it. The could, of course, code up their own permission
checks to see which variant of reflink to call, but it's still useless
(to them) boilerplate.
Also, if the 'common' user has to use the reflinkat() call?
We've lost.
Finally, how is this safer? Don't get me wrong, I do respect
the concern - that's why I originally went with your proposal of
is_owner_or_cap(). But the fact is that if you've hijacked a process
with enough privileges, you *can* make the full reflink, and if your
hijacked process doesn't but does have read access, you *can* make the
NOPERMS reflink. So doing it with the userspace code above is identical
to the kernel code, except that every userspace program has to handle it
themselves.
Joel
--
"Vote early and vote often."
- Al Capone
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 22:00 ` [Ocfs2-devel] " Joel Becker
(?)
@ 2009-05-15 1:20 ` Jamie Lokier
-1 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-15 1:20 UTC (permalink / raw)
To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages
Joel Becker wrote:
> Here's my problem. Every single shell script now has to do:
>
> ln -r source target
> [ $? != 0 ] && ln -r --no-perms source target
No, they'll obviously do
ln -Rr source target
It is not a burden to type that.
(Where -R == your -r --no-perms, and -R -r together means try -R then -r).
> Every single program now has to do:
>
> if (reflink(source, target) && errno == EPERM)
> reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
Yes if that's what they want.
> Because the 99% user wants a real snapshot,
A quick poll based on emails in these threads says >50% doesn't want a real snapshot :-)
But even at 99%, what about the other 1%?
As I've explained, it is _impossible_ for userspace to do "ln -r" thing
itself in some conditions given your system call.
> and doesn't want to have to think about it.
The problem with the "automatic" switch is that it isn't obvious, so
people will make mistaken assumptions when using it.
If they _want_ the automatic switch, then a few moments of thought
doesn't matter. Make it easy if you care: like "ln -Rr" in scripts
and a flag REFLINK_PERMS_IF_ALLOWED in the system call.
This is especially so with reflink(), because the userspace code if
you _didn't_ want the automatic change are tricky to write (and
extremely difficult to get right), so authors will either not bother,
or do it badly.
And test suites for programs using reflink() will pass nicely, yet the
code may still be broken because ordinary users can't test the "other
user's files" cases.
> The could, of course, code up their own permission
> checks to see which variant of reflink to call, but it's still useless
> (to them) boilerplate.
Why wouldn't you just do the two calls? It's much easier. But even
that goes away with REFLINK_PERMS_IF_ALLOWED (and conversely
REFLINK_PERMS_STRICT).
(Note it's not just permissions - it's also timestamps, group,
xattrs. The flag names could reflect that).
> Also, if the 'common' user has to use the reflinkat() call? We've lost.
Provide a reflink() call in libc. Problem solved.
Heck, provide separate reflink() and cowlink() calls in libc if you
don't like a flag.
> Finally, how is this safer? Don't get me wrong, I do respect
> the concern - that's why I originally went with your proposal of
> is_owner_or_cap(). But the fact is that if you've hijacked a process
> with enough privileges, you *can* make the full reflink, and if your
> hijacked process doesn't but does have read access, you *can* make the
> NOPERMS reflink.
If you can trick a process into unexpected behaviour, it doesn't mean
you can make it do just anything. It means you can trick specific
checks and assumptions that the program makes into being wrong,
because you made something behave in a way the authors didn't expect.
Building on that, sometimes the trick is enough to make a backdoor.
Which is why file system calls should behave in a simple way that
don't surprise anyone.
> So doing it with the userspace code above is identical
> to the kernel code, except that every userspace program has to handle it
> themselves.
No because not every userspace program _wants_ that behaviour.
So you have these problems if it's forced in the kernel:
- Userspace programs that _don't want_ a "full reflink" but have the
privilege to do to. Sometimes they can't do the chmod/etc. to
fix the attributes after _at all_ (think setgid-directories
among other things - it's *hard* to simulate that in userspace
and never quite right).
- Sometimes fixing up afterwards would be a security race
condition - the temporary unwanted permissions can be looser
looser than the process wants to expose in the new directory.
What I'm seeing is that for the benefit of saving exactly one line in
some userspace programs - a line which is quite helpful in showing
what the program intends - it will cost about 1000 lines of code
(which is still slightly broken) in other userspace programs, and I
can think of a number of those programs already. Not pretty.
If you don't like the two calls, just add a flag which means try one
then the other. Then it's clear what the app is requesting, and
invites authors to decide what behaviour they want, trivially.
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-14 22:00 ` [Ocfs2-devel] " Joel Becker
(?)
(?)
@ 2009-05-15 1:20 ` Jamie Lokier
-1 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-15 1:20 UTC (permalink / raw)
To: ocfs2-devel
Joel Becker wrote:
> Here's my problem. Every single shell script now has to do:
>
> ln -r source target
> [ $? != 0 ] && ln -r --no-perms source target
No, they'll obviously do
ln -Rr source target
It is not a burden to type that.
(Where -R == your -r --no-perms, and -R -r together means try -R then -r).
> Every single program now has to do:
>
> if (reflink(source, target) && errno == EPERM)
> reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
Yes if that's what they want.
> Because the 99% user wants a real snapshot,
A quick poll based on emails in these threads says >50% doesn't want a real snapshot :-)
But even at 99%, what about the other 1%?
As I've explained, it is _impossible_ for userspace to do "ln -r" thing
itself in some conditions given your system call.
> and doesn't want to have to think about it.
The problem with the "automatic" switch is that it isn't obvious, so
people will make mistaken assumptions when using it.
If they _want_ the automatic switch, then a few moments of thought
doesn't matter. Make it easy if you care: like "ln -Rr" in scripts
and a flag REFLINK_PERMS_IF_ALLOWED in the system call.
This is especially so with reflink(), because the userspace code if
you _didn't_ want the automatic change are tricky to write (and
extremely difficult to get right), so authors will either not bother,
or do it badly.
And test suites for programs using reflink() will pass nicely, yet the
code may still be broken because ordinary users can't test the "other
user's files" cases.
> The could, of course, code up their own permission
> checks to see which variant of reflink to call, but it's still useless
> (to them) boilerplate.
Why wouldn't you just do the two calls? It's much easier. But even
that goes away with REFLINK_PERMS_IF_ALLOWED (and conversely
REFLINK_PERMS_STRICT).
(Note it's not just permissions - it's also timestamps, group,
xattrs. The flag names could reflect that).
> Also, if the 'common' user has to use the reflinkat() call? We've lost.
Provide a reflink() call in libc. Problem solved.
Heck, provide separate reflink() and cowlink() calls in libc if you
don't like a flag.
> Finally, how is this safer? Don't get me wrong, I do respect
> the concern - that's why I originally went with your proposal of
> is_owner_or_cap(). But the fact is that if you've hijacked a process
> with enough privileges, you *can* make the full reflink, and if your
> hijacked process doesn't but does have read access, you *can* make the
> NOPERMS reflink.
If you can trick a process into unexpected behaviour, it doesn't mean
you can make it do just anything. It means you can trick specific
checks and assumptions that the program makes into being wrong,
because you made something behave in a way the authors didn't expect.
Building on that, sometimes the trick is enough to make a backdoor.
Which is why file system calls should behave in a simple way that
don't surprise anyone.
> So doing it with the userspace code above is identical
> to the kernel code, except that every userspace program has to handle it
> themselves.
No because not every userspace program _wants_ that behaviour.
So you have these problems if it's forced in the kernel:
- Userspace programs that _don't want_ a "full reflink" but have the
privilege to do to. Sometimes they can't do the chmod/etc. to
fix the attributes after _at all_ (think setgid-directories
among other things - it's *hard* to simulate that in userspace
and never quite right).
- Sometimes fixing up afterwards would be a security race
condition - the temporary unwanted permissions can be looser
looser than the process wants to expose in the new directory.
What I'm seeing is that for the benefit of saving exactly one line in
some userspace programs - a line which is quite helpful in showing
what the program intends - it will cost about 1000 lines of code
(which is still slightly broken) in other userspace programs, and I
can think of a number of those programs already. Not pretty.
If you don't like the two calls, just add a flag which means try one
then the other. Then it's clear what the app is requesting, and
invites authors to decide what behaviour they want, trivially.
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-14 22:00 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-15 12:01 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 12:01 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Thu, 2009-05-14 at 15:00 -0700, Joel Becker wrote:
> On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote:
> > On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> > > Joel Becker wrote:
> > > > +
> > > > +Preserving the security context of the source file obviously requires
> > > > +the privilege to do so. Callers that do not own the source file and do
> > > > +not have CAP_CHOWN will get a new reflink with all non-security
> > > > +attributes preserved; the security context of the new reflink will be
> > > > +as a newly created file by that user.
> > > > +
> > >
> > > There are plenty of syscalls that require some privilege and fail if the
> > > caller doesn't have it. But I can think of only one syscall that does
> > > *something different* depending on who called it: setuid.
> > >
> > > Please search the web and marvel at the disasters caused by setuid's
> > > magical caller-dependent behavior (the sendmail bug is probably the most
> > > famous [1]). This proposal for reflink is just asking for bugs where an
> > > attacker gets some otherwise privileged program to call reflink but to
> > > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
> > > copy security attributes, thus exposing a link with the wrong permissions.
> > >
> > > Would it really be that hard to have two syscalls, or a flag, or
> > > whatever, where one of them preserves all security attributes and
> > > *fails* if the caller isn't allowed to do that and the other one makes
> > > the caller own the new link?
> > >
> > >
> > > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
> >
> > Yes, I agree - the selection of whether or not to preserve the security
> > attributes should be an explicit part of the kernel interface. Then the
> > application still has the freedom to fall back on the non-preserving
> > form of the call if that is truly what it wants.
>
> Here's my problem. Every single shell script now has to do:
>
> ln -r source target
> [ $? != 0 ] && ln -r --no-perms source target
>
> Every single program now has to do:
>
> if (reflink(source, target) && errno == EPERM)
> reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
>
> Because the 99% user wants a real snapshot, and doesn't want to have to
> think about it. The could, of course, code up their own permission
> checks to see which variant of reflink to call, but it's still useless
> (to them) boilerplate.
> Also, if the 'common' user has to use the reflinkat() call?
> We've lost.
I think Jamie covered the fact that you can provide a user interface and
library functions that provide the "simpler" interface on top of this
interface, but not vice versa.
> Finally, how is this safer? Don't get me wrong, I do respect
> the concern - that's why I originally went with your proposal of
> is_owner_or_cap(). But the fact is that if you've hijacked a process
> with enough privileges, you *can* make the full reflink, and if your
> hijacked process doesn't but does have read access, you *can* make the
> NOPERMS reflink. So doing it with the userspace code above is identical
> to the kernel code, except that every userspace program has to handle it
> themselves.
As Jamie said, we aren't talking about injecting arbitrary code into the
process. The failure scenario is quite similar to the setuid() one:
arrange conditions such that the process lacks sufficient privileges to
preserve attributes, and when it calls reflink(2) expecting to preserve
the attributes, it will get no indication that they weren't preserved.
At which point the data may be unwittingly exposed beyond its original
constraints.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 12:01 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 12:01 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Thu, 2009-05-14 at 15:00 -0700, Joel Becker wrote:
> On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote:
> > On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> > > Joel Becker wrote:
> > > > +
> > > > +Preserving the security context of the source file obviously requires
> > > > +the privilege to do so. Callers that do not own the source file and do
> > > > +not have CAP_CHOWN will get a new reflink with all non-security
> > > > +attributes preserved; the security context of the new reflink will be
> > > > +as a newly created file by that user.
> > > > +
> > >
> > > There are plenty of syscalls that require some privilege and fail if the
> > > caller doesn't have it. But I can think of only one syscall that does
> > > *something different* depending on who called it: setuid.
> > >
> > > Please search the web and marvel at the disasters caused by setuid's
> > > magical caller-dependent behavior (the sendmail bug is probably the most
> > > famous [1]). This proposal for reflink is just asking for bugs where an
> > > attacker gets some otherwise privileged program to call reflink but to
> > > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to
> > > copy security attributes, thus exposing a link with the wrong permissions.
> > >
> > > Would it really be that hard to have two syscalls, or a flag, or
> > > whatever, where one of them preserves all security attributes and
> > > *fails* if the caller isn't allowed to do that and the other one makes
> > > the caller own the new link?
> > >
> > >
> > > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
> >
> > Yes, I agree - the selection of whether or not to preserve the security
> > attributes should be an explicit part of the kernel interface. Then the
> > application still has the freedom to fall back on the non-preserving
> > form of the call if that is truly what it wants.
>
> Here's my problem. Every single shell script now has to do:
>
> ln -r source target
> [ $? != 0 ] && ln -r --no-perms source target
>
> Every single program now has to do:
>
> if (reflink(source, target) && errno == EPERM)
> reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
>
> Because the 99% user wants a real snapshot, and doesn't want to have to
> think about it. The could, of course, code up their own permission
> checks to see which variant of reflink to call, but it's still useless
> (to them) boilerplate.
> Also, if the 'common' user has to use the reflinkat() call?
> We've lost.
I think Jamie covered the fact that you can provide a user interface and
library functions that provide the "simpler" interface on top of this
interface, but not vice versa.
> Finally, how is this safer? Don't get me wrong, I do respect
> the concern - that's why I originally went with your proposal of
> is_owner_or_cap(). But the fact is that if you've hijacked a process
> with enough privileges, you *can* make the full reflink, and if your
> hijacked process doesn't but does have read access, you *can* make the
> NOPERMS reflink. So doing it with the userspace code above is identical
> to the kernel code, except that every userspace program has to handle it
> themselves.
As Jamie said, we aren't talking about injecting arbitrary code into the
process. The failure scenario is quite similar to the setuid() one:
arrange conditions such that the process lacks sufficient privileges to
preserve attributes, and when it calls reflink(2) expecting to preserve
the attributes, it will get no indication that they weren't preserved.
At which point the data may be unwittingly exposed beyond its original
constraints.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-15 12:01 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-15 15:22 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-15 15:22 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module,
mtk.manpages, jim owens, ocfs2-devel, viro
On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote:
> > Finally, how is this safer? Don't get me wrong, I do respect
> > the concern - that's why I originally went with your proposal of
> > is_owner_or_cap(). But the fact is that if you've hijacked a process
> > with enough privileges, you *can* make the full reflink, and if your
> > hijacked process doesn't but does have read access, you *can* make the
> > NOPERMS reflink. So doing it with the userspace code above is identical
> > to the kernel code, except that every userspace program has to handle it
> > themselves.
>
> As Jamie said, we aren't talking about injecting arbitrary code into the
> process. The failure scenario is quite similar to the setuid() one:
> arrange conditions such that the process lacks sufficient privileges to
> preserve attributes, and when it calls reflink(2) expecting to preserve
> the attributes, it will get no indication that they weren't preserved.
> At which point the data may be unwittingly exposed beyond its original
> constraints.
I wasn't being specific to injected code. Assume we have a
deliberate flag to reflinkat(2). Then we provide reflink(3) in
userspace that does the fallback, keeping it out of the kernel. Doesn't
that have the exact same problem?
Joel
--
"Same dancers in the same old shoes.
You get too careful with the steps you choose.
You don't care about winning but you don't want to lose
After the thrill is gone."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 15:22 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-15 15:22 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module,
mtk.manpages, jim owens, ocfs2-devel, viro
On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote:
> > Finally, how is this safer? Don't get me wrong, I do respect
> > the concern - that's why I originally went with your proposal of
> > is_owner_or_cap(). But the fact is that if you've hijacked a process
> > with enough privileges, you *can* make the full reflink, and if your
> > hijacked process doesn't but does have read access, you *can* make the
> > NOPERMS reflink. So doing it with the userspace code above is identical
> > to the kernel code, except that every userspace program has to handle it
> > themselves.
>
> As Jamie said, we aren't talking about injecting arbitrary code into the
> process. The failure scenario is quite similar to the setuid() one:
> arrange conditions such that the process lacks sufficient privileges to
> preserve attributes, and when it calls reflink(2) expecting to preserve
> the attributes, it will get no indication that they weren't preserved.
> At which point the data may be unwittingly exposed beyond its original
> constraints.
I wasn't being specific to injected code. Assume we have a
deliberate flag to reflinkat(2). Then we provide reflink(3) in
userspace that does the fallback, keeping it out of the kernel. Doesn't
that have the exact same problem?
Joel
--
"Same dancers in the same old shoes.
You get too careful with the steps you choose.
You don't care about winning but you don't want to lose
After the thrill is gone."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-15 15:22 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-15 15:55 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 15:55 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 08:22 -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote:
> > > Finally, how is this safer? Don't get me wrong, I do respect
> > > the concern - that's why I originally went with your proposal of
> > > is_owner_or_cap(). But the fact is that if you've hijacked a process
> > > with enough privileges, you *can* make the full reflink, and if your
> > > hijacked process doesn't but does have read access, you *can* make the
> > > NOPERMS reflink. So doing it with the userspace code above is identical
> > > to the kernel code, except that every userspace program has to handle it
> > > themselves.
> >
> > As Jamie said, we aren't talking about injecting arbitrary code into the
> > process. The failure scenario is quite similar to the setuid() one:
> > arrange conditions such that the process lacks sufficient privileges to
> > preserve attributes, and when it calls reflink(2) expecting to preserve
> > the attributes, it will get no indication that they weren't preserved.
> > At which point the data may be unwittingly exposed beyond its original
> > constraints.
>
> I wasn't being specific to injected code. Assume we have a
> deliberate flag to reflinkat(2). Then we provide reflink(3) in
> userspace that does the fallback, keeping it out of the kernel. Doesn't
> that have the exact same problem?
You wouldn't always do the fallback in reflink(3), but instead provide a
helper interface that would perform the fallback for applications that
want that behavior.
Consider a program that wants to always preserve attributes on the
reflinks it creates. If the interface allows the program to explicitly
request that behavior and returns an error when the request cannot be
honored, then the program knows that upon a successful return, the
attributes were in fact preserved. If the interface instead silently
selects a behavior based on the current privileges of the process and
gives no indication to the caller as to what behavior was selected, then
the opportunity for error is great.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 15:55 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-15 15:55 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 08:22 -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote:
> > > Finally, how is this safer? Don't get me wrong, I do respect
> > > the concern - that's why I originally went with your proposal of
> > > is_owner_or_cap(). But the fact is that if you've hijacked a process
> > > with enough privileges, you *can* make the full reflink, and if your
> > > hijacked process doesn't but does have read access, you *can* make the
> > > NOPERMS reflink. So doing it with the userspace code above is identical
> > > to the kernel code, except that every userspace program has to handle it
> > > themselves.
> >
> > As Jamie said, we aren't talking about injecting arbitrary code into the
> > process. The failure scenario is quite similar to the setuid() one:
> > arrange conditions such that the process lacks sufficient privileges to
> > preserve attributes, and when it calls reflink(2) expecting to preserve
> > the attributes, it will get no indication that they weren't preserved.
> > At which point the data may be unwittingly exposed beyond its original
> > constraints.
>
> I wasn't being specific to injected code. Assume we have a
> deliberate flag to reflinkat(2). Then we provide reflink(3) in
> userspace that does the fallback, keeping it out of the kernel. Doesn't
> that have the exact same problem?
You wouldn't always do the fallback in reflink(3), but instead provide a
helper interface that would perform the fallback for applications that
want that behavior.
Consider a program that wants to always preserve attributes on the
reflinks it creates. If the interface allows the program to explicitly
request that behavior and returns an error when the request cannot be
honored, then the program knows that upon a successful return, the
attributes were in fact preserved. If the interface instead silently
selects a behavior based on the current privileges of the process and
gives no indication to the caller as to what behavior was selected, then
the opportunity for error is great.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-15 15:55 ` [Ocfs2-devel] " Stephen Smalley
@ 2009-05-15 16:42 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-15 16:42 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > I wasn't being specific to injected code. Assume we have a
> > deliberate flag to reflinkat(2). Then we provide reflink(3) in
> > userspace that does the fallback, keeping it out of the kernel. Doesn't
> > that have the exact same problem?
>
> You wouldn't always do the fallback in reflink(3), but instead provide a
> helper interface that would perform the fallback for applications that
> want that behavior.
But isn't that reflink(3)? And the application that wants to
know uses reflinkat(2)?
>
> Consider a program that wants to always preserve attributes on the
> reflinks it creates. If the interface allows the program to explicitly
> request that behavior and returns an error when the request cannot be
> honored, then the program knows that upon a successful return, the
> attributes were in fact preserved. If the interface instead silently
> selects a behavior based on the current privileges of the process and
> gives no indication to the caller as to what behavior was selected, then
> the opportunity for error is great.
I get that. I'm looking at what the programming interface is.
What's the standard function for "I want the fallback behavior" called?
What's the standard function for "I want preserve security" called?
"int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
is it?
Joel
--
Life's Little Instruction Book #69
"Whistle"
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 16:42 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-15 16:42 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > I wasn't being specific to injected code. Assume we have a
> > deliberate flag to reflinkat(2). Then we provide reflink(3) in
> > userspace that does the fallback, keeping it out of the kernel. Doesn't
> > that have the exact same problem?
>
> You wouldn't always do the fallback in reflink(3), but instead provide a
> helper interface that would perform the fallback for applications that
> want that behavior.
But isn't that reflink(3)? And the application that wants to
know uses reflinkat(2)?
>
> Consider a program that wants to always preserve attributes on the
> reflinks it creates. If the interface allows the program to explicitly
> request that behavior and returns an error when the request cannot be
> honored, then the program knows that upon a successful return, the
> attributes were in fact preserved. If the interface instead silently
> selects a behavior based on the current privileges of the process and
> gives no indication to the caller as to what behavior was selected, then
> the opportunity for error is great.
I get that. I'm looking at what the programming interface is.
What's the standard function for "I want the fallback behavior" called?
What's the standard function for "I want preserve security" called?
"int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
is it?
Joel
--
Life's Little Instruction Book #69
"Whistle"
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-15 16:42 ` [Ocfs2-devel] " Joel Becker
(?)
@ 2009-05-15 17:01 ` Shaya Potter
-1 siblings, 0 replies; 304+ messages in thread
From: Shaya Potter @ 2009-05-15 17:01 UTC (permalink / raw)
To: ocfs2-devel
Joel Becker wrote:
> On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
>>> I wasn't being specific to injected code. Assume we have a
>>> deliberate flag to reflinkat(2). Then we provide reflink(3) in
>>> userspace that does the fallback, keeping it out of the kernel. Doesn't
>>> that have the exact same problem?
>> You wouldn't always do the fallback in reflink(3), but instead provide a
>> helper interface that would perform the fallback for applications that
>> want that behavior.
>
> But isn't that reflink(3)? And the application that wants to
> know uses reflinkat(2)?
>> Consider a program that wants to always preserve attributes on the
>> reflinks it creates. If the interface allows the program to explicitly
>> request that behavior and returns an error when the request cannot be
>> honored, then the program knows that upon a successful return, the
>> attributes were in fact preserved. If the interface instead silently
>> selects a behavior based on the current privileges of the process and
>> gives no indication to the caller as to what behavior was selected, then
>> the opportunity for error is great.
>
> I get that. I'm looking at what the programming interface is.
> What's the standard function for "I want the fallback behavior" called?
> What's the standard function for "I want preserve security" called?
> "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> is it?
whenever there's hidden fallback behavior that changes the security
semantics you will cause programming error.
the only correct way for an application to code that want the fallback
functionality
if (initial_behavior()) {
if (fallback_behavior()) {
some sort of error
}
}
as that way the application knows what occured. if that logic is
wrapped in a single function (like , you would have to dosomething like
if (ret == initial_and_fallbakc()) {
if (ret == 0) {
fallback = 0;
} else if (ret == 1) {
fallback == 1;
} else {
some sort of error
}
}
which is much more prone to error.
at the end of the day, a single function that has hidden fallback
behavior does not really save lines of code in a well written
application. it does however make it easier to write a poorly written
application that can cause security problems.
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-15 16:42 ` [Ocfs2-devel] " Joel Becker
(?)
(?)
@ 2009-05-15 17:01 ` Shaya Potter
-1 siblings, 0 replies; 304+ messages in thread
From: Shaya Potter @ 2009-05-15 17:01 UTC (permalink / raw)
To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages
Joel Becker wrote:
> On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
>>> I wasn't being specific to injected code. Assume we have a
>>> deliberate flag to reflinkat(2). Then we provide reflink(3) in
>>> userspace that does the fallback, keeping it out of the kernel. Doesn't
>>> that have the exact same problem?
>> You wouldn't always do the fallback in reflink(3), but instead provide a
>> helper interface that would perform the fallback for applications that
>> want that behavior.
>
> But isn't that reflink(3)? And the application that wants to
> know uses reflinkat(2)?
>> Consider a program that wants to always preserve attributes on the
>> reflinks it creates. If the interface allows the program to explicitly
>> request that behavior and returns an error when the request cannot be
>> honored, then the program knows that upon a successful return, the
>> attributes were in fact preserved. If the interface instead silently
>> selects a behavior based on the current privileges of the process and
>> gives no indication to the caller as to what behavior was selected, then
>> the opportunity for error is great.
>
> I get that. I'm looking at what the programming interface is.
> What's the standard function for "I want the fallback behavior" called?
> What's the standard function for "I want preserve security" called?
> "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> is it?
whenever there's hidden fallback behavior that changes the security
semantics you will cause programming error.
the only correct way for an application to code that want the fallback
functionality
if (initial_behavior()) {
if (fallback_behavior()) {
some sort of error
}
}
as that way the application knows what occured. if that logic is
wrapped in a single function (like , you would have to dosomething like
if (ret == initial_and_fallbakc()) {
if (ret == 0) {
fallback = 0;
} else if (ret == 1) {
fallback == 1;
} else {
some sort of error
}
}
which is much more prone to error.
at the end of the day, a single function that has hidden fallback
behavior does not really save lines of code in a well written
application. it does however make it easier to write a poorly written
application that can cause security problems.
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-15 16:42 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-15 20:53 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-15 20:53 UTC (permalink / raw)
To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages
On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > Consider a program that wants to always preserve attributes on the
> > reflinks it creates. If the interface allows the program to explicitly
> > request that behavior and returns an error when the request cannot be
> > honored, then the program knows that upon a successful return, the
> > attributes were in fact preserved. If the interface instead silently
> > selects a behavior based on the current privileges of the process and
> > gives no indication to the caller as to what behavior was selected, then
> > the opportunity for error is great.
>
> I get that. I'm looking at what the programming interface is.
> What's the standard function for "I want the fallback behavior" called?
> What's the standard function for "I want preserve security" called?
> "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> is it?
Ok, I've been casting about how to solve the concern and provide
a decent interface. I'm not about to give up on either. I think,
though, that we do have to let the application signal its intent to the
system. And if we're doing that, let's add a little flexibility.
I think the interface will be this (ignoring the reflinkat(2)
bit for now):
int reflink(const char *oldpath, const char *newpath, int preserve);
- Data and xattrs are reflinked always.
- 'preserve is a bitfield describing which attributes to keep across the
reflink:
* REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
CAP_CHOWN.
* REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
the same. This requires REFLINK_ATTR_OWNER (the security state makes
no sense if the ownership changes). If not set, the filesystem wipes
all security.* xattrs and reinitializes with
security_inode_init_security() just like a new file.
* REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
or CAP_FOWNER.
* REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
changes, and so you can't keep them the same if the mode wasn't
preserved. If not set, the filesystem reinits the ACLs as for a new
file.
- REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
That's all the relevant attributes. The timestamps behave as
already described (ctime is now, mtime matches the source), which is the
only sane behavior for this sort of thing.
So, a copy program would reflink(source, target,
REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
easily.
In the kernel, security_inode_reflink() gets passed the preserve
bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
It may do other checks on the reflink and the preserve bits, that's up
to the LSM.
For scripting, we add the we add the '-p' and '-P' to "ln -r":
- ln -r == reflink(source, target, REFLINK_ATTR_NONE);
- ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
- ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
is best-effort.
Does this make everyone happy?
Joel
--
"In the beginning, the universe was created. This has made a lot
of people very angry, and is generally considered to have been a
bad move."
- Douglas Adams
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-15 20:53 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-15 20:53 UTC (permalink / raw)
To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages
On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > Consider a program that wants to always preserve attributes on the
> > reflinks it creates. If the interface allows the program to explicitly
> > request that behavior and returns an error when the request cannot be
> > honored, then the program knows that upon a successful return, the
> > attributes were in fact preserved. If the interface instead silently
> > selects a behavior based on the current privileges of the process and
> > gives no indication to the caller as to what behavior was selected, then
> > the opportunity for error is great.
>
> I get that. I'm looking at what the programming interface is.
> What's the standard function for "I want the fallback behavior" called?
> What's the standard function for "I want preserve security" called?
> "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> is it?
Ok, I've been casting about how to solve the concern and provide
a decent interface. I'm not about to give up on either. I think,
though, that we do have to let the application signal its intent to the
system. And if we're doing that, let's add a little flexibility.
I think the interface will be this (ignoring the reflinkat(2)
bit for now):
int reflink(const char *oldpath, const char *newpath, int preserve);
- Data and xattrs are reflinked always.
- 'preserve is a bitfield describing which attributes to keep across the
reflink:
* REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
CAP_CHOWN.
* REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
the same. This requires REFLINK_ATTR_OWNER (the security state makes
no sense if the ownership changes). If not set, the filesystem wipes
all security.* xattrs and reinitializes with
security_inode_init_security() just like a new file.
* REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
or CAP_FOWNER.
* REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
changes, and so you can't keep them the same if the mode wasn't
preserved. If not set, the filesystem reinits the ACLs as for a new
file.
- REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
That's all the relevant attributes. The timestamps behave as
already described (ctime is now, mtime matches the source), which is the
only sane behavior for this sort of thing.
So, a copy program would reflink(source, target,
REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
easily.
In the kernel, security_inode_reflink() gets passed the preserve
bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
It may do other checks on the reflink and the preserve bits, that's up
to the LSM.
For scripting, we add the we add the '-p' and '-P' to "ln -r":
- ln -r == reflink(source, target, REFLINK_ATTR_NONE);
- ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
- ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
is best-effort.
Does this make everyone happy?
Joel
--
"In the beginning, the universe was created. This has made a lot
of people very angry, and is generally considered to have been a
bad move."
- Douglas Adams
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-15 20:53 ` Joel Becker
@ 2009-05-18 9:17 ` Jörn Engel
-1 siblings, 0 replies; 304+ messages in thread
From: Jörn Engel @ 2009-05-18 9:17 UTC (permalink / raw)
To: Joel Becker
Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages, linux-security-module,
linux-fsdevel
On Fri, 15 May 2009 13:53:35 -0700, Joel Becker wrote:
>
> Does this make everyone happy?
Provided the only fallback is to return an error code and let userspace
decide what to do, I'm a happy camper.
Not sure how many of the REFLINK_ATTR_* flags will actually be used,
apart from ALL and NONE. But I don't mind having them.
Jörn
--
People will accept your ideas much more readily if you tell them
that Benjamin Franklin said it first.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-18 9:17 ` Jörn Engel
0 siblings, 0 replies; 304+ messages in thread
From: Jörn Engel @ 2009-05-18 9:17 UTC (permalink / raw)
To: Joel Becker
Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages, linux-security-module,
linux-fsdevel
On Fri, 15 May 2009 13:53:35 -0700, Joel Becker wrote:
>
> Does this make everyone happy?
Provided the only fallback is to return an error code and let userspace
decide what to do, I'm a happy camper.
Not sure how many of the REFLINK_ATTR_* flags will actually be used,
apart from ALL and NONE. But I don't mind having them.
J?rn
--
People will accept your ideas much more readily if you tell them
that Benjamin Franklin said it first.
-- unknown
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-15 20:53 ` Joel Becker
@ 2009-05-18 13:02 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-18 13:02 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > Consider a program that wants to always preserve attributes on the
> > > reflinks it creates. If the interface allows the program to explicitly
> > > request that behavior and returns an error when the request cannot be
> > > honored, then the program knows that upon a successful return, the
> > > attributes were in fact preserved. If the interface instead silently
> > > selects a behavior based on the current privileges of the process and
> > > gives no indication to the caller as to what behavior was selected, then
> > > the opportunity for error is great.
> >
> > I get that. I'm looking at what the programming interface is.
> > What's the standard function for "I want the fallback behavior" called?
> > What's the standard function for "I want preserve security" called?
> > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> > is it?
>
> Ok, I've been casting about how to solve the concern and provide
> a decent interface. I'm not about to give up on either. I think,
> though, that we do have to let the application signal its intent to the
> system. And if we're doing that, let's add a little flexibility.
> I think the interface will be this (ignoring the reflinkat(2)
> bit for now):
>
> int reflink(const char *oldpath, const char *newpath, int preserve);
>
> - Data and xattrs are reflinked always.
> - 'preserve is a bitfield describing which attributes to keep across the
> reflink:
> * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
> CAP_CHOWN.
> * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> the same. This requires REFLINK_ATTR_OWNER (the security state makes
> no sense if the ownership changes). If not set, the filesystem wipes
> all security.* xattrs and reinitializes with
> security_inode_init_security() just like a new file.
> * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
> or CAP_FOWNER.
> * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
> REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> changes, and so you can't keep them the same if the mode wasn't
> preserved. If not set, the filesystem reinits the ACLs as for a new
> file.
> - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
>
> That's all the relevant attributes. The timestamps behave as
> already described (ctime is now, mtime matches the source), which is the
> only sane behavior for this sort of thing.
> So, a copy program would reflink(source, target,
> REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> easily.
> In the kernel, security_inode_reflink() gets passed the preserve
> bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
> allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> It may do other checks on the reflink and the preserve bits, that's up
> to the LSM.
> For scripting, we add the we add the '-p' and '-P' to "ln -r":
>
> - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
> is best-effort.
>
> Does this make everyone happy?
For simplicity and robustness, I would only support the none or all
flags, i.e. preserve can be a simple bool. I don't think you really
want to deal with the individual flags, and I don't see a use case for
them.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-18 13:02 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-18 13:02 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > Consider a program that wants to always preserve attributes on the
> > > reflinks it creates. If the interface allows the program to explicitly
> > > request that behavior and returns an error when the request cannot be
> > > honored, then the program knows that upon a successful return, the
> > > attributes were in fact preserved. If the interface instead silently
> > > selects a behavior based on the current privileges of the process and
> > > gives no indication to the caller as to what behavior was selected, then
> > > the opportunity for error is great.
> >
> > I get that. I'm looking at what the programming interface is.
> > What's the standard function for "I want the fallback behavior" called?
> > What's the standard function for "I want preserve security" called?
> > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> > is it?
>
> Ok, I've been casting about how to solve the concern and provide
> a decent interface. I'm not about to give up on either. I think,
> though, that we do have to let the application signal its intent to the
> system. And if we're doing that, let's add a little flexibility.
> I think the interface will be this (ignoring the reflinkat(2)
> bit for now):
>
> int reflink(const char *oldpath, const char *newpath, int preserve);
>
> - Data and xattrs are reflinked always.
> - 'preserve is a bitfield describing which attributes to keep across the
> reflink:
> * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
> CAP_CHOWN.
> * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> the same. This requires REFLINK_ATTR_OWNER (the security state makes
> no sense if the ownership changes). If not set, the filesystem wipes
> all security.* xattrs and reinitializes with
> security_inode_init_security() just like a new file.
> * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
> or CAP_FOWNER.
> * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
> REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> changes, and so you can't keep them the same if the mode wasn't
> preserved. If not set, the filesystem reinits the ACLs as for a new
> file.
> - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
>
> That's all the relevant attributes. The timestamps behave as
> already described (ctime is now, mtime matches the source), which is the
> only sane behavior for this sort of thing.
> So, a copy program would reflink(source, target,
> REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> easily.
> In the kernel, security_inode_reflink() gets passed the preserve
> bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
> allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> It may do other checks on the reflink and the preserve bits, that's up
> to the LSM.
> For scripting, we add the we add the '-p' and '-P' to "ln -r":
>
> - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
> is best-effort.
>
> Does this make everyone happy?
For simplicity and robustness, I would only support the none or all
flags, i.e. preserve can be a simple bool. I don't think you really
want to deal with the individual flags, and I don't see a use case for
them.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-18 13:02 ` Stephen Smalley
@ 2009-05-18 14:33 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-18 14:33 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote:
> On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > > Consider a program that wants to always preserve attributes on the
> > > > reflinks it creates. If the interface allows the program to explicitly
> > > > request that behavior and returns an error when the request cannot be
> > > > honored, then the program knows that upon a successful return, the
> > > > attributes were in fact preserved. If the interface instead silently
> > > > selects a behavior based on the current privileges of the process and
> > > > gives no indication to the caller as to what behavior was selected, then
> > > > the opportunity for error is great.
> > >
> > > I get that. I'm looking at what the programming interface is.
> > > What's the standard function for "I want the fallback behavior" called?
> > > What's the standard function for "I want preserve security" called?
> > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> > > is it?
> >
> > Ok, I've been casting about how to solve the concern and provide
> > a decent interface. I'm not about to give up on either. I think,
> > though, that we do have to let the application signal its intent to the
> > system. And if we're doing that, let's add a little flexibility.
> > I think the interface will be this (ignoring the reflinkat(2)
> > bit for now):
> >
> > int reflink(const char *oldpath, const char *newpath, int preserve);
> >
> > - Data and xattrs are reflinked always.
> > - 'preserve is a bitfield describing which attributes to keep across the
> > reflink:
> > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
> > CAP_CHOWN.
> > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> > the same. This requires REFLINK_ATTR_OWNER (the security state makes
> > no sense if the ownership changes). If not set, the filesystem wipes
> > all security.* xattrs and reinitializes with
> > security_inode_init_security() just like a new file.
> > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
> > or CAP_FOWNER.
> > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
> > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> > changes, and so you can't keep them the same if the mode wasn't
> > preserved. If not set, the filesystem reinits the ACLs as for a new
> > file.
> > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> >
> > That's all the relevant attributes. The timestamps behave as
> > already described (ctime is now, mtime matches the source), which is the
> > only sane behavior for this sort of thing.
> > So, a copy program would reflink(source, target,
> > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> > easily.
> > In the kernel, security_inode_reflink() gets passed the preserve
> > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
> > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> > It may do other checks on the reflink and the preserve bits, that's up
> > to the LSM.
> > For scripting, we add the we add the '-p' and '-P' to "ln -r":
> >
> > - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
> > is best-effort.
> >
> > Does this make everyone happy?
>
> For simplicity and robustness, I would only support the none or all
> flags, i.e. preserve can be a simple bool. I don't think you really
> want to deal with the individual flags, and I don't see a use case for
> them.
Or possibly only distinguish preserve-dac from preserve-mac, e.g.
REFLINK_ATTR_NONE (preserve none),
REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p)
REFLINK_ATTR_MAC (preserve MAC security label ala cp -c)
REFLINK_ATTR_ALL (preserve all)
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-18 14:33 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-18 14:33 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote:
> On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > > Consider a program that wants to always preserve attributes on the
> > > > reflinks it creates. If the interface allows the program to explicitly
> > > > request that behavior and returns an error when the request cannot be
> > > > honored, then the program knows that upon a successful return, the
> > > > attributes were in fact preserved. If the interface instead silently
> > > > selects a behavior based on the current privileges of the process and
> > > > gives no indication to the caller as to what behavior was selected, then
> > > > the opportunity for error is great.
> > >
> > > I get that. I'm looking at what the programming interface is.
> > > What's the standard function for "I want the fallback behavior" called?
> > > What's the standard function for "I want preserve security" called?
> > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> > > is it?
> >
> > Ok, I've been casting about how to solve the concern and provide
> > a decent interface. I'm not about to give up on either. I think,
> > though, that we do have to let the application signal its intent to the
> > system. And if we're doing that, let's add a little flexibility.
> > I think the interface will be this (ignoring the reflinkat(2)
> > bit for now):
> >
> > int reflink(const char *oldpath, const char *newpath, int preserve);
> >
> > - Data and xattrs are reflinked always.
> > - 'preserve is a bitfield describing which attributes to keep across the
> > reflink:
> > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
> > CAP_CHOWN.
> > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> > the same. This requires REFLINK_ATTR_OWNER (the security state makes
> > no sense if the ownership changes). If not set, the filesystem wipes
> > all security.* xattrs and reinitializes with
> > security_inode_init_security() just like a new file.
> > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
> > or CAP_FOWNER.
> > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
> > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> > changes, and so you can't keep them the same if the mode wasn't
> > preserved. If not set, the filesystem reinits the ACLs as for a new
> > file.
> > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> >
> > That's all the relevant attributes. The timestamps behave as
> > already described (ctime is now, mtime matches the source), which is the
> > only sane behavior for this sort of thing.
> > So, a copy program would reflink(source, target,
> > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> > easily.
> > In the kernel, security_inode_reflink() gets passed the preserve
> > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
> > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> > It may do other checks on the reflink and the preserve bits, that's up
> > to the LSM.
> > For scripting, we add the we add the '-p' and '-P' to "ln -r":
> >
> > - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
> > is best-effort.
> >
> > Does this make everyone happy?
>
> For simplicity and robustness, I would only support the none or all
> flags, i.e. preserve can be a simple bool. I don't think you really
> want to deal with the individual flags, and I don't see a use case for
> them.
Or possibly only distinguish preserve-dac from preserve-mac, e.g.
REFLINK_ATTR_NONE (preserve none),
REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p)
REFLINK_ATTR_MAC (preserve MAC security label ala cp -c)
REFLINK_ATTR_ALL (preserve all)
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-18 14:33 ` Stephen Smalley
@ 2009-05-18 17:15 ` Stephen Smalley
-1 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-18 17:15 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Mon, 2009-05-18 at 10:33 -0400, Stephen Smalley wrote:
> On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote:
> > On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> > > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > > > Consider a program that wants to always preserve attributes on the
> > > > > reflinks it creates. If the interface allows the program to explicitly
> > > > > request that behavior and returns an error when the request cannot be
> > > > > honored, then the program knows that upon a successful return, the
> > > > > attributes were in fact preserved. If the interface instead silently
> > > > > selects a behavior based on the current privileges of the process and
> > > > > gives no indication to the caller as to what behavior was selected, then
> > > > > the opportunity for error is great.
> > > >
> > > > I get that. I'm looking at what the programming interface is.
> > > > What's the standard function for "I want the fallback behavior" called?
> > > > What's the standard function for "I want preserve security" called?
> > > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> > > > is it?
> > >
> > > Ok, I've been casting about how to solve the concern and provide
> > > a decent interface. I'm not about to give up on either. I think,
> > > though, that we do have to let the application signal its intent to the
> > > system. And if we're doing that, let's add a little flexibility.
> > > I think the interface will be this (ignoring the reflinkat(2)
> > > bit for now):
> > >
> > > int reflink(const char *oldpath, const char *newpath, int preserve);
> > >
> > > - Data and xattrs are reflinked always.
> > > - 'preserve is a bitfield describing which attributes to keep across the
> > > reflink:
> > > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
> > > CAP_CHOWN.
> > > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> > > the same. This requires REFLINK_ATTR_OWNER (the security state makes
> > > no sense if the ownership changes). If not set, the filesystem wipes
> > > all security.* xattrs and reinitializes with
> > > security_inode_init_security() just like a new file.
> > > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
> > > or CAP_FOWNER.
> > > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
> > > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> > > changes, and so you can't keep them the same if the mode wasn't
> > > preserved. If not set, the filesystem reinits the ACLs as for a new
> > > file.
> > > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> > >
> > > That's all the relevant attributes. The timestamps behave as
> > > already described (ctime is now, mtime matches the source), which is the
> > > only sane behavior for this sort of thing.
> > > So, a copy program would reflink(source, target,
> > > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> > > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> > > easily.
> > > In the kernel, security_inode_reflink() gets passed the preserve
> > > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
> > > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> > > It may do other checks on the reflink and the preserve bits, that's up
> > > to the LSM.
> > > For scripting, we add the we add the '-p' and '-P' to "ln -r":
> > >
> > > - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> > > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> > > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
> > > is best-effort.
> > >
> > > Does this make everyone happy?
> >
> > For simplicity and robustness, I would only support the none or all
> > flags, i.e. preserve can be a simple bool. I don't think you really
> > want to deal with the individual flags, and I don't see a use case for
> > them.
>
> Or possibly only distinguish preserve-dac from preserve-mac, e.g.
> REFLINK_ATTR_NONE (preserve none),
> REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p)
> REFLINK_ATTR_MAC (preserve MAC security label ala cp -c)
> REFLINK_ATTR_ALL (preserve all)
Even this distinction doesn't seem worthwhile and could get complicated,
e.g. security.capability is an alternative to using the setuid mode bit,
and thus logically would fall into the same class as the owner and mode.
I'd just limit reflink() to preserving none or all of the security
attributes.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-18 17:15 ` Stephen Smalley
0 siblings, 0 replies; 304+ messages in thread
From: Stephen Smalley @ 2009-05-18 17:15 UTC (permalink / raw)
To: Joel Becker
Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
mtk.manpages, linux-security-module, linux-fsdevel
On Mon, 2009-05-18 at 10:33 -0400, Stephen Smalley wrote:
> On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote:
> > On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> > > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > > > Consider a program that wants to always preserve attributes on the
> > > > > reflinks it creates. If the interface allows the program to explicitly
> > > > > request that behavior and returns an error when the request cannot be
> > > > > honored, then the program knows that upon a successful return, the
> > > > > attributes were in fact preserved. If the interface instead silently
> > > > > selects a behavior based on the current privileges of the process and
> > > > > gives no indication to the caller as to what behavior was selected, then
> > > > > the opportunity for error is great.
> > > >
> > > > I get that. I'm looking at what the programming interface is.
> > > > What's the standard function for "I want the fallback behavior" called?
> > > > What's the standard function for "I want preserve security" called?
> > > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which
> > > > is it?
> > >
> > > Ok, I've been casting about how to solve the concern and provide
> > > a decent interface. I'm not about to give up on either. I think,
> > > though, that we do have to let the application signal its intent to the
> > > system. And if we're doing that, let's add a little flexibility.
> > > I think the interface will be this (ignoring the reflinkat(2)
> > > bit for now):
> > >
> > > int reflink(const char *oldpath, const char *newpath, int preserve);
> > >
> > > - Data and xattrs are reflinked always.
> > > - 'preserve is a bitfield describing which attributes to keep across the
> > > reflink:
> > > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or
> > > CAP_CHOWN.
> > > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> > > the same. This requires REFLINK_ATTR_OWNER (the security state makes
> > > no sense if the ownership changes). If not set, the filesystem wipes
> > > all security.* xattrs and reinitializes with
> > > security_inode_init_security() just like a new file.
> > > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership
> > > or CAP_FOWNER.
> > > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires
> > > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> > > changes, and so you can't keep them the same if the mode wasn't
> > > preserved. If not set, the filesystem reinits the ACLs as for a new
> > > file.
> > > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> > >
> > > That's all the relevant attributes. The timestamps behave as
> > > already described (ctime is now, mtime matches the source), which is the
> > > only sane behavior for this sort of thing.
> > > So, a copy program would reflink(source, target,
> > > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> > > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> > > easily.
> > > In the kernel, security_inode_reflink() gets passed the preserve
> > > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is
> > > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> > > It may do other checks on the reflink and the preserve bits, that's up
> > > to the LSM.
> > > For scripting, we add the we add the '-p' and '-P' to "ln -r":
> > >
> > > - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> > > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> > > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p"
> > > is best-effort.
> > >
> > > Does this make everyone happy?
> >
> > For simplicity and robustness, I would only support the none or all
> > flags, i.e. preserve can be a simple bool. I don't think you really
> > want to deal with the individual flags, and I don't see a use case for
> > them.
>
> Or possibly only distinguish preserve-dac from preserve-mac, e.g.
> REFLINK_ATTR_NONE (preserve none),
> REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p)
> REFLINK_ATTR_MAC (preserve MAC security label ala cp -c)
> REFLINK_ATTR_ALL (preserve all)
Even this distinction doesn't seem worthwhile and could get complicated,
e.g. security.capability is an alternative to using the setuid mode bit,
and thus logically would fall into the same class as the owner and mode.
I'd just limit reflink() to preserving none or all of the security
attributes.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [RFC] The reflink(2) system call v4.
2009-05-18 13:02 ` Stephen Smalley
@ 2009-05-18 18:26 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-18 18:26 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module,
mtk.manpages, jim owens, ocfs2-devel, viro
On Mon, May 18, 2009 at 09:02:39AM -0400, Stephen Smalley wrote:
> For simplicity and robustness, I would only support the none or all
> flags, i.e. preserve can be a simple bool. I don't think you really
> want to deal with the individual flags, and I don't see a use case for
> them.
The simple use case I can think of is "I want a snapshot, but I
don't have rights to copy the MAC context". Or "I want to own it, but I
want to keep all the ACLs for other users".
Basically, if I'm adding another int argument to reflinkat(2), I
wanted to consider the future. Maybe define it as 1 or 0, and leave the
use of the other bits for future possibilities? If we're lucky, of
course, we never need future changes.
Joel
--
"There is a country in Europe where multiple-choice tests are
illegal."
- Sigfried Hulzer
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-18 18:26 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-18 18:26 UTC (permalink / raw)
To: Stephen Smalley
Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module,
mtk.manpages, jim owens, ocfs2-devel, viro
On Mon, May 18, 2009 at 09:02:39AM -0400, Stephen Smalley wrote:
> For simplicity and robustness, I would only support the none or all
> flags, i.e. preserve can be a simple bool. I don't think you really
> want to deal with the individual flags, and I don't see a use case for
> them.
The simple use case I can think of is "I want a snapshot, but I
don't have rights to copy the MAC context". Or "I want to own it, but I
want to keep all the ACLs for other users".
Basically, if I'm adding another int argument to reflinkat(2), I
wanted to consider the future. Maybe define it as 1 or 0, and leave the
use of the other bits for future possibilities? If we're lucky, of
course, we never need future changes.
Joel
--
"There is a country in Europe where multiple-choice tests are
illegal."
- Sigfried Hulzer
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-18 18:26 ` [Ocfs2-devel] " Joel Becker
@ 2009-05-19 16:32 ` Sage Weil
-1 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-19 16:32 UTC (permalink / raw)
To: Joel Becker
Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages, linux-security-module,
linux-fsdevel
Hi Joel,
This version (with whatever flag simplifications are deemed appropriate)
looks pretty good to me!
The only other thing I would like to see is a flag that makes copying the
xattrs optional. That's straying toward kitchen sink territory, but it
seems like a natural enough interface once you're cherry-picking what to
preserve in the reflink. (Since you can always remove unwanted xattrs
later, of course, it's certainly not a show-stopper.)
Thanks!
sage
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-19 16:32 ` Sage Weil
0 siblings, 0 replies; 304+ messages in thread
From: Sage Weil @ 2009-05-19 16:32 UTC (permalink / raw)
To: Joel Becker
Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages, linux-security-module,
linux-fsdevel
Hi Joel,
This version (with whatever flag simplifications are deemed appropriate)
looks pretty good to me!
The only other thing I would like to see is a flag that makes copying the
xattrs optional. That's straying toward kitchen sink territory, but it
seems like a natural enough interface once you're cherry-picking what to
preserve in the reflink. (Since you can always remove unwanted xattrs
later, of course, it's certainly not a show-stopper.)
Thanks!
sage
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-15 20:53 ` Joel Becker
` (2 preceding siblings ...)
(?)
@ 2009-05-19 19:20 ` Jonathan Corbet
2009-05-19 19:32 ` Joel Becker
-1 siblings, 1 reply; 304+ messages in thread
From: Jonathan Corbet @ 2009-05-19 19:20 UTC (permalink / raw)
To: ocfs2-devel
One tiny little thing that crossed my mind as I was looking at this...
> - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
That, I think, could lead to unexpected results if different flags
(perhaps controlling different aspects of behavior altogether) are
added in the future. Might it make more sense for REFLINK_ATTR_ALL to
be something like 0xffff, with the current implementation insisting
that all other bits are zero? That would leave room for expansion of
the set of things covered by the "preserve all" semantics while,
simultaneously, allowing the addition of different types of flags
entirely.
jon
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-19 19:20 ` Jonathan Corbet
@ 2009-05-19 19:32 ` Joel Becker
2009-05-19 19:41 ` Jonathan Corbet
0 siblings, 1 reply; 304+ messages in thread
From: Joel Becker @ 2009-05-19 19:32 UTC (permalink / raw)
To: ocfs2-devel
On Tue, May 19, 2009 at 01:20:57PM -0600, Jonathan Corbet wrote:
> One tiny little thing that crossed my mind as I was looking at this...
>
> > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
>
> That, I think, could lead to unexpected results if different flags
> (perhaps controlling different aspects of behavior altogether) are
> added in the future. Might it make more sense for REFLINK_ATTR_ALL to
> be something like 0xffff, with the current implementation insisting
> that all other bits are zero? That would leave room for expansion of
> the set of things covered by the "preserve all" semantics while,
> simultaneously, allowing the addition of different types of flags
> entirely.
I considered that, but really a process specifying
REFLINK_ATTR_ALL wants a complete snapshot. So if we add things to our
inodes later, and then you have an old program asking for "a complete
snapshot", it won't get it. It'll get a partial snapshot, missing the
things we added later.
Conversely, a newer program that knows about the new things will
get an error on an older kernel when it asks for the complete snapshot.
You'll note I called this 'preserve', not 'flags'. It's not a
set of behavioral flags, it's a mask of attributes to preserve.
Joel
--
"Here's something to think about: How come you never see a headline
like ``Psychic Wins Lottery''?"
- Jay Leno
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-19 19:32 ` Joel Becker
@ 2009-05-19 19:41 ` Jonathan Corbet
0 siblings, 0 replies; 304+ messages in thread
From: Jonathan Corbet @ 2009-05-19 19:41 UTC (permalink / raw)
To: Joel Becker
Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages, linux-security-module,
linux-fsdevel
On Tue, 19 May 2009 12:32:44 -0700
Joel Becker <Joel.Becker@oracle.com> wrote:
> I considered that, but really a process specifying
> REFLINK_ATTR_ALL wants a complete snapshot. So if we add things to our
> inodes later, and then you have an old program asking for "a complete
> snapshot", it won't get it. It'll get a partial snapshot, missing the
> things we added later.
> Conversely, a newer program that knows about the new things will
> get an error on an older kernel when it asks for the complete snapshot.
Yep, that's why I'd suggested carving out a set of bits rather larger
than the ones specified now. That would allow any future flags to be
included in the REFLINK_ATTR_ALL "space" if that seemed like the right
thing to do. It would be forward and backward compatible.
Anything added outside that bit range would, presumably, be a more
significant change which should not carry forward or backward
automatically.
> You'll note I called this 'preserve', not 'flags'. It's not a
> set of behavioral flags, it's a mask of attributes to preserve.
Understood, but that may not stop somebody else from trying to extend
the API in different directions in the future. It seems like a way to
make life easier for that person when the time comes.
Just a thought, anyway; not something I'd make a fuss about.
jon
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4.
@ 2009-05-19 19:41 ` Jonathan Corbet
0 siblings, 0 replies; 304+ messages in thread
From: Jonathan Corbet @ 2009-05-19 19:41 UTC (permalink / raw)
To: Joel Becker
Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris,
ocfs2-devel, viro, mtk.manpages, linux-security-module,
linux-fsdevel
On Tue, 19 May 2009 12:32:44 -0700
Joel Becker <Joel.Becker@oracle.com> wrote:
> I considered that, but really a process specifying
> REFLINK_ATTR_ALL wants a complete snapshot. So if we add things to our
> inodes later, and then you have an old program asking for "a complete
> snapshot", it won't get it. It'll get a partial snapshot, missing the
> things we added later.
> Conversely, a newer program that knows about the new things will
> get an error on an older kernel when it asks for the complete snapshot.
Yep, that's why I'd suggested carving out a set of bits rather larger
than the ones specified now. That would allow any future flags to be
included in the REFLINK_ATTR_ALL "space" if that seemed like the right
thing to do. It would be forward and backward compatible.
Anything added outside that bit range would, presumably, be a more
significant change which should not carry forward or backward
automatically.
> You'll note I called this 'preserve', not 'flags'. It's not a
> set of behavioral flags, it's a mask of attributes to preserve.
Understood, but that may not stop somebody else from trying to extend
the API in different directions in the future. It seems like a way to
make life easier for that person when the time comes.
Just a thought, anyway; not something I'd make a fuss about.
jon
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-15 20:53 ` Joel Becker
` (3 preceding siblings ...)
(?)
@ 2009-05-19 19:33 ` Jonathan Corbet
2009-05-19 20:15 ` Jamie Lokier
-1 siblings, 1 reply; 304+ messages in thread
From: Jonathan Corbet @ 2009-05-19 19:33 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-security-module
One tiny little thing that crossed my mind as I was looking at this...
> - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
That, I think, could lead to unexpected results if different flags
(perhaps controlling different aspects of behavior altogether) are
added in the future. Might it make more sense for REFLINK_ATTR_ALL to
be something like 0xffff, with the current implementation insisting
that all other bits are zero? That would leave room for expansion of
the set of things covered by the "preserve all" semantics while,
simultaneously, allowing the addition of different types of flags
entirely.
jon
^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
2009-05-19 19:33 ` Jonathan Corbet
@ 2009-05-19 20:15 ` Jamie Lokier
0 siblings, 0 replies; 304+ messages in thread
From: Jamie Lokier @ 2009-05-19 20:15 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-fsdevel, linux-security-module
Jonathan Corbet wrote:
> One tiny little thing that crossed my mind as I was looking at this...
>
> > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
>
> That, I think, could lead to unexpected results if different flags
> (perhaps controlling different aspects of behavior altogether) are
> added in the future. Might it make more sense for REFLINK_ATTR_ALL to
> be something like 0xffff, with the current implementation insisting
> that all other bits are zero? That would leave room for expansion of
> the set of things covered by the "preserve all" semantics while,
> simultaneously, allowing the addition of different types of flags
> entirely.
I think it's far better if REFLINK_ATTR_ALL is simply it's own 1-bit
flag, meaning exactly what you think it means: In the kernel, it sets
all the attribute flags.
It's possible to choose a bit-mask now, but there's no particular
reason that 16 bits is the right size, and it's ugly if it turns out
you need a hack for a backward-compatible 17th attribute sometime.
(It can be done, it's just ugly).
(I'd also add REFLINK_ATTR_ATOMIC, because you might want the
attributes copied but don't care about atomicity, and some filesystems
might be able to one without the other. I'm thinking of SMB/CIFS here.)
By the way, there is work going on towards a "selective stat()" call,
which takes a set of bits for which attributes are to be returned. Is
it worth converging on some common flags to select attributes?
-- Jamie
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4. - Question for suitability
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
` (5 preceding siblings ...)
(?)
@ 2009-05-25 7:44 ` Mihail Daskalov
2009-05-25 20:42 ` Joel Becker
-1 siblings, 1 reply; 304+ messages in thread
From: Mihail Daskalov @ 2009-05-25 7:44 UTC (permalink / raw)
To: ocfs2-devel
Hi Joel,
I would like to ask additional question on the reflink semantics. What I
want to know is what will be the behaviour if you have a file with
pending modifications, and you do a reflink to it. Will the reflink call
be atomic in terms of snapshoting ?
The context I am asking this is the following:
Suppose that we have an Oracle database running in a file system that
supports reflink.
Will it be possible and safe to use reflink for backup purposes ? e.g.
1) execute "alter tablespace XXX begin backup" sql command
2) do a reflink of all datafiles to datafiles.reflinked (e.g. for X
in (datafiles) do ; reflink X X.reflinked ; done ).
3) execute "alter tablespace XXX end backup"
4) backup the datafiles.reflinked with some tool
?
I understand that Oracle datafile hot backup will not be affected even
if part of the pending notifications are accepted to the file, and some
are not - this is the way oracle hot backup is designed to work - so
this is not a really good example, but it probably illustrates what I
mean. There might be other cases that need some stability for the file.
Other related issues is which process would get an error if there is
insufficient space to store data for two different snapshots - will that
be the writer to the original file, or the reader of the snapshot? I
guess the writer will get ENOSPACE. But I ask this question anyway...
Will this be configurable (e.g. through some attribute - e.g.
GUARANTEE_SPACE_FOR_ORIGINAL_FILE, or ALLOW_ERRORS_READING_SNAPSHOT) ?
Regards,
Mihail Daskalov
-----Original Message-----
From: ocfs2-devel-bounces@oss.oracle.com
[mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Joel Becker
Sent: Monday, May 11, 2009 11:40 PM
To: jim owens; jmorris at namei.org; ocfs2-devel at oss.oracle.com;
viro at zeniv.linux.org.uk; mtk.manpages at gmail.com;
linux-security-module at vger.kernel.org; linux-fsdevel at vger.kernel.org
Subject: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
On Thu, May 07, 2009 at 08:10:18PM -0700, Joel Becker wrote:
> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
> > You certainly did not address:
> >
> > - desire for one single system call to handle both
> > owner preservation and create with current owner.
>
> Nope, and I don't intend to. reflink() is a snapshotting call,
> not a kitchen sink.
I've been thinking about this all weekend. The current state
doesn't make me happy.
Now, what concerns me here is the interface to userspace. The
system call itself. I don't care if we implement it via one vfs_foo()
or 10 nor how many iops we end up with. We can and will modify those as
we find better ideas. But I want reflink(2) to have a semantic that is
easily understood and intuitive.
When I initially designed reflink(), I hadn't thought about the
ownership and permission implications of snapshotting. I was having too
much fun reflinking files around. In that iteration, anyone could
reflink a file. But a true snapshot needs ownership, permissions, acls,
and other security attributes (in all, I'm gonna call that the "security
context") as well. So I defined reflink() as such. This meant
requiring privileges, but lost some of the flexibility of the call. I
call that a loss.
What I'm not going to do is add optional behaviors to the system
call. It should be pretty obvious what it does, or we're doing it
wrong. The 'flags' field of reflinkat(2) is for AT_* flags.
When I decided on requiring privileges, I thought that degrading
without privileges was too confusing. I was wrong. I want reflink() to
fit into the pantheon of file system operations in a way that makes
sense alongside the others, and this isn't it.
Here's v4 of reflink(). If you have the privileges, you get the
full snapshot. If you don't, you must have read access, and then you
get the entire snapshot (data and extended attributes) except that the
security context is reinitialized. That's it. It fits with most of the
other ops, and it's a clean degradation.
I add a flag to ips->reflink() so that the filesystem knows what
to do with the security context. That's the only change visible outside
of vfs_reflink().
Security folks, check my work. Everyone else, let me know if
this satisfies.
Joel
From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system
call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and security
contexts in order to create a fully snapshot. Preserving those
attributes requires ownership or CAP_CHOWN. A caller without those
privileges will see the security context of the new file initialized to
their default.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security context on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
XXX: Currently only adds the x86_32 linkage. The rest of the
architectures belong here too.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 165
+++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 113 ++++++++++++++++++++++
include/linux/fs.h | 2 +
include/linux/security.h | 16 +++
include/linux/syscalls.h | 2 +
security/capability.c | 6 +
security/security.c | 7 ++
10 files changed, 317 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt
b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..aa7380f
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,165 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two
directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks just like link(2):
+
+ int reflink(const char *oldpath, const char *newpath);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security context, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out
an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security context of the source file obviously requires
+the privilege to do so. Callers that do not own the source file and do
+not have CAP_CHOWN will get a new reflink with all non-security
+attributes preserved; the security context of the new reflink will be
+as a newly created file by that user.
+
+Partial reflinks are not allowed. The new inode will only appear in
the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. Like hard links and symlinks, a reflink cannot be
+created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security context (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller will see their own default security
+context applied to the file.
+
+A caller without the privileges to preserve the security context must
+have read access to reflink a file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the
source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will
break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these
extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those
extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+All file attributes and extended attributes of the new file must
+identical to the source file with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+- If the caller lacks the privileges to preserve the security context,
+ the file will have its security context initialized as would any new
+ file.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, int preserve_security);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the security context should be preserved or reinitialized, as specified
+by the preserve_security argument. The filesystem just needs to create
+the new inode identical to the old one with the exceptions noted above,
+link up the shared data extents, and then link the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt
b/Documentation/filesystems/vfs.txt
index f49eecf..01cd810 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to
truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you
want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/include/asm/unistd_32.h
b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/kernel/syscall_table_32.S
b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..34a6ce5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *,
oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct
dentry *new_dentry)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+ int preserve_security = 1;
+
+ if (!inode)
+ return -ENOENT;
+
+ /*
+ * If the caller has the rights, reflink() will preserve the
+ * security context of the source inode.
+ */
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ preserve_security = 0;
+ if ((current_fsuid() != inode->i_uid) &&
+ !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ preserve_security = 0;
+
+ /*
+ * If the caller doesn't have the right to preserve the security
+ * context, the caller is only getting the data and extended
+ * attributes. They need read permission on the file.
+ */
+ if (!preserve_security) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be
created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+ if (S_ISDIR(inode->i_mode))
+ return -EPERM;
+
+ error = security_inode_reflink(old_dentry, dir);
+ if (error)
+ return error;
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry,
+ preserve_security);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW :
0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path,
new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
new_dentry);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory.
"Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a
trip...
@@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..0a5c807 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode
*, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *,
struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry
*);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64
start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry
*,int);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..ea9cd93 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct
security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be
saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link
to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1423,7 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char
*old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode
*dir);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry,
int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry,
struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode
*dir);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int
mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int
mode, dev_t dev);
@@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct
inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..35a8743 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user *
oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int
flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname,
int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..3dcc4cc 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode,
struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode
*inode)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations
*ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..70d0ac3 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir,
struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode
*dir)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int
mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.1.3
--
"Three o'clock is always too late or too early for anything you
want to do."
- Jean-Paul Sartre
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-devel
^ permalink raw reply related [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v4. - Question for suitability
2009-05-25 7:44 ` [Ocfs2-devel] [RFC] The reflink(2) system call v4. - Question for suitability Mihail Daskalov
@ 2009-05-25 20:42 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-25 20:42 UTC (permalink / raw)
To: ocfs2-devel
On Mon, May 25, 2009 at 10:44:11AM +0300, Mihail Daskalov wrote:
> I would like to ask additional question on the reflink semantics. What I
> want to know is what will be the behaviour if you have a file with
> pending modifications, and you do a reflink to it. Will the reflink call
> be atomic in terms of snapshoting ?
A reflink is atomic - it's a snapshot. Now, how that interacts
with a userspace process is up to that process.>
> The context I am asking this is the following:
> Suppose that we have an Oracle database running in a file system that
> supports reflink.
> Will it be possible and safe to use reflink for backup purposes ? e.g.
> 1) execute "alter tablespace XXX begin backup" sql command
> 2) do a reflink of all datafiles to datafiles.reflinked (e.g. for X
> in (datafiles) do ; reflink X X.reflinked ; done ).
> 3) execute "alter tablespace XXX end backup"
> 4) backup the datafiles.reflinked with some tool
> ?
That will work, absolutely.
> I understand that Oracle datafile hot backup will not be affected even
> if part of the pending notifications are accepted to the file, and some
> are not - this is the way oracle hot backup is designed to work - so
> this is not a really good example, but it probably illustrates what I
> mean. There might be other cases that need some stability for the file.
As long as the software can guarantee a state for the duration
of the reflink call, they'll get their snapshot. The reflink op
actually locks out the file, makes the snap, then unlocks it. But from
userspace you can only see the duration of the system call.
> Other related issues is which process would get an error if there is
> insufficient space to store data for two different snapshots - will that
> be the writer to the original file, or the reader of the snapshot? I
> guess the writer will get ENOSPACE. But I ask this question anyway...
> Will this be configurable (e.g. through some attribute - e.g.
> GUARANTEE_SPACE_FOR_ORIGINAL_FILE, or ALLOW_ERRORS_READING_SNAPSHOT) ?
Hmm? If there isn't enough space to create the new reflink
(the shallow metadata thereof), the snapper gets ENOSPACE. Once a
reflink is created, no reader should ever get an error that wouldn't
happen outside of a reflink. The data is always there.
If someone writes to a snapshot, and CoW ensues, that CoW can,
indeed, get ENOSPACE. But that's the normal behavior of CoW, and
consistent with all other implementors. The only way to guarantee space
would be to copy all the data, and then you don't have CoW, do you?
Joel
--
"Under capitalism, man exploits man. Under Communism, it's just
the opposite."
- John Kenneth Galbraith
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v5.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
` (6 preceding siblings ...)
(?)
@ 2009-05-28 0:24 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-28 0:24 UTC (permalink / raw)
To: ocfs2-devel
Here's v5 of reflink(). It adds a 'preserve' argument to the
call. This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges. _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc). I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.
Joel
From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot. A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN. A caller
without those privileges will get EPERM. An unpriviledged caller can
specify REFLINK_ATTR_NONE. They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file. The unpriviledged reflink requires read access.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
This only adds the x86 linkage. The trend appears to be for other
architectures to add their own linkage.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 124 +++++++++++++++++++++++
include/linux/fcntl.h | 8 ++
include/linux/fs.h | 2 +
include/linux/security.h | 23 +++++
include/linux/syscalls.h | 3 +
security/capability.c | 7 ++
security/security.c | 8 ++
13 files changed, 358 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+ int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath,
+ int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security state, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so. Because of this, the reflink(2) call has the
+preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM. If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes. The security state
+and attributes of the new reflink will be as a newly created file by
+that user. With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed. The new inode will only appear in the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. A reflink tightens that to regular files only. Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument. The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
.quad sys_inotify_init1
.quad compat_sys_preadv
.quad compat_sys_pwritev
+ .quad sys_reflinkat /* 335 */
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
__SYSCALL(__NR_preadv, sys_preadv)
#define __NR_pwritev 296
__SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink 297
+__SYSCALL(__NR_reflink, sys_reflink)
#ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+
+ if (!inode)
+ return -ENOENT;
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+
+ /*
+ * Only regular files can be reflinked; if a user tries to
+ * reflink a block device, do they expect copy-on-write of the
+ * entire device?
+ */
+ if (!S_ISREG(inode->i_mode))
+ return -EPERM;
+
+ /*
+ * If the caller wants to preserve ownership, they require the
+ * rights to do so.
+ */
+ if (preserve) {
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ }
+
+ error = security_inode_reflink(old_dentry, dir, preserve);
+ if (error)
+ return error;
+
+ /*
+ * If the caller is modifying any aspect of the attributes, they
+ * are not creating a snapshot. They need read permission on the
+ * file.
+ */
+ if (!preserve) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, preserve,
+ int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+ new_dentry, preserve);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory. "Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE 0x00000001
+#define REFLINK_ATTR_NONE 0
+
+
#ifdef __KERNEL__
#ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * @preserve specifies whether the caller wishes to preserve the
+ * file's attributes. If true, the caller wishes to clone the file's
+ * attributes exactly. If false, the caller expects to reflink the
+ * data extents but reset the attributes.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1427,8 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char *old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir,
+ bool preserve)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname,
+ int preserve, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+ bool preserve)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.3
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply related [flat|nested] 304+ messages in thread
* [RFC] The reflink(2) system call v5.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
` (7 preceding siblings ...)
(?)
@ 2009-05-28 0:24 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-05-28 0:24 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
Here's v5 of reflink(). It adds a 'preserve' argument to the
call. This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges. _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc). I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.
Joel
>From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot. A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN. A caller
without those privileges will get EPERM. An unpriviledged caller can
specify REFLINK_ATTR_NONE. They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file. The unpriviledged reflink requires read access.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
This only adds the x86 linkage. The trend appears to be for other
architectures to add their own linkage.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 124 +++++++++++++++++++++++
include/linux/fcntl.h | 8 ++
include/linux/fs.h | 2 +
include/linux/security.h | 23 +++++
include/linux/syscalls.h | 3 +
security/capability.c | 7 ++
security/security.c | 8 ++
13 files changed, 358 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+ int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath,
+ int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security state, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so. Because of this, the reflink(2) call has the
+preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM. If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes. The security state
+and attributes of the new reflink will be as a newly created file by
+that user. With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed. The new inode will only appear in the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. A reflink tightens that to regular files only. Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument. The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
.quad sys_inotify_init1
.quad compat_sys_preadv
.quad compat_sys_pwritev
+ .quad sys_reflinkat /* 335 */
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
__SYSCALL(__NR_preadv, sys_preadv)
#define __NR_pwritev 296
__SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink 297
+__SYSCALL(__NR_reflink, sys_reflink)
#ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+
+ if (!inode)
+ return -ENOENT;
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+
+ /*
+ * Only regular files can be reflinked; if a user tries to
+ * reflink a block device, do they expect copy-on-write of the
+ * entire device?
+ */
+ if (!S_ISREG(inode->i_mode))
+ return -EPERM;
+
+ /*
+ * If the caller wants to preserve ownership, they require the
+ * rights to do so.
+ */
+ if (preserve) {
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ }
+
+ error = security_inode_reflink(old_dentry, dir, preserve);
+ if (error)
+ return error;
+
+ /*
+ * If the caller is modifying any aspect of the attributes, they
+ * are not creating a snapshot. They need read permission on the
+ * file.
+ */
+ if (!preserve) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, preserve,
+ int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+ new_dentry, preserve);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory. "Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE 0x00000001
+#define REFLINK_ATTR_NONE 0
+
+
#ifdef __KERNEL__
#ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * @preserve specifies whether the caller wishes to preserve the
+ * file's attributes. If true, the caller wishes to clone the file's
+ * attributes exactly. If false, the caller expects to reflink the
+ * data extents but reset the attributes.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1427,8 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char *old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir,
+ bool preserve)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname,
+ int preserve, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+ bool preserve)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.3
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply related [flat|nested] 304+ messages in thread
* [RFC] The reflink(2) system call v5.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
` (8 preceding siblings ...)
(?)
@ 2009-09-14 22:24 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-09-14 22:24 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel
[This is a resend of the v5 patch sent on May 25th. Jim, Al, can I get
acks please.]
Here's v5 of reflink(). It adds a 'preserve' argument to the
call. This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges. _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc). I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.
Joel
>From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot. A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN. A caller
without those privileges will get EPERM. An unpriviledged caller can
specify REFLINK_ATTR_NONE. They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file. The unpriviledged reflink requires read access.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
This only adds the x86 linkage. The trend appears to be for other
architectures to add their own linkage.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 124 +++++++++++++++++++++++
include/linux/fcntl.h | 8 ++
include/linux/fs.h | 2 +
include/linux/security.h | 23 +++++
include/linux/syscalls.h | 3 +
security/capability.c | 7 ++
security/security.c | 8 ++
13 files changed, 358 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+ int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath,
+ int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security state, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so. Because of this, the reflink(2) call has the
+preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM. If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes. The security state
+and attributes of the new reflink will be as a newly created file by
+that user. With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed. The new inode will only appear in the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. A reflink tightens that to regular files only. Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument. The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
.quad sys_inotify_init1
.quad compat_sys_preadv
.quad compat_sys_pwritev
+ .quad sys_reflinkat /* 335 */
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
__SYSCALL(__NR_preadv, sys_preadv)
#define __NR_pwritev 296
__SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink 297
+__SYSCALL(__NR_reflink, sys_reflink)
#ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+
+ if (!inode)
+ return -ENOENT;
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+
+ /*
+ * Only regular files can be reflinked; if a user tries to
+ * reflink a block device, do they expect copy-on-write of the
+ * entire device?
+ */
+ if (!S_ISREG(inode->i_mode))
+ return -EPERM;
+
+ /*
+ * If the caller wants to preserve ownership, they require the
+ * rights to do so.
+ */
+ if (preserve) {
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ }
+
+ error = security_inode_reflink(old_dentry, dir, preserve);
+ if (error)
+ return error;
+
+ /*
+ * If the caller is modifying any aspect of the attributes, they
+ * are not creating a snapshot. They need read permission on the
+ * file.
+ */
+ if (!preserve) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, preserve,
+ int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+ new_dentry, preserve);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory. "Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE 0x00000001
+#define REFLINK_ATTR_NONE 0
+
+
#ifdef __KERNEL__
#ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * @preserve specifies whether the caller wishes to preserve the
+ * file's attributes. If true, the caller wishes to clone the file's
+ * attributes exactly. If false, the caller expects to reflink the
+ * data extents but reset the attributes.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1427,8 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char *old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir,
+ bool preserve)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname,
+ int preserve, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+ bool preserve)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.3
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 304+ messages in thread
* [RFC] The reflink(2) system call v5.
2009-05-11 20:40 ` [Ocfs2-devel] " Joel Becker
@ 2009-09-14 22:24 ` Joel Becker
-1 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-09-14 22:24 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel, linux-kernel
[This is a resend of the v5 patch sent on May 25th. Jim, Al, can I get
acks please.]
Here's v5 of reflink(). It adds a 'preserve' argument to the
call. This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges. _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc). I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.
Joel
>From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot. A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN. A caller
without those privileges will get EPERM. An unpriviledged caller can
specify REFLINK_ATTR_NONE. They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file. The unpriviledged reflink requires read access.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
This only adds the x86 linkage. The trend appears to be for other
architectures to add their own linkage.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 124 +++++++++++++++++++++++
include/linux/fcntl.h | 8 ++
include/linux/fs.h | 2 +
include/linux/security.h | 23 +++++
include/linux/syscalls.h | 3 +
security/capability.c | 7 ++
security/security.c | 8 ++
13 files changed, 358 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+ int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath,
+ int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security state, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so. Because of this, the reflink(2) call has the
+preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM. If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes. The security state
+and attributes of the new reflink will be as a newly created file by
+that user. With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed. The new inode will only appear in the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. A reflink tightens that to regular files only. Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument. The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
.quad sys_inotify_init1
.quad compat_sys_preadv
.quad compat_sys_pwritev
+ .quad sys_reflinkat /* 335 */
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
__SYSCALL(__NR_preadv, sys_preadv)
#define __NR_pwritev 296
__SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink 297
+__SYSCALL(__NR_reflink, sys_reflink)
#ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+
+ if (!inode)
+ return -ENOENT;
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+
+ /*
+ * Only regular files can be reflinked; if a user tries to
+ * reflink a block device, do they expect copy-on-write of the
+ * entire device?
+ */
+ if (!S_ISREG(inode->i_mode))
+ return -EPERM;
+
+ /*
+ * If the caller wants to preserve ownership, they require the
+ * rights to do so.
+ */
+ if (preserve) {
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ }
+
+ error = security_inode_reflink(old_dentry, dir, preserve);
+ if (error)
+ return error;
+
+ /*
+ * If the caller is modifying any aspect of the attributes, they
+ * are not creating a snapshot. They need read permission on the
+ * file.
+ */
+ if (!preserve) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, preserve,
+ int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+ new_dentry, preserve);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory. "Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE 0x00000001
+#define REFLINK_ATTR_NONE 0
+
+
#ifdef __KERNEL__
#ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * @preserve specifies whether the caller wishes to preserve the
+ * file's attributes. If true, the caller wishes to clone the file's
+ * attributes exactly. If false, the caller expects to reflink the
+ * data extents but reset the attributes.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1427,8 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char *old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir,
+ bool preserve)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname,
+ int preserve, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+ bool preserve)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.3
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 304+ messages in thread
* [Ocfs2-devel] [RFC] The reflink(2) system call v5.
@ 2009-09-14 22:24 ` Joel Becker
0 siblings, 0 replies; 304+ messages in thread
From: Joel Becker @ 2009-09-14 22:24 UTC (permalink / raw)
To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
linux-security-module, linux-fsdevel, linux-kernel
[This is a resend of the v5 patch sent on May 25th. Jim, Al, can I get
acks please.]
Here's v5 of reflink(). It adds a 'preserve' argument to the
call. This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges. _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc). I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.
Joel
From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
The userpace visible idea of the operation is:
int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath,
int preserve, int flags);
The kernel only implements reflinkat(2). reflink(3) is a trivial
wrapper around reflinkat(2).
The reflink() system call creates reference-counted links. It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion. Its calling semantics are identical to link(2)
and linkat(2). Once complete, programs see the new file as a completely
separate entry.
reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot. A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN. A caller
without those privileges will get EPERM. An unpriviledged caller can
specify REFLINK_ATTR_NONE. They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file. The unpriviledged reflink requires read access.
In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.
A new LSM hook, security_inode_reflink(), is added. None of the
existing LSM hooks appeared to fit.
This only adds the x86 linkage. The trend appears to be for other
architectures to add their own linkage.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 4 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/namei.c | 124 +++++++++++++++++++++++
include/linux/fcntl.h | 8 ++
include/linux/fs.h | 2 +
include/linux/security.h | 23 +++++
include/linux/syscalls.h | 3 +
security/capability.c | 7 ++
security/security.c | 8 ++
13 files changed, 358 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/reflink.txt
diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link. The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data. Writes do not modify the shared data; they
+use copy-on-write (CoW). Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+ int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+ int reflinkat(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath,
+ int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing. A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry. Hard links are one step
+down. Multiple directory entries are sharing one inode. Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical. When
+accessing more than one name for a hard link, the object returned looks
+identical. Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such. This includes
+ownership, permissions, security state, and data. The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file. Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file. Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file. ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so. Because of this, the reflink(2) call has the
+preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM. If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes. The security state
+and attributes of the new reflink will be as a newly created file by
+that user. With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed. The new inode will only appear in the
+directory structure after it is fully formed. This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it. Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows. When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter. A symlink doesn't
+require any access permissions other than being able to create its
+inode. It can cross filesystems and mount points, and it can point to
+any type of file. A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory. A reflink tightens that to regular files only. Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file. Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode. It shares all data extents of the source
+file; this includes file data and extended attribute data. All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents. Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode. Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number. This allows POSIX
+ programs to treat the source and new files as separate objects. From
+ the view of the POSIX application, the files are distinct. The
+ sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+ must be changed to accommodate the copy-on-write linkage. The ctime
+ of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+ the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file. This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation. It has almost
+the same prototype as ->link():
+
+ int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation. It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument. The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
truncate_range: a method provided by the underlying filesystem to truncate a
range of blocks , i.e. punch a hole somewhere in a file.
+ reflink: called by the reflink(2) system call. Only required if you want
+ to support reflinks. For further information, see
+ Documentation/filesystems/reflink.txt.
The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
.quad sys_inotify_init1
.quad compat_sys_preadv
.quad compat_sys_pwritev
+ .quad sys_reflinkat /* 335 */
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_reflinkat 335
#ifdef __KERNEL__
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
__SYSCALL(__NR_preadv, sys_preadv)
#define __NR_pwritev 296
__SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink 297
+__SYSCALL(__NR_reflink, sys_reflink)
#ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+ .long sys_reflinkat /* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
}
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *new_dentry, bool preserve)
+{
+ struct inode *inode = old_dentry->d_inode;
+ int error;
+
+ if (!inode)
+ return -ENOENT;
+
+ error = may_create(dir, new_dentry);
+ if (error)
+ return error;
+
+ if (dir->i_sb != inode->i_sb)
+ return -EXDEV;
+
+ /*
+ * A reflink to an append-only or immutable file cannot be created.
+ */
+ if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+ return -EPERM;
+ if (!dir->i_op->reflink)
+ return -EPERM;
+
+ /*
+ * Only regular files can be reflinked; if a user tries to
+ * reflink a block device, do they expect copy-on-write of the
+ * entire device?
+ */
+ if (!S_ISREG(inode->i_mode))
+ return -EPERM;
+
+ /*
+ * If the caller wants to preserve ownership, they require the
+ * rights to do so.
+ */
+ if (preserve) {
+ if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+ return -EPERM;
+ }
+
+ error = security_inode_reflink(old_dentry, dir, preserve);
+ if (error)
+ return error;
+
+ /*
+ * If the caller is modifying any aspect of the attributes, they
+ * are not creating a snapshot. They need read permission on the
+ * file.
+ */
+ if (!preserve) {
+ error = inode_permission(inode, MAY_READ);
+ if (error)
+ return error;
+ }
+
+ mutex_lock(&inode->i_mutex);
+ vfs_dq_init(dir);
+ error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+ mutex_unlock(&inode->i_mutex);
+ if (!error)
+ fsnotify_create(dir, new_dentry);
+ return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+ int, newdfd, const char __user *, newname, int, preserve,
+ int, flags)
+{
+ struct dentry *new_dentry;
+ struct nameidata nd;
+ struct path old_path;
+ int error;
+ char *to;
+
+ if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+ return -EINVAL;
+
+ if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+ return -EINVAL;
+
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ if (error)
+ return error;
+
+ error = user_path_parent(newdfd, newname, &nd, &to);
+ if (error)
+ goto out;
+ error = -EXDEV;
+ if (old_path.mnt != nd.path.mnt)
+ goto out_release;
+ new_dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto out_dput;
+ error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ if (error)
+ goto out_drop_write;
+ error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+ new_dentry, preserve);
+out_drop_write:
+ mnt_drop_write(nd.path.mnt);
+out_dput:
+ dput(new_dentry);
+out_unlock:
+ mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+ path_put(&nd.path);
+ putname(to);
+out:
+ path_put(&old_path);
+
+ return error;
+}
+
+
/*
* The worst of all namespace operations - renaming directory. "Perverted"
* doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_follow_link);
EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);
EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE 0x00000001
+#define REFLINK_ATTR_NONE 0
+
+
#ifdef __KERNEL__
#ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
/*
* VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
loff_t len);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
+ int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
};
struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
* @inode contains a pointer to the inode.
* @secid contains a pointer to the location where result will be saved.
* In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ * Check permission before creating a new reference-counted link to
+ * a file.
+ * @old_dentry contains the dentry structure for an existing link to
+ * the file.
+ * @dir contains the inode structure of the parent directory of the
+ * new reflink.
+ * @preserve specifies whether the caller wishes to preserve the
+ * file's attributes. If true, the caller wishes to clone the file's
+ * attributes exactly. If false, the caller expects to reflink the
+ * data extents but reset the attributes.
+ * Return 0 if permission is granted.
*
* Security hooks for file operations
*
@@ -1415,6 +1427,8 @@ struct security_operations {
int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
int (*inode_symlink) (struct inode *dir,
struct dentry *dentry, const char *old_name);
+ int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
int security_inode_unlink(struct inode *dir, struct dentry *dentry);
int security_inode_symlink(struct inode *dir, struct dentry *dentry,
const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve);
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
return 0;
}
+static inline int security_inode_reflink(struct dentry *old_dentry,
+ struct inode *dir,
+ bool preserve)
+{
+ return 0;
+}
+
static inline int security_inode_mkdir(struct inode *dir,
struct dentry *dentry,
int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+ int newdfd, const char __user *newname,
+ int preserve, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
return 0;
}
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+ bool preserve)
+{
+ return 0;
+}
+
static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
int mask)
{
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
set_to_cap_if_null(ops, inode_link);
set_to_cap_if_null(ops, inode_unlink);
set_to_cap_if_null(ops, inode_symlink);
+ set_to_cap_if_null(ops, inode_reflink);
set_to_cap_if_null(ops, inode_mkdir);
set_to_cap_if_null(ops, inode_rmdir);
set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
return security_ops->inode_symlink(dir, dentry, old_name);
}
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+ bool preserve)
+{
+ if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+ return 0;
+ return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
if (unlikely(IS_PRIVATE(dir)))
--
1.6.3
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 304+ messages in thread