All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-06-30 10:28 David Drysdale
  2014-06-30 10:28 ` [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2) David Drysdale
                   ` (16 more replies)
  0 siblings, 17 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Hi all,

The last couple of versions of FreeBSD (9.x/10.x) have included the
Capsicum security framework [1], which allows security-aware
applications to sandbox themselves in a very fine-grained way.  For
example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
restrict sshd's credentials checking process, to reduce the chances of
credential leakage.

It would be good to have equivalent functionality in Linux, so I've been
working on getting the Capsicum framework running in the kernel, and I'd
appreciate some feedback/opinions on the general design approach.

I'm attaching a corresponding draft patchset for reference, but
hopefully this cover email can cover the significant features to save
everyone having to look through the code details.  (It does mean this is
a long email though -- apologies for that.)


1) Capsicum Capabilities
------------------------

The most significant aspect of Capsicum is associating *rights* with
(some) file descriptors, so that the kernel only allows operations on an
FD if the rights permit it.  This allows userspace applications to
sandbox themselves by tightly constraining what's allowed with both
input and outputs; for example, tcpdump might restrict itself so it can
only read from the network FD, and only write to stdout.

  [Capsicum also includes 'capability mode', which locks down the
  available syscalls so the rights restrictions can't just be bypassed
  by opening new file descriptors; I'll describe that separately later.]

The kernel thus needs to police the rights checks for these file
descriptors (referred to as 'Capsicum capabilities', completely
different than POSIX.1e capabilities), and the best place to do this is
at the points where a file descriptor from userspace is converted to a
struct file * within the kernel.

  [Policing the rights checks anywhere else, for example at the system
  call boundary, isn't a good idea because it opens up the possibility
  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
  changed (as openat/close/dup2 are allowed in capability mode) between
  the 'check' at syscall entry and the 'use' at fget() invocation.]

However, this does lead to quite an invasive change to the kernel --
every invocation of fget() or similar functions (fdget(),
sockfd_lookup(), user_path_at(),...) needs to be annotated with the
rights associated with the specific operations that will be performed on
the struct file.  There are ~100 such invocations that need annotation.

My current implementation approach is to use varargs variants of the
fget() functions that include the required rights, varargs-macroed so
that the only impact in a non-Capsicum build is the need to cope with an
ERR_PTR on failure rather than just NULL:

  #ifdef CONFIG_SECURITY_CAPSICUM
  #define fgetr(fd, ...)	_fgetr((fd), __VA_ARGS__, CAP_LIST_END)
  /* + Other variants... */
  #else
  #define fgetr(fd, ...)	(fget(fd) ?: ERR_PTR(-EBADF))
  /* + Other variants... */
  #endif

For example, an existing chunk of code like:

  SYSCALL_DEFINE1(fchdir, unsigned int, fd)
  {
  	struct fd f = fdget_raw(fd);
  	struct inode *inode;
  	int error = -EBADF;

  	error = -EBADF;
  	if (!f.file)
  		goto out;
  ...

might become:

  SYSCALL_DEFINE1(fchdir, unsigned int, fd)
  {
  	struct fd f = fdgetr_raw(fd, CAP_FCHDIR);
  	struct inode *inode;
  	int error = -EBADF;

  	if (IS_ERR(f.file)) {
  		error = PTR_ERR(f.file);
  		goto out;
  	}
  ...

In a Capsicum build the fdgetr_raw() function performs rights checks
(and potentially returns a new errno as ERR_PTR(-ENOTCAPABLE)), whereas
in a non-Capsicum build the only change is that fdget_raw() returns
ERR_PTR(-EBADF) rather than just NULL.


2) Capsicum Capabilities Data Structure
---------------------------------------

Internally, the rights associated with a Capsicum capability FD are
stored in a special struct file wrapper.  For a normal file, the rights
check inside fget() falls through, but for a capability wrapper the
rights in the wrapper are checked and (if capable) the underlying
wrapped struct file is returned.

  [This is approximately the implementation that was present in FreeBSD
  9.x.  For FreeBSD 10.x, the wrapper file was removed and the rights
  associated with a file descriptor are now stored in the fdtable. As
  that impacts memory use for all processes, whether Capsicum users or
  not, I've stuck with the FreeBSD 9.x approach.]


3) New LSM Hooks
----------------

To actually perform the checking and unwrapping, I've added a couple of
new LSM hooks:
 - .file_lookup(), which allows modification of the result of fget().
 - .file_install(), which allows for the wrapping of a newly-created file
   when that file was created from a Capsicum capability (e.g. via
   openat(2) or accept(2)).

However, I'm not sure that adding the functionality via new LSM hooks is
appropriate, because I don't think Capsicum should be a fully-fledged
LSM:
 - Capsicum doesn't use any of the existing LSM hooks, so (say) AppArmor
   and Capsicum use a disjoint set of hooks.
 - Capsicum needs to co-exist with the existing LSMs, and given the
   current disjoint use, can do so without revisiting the general
   problem of LSM stacking.

Of course, if in future an LSM wanted to use one of these new hooks,
it would have to deal with Capsicum being the "fallback" implementation
of the hook -- i.e. the stacking/interaction problem would show up
again.  So maybe it would be better to avoid the LSM infrastructure
altogether?


4) New System Calls
-------------------

To allow userspace applications to access the Capsicum capability
functionality, I'm proposing two new system calls: cap_rights_limit(2)
and cap_rights_get(2).  I guess these could potentially be implemented
elsewhere (e.g. as fcntl(2) operations?) but the changes seem
significant enough that new syscalls are warranted.

  [FreeBSD 10.x actually includes six new syscalls for manipulating the
  rights associated with a Capsicum capability -- the capability rights
  can police that only specific fcntl(2) or ioctl(2) commands are
  allowed, and FreeBSD sets these with distinct syscalls.]


5) New openat(2) O_BENEATH_ONLY Flag
------------------------------------

For Capsicum capabilities that are directory file descriptors, the
Capsicum framework only allows openat(cap_dfd, path, ...) operations to
work for files that are beneath the specified directory (and even that
only when the directory FD has the CAP_LOOKUP right), rejecting paths
that start with "/" or include "..".

As this seemed like functionality that might be more generally useful,
I've implemented it independently as a new O_BENEATH_ONLY flag for
openat(2).  The Capsicum code then always triggers the use of that flag
when the dfd is a Capsicum capability.


6) Patchset Notes
-----------------

I've appended the draft patchset (against v3.15) for the implementation
of Capsicum capabilities, in case anyone wants to dive into the details.

However, I should point out that it might include some code that hasn't
been compiled -- I attempted to change every fget() invocation I could
find, even if it was for a build that I can't perform (but I have built
allyesconfig on x86 & ARM).


Regards,

David Drysdale


[1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
[2] http://www.watson.org/~robert/2007woot/


David Drysdale (11):
  fs: add O_BENEATH_ONLY flag to openat(2)
  selftests: Add test of O_BENEATH_ONLY & openat(2)
  capsicum: rights values and structure definitions
  capsicum: implement fgetr() and friends
  capsicum: convert callers to use fgetr() etc
  capsicum: implement sockfd_lookupr()
  capsicum: convert callers to use sockfd_lookupr() etc
  capsicum: add new LSM hooks on FD/file conversion
  capsicum: implementations of new LSM hooks
  capsicum: invocation of new LSM hooks
  capsicum: add syscalls to limit FD rights

 Documentation/security/capsicum.txt             | 102 ++++++
 arch/alpha/include/uapi/asm/fcntl.h             |   1 +
 arch/alpha/kernel/osf_sys.c                     |   6 +-
 arch/ia64/kernel/perfmon.c                      |  54 ++--
 arch/parisc/hpux/fs.c                           |   6 +-
 arch/parisc/include/uapi/asm/fcntl.h            |   1 +
 arch/powerpc/kvm/powerpc.c                      |   4 +-
 arch/powerpc/platforms/cell/spu_syscalls.c      |  15 +-
 arch/powerpc/platforms/cell/spufs/coredump.c    |   2 +
 arch/sparc/include/uapi/asm/fcntl.h             |   1 +
 arch/x86/syscalls/syscall_64.tbl                |   2 +
 drivers/base/dma-buf.c                          |   6 +-
 drivers/block/loop.c                            |  14 +-
 drivers/block/nbd.c                             |   5 +-
 drivers/infiniband/core/ucma.c                  |   4 +-
 drivers/infiniband/core/uverbs_cmd.c            |   6 +-
 drivers/infiniband/core/uverbs_main.c           |   4 +-
 drivers/infiniband/hw/usnic/usnic_transport.c   |   2 +-
 drivers/md/md.c                                 |   8 +-
 drivers/scsi/iscsi_tcp.c                        |   2 +-
 drivers/staging/android/sync.c                  |   2 +-
 drivers/staging/lustre/lustre/llite/file.c      |   6 +-
 drivers/staging/lustre/lustre/lmv/lmv_obd.c     |   7 +-
 drivers/staging/lustre/lustre/mdc/lproc_mdc.c   |   8 +-
 drivers/staging/lustre/lustre/mdc/mdc_request.c |   4 +-
 drivers/staging/usbip/stub_dev.c                |   2 +-
 drivers/staging/usbip/vhci_sysfs.c              |   2 +-
 drivers/vfio/pci/vfio_pci.c                     |   6 +-
 drivers/vfio/pci/vfio_pci_intrs.c               |   6 +-
 drivers/vfio/vfio.c                             |   6 +-
 drivers/vhost/net.c                             |   8 +-
 drivers/video/fbdev/msm/mdp.c                   |   4 +-
 fs/aio.c                                        |  37 ++-
 fs/autofs4/dev-ioctl.c                          |  16 +-
 fs/autofs4/inode.c                              |   4 +-
 fs/btrfs/ioctl.c                                |  20 +-
 fs/btrfs/send.c                                 |   7 +-
 fs/cifs/ioctl.c                                 |   6 +-
 fs/coda/inode.c                                 |   4 +-
 fs/coda/psdev.c                                 |   2 +-
 fs/compat.c                                     |  18 +-
 fs/compat_ioctl.c                               |  14 +-
 fs/eventfd.c                                    |  17 +-
 fs/eventpoll.c                                  |  19 +-
 fs/ext4/ioctl.c                                 |   6 +-
 fs/fcntl.c                                      | 106 ++++++-
 fs/fhandle.c                                    |   6 +-
 fs/file.c                                       | 130 ++++++++
 fs/fuse/inode.c                                 |  10 +-
 fs/ioctl.c                                      |  13 +-
 fs/locks.c                                      |  10 +-
 fs/namei.c                                      | 307 ++++++++++++++----
 fs/ncpfs/inode.c                                |   5 +-
 fs/notify/dnotify/dnotify.c                     |   2 +
 fs/notify/fanotify/fanotify_user.c              |  16 +-
 fs/notify/inotify/inotify_user.c                |  12 +-
 fs/ocfs2/cluster/heartbeat.c                    |   8 +-
 fs/open.c                                       |  46 +--
 fs/proc/fd.c                                    |  16 +-
 fs/proc/namespaces.c                            |   6 +-
 fs/read_write.c                                 | 113 ++++---
 fs/readdir.c                                    |  18 +-
 fs/select.c                                     |  11 +-
 fs/signalfd.c                                   |   6 +-
 fs/splice.c                                     |  34 +-
 fs/stat.c                                       |  10 +-
 fs/statfs.c                                     |   8 +-
 fs/sync.c                                       |  21 +-
 fs/timerfd.c                                    |  40 ++-
 fs/utimes.c                                     |  10 +-
 fs/xattr.c                                      |  26 +-
 fs/xfs/xfs_ioctl.c                              |  14 +-
 include/linux/capsicum.h                        |  57 ++++
 include/linux/file.h                            | 136 ++++++++
 include/linux/namei.h                           |  10 +
 include/linux/net.h                             |  16 +
 include/linux/security.h                        |  48 +++
 include/linux/syscalls.h                        |  12 +
 include/uapi/asm-generic/errno.h                |   3 +
 include/uapi/asm-generic/fcntl.h                |   4 +
 include/uapi/linux/Kbuild                       |   1 +
 include/uapi/linux/capsicum.h                   | 343 ++++++++++++++++++++
 ipc/mqueue.c                                    |  30 +-
 kernel/events/core.c                            |  14 +-
 kernel/module.c                                 |  10 +-
 kernel/sys.c                                    |   6 +-
 kernel/sys_ni.c                                 |   4 +
 kernel/taskstats.c                              |   4 +-
 kernel/time/posix-clock.c                       |  27 +-
 mm/fadvise.c                                    |   7 +-
 mm/internal.h                                   |  19 ++
 mm/memcontrol.c                                 |  12 +-
 mm/mmap.c                                       |   7 +-
 mm/nommu.c                                      |   9 +-
 mm/readahead.c                                  |   6 +-
 net/9p/trans_fd.c                               |  10 +-
 net/bluetooth/bnep/sock.c                       |   2 +-
 net/bluetooth/cmtp/sock.c                       |   2 +-
 net/bluetooth/hidp/sock.c                       |   4 +-
 net/compat.c                                    |   4 +-
 net/l2tp/l2tp_core.c                            |  11 +-
 net/l2tp/l2tp_core.h                            |   2 +
 net/sched/sch_atm.c                             |   2 +-
 net/socket.c                                    | 207 +++++++++---
 net/sunrpc/svcsock.c                            |   4 +-
 security/Kconfig                                |  15 +
 security/Makefile                               |   2 +-
 security/capability.c                           |  17 +-
 security/capsicum-rights.c                      | 201 ++++++++++++
 security/capsicum-rights.h                      |  10 +
 security/capsicum.c                             | 403 ++++++++++++++++++++++++
 security/security.c                             |  13 +
 sound/core/pcm_native.c                         |  10 +-
 tools/testing/selftests/openat/.gitignore       |   3 +
 tools/testing/selftests/openat/Makefile         |  24 ++
 tools/testing/selftests/openat/openat.c         | 146 +++++++++
 virt/kvm/eventfd.c                              |   6 +-
 virt/kvm/vfio.c                                 |  12 +-
 118 files changed, 2840 insertions(+), 535 deletions(-)
 create mode 100644 Documentation/security/capsicum.txt
 create mode 100644 include/linux/capsicum.h
 create mode 100644 include/uapi/linux/capsicum.h
 create mode 100644 security/capsicum-rights.c
 create mode 100644 security/capsicum-rights.h
 create mode 100644 security/capsicum.c
 create mode 100644 tools/testing/selftests/openat/.gitignore
 create mode 100644 tools/testing/selftests/openat/Makefile
 create mode 100644 tools/testing/selftests/openat/openat.c

--
2.0.0.526.g5318336


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 14:49   ` Andy Lutomirski
                     ` (2 more replies)
  2014-06-30 10:28   ` David Drysdale
                   ` (15 subsequent siblings)
  16 siblings, 3 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
provided path, rejecting (with -EACCES) paths that are not beneath
the provided dfd.  In particular, reject:
 - paths that contain .. components
 - paths that begin with /
 - symlinks that have paths as above.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h |  1 +
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c                           |  5 +++--
 fs/namei.c                           | 43 ++++++++++++++++++++++++------------
 fs/open.c                            |  4 +++-
 include/linux/namei.h                |  1 +
 include/uapi/asm-generic/fcntl.h     |  4 ++++
 8 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 09f49a6b87d1..b3e0b00ff9ed 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -33,6 +33,7 @@
 
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
+#define O_BENEATH_ONLY	0200000000	/* no / or .. in openat path */
 
 #define F_GETLK		7
 #define F_SETLK		8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 34a46cbc76ed..da4447775f87 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -21,6 +21,7 @@
 
 #define O_PATH		020000000
 #define __O_TMPFILE	040000000
+#define O_BENEATH_ONLY	080000000	/* no / or .. in openat path */
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 7e8ace5bf760..9f2635197cf0 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -36,6 +36,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_BENEATH_ONLY	0x4000000	/* no / or .. in openat path */
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 72c82f69b01b..79f9b09fa46b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -742,14 +742,15 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
 		O_RDONLY	| O_WRONLY	| O_RDWR	|
 		O_CREAT		| O_EXCL	| O_NOCTTY	|
 		O_TRUNC		| O_APPEND	| /* O_NONBLOCK	| */
 		__O_SYNC	| O_DSYNC	| FASYNC	|
 		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
 		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC	|
-		__FMODE_EXEC	| O_PATH	| __O_TMPFILE
+		__FMODE_EXEC	| O_PATH	| __O_TMPFILE	|
+		O_BENEATH_ONLY
 		));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/namei.c b/fs/namei.c
index 80168273396b..e6b72531dfc7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -646,7 +646,7 @@ static __always_inline void set_root(struct nameidata *nd)
 		get_fs_root(current->fs, &nd->root);
 }
 
-static int link_path_walk(const char *, struct nameidata *);
+static int link_path_walk(const char *, struct nameidata *, unsigned int);
 
 static __always_inline void set_root_rcu(struct nameidata *nd)
 {
@@ -819,7 +819,8 @@ static int may_linkat(struct path *link)
 }
 
 static __always_inline int
-follow_link(struct path *link, struct nameidata *nd, void **p)
+follow_link(struct path *link, struct nameidata *nd, unsigned int flags,
+	    void **p)
 {
 	struct dentry *dentry = link->dentry;
 	int error;
@@ -866,7 +867,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 			nd->flags |= LOOKUP_JUMPED;
 		}
 		nd->inode = nd->path.dentry->d_inode;
-		error = link_path_walk(s, nd);
+		error = link_path_walk(s, nd, flags);
 		if (unlikely(error))
 			put_link(nd, link, *p);
 	}
@@ -1573,7 +1574,8 @@ out_err:
  * Without that kind of total limit, nasty chains of consecutive
  * symlinks can cause almost arbitrarily long lookups.
  */
-static inline int nested_symlink(struct path *path, struct nameidata *nd)
+static inline int nested_symlink(struct path *path, struct nameidata *nd,
+				 unsigned int flags)
 {
 	int res;
 
@@ -1591,7 +1593,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
 		struct path link = *path;
 		void *cookie;
 
-		res = follow_link(&link, nd, &cookie);
+		res = follow_link(&link, nd, flags, &cookie);
 		if (res)
 			break;
 		res = walk_component(nd, path, LOOKUP_FOLLOW);
@@ -1730,13 +1732,19 @@ static inline unsigned long hash_name(const char *name, unsigned int *hashp)
  * Returns 0 and nd will have valid dentry and mnt on success.
  * Returns error and drops reference to input namei data on failure.
  */
-static int link_path_walk(const char *name, struct nameidata *nd)
+static int link_path_walk(const char *name, struct nameidata *nd,
+			  unsigned int flags)
 {
 	struct path next;
 	int err;
 	
-	while (*name=='/')
+	while (*name == '/') {
+		if (flags & LOOKUP_BENEATH_ONLY) {
+			err = -EACCES;
+			goto exit;
+		}
 		name++;
+	}
 	if (!*name)
 		return 0;
 
@@ -1758,6 +1766,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 		if (name[0] == '.') switch (len) {
 			case 2:
 				if (name[1] == '.') {
+					if (flags & LOOKUP_BENEATH_ONLY) {
+						err = -EACCES;
+						goto exit;
+					}
 					type = LAST_DOTDOT;
 					nd->flags |= LOOKUP_JUMPED;
 				}
@@ -1797,7 +1809,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 			return err;
 
 		if (err) {
-			err = nested_symlink(&next, nd);
+			err = nested_symlink(&next, nd, flags);
 			if (err)
 				return err;
 		}
@@ -1806,6 +1818,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 			break;
 		}
 	}
+exit:
 	terminate_walk(nd);
 	return err;
 }
@@ -1844,6 +1857,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 	if (*name=='/') {
+		if (flags & LOOKUP_BENEATH_ONLY)
+			return -EACCES;
 		if (flags & LOOKUP_RCU) {
 			rcu_read_lock();
 			set_root_rcu(nd);
@@ -1937,7 +1952,7 @@ static int path_lookupat(int dfd, const char *name,
 		return err;
 
 	current->total_link_count = 0;
-	err = link_path_walk(name, nd);
+	err = link_path_walk(name, nd, flags);
 
 	if (!err && !(flags & LOOKUP_PARENT)) {
 		err = lookup_last(nd, &path);
@@ -1948,7 +1963,7 @@ static int path_lookupat(int dfd, const char *name,
 			if (unlikely(err))
 				break;
 			nd->flags |= LOOKUP_PARENT;
-			err = follow_link(&link, nd, &cookie);
+			err = follow_link(&link, nd, flags, &cookie);
 			if (err)
 				break;
 			err = lookup_last(nd, &path);
@@ -2287,7 +2302,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 		return err;
 
 	current->total_link_count = 0;
-	err = link_path_walk(name, &nd);
+	err = link_path_walk(name, &nd, flags);
 	if (err)
 		goto out;
 
@@ -2299,7 +2314,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 		if (unlikely(err))
 			break;
 		nd.flags |= LOOKUP_PARENT;
-		err = follow_link(&link, &nd, &cookie);
+		err = follow_link(&link, &nd, flags, &cookie);
 		if (err)
 			break;
 		err = mountpoint_last(&nd, path);
@@ -3185,7 +3200,7 @@ static struct file *path_openat(int dfd, struct filename *pathname,
 		goto out;
 
 	current->total_link_count = 0;
-	error = link_path_walk(pathname->name, nd);
+	error = link_path_walk(pathname->name, nd, flags);
 	if (unlikely(error))
 		goto out;
 
@@ -3204,7 +3219,7 @@ static struct file *path_openat(int dfd, struct filename *pathname,
 			break;
 		nd->flags |= LOOKUP_PARENT;
 		nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
-		error = follow_link(&link, nd, &cookie);
+		error = follow_link(&link, nd, flags, &cookie);
 		if (unlikely(error))
 			break;
 		error = do_last(nd, &path, file, op, &opened, pathname);
diff --git a/fs/open.c b/fs/open.c
index 9d64679cec73..f26c492f3698 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -869,7 +869,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		 * If we have O_PATH in the open flag. Then we
 		 * cannot have anything other than the below set of flags
 		 */
-		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH_ONLY;
 		acc_mode = 0;
 	} else {
 		acc_mode = MAY_OPEN | ACC_MODE(flags);
@@ -900,6 +900,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH_ONLY)
+		lookup_flags |= LOOKUP_BENEATH_ONLY;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 492de72560fa..cd56c50109fc 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -39,6 +39,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_FOLLOW		0x0001
 #define LOOKUP_DIRECTORY	0x0002
 #define LOOKUP_AUTOMOUNT	0x0004
+#define LOOKUP_BENEATH_ONLY	0x0008
 
 #define LOOKUP_PARENT		0x0010
 #define LOOKUP_REVAL		0x0020
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 7543b3e51331..e662821c4bc2 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -92,6 +92,10 @@
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
 
+#ifndef O_BENEATH_ONLY
+#define O_BENEATH_ONLY	040000000	/* no / or .. in openat path */
+#endif
+
 #ifndef O_NDELAY
 #define O_NDELAY	O_NONBLOCK
 #endif
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 02/11] selftests: Add test of O_BENEATH_ONLY & openat(2)
@ 2014-06-30 10:28   ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

At simple tests of openat(2) variations, including examples that
check the new O_BENEATH_ONLY flag.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 tools/testing/selftests/openat/.gitignore |   3 +
 tools/testing/selftests/openat/Makefile   |  24 +++++
 tools/testing/selftests/openat/openat.c   | 146 ++++++++++++++++++++++++++++++
 3 files changed, 173 insertions(+)
 create mode 100644 tools/testing/selftests/openat/.gitignore
 create mode 100644 tools/testing/selftests/openat/Makefile
 create mode 100644 tools/testing/selftests/openat/openat.c

diff --git a/tools/testing/selftests/openat/.gitignore b/tools/testing/selftests/openat/.gitignore
new file mode 100644
index 000000000000..0a2446e89ad5
--- /dev/null
+++ b/tools/testing/selftests/openat/.gitignore
@@ -0,0 +1,3 @@
+openat
+subdir
+topfile
\ No newline at end of file
diff --git a/tools/testing/selftests/openat/Makefile b/tools/testing/selftests/openat/Makefile
new file mode 100644
index 000000000000..dc28ce943edf
--- /dev/null
+++ b/tools/testing/selftests/openat/Makefile
@@ -0,0 +1,24 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = openat
+DEPS = subdir topfile subdir/bottomfile subdir/symlinkup subdir/symlinkout
+all: $(BINARIES) $(DEPS)
+
+subdir:
+	mkdir -p subdir
+topfile:
+	echo 0123456789 > $@
+subdir/bottomfile: | subdir
+	echo 0123456789 > $@
+subdir/symlinkup:
+	ln -s ../topfile $@
+subdir/symlinkout:
+	ln -s /etc/passwd $@
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	./openat
+
+clean:
+	rm -rf $(BINARIES) $(DEPS)
diff --git a/tools/testing/selftests/openat/openat.c b/tools/testing/selftests/openat/openat.c
new file mode 100644
index 000000000000..6171af6001c7
--- /dev/null
+++ b/tools/testing/selftests/openat/openat.c
@@ -0,0 +1,146 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+#include <linux/fcntl.h>
+
+/* Bypass glibc */
+static int openat_(int dirfd, const char *pathname, int flags)
+{
+	return syscall(__NR_openat, dirfd, pathname, flags);
+}
+
+static int openat_or_die(int dfd, const char *path, int flags)
+{
+	int fd = openat_(dfd, path, flags);
+	if (fd < 0) {
+		printf("Failed to openat(%d, '%s'); "
+			"check prerequisites are available\n", dfd, path);
+		exit(1);
+	}
+	return fd;
+}
+
+static int check_openat(int dfd, const char *path, int flags)
+{
+	int rc;
+	int fd;
+	char buffer[4];
+
+	errno = 0;
+	printf("Check success of openat(%d, '%s', %x)... ",
+	       dfd, path?:"(null)", flags);
+	fd = openat_(dfd, path, flags);
+	if (fd < 0) {
+		printf("[FAIL]: openat() failed, rc=%d errno=%d (%s)\n",
+			fd, errno, strerror(errno));
+		return 1;
+	}
+	errno = 0;
+	rc = read(fd, buffer, sizeof(buffer));
+	if (rc < 0) {
+		printf("[FAIL]: read() failed, rc=%d errno=%d (%s)\n",
+			rc, errno, strerror(errno));
+		return 1;
+	}
+	close(fd);
+	printf("[OK]\n");
+	return 0;
+}
+
+#define check_openat_fail(dfd, path, flags, errno)	\
+	_check_openat_fail(dfd, path, flags, errno, #errno)
+static int _check_openat_fail(int dfd, const char *path, int flags,
+			      int expected_errno, const char *errno_str)
+{
+	errno = 0;
+	printf("Check failure of openat(%d, '%s', %x) with %s... ",
+		dfd, path?:"(null)", flags, errno_str);
+	int rc = openat_(dfd, path, flags);
+	if (rc > 0) {
+		printf("[FAIL] (unexpected success from openat(2))\n");
+		close(rc);
+		return 1;
+	}
+	if (errno != expected_errno) {
+		printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+			expected_errno, strerror(expected_errno),
+			errno, strerror(errno));
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int fail = 0;
+	int dot_dfd = openat_or_die(AT_FDCWD, ".", O_RDONLY);
+	int subdir_dfd = openat_or_die(AT_FDCWD, "subdir", O_RDONLY);
+	int file_fd = openat_or_die(AT_FDCWD, "topfile", O_RDONLY);
+
+	/* Sanity check normal behavior */
+	fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY);
+	fail |= check_openat(AT_FDCWD, "subdir/bottomfile", O_RDONLY);
+
+	fail |= check_openat(dot_dfd, "topfile", O_RDONLY);
+	fail |= check_openat(dot_dfd, "subdir/bottomfile", O_RDONLY);
+	fail |= check_openat(dot_dfd, "subdir/../topfile", O_RDONLY);
+
+	fail |= check_openat(subdir_dfd, "../topfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "../subdir/bottomfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "symlinkup", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "symlinkout", O_RDONLY);
+
+	fail |= check_openat(AT_FDCWD, "/etc/passwd", O_RDONLY);
+	fail |= check_openat(dot_dfd, "/etc/passwd", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "/etc/passwd", O_RDONLY);
+
+	fail |= check_openat_fail(AT_FDCWD, "bogus", O_RDONLY, ENOENT);
+	fail |= check_openat_fail(dot_dfd, "bogus", O_RDONLY, ENOENT);
+	fail |= check_openat_fail(999, "bogus", O_RDONLY, EBADF);
+	fail |= check_openat_fail(file_fd, "bogus", O_RDONLY, ENOTDIR);
+
+#ifdef O_BENEATH_ONLY
+	/* Test out O_BENEATH_ONLY */
+	fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY|O_BENEATH_ONLY);
+	fail |= check_openat(AT_FDCWD, "subdir/bottomfile",
+			     O_RDONLY|O_BENEATH_ONLY);
+
+	fail |= check_openat(dot_dfd, "topfile", O_RDONLY|O_BENEATH_ONLY);
+	fail |= check_openat(dot_dfd, "subdir/bottomfile",
+			     O_RDONLY|O_BENEATH_ONLY);
+	fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY|O_BENEATH_ONLY);
+
+	/* Can't open paths with ".." in them */
+	fail |= check_openat_fail(dot_dfd, "subdir/../topfile",
+				O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "../topfile",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "../subdir/bottomfile",
+				O_RDONLY|O_BENEATH_ONLY, EACCES);
+
+	/* Can't open paths starting with "/" */
+	fail |= check_openat_fail(AT_FDCWD, "/etc/passwd",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(dot_dfd, "/etc/passwd",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "/etc/passwd",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	/* Can't sneak around constraints with symlinks */
+	fail |= check_openat_fail(subdir_dfd, "symlinkup",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "symlinkout",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+#else
+	printf("Skipping O_BENEATH_ONLY tests due to missing #define\n");
+#endif
+
+	return fail ? -1 : 0;
+}
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 02/11] selftests: Add test of O_BENEATH_ONLY & openat(2)
@ 2014-06-30 10:28   ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api-u79uwXL29TY76Z2rM5mHXA, David Drysdale

At simple tests of openat(2) variations, including examples that
check the new O_BENEATH_ONLY flag.

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/openat/.gitignore |   3 +
 tools/testing/selftests/openat/Makefile   |  24 +++++
 tools/testing/selftests/openat/openat.c   | 146 ++++++++++++++++++++++++++++++
 3 files changed, 173 insertions(+)
 create mode 100644 tools/testing/selftests/openat/.gitignore
 create mode 100644 tools/testing/selftests/openat/Makefile
 create mode 100644 tools/testing/selftests/openat/openat.c

diff --git a/tools/testing/selftests/openat/.gitignore b/tools/testing/selftests/openat/.gitignore
new file mode 100644
index 000000000000..0a2446e89ad5
--- /dev/null
+++ b/tools/testing/selftests/openat/.gitignore
@@ -0,0 +1,3 @@
+openat
+subdir
+topfile
\ No newline at end of file
diff --git a/tools/testing/selftests/openat/Makefile b/tools/testing/selftests/openat/Makefile
new file mode 100644
index 000000000000..dc28ce943edf
--- /dev/null
+++ b/tools/testing/selftests/openat/Makefile
@@ -0,0 +1,24 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = openat
+DEPS = subdir topfile subdir/bottomfile subdir/symlinkup subdir/symlinkout
+all: $(BINARIES) $(DEPS)
+
+subdir:
+	mkdir -p subdir
+topfile:
+	echo 0123456789 > $@
+subdir/bottomfile: | subdir
+	echo 0123456789 > $@
+subdir/symlinkup:
+	ln -s ../topfile $@
+subdir/symlinkout:
+	ln -s /etc/passwd $@
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	./openat
+
+clean:
+	rm -rf $(BINARIES) $(DEPS)
diff --git a/tools/testing/selftests/openat/openat.c b/tools/testing/selftests/openat/openat.c
new file mode 100644
index 000000000000..6171af6001c7
--- /dev/null
+++ b/tools/testing/selftests/openat/openat.c
@@ -0,0 +1,146 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+#include <linux/fcntl.h>
+
+/* Bypass glibc */
+static int openat_(int dirfd, const char *pathname, int flags)
+{
+	return syscall(__NR_openat, dirfd, pathname, flags);
+}
+
+static int openat_or_die(int dfd, const char *path, int flags)
+{
+	int fd = openat_(dfd, path, flags);
+	if (fd < 0) {
+		printf("Failed to openat(%d, '%s'); "
+			"check prerequisites are available\n", dfd, path);
+		exit(1);
+	}
+	return fd;
+}
+
+static int check_openat(int dfd, const char *path, int flags)
+{
+	int rc;
+	int fd;
+	char buffer[4];
+
+	errno = 0;
+	printf("Check success of openat(%d, '%s', %x)... ",
+	       dfd, path?:"(null)", flags);
+	fd = openat_(dfd, path, flags);
+	if (fd < 0) {
+		printf("[FAIL]: openat() failed, rc=%d errno=%d (%s)\n",
+			fd, errno, strerror(errno));
+		return 1;
+	}
+	errno = 0;
+	rc = read(fd, buffer, sizeof(buffer));
+	if (rc < 0) {
+		printf("[FAIL]: read() failed, rc=%d errno=%d (%s)\n",
+			rc, errno, strerror(errno));
+		return 1;
+	}
+	close(fd);
+	printf("[OK]\n");
+	return 0;
+}
+
+#define check_openat_fail(dfd, path, flags, errno)	\
+	_check_openat_fail(dfd, path, flags, errno, #errno)
+static int _check_openat_fail(int dfd, const char *path, int flags,
+			      int expected_errno, const char *errno_str)
+{
+	errno = 0;
+	printf("Check failure of openat(%d, '%s', %x) with %s... ",
+		dfd, path?:"(null)", flags, errno_str);
+	int rc = openat_(dfd, path, flags);
+	if (rc > 0) {
+		printf("[FAIL] (unexpected success from openat(2))\n");
+		close(rc);
+		return 1;
+	}
+	if (errno != expected_errno) {
+		printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+			expected_errno, strerror(expected_errno),
+			errno, strerror(errno));
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int fail = 0;
+	int dot_dfd = openat_or_die(AT_FDCWD, ".", O_RDONLY);
+	int subdir_dfd = openat_or_die(AT_FDCWD, "subdir", O_RDONLY);
+	int file_fd = openat_or_die(AT_FDCWD, "topfile", O_RDONLY);
+
+	/* Sanity check normal behavior */
+	fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY);
+	fail |= check_openat(AT_FDCWD, "subdir/bottomfile", O_RDONLY);
+
+	fail |= check_openat(dot_dfd, "topfile", O_RDONLY);
+	fail |= check_openat(dot_dfd, "subdir/bottomfile", O_RDONLY);
+	fail |= check_openat(dot_dfd, "subdir/../topfile", O_RDONLY);
+
+	fail |= check_openat(subdir_dfd, "../topfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "../subdir/bottomfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "symlinkup", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "symlinkout", O_RDONLY);
+
+	fail |= check_openat(AT_FDCWD, "/etc/passwd", O_RDONLY);
+	fail |= check_openat(dot_dfd, "/etc/passwd", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "/etc/passwd", O_RDONLY);
+
+	fail |= check_openat_fail(AT_FDCWD, "bogus", O_RDONLY, ENOENT);
+	fail |= check_openat_fail(dot_dfd, "bogus", O_RDONLY, ENOENT);
+	fail |= check_openat_fail(999, "bogus", O_RDONLY, EBADF);
+	fail |= check_openat_fail(file_fd, "bogus", O_RDONLY, ENOTDIR);
+
+#ifdef O_BENEATH_ONLY
+	/* Test out O_BENEATH_ONLY */
+	fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY|O_BENEATH_ONLY);
+	fail |= check_openat(AT_FDCWD, "subdir/bottomfile",
+			     O_RDONLY|O_BENEATH_ONLY);
+
+	fail |= check_openat(dot_dfd, "topfile", O_RDONLY|O_BENEATH_ONLY);
+	fail |= check_openat(dot_dfd, "subdir/bottomfile",
+			     O_RDONLY|O_BENEATH_ONLY);
+	fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY|O_BENEATH_ONLY);
+
+	/* Can't open paths with ".." in them */
+	fail |= check_openat_fail(dot_dfd, "subdir/../topfile",
+				O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "../topfile",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "../subdir/bottomfile",
+				O_RDONLY|O_BENEATH_ONLY, EACCES);
+
+	/* Can't open paths starting with "/" */
+	fail |= check_openat_fail(AT_FDCWD, "/etc/passwd",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(dot_dfd, "/etc/passwd",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "/etc/passwd",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	/* Can't sneak around constraints with symlinks */
+	fail |= check_openat_fail(subdir_dfd, "symlinkup",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "symlinkout",
+				  O_RDONLY|O_BENEATH_ONLY, EACCES);
+#else
+	printf("Skipping O_BENEATH_ONLY tests due to missing #define\n");
+#endif
+
+	return fail ? -1 : 0;
+}
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 03/11] capsicum: rights values and structure definitions
@ 2014-06-30 10:28   ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Define (in include/uapi/linux/capsicum.h) values for primary
rights associated with Capsicum capability file descriptors.

Also define the structure that primary rights reside in (struct
cap_rights), and the complete compound rights structure (struct
capsicum_rights).

 - Primary rights describe the main operations that can be
   performed on a file.
 - Secondary rights allow for specific fcntl() and ioctl()
   operations to be policed.

Add functions to manipulate these rights structures.

This change is adapted from the FreeBSD 10.x implementation of
Capsicum, with the aim of preserving compatibility between the
two implementations as closely as possible.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 Documentation/security/capsicum.txt | 102 +++++++++++
 include/linux/capsicum.h            |  50 ++++++
 include/uapi/linux/Kbuild           |   1 +
 include/uapi/linux/capsicum.h       | 343 ++++++++++++++++++++++++++++++++++++
 security/Kconfig                    |  15 ++
 security/Makefile                   |   2 +-
 security/capsicum-rights.c          | 201 +++++++++++++++++++++
 security/capsicum-rights.h          |  10 ++
 8 files changed, 723 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/security/capsicum.txt
 create mode 100644 include/linux/capsicum.h
 create mode 100644 include/uapi/linux/capsicum.h
 create mode 100644 security/capsicum-rights.c
 create mode 100644 security/capsicum-rights.h

diff --git a/Documentation/security/capsicum.txt b/Documentation/security/capsicum.txt
new file mode 100644
index 000000000000..27e950828359
--- /dev/null
+++ b/Documentation/security/capsicum.txt
@@ -0,0 +1,102 @@
+Capsicum Object Capabilities
+============================
+
+Capsicum is a lightweight OS capability and sandbox framework, which allows
+security-aware userspace applications to sandbox parts of their own code in a
+highly granular way, reducing the attack surface in the event of subversion.
+
+Originally developed at the University of Cambridge Computer Laboratory, and
+initially implemented in FreeBSD 9.x, Capsicum extends the POSIX API, providing
+several new OS primitives to support object-capability security on UNIX-like
+operating systems.
+
+Note that Capsicum capability file descriptors are radically different to the
+POSIX.1e capabilities that are already available in Linux:
+ - POSIX.1e capabilities subdivide the root user's authority into different
+   areas of functionality.
+ - Capsicum capabilities restrict individual file descriptors so that
+   only operations permitted by that particular FD's rights are allowed.
+
+
+Overview
+--------
+
+Capability-based security is a security model where objects can only be
+accessed via capabilities, which are unforgeable tokens of authority that only
+give rights to perform certain operations.
+
+Capsicum is a pragmatic blend of capability-based security with standard
+UNIX/POSIX system semantics.  A Capsicum capability is a file descriptor that
+has an associated rights bitmask, and the kernel polices operations using that
+file descriptor, failing operations with insufficient rights.
+
+
+Capability Data Structure
+-------------------------
+
+Internally, a capability is a particular kind of struct file that wraps an
+underlying normal file.   The private data for the wrapper indicates the
+wrapped file, and holds the rights information for the capability.
+
+
+FD to File Conversion
+---------------------
+
+The primary policing of Capsicum capabilities occurs when a user-provided file
+descriptor is converted to a struct file object, normally using one of the
+fgetr() family of functions.
+
+All such operations in the kernel are annotated with information about the
+operations that are going to be performed on the retrieved struct file.  For
+example, a file that is retrieved for a read operation has its fgetr() call
+annotated with CAP_READ, indicating that any capability FD that reaches this
+point needs to include the CAP_READ right to progress further.  If the
+appropriate right is not available, -ENOTCAPABLE is returned.
+
+This change is the most significant change to the kernel, as it affects all
+FD-to-file conversions.  However, for a non-Capsicum build of the kernel the
+impact is minimal as the additional rights parameters to fgetr*() are macroed
+out.
+
+
+Path Traversal
+--------------
+
+Capsicum does allow new files to be accessed beneath a directory for which the
+application has a suitable capability FD (one including the CAP_LOOKUP right),
+using the openat(2) system call.  To prevent escape from the directory, path
+traversals are policed for "/" and ".." components.
+
+
+LSM Interactions
+----------------
+
+The annotation of all fget() calls with intended file operations, expressed
+as combinations of Capsicum rights values, is implemented as mainline kernel
+modifications.
+
+The remainder of the Capsicum functionality is via Linux Security Module (LSM)
+hooks, with Capsicum providing the default implementation when the active LSM
+does not override.  (If the active LSM does choose to override the Capsicum
+implementation, it should ensure that the Capsicum functionality is unaffected,
+by combining the results of the Capsicum implementation with its own.)
+
+The additional hooks added for (and implemented by) Capsicum are:
+ - file_lookup: Allow modification of the result of an fget() operation, so that
+   a rights check can be performed and the normal file underlying a capability
+   can be returned.
+ - file_install: Allow modification of a file that is about to be installed
+   into the file descriptor table, in cases where the new file is derived
+   from another file (that may be a Capsicum capability and so have rights
+   associated with it).
+
+
+New System Calls
+----------------
+
+Capsicum implements the following new system calls:
+ - cap_rights_limit: restrict the rights associated with file descriptor, thus
+   turning it into a capability FD; internally this is implemented by wrapping
+   the original struct file with a capability file (security/capsicum.c)
+ - cap_rights_get: return the rights associated with a capability FD
+   (security/capsicum.c)
diff --git a/include/linux/capsicum.h b/include/linux/capsicum.h
new file mode 100644
index 000000000000..74f79756097a
--- /dev/null
+++ b/include/linux/capsicum.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_CAPSICUM_H
+#define _LINUX_CAPSICUM_H
+
+#include <stdarg.h>
+#include <uapi/linux/capsicum.h>
+
+struct file;
+/* Complete rights structure (primary and subrights). */
+struct capsicum_rights {
+	struct cap_rights primary;
+	unsigned int fcntls;  /* Only valid if CAP_FCNTL set in primary. */
+	int nioctls;  /* -1=>all; only valid if CAP_IOCTL set in primary */
+	unsigned int *ioctls;
+};
+
+#define CAP_LIST_END	0ULL
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/* Rights manipulation functions */
+#define cap_rights_init(rights, ...) \
+	_cap_rights_init((rights), __VA_ARGS__, CAP_LIST_END)
+#define cap_rights_set(rights, ...) \
+	_cap_rights_set((rights), __VA_ARGS__, CAP_LIST_END)
+struct capsicum_rights *_cap_rights_init(struct capsicum_rights *rights, ...);
+struct capsicum_rights *_cap_rights_set(struct capsicum_rights *rights, ...);
+struct capsicum_rights *cap_rights_vinit(struct capsicum_rights *rights,
+					 va_list ap);
+struct capsicum_rights *cap_rights_vset(struct capsicum_rights *rights,
+					va_list ap);
+struct capsicum_rights *cap_rights_set_all(struct capsicum_rights *rights);
+bool cap_rights_is_all(const struct capsicum_rights *rights);
+
+#else
+
+#define cap_rights_init(rights, ...) _cap_rights_noop(rights)
+#define cap_rights_set(rights, ...) _cap_rights_noop(rights)
+#define cap_rights_set_all(rights) _cap_rights_noop(rights)
+static inline struct capsicum_rights *
+_cap_rights_noop(struct capsicum_rights *rights)
+{
+	return rights;
+}
+static inline bool cap_rights_is_all(const struct capsicum_rights *rights)
+{
+	return true;
+}
+
+#endif
+
+#endif /* _LINUX_CAPSICUM_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6929571b79b0..57410bbee2f6 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -73,6 +73,7 @@ header-y += btrfs.h
 header-y += can.h
 header-y += capability.h
 header-y += capi.h
+header-y += capsicum.h
 header-y += cciss_defs.h
 header-y += cciss_ioctl.h
 header-y += cdrom.h
diff --git a/include/uapi/linux/capsicum.h b/include/uapi/linux/capsicum.h
new file mode 100644
index 000000000000..a39ac86fa183
--- /dev/null
+++ b/include/uapi/linux/capsicum.h
@@ -0,0 +1,343 @@
+#ifndef _UAPI_LINUX_CAPSICUM_H
+#define _UAPI_LINUX_CAPSICUM_H
+
+/*-
+ * Copyright (c) 2008-2010 Robert N. M. Watson
+ * Copyright (c) 2012 FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed at the University of Cambridge Computer
+ * Laboratory with support from a grant from Google, Inc.
+ *
+ * Portions of this software were developed by Pawel Jakub Dawidek under
+ * sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * Definitions for Capsicum capabilities facility.
+ */
+#include <linux/types.h>
+
+/*
+ * The top two bits in the first element of the cr_rights[] array contain
+ * total number of elements in the array - 2. This means if those two bits are
+ * equal to 0, we have 2 array elements.
+ * The top two bits in all remaining array elements should be 0.
+ * The next five bits contain array index. Only one bit is used and bit position
+ * in this five-bits range defines array index. This means there can be at most
+ * five array elements.
+ */
+#define CAP_RIGHTS_VERSION_00	0
+/*
+#define CAP_RIGHTS_VERSION_01	1
+#define CAP_RIGHTS_VERSION_02	2
+#define CAP_RIGHTS_VERSION_03	3
+*/
+#define CAP_RIGHTS_VERSION	CAP_RIGHTS_VERSION_00
+
+/* Primary rights */
+struct cap_rights {
+	__u64	cr_rights[CAP_RIGHTS_VERSION + 2];
+};
+
+#define CAPRIGHT(idx, bit)	((1ULL << (57 + (idx))) | (bit))
+
+/*
+ * Possible rights on capabilities.
+ *
+ * Notes:
+ * Some system calls don't require a capability in order to perform an
+ * operation on an fd.  These include: close, dup, dup2.
+ *
+ * sendfile is authorized using CAP_READ on the file and CAP_WRITE on the
+ * socket.
+ *
+ * mmap() and aio*() system calls will need special attention as they may
+ * involve reads or writes depending a great deal on context.
+ */
+
+/* INDEX 0 */
+
+/*
+ * General file I/O.
+ */
+/* Allows for openat(O_RDONLY), read(2), readv(2). */
+#define CAP_READ		CAPRIGHT(0, 0x0000000000000001ULL)
+/* Allows for openat(O_WRONLY | O_APPEND), write(2), writev(2). */
+#define CAP_WRITE		CAPRIGHT(0, 0x0000000000000002ULL)
+/* Allows for lseek(fd, 0, SEEK_CUR). */
+#define CAP_SEEK_TELL		CAPRIGHT(0, 0x0000000000000004ULL)
+/* Allows for lseek(2). */
+#define CAP_SEEK		(CAP_SEEK_TELL | 0x0000000000000008ULL)
+/* Allows for aio_read(2), pread(2), preadv(2). */
+#define CAP_PREAD		(CAP_SEEK | CAP_READ)
+/*
+ * Allows for aio_write(2), openat(O_WRONLY) (without O_APPEND), pwrite(2),
+ * pwritev(2).
+ */
+#define CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
+/* Allows for mmap(PROT_NONE). */
+#define CAP_MMAP		CAPRIGHT(0, 0x0000000000000010ULL)
+/* Allows for mmap(PROT_READ). */
+#define CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
+/* Allows for mmap(PROT_WRITE). */
+#define CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
+/* Allows for mmap(PROT_EXEC). */
+#define CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000020ULL)
+/* Allows for mmap(PROT_READ | PROT_WRITE). */
+#define CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
+/* Allows for mmap(PROT_READ | PROT_EXEC). */
+#define CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
+/* Allows for mmap(PROT_WRITE | PROT_EXEC). */
+#define CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
+/* Allows for mmap(PROT_READ | PROT_WRITE | PROT_EXEC). */
+#define CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
+/* Allows for openat(O_CREAT). */
+#define CAP_CREATE		CAPRIGHT(0, 0x0000000000000040ULL)
+/* Allows for openat(O_EXEC) and fexecve(2) in turn. */
+#define CAP_FEXECVE		CAPRIGHT(0, 0x0000000000000080ULL)
+/* Allows for openat(O_SYNC), openat(O_FSYNC), fsync(2), aio_fsync(2). */
+#define CAP_FSYNC		CAPRIGHT(0, 0x0000000000000100ULL)
+/* Allows for openat(O_TRUNC), ftruncate(2). */
+#define CAP_FTRUNCATE		CAPRIGHT(0, 0x0000000000000200ULL)
+
+/* Lookups - used to constrain *at() calls. */
+#define CAP_LOOKUP		CAPRIGHT(0, 0x0000000000000400ULL)
+
+/* VFS methods. */
+/* Allows for fchdir(2). */
+#define CAP_FCHDIR		CAPRIGHT(0, 0x0000000000000800ULL)
+/* Allows for fchflags(2). */
+#define CAP_FCHFLAGS		CAPRIGHT(0, 0x0000000000001000ULL)
+/* Allows for fchflags(2) and chflagsat(2). */
+#define CAP_CHFLAGSAT		(CAP_FCHFLAGS | CAP_LOOKUP)
+/* Allows for fchmod(2). */
+#define CAP_FCHMOD		CAPRIGHT(0, 0x0000000000002000ULL)
+/* Allows for fchmod(2) and fchmodat(2). */
+#define CAP_FCHMODAT		(CAP_FCHMOD | CAP_LOOKUP)
+/* Allows for fchown(2). */
+#define CAP_FCHOWN		CAPRIGHT(0, 0x0000000000004000ULL)
+/* Allows for fchown(2) and fchownat(2). */
+#define CAP_FCHOWNAT		(CAP_FCHOWN | CAP_LOOKUP)
+/* Allows for fcntl(2). */
+#define CAP_FCNTL		CAPRIGHT(0, 0x0000000000008000ULL)
+/*
+ * Allows for flock(2), openat(O_SHLOCK), openat(O_EXLOCK),
+ * fcntl(F_SETLK_REMOTE), fcntl(F_SETLKW), fcntl(F_SETLK), fcntl(F_GETLK).
+ */
+#define CAP_FLOCK		CAPRIGHT(0, 0x0000000000010000ULL)
+/* Allows for fpathconf(2). */
+#define CAP_FPATHCONF		CAPRIGHT(0, 0x0000000000020000ULL)
+/* Allows for UFS background-fsck operations. */
+#define CAP_FSCK		CAPRIGHT(0, 0x0000000000040000ULL)
+/* Allows for fstat(2). */
+#define CAP_FSTAT		CAPRIGHT(0, 0x0000000000080000ULL)
+/* Allows for fstat(2), fstatat(2) and faccessat(2). */
+#define CAP_FSTATAT		(CAP_FSTAT | CAP_LOOKUP)
+/* Allows for fstatfs(2). */
+#define CAP_FSTATFS		CAPRIGHT(0, 0x0000000000100000ULL)
+/* Allows for futimes(2). */
+#define CAP_FUTIMES		CAPRIGHT(0, 0x0000000000200000ULL)
+/* Allows for futimes(2) and futimesat(2). */
+#define CAP_FUTIMESAT		(CAP_FUTIMES | CAP_LOOKUP)
+/* Allows for linkat(2) and renameat(2) (destination directory descriptor). */
+#define CAP_LINKAT		(CAP_LOOKUP | 0x0000000000400000ULL)
+/* Allows for mkdirat(2). */
+#define CAP_MKDIRAT		(CAP_LOOKUP | 0x0000000000800000ULL)
+/* Allows for mkfifoat(2). */
+#define CAP_MKFIFOAT		(CAP_LOOKUP | 0x0000000001000000ULL)
+/* Allows for mknodat(2). */
+#define CAP_MKNODAT		(CAP_LOOKUP | 0x0000000002000000ULL)
+/* Allows for renameat(2). */
+#define CAP_RENAMEAT		(CAP_LOOKUP | 0x0000000004000000ULL)
+/* Allows for symlinkat(2). */
+#define CAP_SYMLINKAT		(CAP_LOOKUP | 0x0000000008000000ULL)
+/*
+ * Allows for unlinkat(2) and renameat(2) if destination object exists and
+ * will be removed.
+ */
+#define CAP_UNLINKAT		(CAP_LOOKUP | 0x0000000010000000ULL)
+
+/* Socket operations. */
+/* Allows for accept(2) and accept4(2). */
+#define CAP_ACCEPT		CAPRIGHT(0, 0x0000000020000000ULL)
+/* Allows for bind(2). */
+#define CAP_BIND		CAPRIGHT(0, 0x0000000040000000ULL)
+/* Allows for connect(2). */
+#define CAP_CONNECT		CAPRIGHT(0, 0x0000000080000000ULL)
+/* Allows for getpeername(2). */
+#define CAP_GETPEERNAME	CAPRIGHT(0, 0x0000000100000000ULL)
+/* Allows for getsockname(2). */
+#define CAP_GETSOCKNAME	CAPRIGHT(0, 0x0000000200000000ULL)
+/* Allows for getsockopt(2). */
+#define CAP_GETSOCKOPT		CAPRIGHT(0, 0x0000000400000000ULL)
+/* Allows for listen(2). */
+#define CAP_LISTEN		CAPRIGHT(0, 0x0000000800000000ULL)
+/* Allows for sctp_peeloff(2). */
+#define CAP_PEELOFF		CAPRIGHT(0, 0x0000001000000000ULL)
+#define CAP_RECV		CAP_READ
+#define CAP_SEND		CAP_WRITE
+/* Allows for setsockopt(2). */
+#define CAP_SETSOCKOPT		CAPRIGHT(0, 0x0000002000000000ULL)
+/* Allows for shutdown(2). */
+#define CAP_SHUTDOWN		CAPRIGHT(0, 0x0000004000000000ULL)
+
+/* Allows for bindat(2) on a directory descriptor. */
+#define CAP_BINDAT		(CAP_LOOKUP | 0x0000008000000000ULL)
+/* Allows for connectat(2) on a directory descriptor. */
+#define CAP_CONNECTAT		(CAP_LOOKUP | 0x0000010000000000ULL)
+
+#define CAP_SOCK_CLIENT \
+	(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
+	 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
+#define CAP_SOCK_SERVER \
+	(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
+	 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
+	 CAP_SETSOCKOPT | CAP_SHUTDOWN)
+
+/* All used bits for index 0. */
+#define CAP_ALL0		CAPRIGHT(0, 0x0000007FFFFFFFFFULL)
+
+/* Available bits for index 0. */
+#define CAP_UNUSED0_40		CAPRIGHT(0, 0x0000008000000000ULL)
+/* ... */
+#define CAP_UNUSED0_57		CAPRIGHT(0, 0x0100000000000000ULL)
+
+/* INDEX 1 */
+
+/* Mandatory Access Control. */
+/* Allows for mac_get_fd(3). */
+#define CAP_MAC_GET		CAPRIGHT(1, 0x0000000000000001ULL)
+/* Allows for mac_set_fd(3). */
+#define CAP_MAC_SET		CAPRIGHT(1, 0x0000000000000002ULL)
+
+/* Methods on semaphores. */
+#define CAP_SEM_GETVALUE	CAPRIGHT(1, 0x0000000000000004ULL)
+#define CAP_SEM_POST		CAPRIGHT(1, 0x0000000000000008ULL)
+#define CAP_SEM_WAIT		CAPRIGHT(1, 0x0000000000000010ULL)
+
+/* Allows select(2) and poll(2) on descriptor. */
+#define CAP_EVENT		CAPRIGHT(1, 0x0000000000000020ULL)
+/* Allows for kevent(2) on kqueue descriptor with eventlist != NULL. */
+#define CAP_KQUEUE_EVENT	CAPRIGHT(1, 0x0000000000000040ULL)
+
+/* Strange and powerful rights that should not be given lightly. */
+/* Allows for ioctl(2). */
+#define CAP_IOCTL		CAPRIGHT(1, 0x0000000000000080ULL)
+#define CAP_TTYHOOK		CAPRIGHT(1, 0x0000000000000100ULL)
+
+/* Process management via process descriptors. */
+/* Allows for pdgetpid(2). */
+#define CAP_PDGETPID		CAPRIGHT(1, 0x0000000000000200ULL)
+/* Allows for pdwait4(2). */
+#define CAP_PDWAIT		CAPRIGHT(1, 0x0000000000000400ULL)
+/* Allows for pdkill(2). */
+#define CAP_PDKILL		CAPRIGHT(1, 0x0000000000000800ULL)
+
+/* Extended attributes. */
+/* Allows for extattr_delete_fd(2). */
+#define CAP_EXTATTR_DELETE	CAPRIGHT(1, 0x0000000000001000ULL)
+/* Allows for extattr_get_fd(2). */
+#define CAP_EXTATTR_GET	CAPRIGHT(1, 0x0000000000002000ULL)
+/* Allows for extattr_list_fd(2). */
+#define CAP_EXTATTR_LIST	CAPRIGHT(1, 0x0000000000004000ULL)
+/* Allows for extattr_set_fd(2). */
+#define CAP_EXTATTR_SET	CAPRIGHT(1, 0x0000000000008000ULL)
+
+/* Access Control Lists. */
+/* Allows for acl_valid_fd_np(3). */
+#define CAP_ACL_CHECK		CAPRIGHT(1, 0x0000000000010000ULL)
+/* Allows for acl_delete_fd_np(3). */
+#define CAP_ACL_DELETE		CAPRIGHT(1, 0x0000000000020000ULL)
+/* Allows for acl_get_fd(3) and acl_get_fd_np(3). */
+#define CAP_ACL_GET		CAPRIGHT(1, 0x0000000000040000ULL)
+/* Allows for acl_set_fd(3) and acl_set_fd_np(3). */
+#define CAP_ACL_SET		CAPRIGHT(1, 0x0000000000080000ULL)
+
+/* Allows for kevent(2) on kqueue descriptor with changelist != NULL. */
+#define CAP_KQUEUE_CHANGE	CAPRIGHT(1, 0x0000000000100000ULL)
+
+#define CAP_KQUEUE		(CAP_KQUEUE_EVENT | CAP_KQUEUE_CHANGE)
+
+/* Modify signalfd signal mask. */
+#define CAP_FSIGNAL             CAPRIGHT(1, 0x0000000000200000ULL)
+
+/* Modify epollfd set of FDs/events */
+#define CAP_EPOLL_CTL           CAPRIGHT(1, 0x0000000000400000ULL)
+
+/* Modify things monitored by inotify/fanotify FD */
+#define CAP_NOTIFY              CAPRIGHT(1, 0x0000000000800000ULL)
+
+/* Allow entry to a namespace associated with a file descriptor */
+#define CAP_SETNS               CAPRIGHT(1, 0x0000000001000000ULL)
+
+/* Allow performance monitoring operations */
+#define CAP_PERFMON             CAPRIGHT(1, 0x0000000002000000ULL)
+
+/* All used bits for index 1. */
+#define CAP_ALL1		CAPRIGHT(1, 0x0000000003FFFFFFULL)
+
+/* Available bits for index 1. */
+#define CAP_UNUSED1_27		CAPRIGHT(1, 0x0000000004000000ULL)
+/* ... */
+#define CAP_UNUSED1_57		CAPRIGHT(1, 0x0100000000000000ULL)
+
+/* Backward compatibility. */
+#define CAP_POLL_EVENT		CAP_EVENT
+
+#define CAP_SET_ALL(rights)		do {				\
+	(rights)->cr_rights[0] =					\
+	    ((__u64)CAP_RIGHTS_VERSION << 62) | CAP_ALL0;		\
+	(rights)->cr_rights[1] = CAP_ALL1;				\
+} while (0)
+
+#define CAP_SET_NONE(rights)	do {					\
+	(rights)->cr_rights[0] =					\
+	    ((__u64)CAP_RIGHTS_VERSION << 62) | CAPRIGHT(0, 0ULL);	\
+	(rights)->cr_rights[1] = CAPRIGHT(1, 0ULL);			\
+} while (0)
+
+#define CAP_IS_ALL(rights)						\
+	(((rights)->cr_rights[0] ==					\
+	  (((__u64)CAP_RIGHTS_VERSION << 62) | CAP_ALL0)) &&	\
+	 ((rights)->cr_rights[1] == CAP_ALL1))
+
+#define CAPRVER(right)		((int)((right) >> 62))
+#define CAPVER(rights)		CAPRVER((rights)->cr_rights[0])
+#define CAPARSIZE(rights)	(CAPVER(rights) + 2)
+#define CAPIDXBIT(right)	((int)(((right) >> 57) & 0x1F))
+
+/*
+ * Allowed fcntl(2) commands.
+ */
+#define CAP_FCNTL_GETFL	(1 << F_GETFL)
+#define CAP_FCNTL_SETFL	(1 << F_SETFL)
+#define CAP_FCNTL_GETOWN	(1 << F_GETOWN)
+#define CAP_FCNTL_SETOWN	(1 << F_SETOWN)
+#define CAP_FCNTL_ALL		(CAP_FCNTL_GETFL | CAP_FCNTL_SETFL | \
+				 CAP_FCNTL_GETOWN | CAP_FCNTL_SETOWN)
+
+#define CAP_IOCTLS_ALL		SSIZE_MAX
+
+#endif /* _UAPI_LINUX_CAPSICUM_H */
diff --git a/security/Kconfig b/security/Kconfig
index beb86b500adf..006020864612 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -117,6 +117,21 @@ config LSM_MMAP_MIN_ADDR
 	  this low address space will need the permission specific to the
 	  systems running LSM.
 
+config SECURITY_CAPSICUM
+	bool "Capsicum capabilities"
+	default y
+	depends on SECURITY
+	depends on SECURITY_PATH
+	depends on SECCOMP
+	help
+	  Enable the Capsicum capability framework, which implements security
+	  primitives that support fine-grained capabilities on file
+	  descriptors; see Documentation/security/capsicum.txt for more
+	  details.
+
+	  If you are unsure as to whether this is required, answer N.
+
+
 source security/selinux/Kconfig
 source security/smack/Kconfig
 source security/tomoyo/Kconfig
diff --git a/security/Makefile b/security/Makefile
index 05f1c934d74b..c5e1363ae136 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -14,7 +14,7 @@ obj-y					+= commoncap.o
 obj-$(CONFIG_MMU)			+= min_addr.o
 
 # Object file lists
-obj-$(CONFIG_SECURITY)			+= security.o capability.o
+obj-$(CONFIG_SECURITY)			+= security.o capability.o capsicum-rights.o
 obj-$(CONFIG_SECURITYFS)		+= inode.o
 obj-$(CONFIG_SECURITY_SELINUX)		+= selinux/
 obj-$(CONFIG_SECURITY_SMACK)		+= smack/
diff --git a/security/capsicum-rights.c b/security/capsicum-rights.c
new file mode 100644
index 000000000000..0a5695fa0e61
--- /dev/null
+++ b/security/capsicum-rights.c
@@ -0,0 +1,201 @@
+/*-
+ * Copyright (c) 2013 FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed by Pawel Jakub Dawidek under sponsorship from
+ * the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stdarg.h>
+#include <linux/capsicum.h>
+#include <linux/slab.h>
+#include <linux/fcntl.h>
+#include <linux/bug.h>
+
+#include "capsicum-rights.h"
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+#define CAPARSIZE_MIN	(CAP_RIGHTS_VERSION_00 + 2)
+#define CAPARSIZE_MAX	(CAP_RIGHTS_VERSION + 2)
+
+/*
+ * -1 indicates invalid index value, otherwise log2(v), ie.:
+ * 0x001 -> 0, 0x002 -> 1, 0x004 -> 2, 0x008 -> 3, 0x010 -> 4, rest -> -1
+ */
+static const int bit2idx[] = {
+	-1, 0, 1, -1, 2, -1, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1,
+	4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
+};
+
+static inline int right_to_index(__u64 right)
+{
+	return bit2idx[CAPIDXBIT(right)];
+}
+
+static inline bool has_right(const struct capsicum_rights *rights, u64 right)
+{
+	int idx = right_to_index(right);
+	return (rights->primary.cr_rights[idx] & right) == right;
+}
+
+struct capsicum_rights *
+cap_rights_vset(struct capsicum_rights *rights, va_list ap)
+{
+	u64 right;
+	int i, n;
+
+	n = CAPARSIZE(&rights->primary);
+	BUG_ON(n < CAPARSIZE_MIN || n > CAPARSIZE_MAX);
+
+	while (true) {
+		right = va_arg(ap, u64);
+		if (right == 0)
+			break;
+		BUG_ON(CAPRVER(right) != 0);
+		i = right_to_index(right);
+		BUG_ON(i < 0 || i >= n);
+		BUG_ON(CAPIDXBIT(rights->primary.cr_rights[i]) !=
+		       CAPIDXBIT(right));
+		rights->primary.cr_rights[i] |= right;
+	}
+	return rights;
+}
+EXPORT_SYMBOL(cap_rights_vset);
+
+struct capsicum_rights *
+cap_rights_vinit(struct capsicum_rights *rights, va_list ap)
+{
+	CAP_SET_NONE(&rights->primary);
+	rights->nioctls = 0;
+	rights->ioctls = NULL;
+	rights->fcntls = 0;
+	cap_rights_vset(rights, ap);
+	return rights;
+}
+EXPORT_SYMBOL(cap_rights_vinit);
+
+bool cap_rights_regularize(struct capsicum_rights *rights)
+{
+	bool changed = false;
+	if (!has_right(rights, CAP_FCNTL) && rights->fcntls != 0x00) {
+		changed = true;
+		rights->fcntls = 0x00;
+	}
+	if (!has_right(rights, CAP_IOCTL) && (rights->nioctls != 0)) {
+		changed = true;
+		kfree(rights->ioctls);
+		rights->nioctls = 0;
+		rights->ioctls = NULL;
+	}
+	return changed;
+}
+
+struct capsicum_rights *_cap_rights_init(struct capsicum_rights *rights, ...)
+{
+	va_list ap;
+	va_start(ap, rights);
+	cap_rights_vinit(rights, ap);
+	va_end(ap);
+	return rights;
+}
+EXPORT_SYMBOL(_cap_rights_init);
+
+struct capsicum_rights *_cap_rights_set(struct capsicum_rights *rights, ...)
+{
+	va_list ap;
+	va_start(ap, rights);
+	cap_rights_vset(rights, ap);
+	va_end(ap);
+	return rights;
+}
+EXPORT_SYMBOL(_cap_rights_set);
+
+struct capsicum_rights *cap_rights_set_all(struct capsicum_rights *rights)
+{
+	CAP_SET_ALL(&rights->primary);
+	rights->nioctls = -1;
+	rights->ioctls = NULL;
+	rights->fcntls = CAP_FCNTL_ALL;
+	return rights;
+}
+EXPORT_SYMBOL(cap_rights_set_all);
+
+static bool cap_rights_ioctls_contains(const struct capsicum_rights *big,
+				       const struct capsicum_rights *little)
+{
+	int i, j;
+
+	if (big->nioctls == -1)
+		return true;
+	if (big->nioctls < little->nioctls)
+		return false;
+	for (i = 0; i < little->nioctls; i++) {
+		for (j = 0; j < big->nioctls; j++) {
+			if (little->ioctls[i] == big->ioctls[j])
+				break;
+		}
+		if (j == big->nioctls)
+			return false;
+	}
+	return true;
+}
+
+static bool cap_rights_primary_contains(const struct cap_rights *big,
+					const struct cap_rights *little)
+{
+	unsigned int i, n;
+
+	BUG_ON(CAPVER(big) != CAP_RIGHTS_VERSION_00);
+	BUG_ON(CAPVER(little) != CAP_RIGHTS_VERSION_00);
+
+	n = CAPARSIZE(big);
+	BUG_ON(n < CAPARSIZE_MIN || n > CAPARSIZE_MAX);
+
+	for (i = 0; i < n; i++) {
+		if ((big->cr_rights[i] & little->cr_rights[i]) !=
+		    little->cr_rights[i]) {
+			return false;
+		}
+	}
+	return true;
+}
+
+bool cap_rights_contains(const struct capsicum_rights *big,
+			const struct capsicum_rights *little)
+{
+	return cap_rights_primary_contains(&big->primary,
+					   &little->primary) &&
+	       ((big->fcntls & little->fcntls) == little->fcntls) &&
+	       cap_rights_ioctls_contains(big, little);
+}
+
+bool cap_rights_is_all(const struct capsicum_rights *rights)
+{
+	return CAP_IS_ALL(&rights->primary) &&
+	       rights->fcntls == CAP_FCNTL_ALL &&
+	       rights->nioctls == -1;
+}
+EXPORT_SYMBOL(cap_rights_is_all);
+
+#endif  /* CONFIG_SECURITY_CAPSICUM */
diff --git a/security/capsicum-rights.h b/security/capsicum-rights.h
new file mode 100644
index 000000000000..b7143e3d65b7
--- /dev/null
+++ b/security/capsicum-rights.h
@@ -0,0 +1,10 @@
+#ifndef _CAPSICUM_RIGHTS_H
+#define _CAPSICUM_RIGHTS_H
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+bool cap_rights_regularize(struct capsicum_rights *rights);
+bool cap_rights_contains(const struct capsicum_rights *big,
+			 const struct capsicum_rights *little);
+#endif
+
+#endif /* _CAPSICUM_RIGHTS_H */
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 03/11] capsicum: rights values and structure definitions
@ 2014-06-30 10:28   ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api-u79uwXL29TY76Z2rM5mHXA, David Drysdale

Define (in include/uapi/linux/capsicum.h) values for primary
rights associated with Capsicum capability file descriptors.

Also define the structure that primary rights reside in (struct
cap_rights), and the complete compound rights structure (struct
capsicum_rights).

 - Primary rights describe the main operations that can be
   performed on a file.
 - Secondary rights allow for specific fcntl() and ioctl()
   operations to be policed.

Add functions to manipulate these rights structures.

This change is adapted from the FreeBSD 10.x implementation of
Capsicum, with the aim of preserving compatibility between the
two implementations as closely as possible.

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/security/capsicum.txt | 102 +++++++++++
 include/linux/capsicum.h            |  50 ++++++
 include/uapi/linux/Kbuild           |   1 +
 include/uapi/linux/capsicum.h       | 343 ++++++++++++++++++++++++++++++++++++
 security/Kconfig                    |  15 ++
 security/Makefile                   |   2 +-
 security/capsicum-rights.c          | 201 +++++++++++++++++++++
 security/capsicum-rights.h          |  10 ++
 8 files changed, 723 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/security/capsicum.txt
 create mode 100644 include/linux/capsicum.h
 create mode 100644 include/uapi/linux/capsicum.h
 create mode 100644 security/capsicum-rights.c
 create mode 100644 security/capsicum-rights.h

diff --git a/Documentation/security/capsicum.txt b/Documentation/security/capsicum.txt
new file mode 100644
index 000000000000..27e950828359
--- /dev/null
+++ b/Documentation/security/capsicum.txt
@@ -0,0 +1,102 @@
+Capsicum Object Capabilities
+============================
+
+Capsicum is a lightweight OS capability and sandbox framework, which allows
+security-aware userspace applications to sandbox parts of their own code in a
+highly granular way, reducing the attack surface in the event of subversion.
+
+Originally developed at the University of Cambridge Computer Laboratory, and
+initially implemented in FreeBSD 9.x, Capsicum extends the POSIX API, providing
+several new OS primitives to support object-capability security on UNIX-like
+operating systems.
+
+Note that Capsicum capability file descriptors are radically different to the
+POSIX.1e capabilities that are already available in Linux:
+ - POSIX.1e capabilities subdivide the root user's authority into different
+   areas of functionality.
+ - Capsicum capabilities restrict individual file descriptors so that
+   only operations permitted by that particular FD's rights are allowed.
+
+
+Overview
+--------
+
+Capability-based security is a security model where objects can only be
+accessed via capabilities, which are unforgeable tokens of authority that only
+give rights to perform certain operations.
+
+Capsicum is a pragmatic blend of capability-based security with standard
+UNIX/POSIX system semantics.  A Capsicum capability is a file descriptor that
+has an associated rights bitmask, and the kernel polices operations using that
+file descriptor, failing operations with insufficient rights.
+
+
+Capability Data Structure
+-------------------------
+
+Internally, a capability is a particular kind of struct file that wraps an
+underlying normal file.   The private data for the wrapper indicates the
+wrapped file, and holds the rights information for the capability.
+
+
+FD to File Conversion
+---------------------
+
+The primary policing of Capsicum capabilities occurs when a user-provided file
+descriptor is converted to a struct file object, normally using one of the
+fgetr() family of functions.
+
+All such operations in the kernel are annotated with information about the
+operations that are going to be performed on the retrieved struct file.  For
+example, a file that is retrieved for a read operation has its fgetr() call
+annotated with CAP_READ, indicating that any capability FD that reaches this
+point needs to include the CAP_READ right to progress further.  If the
+appropriate right is not available, -ENOTCAPABLE is returned.
+
+This change is the most significant change to the kernel, as it affects all
+FD-to-file conversions.  However, for a non-Capsicum build of the kernel the
+impact is minimal as the additional rights parameters to fgetr*() are macroed
+out.
+
+
+Path Traversal
+--------------
+
+Capsicum does allow new files to be accessed beneath a directory for which the
+application has a suitable capability FD (one including the CAP_LOOKUP right),
+using the openat(2) system call.  To prevent escape from the directory, path
+traversals are policed for "/" and ".." components.
+
+
+LSM Interactions
+----------------
+
+The annotation of all fget() calls with intended file operations, expressed
+as combinations of Capsicum rights values, is implemented as mainline kernel
+modifications.
+
+The remainder of the Capsicum functionality is via Linux Security Module (LSM)
+hooks, with Capsicum providing the default implementation when the active LSM
+does not override.  (If the active LSM does choose to override the Capsicum
+implementation, it should ensure that the Capsicum functionality is unaffected,
+by combining the results of the Capsicum implementation with its own.)
+
+The additional hooks added for (and implemented by) Capsicum are:
+ - file_lookup: Allow modification of the result of an fget() operation, so that
+   a rights check can be performed and the normal file underlying a capability
+   can be returned.
+ - file_install: Allow modification of a file that is about to be installed
+   into the file descriptor table, in cases where the new file is derived
+   from another file (that may be a Capsicum capability and so have rights
+   associated with it).
+
+
+New System Calls
+----------------
+
+Capsicum implements the following new system calls:
+ - cap_rights_limit: restrict the rights associated with file descriptor, thus
+   turning it into a capability FD; internally this is implemented by wrapping
+   the original struct file with a capability file (security/capsicum.c)
+ - cap_rights_get: return the rights associated with a capability FD
+   (security/capsicum.c)
diff --git a/include/linux/capsicum.h b/include/linux/capsicum.h
new file mode 100644
index 000000000000..74f79756097a
--- /dev/null
+++ b/include/linux/capsicum.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_CAPSICUM_H
+#define _LINUX_CAPSICUM_H
+
+#include <stdarg.h>
+#include <uapi/linux/capsicum.h>
+
+struct file;
+/* Complete rights structure (primary and subrights). */
+struct capsicum_rights {
+	struct cap_rights primary;
+	unsigned int fcntls;  /* Only valid if CAP_FCNTL set in primary. */
+	int nioctls;  /* -1=>all; only valid if CAP_IOCTL set in primary */
+	unsigned int *ioctls;
+};
+
+#define CAP_LIST_END	0ULL
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/* Rights manipulation functions */
+#define cap_rights_init(rights, ...) \
+	_cap_rights_init((rights), __VA_ARGS__, CAP_LIST_END)
+#define cap_rights_set(rights, ...) \
+	_cap_rights_set((rights), __VA_ARGS__, CAP_LIST_END)
+struct capsicum_rights *_cap_rights_init(struct capsicum_rights *rights, ...);
+struct capsicum_rights *_cap_rights_set(struct capsicum_rights *rights, ...);
+struct capsicum_rights *cap_rights_vinit(struct capsicum_rights *rights,
+					 va_list ap);
+struct capsicum_rights *cap_rights_vset(struct capsicum_rights *rights,
+					va_list ap);
+struct capsicum_rights *cap_rights_set_all(struct capsicum_rights *rights);
+bool cap_rights_is_all(const struct capsicum_rights *rights);
+
+#else
+
+#define cap_rights_init(rights, ...) _cap_rights_noop(rights)
+#define cap_rights_set(rights, ...) _cap_rights_noop(rights)
+#define cap_rights_set_all(rights) _cap_rights_noop(rights)
+static inline struct capsicum_rights *
+_cap_rights_noop(struct capsicum_rights *rights)
+{
+	return rights;
+}
+static inline bool cap_rights_is_all(const struct capsicum_rights *rights)
+{
+	return true;
+}
+
+#endif
+
+#endif /* _LINUX_CAPSICUM_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6929571b79b0..57410bbee2f6 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -73,6 +73,7 @@ header-y += btrfs.h
 header-y += can.h
 header-y += capability.h
 header-y += capi.h
+header-y += capsicum.h
 header-y += cciss_defs.h
 header-y += cciss_ioctl.h
 header-y += cdrom.h
diff --git a/include/uapi/linux/capsicum.h b/include/uapi/linux/capsicum.h
new file mode 100644
index 000000000000..a39ac86fa183
--- /dev/null
+++ b/include/uapi/linux/capsicum.h
@@ -0,0 +1,343 @@
+#ifndef _UAPI_LINUX_CAPSICUM_H
+#define _UAPI_LINUX_CAPSICUM_H
+
+/*-
+ * Copyright (c) 2008-2010 Robert N. M. Watson
+ * Copyright (c) 2012 FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed at the University of Cambridge Computer
+ * Laboratory with support from a grant from Google, Inc.
+ *
+ * Portions of this software were developed by Pawel Jakub Dawidek under
+ * sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * Definitions for Capsicum capabilities facility.
+ */
+#include <linux/types.h>
+
+/*
+ * The top two bits in the first element of the cr_rights[] array contain
+ * total number of elements in the array - 2. This means if those two bits are
+ * equal to 0, we have 2 array elements.
+ * The top two bits in all remaining array elements should be 0.
+ * The next five bits contain array index. Only one bit is used and bit position
+ * in this five-bits range defines array index. This means there can be at most
+ * five array elements.
+ */
+#define CAP_RIGHTS_VERSION_00	0
+/*
+#define CAP_RIGHTS_VERSION_01	1
+#define CAP_RIGHTS_VERSION_02	2
+#define CAP_RIGHTS_VERSION_03	3
+*/
+#define CAP_RIGHTS_VERSION	CAP_RIGHTS_VERSION_00
+
+/* Primary rights */
+struct cap_rights {
+	__u64	cr_rights[CAP_RIGHTS_VERSION + 2];
+};
+
+#define CAPRIGHT(idx, bit)	((1ULL << (57 + (idx))) | (bit))
+
+/*
+ * Possible rights on capabilities.
+ *
+ * Notes:
+ * Some system calls don't require a capability in order to perform an
+ * operation on an fd.  These include: close, dup, dup2.
+ *
+ * sendfile is authorized using CAP_READ on the file and CAP_WRITE on the
+ * socket.
+ *
+ * mmap() and aio*() system calls will need special attention as they may
+ * involve reads or writes depending a great deal on context.
+ */
+
+/* INDEX 0 */
+
+/*
+ * General file I/O.
+ */
+/* Allows for openat(O_RDONLY), read(2), readv(2). */
+#define CAP_READ		CAPRIGHT(0, 0x0000000000000001ULL)
+/* Allows for openat(O_WRONLY | O_APPEND), write(2), writev(2). */
+#define CAP_WRITE		CAPRIGHT(0, 0x0000000000000002ULL)
+/* Allows for lseek(fd, 0, SEEK_CUR). */
+#define CAP_SEEK_TELL		CAPRIGHT(0, 0x0000000000000004ULL)
+/* Allows for lseek(2). */
+#define CAP_SEEK		(CAP_SEEK_TELL | 0x0000000000000008ULL)
+/* Allows for aio_read(2), pread(2), preadv(2). */
+#define CAP_PREAD		(CAP_SEEK | CAP_READ)
+/*
+ * Allows for aio_write(2), openat(O_WRONLY) (without O_APPEND), pwrite(2),
+ * pwritev(2).
+ */
+#define CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
+/* Allows for mmap(PROT_NONE). */
+#define CAP_MMAP		CAPRIGHT(0, 0x0000000000000010ULL)
+/* Allows for mmap(PROT_READ). */
+#define CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
+/* Allows for mmap(PROT_WRITE). */
+#define CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
+/* Allows for mmap(PROT_EXEC). */
+#define CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000020ULL)
+/* Allows for mmap(PROT_READ | PROT_WRITE). */
+#define CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
+/* Allows for mmap(PROT_READ | PROT_EXEC). */
+#define CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
+/* Allows for mmap(PROT_WRITE | PROT_EXEC). */
+#define CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
+/* Allows for mmap(PROT_READ | PROT_WRITE | PROT_EXEC). */
+#define CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
+/* Allows for openat(O_CREAT). */
+#define CAP_CREATE		CAPRIGHT(0, 0x0000000000000040ULL)
+/* Allows for openat(O_EXEC) and fexecve(2) in turn. */
+#define CAP_FEXECVE		CAPRIGHT(0, 0x0000000000000080ULL)
+/* Allows for openat(O_SYNC), openat(O_FSYNC), fsync(2), aio_fsync(2). */
+#define CAP_FSYNC		CAPRIGHT(0, 0x0000000000000100ULL)
+/* Allows for openat(O_TRUNC), ftruncate(2). */
+#define CAP_FTRUNCATE		CAPRIGHT(0, 0x0000000000000200ULL)
+
+/* Lookups - used to constrain *at() calls. */
+#define CAP_LOOKUP		CAPRIGHT(0, 0x0000000000000400ULL)
+
+/* VFS methods. */
+/* Allows for fchdir(2). */
+#define CAP_FCHDIR		CAPRIGHT(0, 0x0000000000000800ULL)
+/* Allows for fchflags(2). */
+#define CAP_FCHFLAGS		CAPRIGHT(0, 0x0000000000001000ULL)
+/* Allows for fchflags(2) and chflagsat(2). */
+#define CAP_CHFLAGSAT		(CAP_FCHFLAGS | CAP_LOOKUP)
+/* Allows for fchmod(2). */
+#define CAP_FCHMOD		CAPRIGHT(0, 0x0000000000002000ULL)
+/* Allows for fchmod(2) and fchmodat(2). */
+#define CAP_FCHMODAT		(CAP_FCHMOD | CAP_LOOKUP)
+/* Allows for fchown(2). */
+#define CAP_FCHOWN		CAPRIGHT(0, 0x0000000000004000ULL)
+/* Allows for fchown(2) and fchownat(2). */
+#define CAP_FCHOWNAT		(CAP_FCHOWN | CAP_LOOKUP)
+/* Allows for fcntl(2). */
+#define CAP_FCNTL		CAPRIGHT(0, 0x0000000000008000ULL)
+/*
+ * Allows for flock(2), openat(O_SHLOCK), openat(O_EXLOCK),
+ * fcntl(F_SETLK_REMOTE), fcntl(F_SETLKW), fcntl(F_SETLK), fcntl(F_GETLK).
+ */
+#define CAP_FLOCK		CAPRIGHT(0, 0x0000000000010000ULL)
+/* Allows for fpathconf(2). */
+#define CAP_FPATHCONF		CAPRIGHT(0, 0x0000000000020000ULL)
+/* Allows for UFS background-fsck operations. */
+#define CAP_FSCK		CAPRIGHT(0, 0x0000000000040000ULL)
+/* Allows for fstat(2). */
+#define CAP_FSTAT		CAPRIGHT(0, 0x0000000000080000ULL)
+/* Allows for fstat(2), fstatat(2) and faccessat(2). */
+#define CAP_FSTATAT		(CAP_FSTAT | CAP_LOOKUP)
+/* Allows for fstatfs(2). */
+#define CAP_FSTATFS		CAPRIGHT(0, 0x0000000000100000ULL)
+/* Allows for futimes(2). */
+#define CAP_FUTIMES		CAPRIGHT(0, 0x0000000000200000ULL)
+/* Allows for futimes(2) and futimesat(2). */
+#define CAP_FUTIMESAT		(CAP_FUTIMES | CAP_LOOKUP)
+/* Allows for linkat(2) and renameat(2) (destination directory descriptor). */
+#define CAP_LINKAT		(CAP_LOOKUP | 0x0000000000400000ULL)
+/* Allows for mkdirat(2). */
+#define CAP_MKDIRAT		(CAP_LOOKUP | 0x0000000000800000ULL)
+/* Allows for mkfifoat(2). */
+#define CAP_MKFIFOAT		(CAP_LOOKUP | 0x0000000001000000ULL)
+/* Allows for mknodat(2). */
+#define CAP_MKNODAT		(CAP_LOOKUP | 0x0000000002000000ULL)
+/* Allows for renameat(2). */
+#define CAP_RENAMEAT		(CAP_LOOKUP | 0x0000000004000000ULL)
+/* Allows for symlinkat(2). */
+#define CAP_SYMLINKAT		(CAP_LOOKUP | 0x0000000008000000ULL)
+/*
+ * Allows for unlinkat(2) and renameat(2) if destination object exists and
+ * will be removed.
+ */
+#define CAP_UNLINKAT		(CAP_LOOKUP | 0x0000000010000000ULL)
+
+/* Socket operations. */
+/* Allows for accept(2) and accept4(2). */
+#define CAP_ACCEPT		CAPRIGHT(0, 0x0000000020000000ULL)
+/* Allows for bind(2). */
+#define CAP_BIND		CAPRIGHT(0, 0x0000000040000000ULL)
+/* Allows for connect(2). */
+#define CAP_CONNECT		CAPRIGHT(0, 0x0000000080000000ULL)
+/* Allows for getpeername(2). */
+#define CAP_GETPEERNAME	CAPRIGHT(0, 0x0000000100000000ULL)
+/* Allows for getsockname(2). */
+#define CAP_GETSOCKNAME	CAPRIGHT(0, 0x0000000200000000ULL)
+/* Allows for getsockopt(2). */
+#define CAP_GETSOCKOPT		CAPRIGHT(0, 0x0000000400000000ULL)
+/* Allows for listen(2). */
+#define CAP_LISTEN		CAPRIGHT(0, 0x0000000800000000ULL)
+/* Allows for sctp_peeloff(2). */
+#define CAP_PEELOFF		CAPRIGHT(0, 0x0000001000000000ULL)
+#define CAP_RECV		CAP_READ
+#define CAP_SEND		CAP_WRITE
+/* Allows for setsockopt(2). */
+#define CAP_SETSOCKOPT		CAPRIGHT(0, 0x0000002000000000ULL)
+/* Allows for shutdown(2). */
+#define CAP_SHUTDOWN		CAPRIGHT(0, 0x0000004000000000ULL)
+
+/* Allows for bindat(2) on a directory descriptor. */
+#define CAP_BINDAT		(CAP_LOOKUP | 0x0000008000000000ULL)
+/* Allows for connectat(2) on a directory descriptor. */
+#define CAP_CONNECTAT		(CAP_LOOKUP | 0x0000010000000000ULL)
+
+#define CAP_SOCK_CLIENT \
+	(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
+	 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
+#define CAP_SOCK_SERVER \
+	(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
+	 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
+	 CAP_SETSOCKOPT | CAP_SHUTDOWN)
+
+/* All used bits for index 0. */
+#define CAP_ALL0		CAPRIGHT(0, 0x0000007FFFFFFFFFULL)
+
+/* Available bits for index 0. */
+#define CAP_UNUSED0_40		CAPRIGHT(0, 0x0000008000000000ULL)
+/* ... */
+#define CAP_UNUSED0_57		CAPRIGHT(0, 0x0100000000000000ULL)
+
+/* INDEX 1 */
+
+/* Mandatory Access Control. */
+/* Allows for mac_get_fd(3). */
+#define CAP_MAC_GET		CAPRIGHT(1, 0x0000000000000001ULL)
+/* Allows for mac_set_fd(3). */
+#define CAP_MAC_SET		CAPRIGHT(1, 0x0000000000000002ULL)
+
+/* Methods on semaphores. */
+#define CAP_SEM_GETVALUE	CAPRIGHT(1, 0x0000000000000004ULL)
+#define CAP_SEM_POST		CAPRIGHT(1, 0x0000000000000008ULL)
+#define CAP_SEM_WAIT		CAPRIGHT(1, 0x0000000000000010ULL)
+
+/* Allows select(2) and poll(2) on descriptor. */
+#define CAP_EVENT		CAPRIGHT(1, 0x0000000000000020ULL)
+/* Allows for kevent(2) on kqueue descriptor with eventlist != NULL. */
+#define CAP_KQUEUE_EVENT	CAPRIGHT(1, 0x0000000000000040ULL)
+
+/* Strange and powerful rights that should not be given lightly. */
+/* Allows for ioctl(2). */
+#define CAP_IOCTL		CAPRIGHT(1, 0x0000000000000080ULL)
+#define CAP_TTYHOOK		CAPRIGHT(1, 0x0000000000000100ULL)
+
+/* Process management via process descriptors. */
+/* Allows for pdgetpid(2). */
+#define CAP_PDGETPID		CAPRIGHT(1, 0x0000000000000200ULL)
+/* Allows for pdwait4(2). */
+#define CAP_PDWAIT		CAPRIGHT(1, 0x0000000000000400ULL)
+/* Allows for pdkill(2). */
+#define CAP_PDKILL		CAPRIGHT(1, 0x0000000000000800ULL)
+
+/* Extended attributes. */
+/* Allows for extattr_delete_fd(2). */
+#define CAP_EXTATTR_DELETE	CAPRIGHT(1, 0x0000000000001000ULL)
+/* Allows for extattr_get_fd(2). */
+#define CAP_EXTATTR_GET	CAPRIGHT(1, 0x0000000000002000ULL)
+/* Allows for extattr_list_fd(2). */
+#define CAP_EXTATTR_LIST	CAPRIGHT(1, 0x0000000000004000ULL)
+/* Allows for extattr_set_fd(2). */
+#define CAP_EXTATTR_SET	CAPRIGHT(1, 0x0000000000008000ULL)
+
+/* Access Control Lists. */
+/* Allows for acl_valid_fd_np(3). */
+#define CAP_ACL_CHECK		CAPRIGHT(1, 0x0000000000010000ULL)
+/* Allows for acl_delete_fd_np(3). */
+#define CAP_ACL_DELETE		CAPRIGHT(1, 0x0000000000020000ULL)
+/* Allows for acl_get_fd(3) and acl_get_fd_np(3). */
+#define CAP_ACL_GET		CAPRIGHT(1, 0x0000000000040000ULL)
+/* Allows for acl_set_fd(3) and acl_set_fd_np(3). */
+#define CAP_ACL_SET		CAPRIGHT(1, 0x0000000000080000ULL)
+
+/* Allows for kevent(2) on kqueue descriptor with changelist != NULL. */
+#define CAP_KQUEUE_CHANGE	CAPRIGHT(1, 0x0000000000100000ULL)
+
+#define CAP_KQUEUE		(CAP_KQUEUE_EVENT | CAP_KQUEUE_CHANGE)
+
+/* Modify signalfd signal mask. */
+#define CAP_FSIGNAL             CAPRIGHT(1, 0x0000000000200000ULL)
+
+/* Modify epollfd set of FDs/events */
+#define CAP_EPOLL_CTL           CAPRIGHT(1, 0x0000000000400000ULL)
+
+/* Modify things monitored by inotify/fanotify FD */
+#define CAP_NOTIFY              CAPRIGHT(1, 0x0000000000800000ULL)
+
+/* Allow entry to a namespace associated with a file descriptor */
+#define CAP_SETNS               CAPRIGHT(1, 0x0000000001000000ULL)
+
+/* Allow performance monitoring operations */
+#define CAP_PERFMON             CAPRIGHT(1, 0x0000000002000000ULL)
+
+/* All used bits for index 1. */
+#define CAP_ALL1		CAPRIGHT(1, 0x0000000003FFFFFFULL)
+
+/* Available bits for index 1. */
+#define CAP_UNUSED1_27		CAPRIGHT(1, 0x0000000004000000ULL)
+/* ... */
+#define CAP_UNUSED1_57		CAPRIGHT(1, 0x0100000000000000ULL)
+
+/* Backward compatibility. */
+#define CAP_POLL_EVENT		CAP_EVENT
+
+#define CAP_SET_ALL(rights)		do {				\
+	(rights)->cr_rights[0] =					\
+	    ((__u64)CAP_RIGHTS_VERSION << 62) | CAP_ALL0;		\
+	(rights)->cr_rights[1] = CAP_ALL1;				\
+} while (0)
+
+#define CAP_SET_NONE(rights)	do {					\
+	(rights)->cr_rights[0] =					\
+	    ((__u64)CAP_RIGHTS_VERSION << 62) | CAPRIGHT(0, 0ULL);	\
+	(rights)->cr_rights[1] = CAPRIGHT(1, 0ULL);			\
+} while (0)
+
+#define CAP_IS_ALL(rights)						\
+	(((rights)->cr_rights[0] ==					\
+	  (((__u64)CAP_RIGHTS_VERSION << 62) | CAP_ALL0)) &&	\
+	 ((rights)->cr_rights[1] == CAP_ALL1))
+
+#define CAPRVER(right)		((int)((right) >> 62))
+#define CAPVER(rights)		CAPRVER((rights)->cr_rights[0])
+#define CAPARSIZE(rights)	(CAPVER(rights) + 2)
+#define CAPIDXBIT(right)	((int)(((right) >> 57) & 0x1F))
+
+/*
+ * Allowed fcntl(2) commands.
+ */
+#define CAP_FCNTL_GETFL	(1 << F_GETFL)
+#define CAP_FCNTL_SETFL	(1 << F_SETFL)
+#define CAP_FCNTL_GETOWN	(1 << F_GETOWN)
+#define CAP_FCNTL_SETOWN	(1 << F_SETOWN)
+#define CAP_FCNTL_ALL		(CAP_FCNTL_GETFL | CAP_FCNTL_SETFL | \
+				 CAP_FCNTL_GETOWN | CAP_FCNTL_SETOWN)
+
+#define CAP_IOCTLS_ALL		SSIZE_MAX
+
+#endif /* _UAPI_LINUX_CAPSICUM_H */
diff --git a/security/Kconfig b/security/Kconfig
index beb86b500adf..006020864612 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -117,6 +117,21 @@ config LSM_MMAP_MIN_ADDR
 	  this low address space will need the permission specific to the
 	  systems running LSM.
 
+config SECURITY_CAPSICUM
+	bool "Capsicum capabilities"
+	default y
+	depends on SECURITY
+	depends on SECURITY_PATH
+	depends on SECCOMP
+	help
+	  Enable the Capsicum capability framework, which implements security
+	  primitives that support fine-grained capabilities on file
+	  descriptors; see Documentation/security/capsicum.txt for more
+	  details.
+
+	  If you are unsure as to whether this is required, answer N.
+
+
 source security/selinux/Kconfig
 source security/smack/Kconfig
 source security/tomoyo/Kconfig
diff --git a/security/Makefile b/security/Makefile
index 05f1c934d74b..c5e1363ae136 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -14,7 +14,7 @@ obj-y					+= commoncap.o
 obj-$(CONFIG_MMU)			+= min_addr.o
 
 # Object file lists
-obj-$(CONFIG_SECURITY)			+= security.o capability.o
+obj-$(CONFIG_SECURITY)			+= security.o capability.o capsicum-rights.o
 obj-$(CONFIG_SECURITYFS)		+= inode.o
 obj-$(CONFIG_SECURITY_SELINUX)		+= selinux/
 obj-$(CONFIG_SECURITY_SMACK)		+= smack/
diff --git a/security/capsicum-rights.c b/security/capsicum-rights.c
new file mode 100644
index 000000000000..0a5695fa0e61
--- /dev/null
+++ b/security/capsicum-rights.c
@@ -0,0 +1,201 @@
+/*-
+ * Copyright (c) 2013 FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed by Pawel Jakub Dawidek under sponsorship from
+ * the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stdarg.h>
+#include <linux/capsicum.h>
+#include <linux/slab.h>
+#include <linux/fcntl.h>
+#include <linux/bug.h>
+
+#include "capsicum-rights.h"
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+#define CAPARSIZE_MIN	(CAP_RIGHTS_VERSION_00 + 2)
+#define CAPARSIZE_MAX	(CAP_RIGHTS_VERSION + 2)
+
+/*
+ * -1 indicates invalid index value, otherwise log2(v), ie.:
+ * 0x001 -> 0, 0x002 -> 1, 0x004 -> 2, 0x008 -> 3, 0x010 -> 4, rest -> -1
+ */
+static const int bit2idx[] = {
+	-1, 0, 1, -1, 2, -1, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1,
+	4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
+};
+
+static inline int right_to_index(__u64 right)
+{
+	return bit2idx[CAPIDXBIT(right)];
+}
+
+static inline bool has_right(const struct capsicum_rights *rights, u64 right)
+{
+	int idx = right_to_index(right);
+	return (rights->primary.cr_rights[idx] & right) == right;
+}
+
+struct capsicum_rights *
+cap_rights_vset(struct capsicum_rights *rights, va_list ap)
+{
+	u64 right;
+	int i, n;
+
+	n = CAPARSIZE(&rights->primary);
+	BUG_ON(n < CAPARSIZE_MIN || n > CAPARSIZE_MAX);
+
+	while (true) {
+		right = va_arg(ap, u64);
+		if (right == 0)
+			break;
+		BUG_ON(CAPRVER(right) != 0);
+		i = right_to_index(right);
+		BUG_ON(i < 0 || i >= n);
+		BUG_ON(CAPIDXBIT(rights->primary.cr_rights[i]) !=
+		       CAPIDXBIT(right));
+		rights->primary.cr_rights[i] |= right;
+	}
+	return rights;
+}
+EXPORT_SYMBOL(cap_rights_vset);
+
+struct capsicum_rights *
+cap_rights_vinit(struct capsicum_rights *rights, va_list ap)
+{
+	CAP_SET_NONE(&rights->primary);
+	rights->nioctls = 0;
+	rights->ioctls = NULL;
+	rights->fcntls = 0;
+	cap_rights_vset(rights, ap);
+	return rights;
+}
+EXPORT_SYMBOL(cap_rights_vinit);
+
+bool cap_rights_regularize(struct capsicum_rights *rights)
+{
+	bool changed = false;
+	if (!has_right(rights, CAP_FCNTL) && rights->fcntls != 0x00) {
+		changed = true;
+		rights->fcntls = 0x00;
+	}
+	if (!has_right(rights, CAP_IOCTL) && (rights->nioctls != 0)) {
+		changed = true;
+		kfree(rights->ioctls);
+		rights->nioctls = 0;
+		rights->ioctls = NULL;
+	}
+	return changed;
+}
+
+struct capsicum_rights *_cap_rights_init(struct capsicum_rights *rights, ...)
+{
+	va_list ap;
+	va_start(ap, rights);
+	cap_rights_vinit(rights, ap);
+	va_end(ap);
+	return rights;
+}
+EXPORT_SYMBOL(_cap_rights_init);
+
+struct capsicum_rights *_cap_rights_set(struct capsicum_rights *rights, ...)
+{
+	va_list ap;
+	va_start(ap, rights);
+	cap_rights_vset(rights, ap);
+	va_end(ap);
+	return rights;
+}
+EXPORT_SYMBOL(_cap_rights_set);
+
+struct capsicum_rights *cap_rights_set_all(struct capsicum_rights *rights)
+{
+	CAP_SET_ALL(&rights->primary);
+	rights->nioctls = -1;
+	rights->ioctls = NULL;
+	rights->fcntls = CAP_FCNTL_ALL;
+	return rights;
+}
+EXPORT_SYMBOL(cap_rights_set_all);
+
+static bool cap_rights_ioctls_contains(const struct capsicum_rights *big,
+				       const struct capsicum_rights *little)
+{
+	int i, j;
+
+	if (big->nioctls == -1)
+		return true;
+	if (big->nioctls < little->nioctls)
+		return false;
+	for (i = 0; i < little->nioctls; i++) {
+		for (j = 0; j < big->nioctls; j++) {
+			if (little->ioctls[i] == big->ioctls[j])
+				break;
+		}
+		if (j == big->nioctls)
+			return false;
+	}
+	return true;
+}
+
+static bool cap_rights_primary_contains(const struct cap_rights *big,
+					const struct cap_rights *little)
+{
+	unsigned int i, n;
+
+	BUG_ON(CAPVER(big) != CAP_RIGHTS_VERSION_00);
+	BUG_ON(CAPVER(little) != CAP_RIGHTS_VERSION_00);
+
+	n = CAPARSIZE(big);
+	BUG_ON(n < CAPARSIZE_MIN || n > CAPARSIZE_MAX);
+
+	for (i = 0; i < n; i++) {
+		if ((big->cr_rights[i] & little->cr_rights[i]) !=
+		    little->cr_rights[i]) {
+			return false;
+		}
+	}
+	return true;
+}
+
+bool cap_rights_contains(const struct capsicum_rights *big,
+			const struct capsicum_rights *little)
+{
+	return cap_rights_primary_contains(&big->primary,
+					   &little->primary) &&
+	       ((big->fcntls & little->fcntls) == little->fcntls) &&
+	       cap_rights_ioctls_contains(big, little);
+}
+
+bool cap_rights_is_all(const struct capsicum_rights *rights)
+{
+	return CAP_IS_ALL(&rights->primary) &&
+	       rights->fcntls == CAP_FCNTL_ALL &&
+	       rights->nioctls == -1;
+}
+EXPORT_SYMBOL(cap_rights_is_all);
+
+#endif  /* CONFIG_SECURITY_CAPSICUM */
diff --git a/security/capsicum-rights.h b/security/capsicum-rights.h
new file mode 100644
index 000000000000..b7143e3d65b7
--- /dev/null
+++ b/security/capsicum-rights.h
@@ -0,0 +1,10 @@
+#ifndef _CAPSICUM_RIGHTS_H
+#define _CAPSICUM_RIGHTS_H
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+bool cap_rights_regularize(struct capsicum_rights *rights);
+bool cap_rights_contains(const struct capsicum_rights *big,
+			 const struct capsicum_rights *little);
+#endif
+
+#endif /* _CAPSICUM_RIGHTS_H */
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 04/11] capsicum: implement fgetr() and friends
@ 2014-06-30 10:28   ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Add variants of fget() and related functions where the caller
indicates the operations that will be performed on the file.

If CONFIG_SECURITY_CAPSICUM is defined, these variants build a
struct capsicum_rights instance holding the rights associated
with the file operations; this will allow a future hook to check
whether a rights-restricted file has those specific rights
available.

If CONFIG_SECURITY_CAPSICUM is not defined, these variants expand
to the underlying fget() function, with one difference: failures
are returned as an ERR_PTR value rather than just NULL.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 fs/file.c             | 130 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namei.c            |  49 ++++++++++++++++--
 fs/read_write.c       |   5 --
 include/linux/file.h  | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/namei.h |   9 ++++
 5 files changed, 321 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 8f294cfac697..562cc82ba442 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -13,6 +13,7 @@
 #include <linux/mmzone.h>
 #include <linux/time.h>
 #include <linux/sched.h>
+#include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/file.h>
@@ -722,6 +723,135 @@ unsigned long __fdget_pos(unsigned int fd)
 	return v;
 }
 
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * The LSM might want to change the return value of fget() and friends.
+ * This function is called with the intended return value, and fget()
+ * will /actually/ return whatever is returned from here. We call an
+ * LSM hook, and return what it returns. We adjust the reference counter
+ * if necessary.
+ */
+static struct file *unwrap_file(struct file *orig,
+				const struct capsicum_rights *required_rights,
+				const struct capsicum_rights **actual_rights,
+				bool update_refcnt)
+{
+	struct file *f;
+
+	if (orig == NULL)
+		return ERR_PTR(-EBADF);
+	if (IS_ERR(orig))
+		return orig;
+	f = orig;  /* TODO: pass to an LSM hook here */
+	if (f != orig && update_refcnt) {
+		/* We're not returning the original, and the calling code
+		 * has already incremented the refcount on it, we need to
+		 * release that reference and obtain a reference to the new
+		 * return value, if any.
+		 */
+		if (!IS_ERR(f) && !atomic_long_inc_not_zero(&f->f_count))
+			f = ERR_PTR(-EBADF);
+		atomic_long_dec(&orig->f_count);
+	}
+
+	return f;
+}
+
+struct file *fget_rights(unsigned int fd, const struct capsicum_rights *rights)
+{
+	return unwrap_file(fget(fd), rights, NULL, true);
+}
+EXPORT_SYMBOL(fget_rights);
+
+struct file *fget_raw_rights(unsigned int fd,
+			     const struct capsicum_rights *rights)
+{
+	return unwrap_file(fget_raw(fd), rights, NULL, true);
+}
+EXPORT_SYMBOL(fget_raw_rights);
+
+struct fd fdget_rights(unsigned int fd, const struct capsicum_rights *rights)
+{
+	struct fd f = fdget(fd);
+	f.file = unwrap_file(f.file, rights, NULL, (f.flags & FDPUT_FPUT));
+	return f;
+}
+EXPORT_SYMBOL(fdget_rights);
+
+struct fd fdget_raw_rights(unsigned int fd,
+			   const struct capsicum_rights **actual_rights,
+			   const struct capsicum_rights *rights)
+{
+	struct fd f = fdget_raw(fd);
+	f.file = unwrap_file(f.file, rights, actual_rights,
+			     (f.flags & FDPUT_FPUT));
+	return f;
+}
+EXPORT_SYMBOL(fdget_raw_rights);
+
+struct file *_fgetr(unsigned int fd, ...)
+{
+	struct capsicum_rights rights;
+	struct file *f;
+	va_list ap;
+	va_start(ap, fd);
+	f = fget_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fgetr);
+
+struct file *_fgetr_raw(unsigned int fd, ...)
+{
+	struct capsicum_rights rights;
+	struct file *f;
+	va_list ap;
+	va_start(ap, fd);
+	f = fget_raw_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fgetr_raw);
+
+struct fd _fdgetr(unsigned int fd, ...)
+{
+	struct fd f;
+	struct capsicum_rights rights;
+	va_list ap;
+	va_start(ap, fd);
+	f = fdget_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fdgetr);
+
+struct fd _fdgetr_raw(unsigned int fd, ...)
+{
+	struct fd f;
+	struct capsicum_rights rights;
+	va_list ap;
+	va_start(ap, fd);
+	f = fdget_raw_rights(fd, NULL, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fdgetr_raw);
+
+struct fd _fdgetr_pos(unsigned int fd, ...)
+{
+	struct fd f;
+	struct capsicum_rights rights;
+	va_list ap;
+	f = __to_fd(__fdget_pos(fd));
+	va_start(ap, fd);
+	f.file = unwrap_file(f.file, cap_rights_vinit(&rights, ap), NULL,
+			     (f.flags & FDPUT_FPUT));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fdgetr_pos);
+#endif
+
 /*
  * We only lock f_pos if we have threads or if the file might be
  * shared with another process. In both cases we'll have an elevated
diff --git a/fs/namei.c b/fs/namei.c
index e6b72531dfc7..c93f7993960e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -646,6 +646,19 @@ static __always_inline void set_root(struct nameidata *nd)
 		get_fs_root(current->fs, &nd->root);
 }
 
+/*
+ * Retrieval of files against a directory file descriptor requires
+ * CAP_LOOKUP. As this is common in this file, set up the required rights once
+ * and for all.
+ */
+static struct capsicum_rights lookup_rights;
+static int __init init_lookup_rights(void)
+{
+	cap_rights_init(&lookup_rights, CAP_LOOKUP);
+	return 0;
+}
+fs_initcall(init_lookup_rights);
+
 static int link_path_walk(const char *, struct nameidata *, unsigned int);
 
 static __always_inline void set_root_rcu(struct nameidata *nd)
@@ -2135,8 +2148,12 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
 }
 EXPORT_SYMBOL(lookup_one_len);
 
-int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
-		 struct path *path, int *empty)
+static int user_path_at_empty_rights(int dfd,
+				const char __user *name,
+				unsigned flags,
+				struct path *path,
+				int *empty,
+				const struct capsicum_rights *rights)
 {
 	struct nameidata nd;
 	struct filename *tmp = getname_flags(name, flags, empty);
@@ -2153,13 +2170,39 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
 	return err;
 }
 
+int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
+		 struct path *path, int *empty)
+{
+	return user_path_at_empty_rights(dfd, name, flags, path, empty,
+					 &lookup_rights);
+}
+
 int user_path_at(int dfd, const char __user *name, unsigned flags,
 		 struct path *path)
 {
-	return user_path_at_empty(dfd, name, flags, path, NULL);
+	return user_path_at_empty_rights(dfd, name, flags, path, NULL,
+					 &lookup_rights);
 }
 EXPORT_SYMBOL(user_path_at);
 
+#ifdef CONFIG_SECURITY_CAPSICUM
+int _user_path_atr(int dfd,
+		   const char __user *name,
+		   unsigned flags,
+		   struct path *path,
+		   ...)
+{
+	struct capsicum_rights rights;
+	int rc;
+	va_list ap;
+	va_start(ap, path);
+	rc = user_path_at_empty_rights(dfd, name, flags, path, NULL,
+				       cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return rc;
+}
+#endif
+
 /*
  * NB: most callers don't do anything directly with the reference to the
  *     to struct filename, but the nd->last pointer points into the name string
diff --git a/fs/read_write.c b/fs/read_write.c
index 31c6efa43183..bd4cc3770b42 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -264,11 +264,6 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
 }
 EXPORT_SYMBOL(vfs_llseek);
 
-static inline struct fd fdget_pos(int fd)
-{
-	return __to_fd(__fdget_pos(fd));
-}
-
 static inline void fdput_pos(struct fd f)
 {
 	if (f.flags & FDPUT_POS_UNLOCK)
diff --git a/include/linux/file.h b/include/linux/file.h
index 4d69123377a2..22952e26ab19 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -8,6 +8,8 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <linux/posix_types.h>
+#include <linux/err.h>
+#include <linux/capsicum.h>
 
 struct file;
 
@@ -39,6 +41,21 @@ static inline void fdput(struct fd fd)
 		fput(fd.file);
 }
 
+/*
+ * The base functions for converting a file descriptor to a struct file are:
+ *  - fget() always increments refcount, doesn't work on O_PATH files.
+ *  - fget_raw() always increments refcount, and does work on O_PATH files.
+ *  - fdget() only increments refcount if needed, doesn't work on O_PATH files.
+ *  - fdget_raw() only increments refcount if needed, works on O_PATH files.
+ *  - fdget_pos() as fdget(), but also locks the file position lock (for
+ *    operations that POSIX requires to be atomic w.r.t file position).
+ * These functions return NULL on failure, and return the actual entry in the
+ * fdtable (which may be a wrapper if the file is a Capsicum capability).
+ *
+ * These functions should normally only be used when a file is being
+ * transferred (e.g. dup(2)) or manipulated as-is; normal users should stick
+ * to the fgetr() variants below.
+ */
 extern struct file *fget(unsigned int fd);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
@@ -60,6 +77,125 @@ static inline struct fd fdget_raw(unsigned int fd)
 	return __to_fd(__fdget_raw(fd));
 }
 
+static inline struct fd fdget_pos(unsigned int fd)
+{
+	return __to_fd(__fdget_pos(fd));
+}
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * The full unwrapping variant functions are:
+ *  - fget_rights()
+ *  - fget_raw_rights()
+ *  - fdget_rights()
+ *  - fdget_raw_rights()
+ * These versions have the same behavior as the equivalent base functions, but:
+ *  - They also take a struct capsicum_rights argument describing the details
+ *    of the operations to be performed on the file.
+ *  - They remove any Capsicum capability wrapper for the file, returning the
+ *    normal underlying file.
+ *  - They return an ERR_PTR on failure (typically with either -EBADF for an
+ *    unrecognized FD, or -ENOTCAPABLE for a Capsicum capability FD that does
+ *    not have the requisite rights).
+ *
+ * The fdget_raw_rights() function also optionally returns the actual Capsicum
+ * rights associated with the file descriptor; the caller should only access
+ * this structure while it holds a reference to the file.
+ *
+ * These functions should normally only be used:
+ *  - when the operation being performed on the file requires more detailed
+ *    specification (in particular: the ioctl(2) or fcntl(2) command invoked)
+ *  - (for fdget_raw_rights()) when a new file descriptor will be created from
+ *    this file descriptor, and so should potentially inherit its rights (if
+ *    it is a Capsicum capability file descriptor).
+ * Otherwise users should stick to the simpler fgetr() variants below.
+ */
+extern struct file *fget_rights(unsigned int fd,
+				const struct capsicum_rights *rights);
+extern struct file *fget_raw_rights(unsigned int fd,
+				    const struct capsicum_rights *rights);
+extern struct fd fdget_rights(unsigned int fd,
+			      const struct capsicum_rights *rights);
+extern struct fd fdget_raw_rights(unsigned int fd,
+				  const struct capsicum_rights **actual_rights,
+				  const struct capsicum_rights *rights);
+
+/*
+ * The simple unwrapping variant functions are:
+ *  - fgetr()
+ *  - fgetr_raw()
+ *  - fdgetr()
+ *  - fdgetr_raw()
+ *  - fdgetr_pos()
+ * These versions have the same behavior as the equivalent base functions, but:
+ *  - They also take variable arguments indicating the operations to be
+ *    performed on the file.
+ *  - They remove any Capsicum capability wrapper for the file, returning the
+ *    normal underlying file.
+ *  - They return an ERR_PTR on failure (typically with either -EBADF for an
+ *    unrecognized FD, or -ENOTCAPABLE for a Capsicum capability FD that does
+ *    not have the requisite rights).
+ *
+ * These functions should normally be used for FD->file conversion.
+ */
+#define fgetr(fd, ...)		_fgetr((fd), __VA_ARGS__, CAP_LIST_END)
+#define fgetr_raw(fd, ...)	_fgetr_raw((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr(fd, ...)	_fdgetr((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr_raw(fd, ...)	_fdgetr_raw((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr_pos(fd, ...)	_fdgetr_pos((fd), __VA_ARGS__, CAP_LIST_END)
+extern struct file *_fgetr(unsigned int fd, ...);
+extern struct file *_fgetr_raw(unsigned int fd, ...);
+extern struct fd _fdgetr(unsigned int fd, ...);
+extern struct fd _fdgetr_raw(unsigned int fd, ...);
+extern struct fd _fdgetr_pos(unsigned int fd, ...);
+
+#else
+/*
+ * In a non-Capsicum build, all rights-checking fget() variants fall back to the
+ * normal versions (but still return errors as ERR_PTR values not just NULL).
+ */
+static inline struct file *fget_rights(unsigned int fd,
+				       const struct capsicum_rights *rights)
+{
+	return fget(fd) ?: ERR_PTR(-EBADF);
+}
+static inline struct file *fget_raw_rights(unsigned int fd,
+					   const struct capsicum_rights *rights)
+{
+	return fget_raw(fd) ?: ERR_PTR(-EBADF);
+}
+static inline struct fd fdget_rights(unsigned int fd,
+				     const struct capsicum_rights *rights)
+{
+	struct fd f = fdget(fd);
+	if (f.file == NULL)
+		f.file = ERR_PTR(-EBADF);
+	return f;
+}
+static inline struct fd
+fdget_raw_rights(unsigned int fd,
+		 const struct capsicum_rights **actual_rights,
+		 const struct capsicum_rights *rights)
+{
+	struct fd f = fdget_raw(fd);
+	if (f.file == NULL)
+		f.file = ERR_PTR(-EBADF);
+	return f;
+}
+
+#define fgetr(fd, ...)		(fget(fd) ?: ERR_PTR(-EBADF))
+#define fgetr_raw(fd, ...)	(fget_raw(fd) ?: ERR_PTR(-EBADF))
+#define fdgetr(fd, ...)	fdget_rights((fd), NULL)
+#define fdgetr_raw(fd, ...)	fdget_raw_rights((fd), NULL, NULL)
+static inline struct fd fdgetr_pos(int fd, ...)
+{
+	struct fd f = fdget_pos(fd);
+	if (f.file == NULL)
+		f.file = ERR_PTR(-EBADF);
+	return f;
+}
+#endif
+
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index cd56c50109fc..ce6f2fe11bcd 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -59,6 +59,15 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 
 extern int user_path_at(int, const char __user *, unsigned, struct path *);
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
+#ifdef CONFIG_SECURITY_CAPSICUM
+extern int _user_path_atr(int, const char __user *, unsigned,
+			  struct path *, ...);
+#define user_path_atr(f, n, x, p, ...) \
+	_user_path_atr((f), (n), (x), (p), __VA_ARGS__, 0ULL)
+#else
+#define user_path_atr(f, n, x, p, ...) \
+	user_path_at((f), (n), (x), (p))
+#endif
 
 #define user_path(name, path) user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW, path)
 #define user_lpath(name, path) user_path_at(AT_FDCWD, name, 0, path)
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 04/11] capsicum: implement fgetr() and friends
@ 2014-06-30 10:28   ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api-u79uwXL29TY76Z2rM5mHXA, David Drysdale

Add variants of fget() and related functions where the caller
indicates the operations that will be performed on the file.

If CONFIG_SECURITY_CAPSICUM is defined, these variants build a
struct capsicum_rights instance holding the rights associated
with the file operations; this will allow a future hook to check
whether a rights-restricted file has those specific rights
available.

If CONFIG_SECURITY_CAPSICUM is not defined, these variants expand
to the underlying fget() function, with one difference: failures
are returned as an ERR_PTR value rather than just NULL.

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 fs/file.c             | 130 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namei.c            |  49 ++++++++++++++++--
 fs/read_write.c       |   5 --
 include/linux/file.h  | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/namei.h |   9 ++++
 5 files changed, 321 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 8f294cfac697..562cc82ba442 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -13,6 +13,7 @@
 #include <linux/mmzone.h>
 #include <linux/time.h>
 #include <linux/sched.h>
+#include <linux/security.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/file.h>
@@ -722,6 +723,135 @@ unsigned long __fdget_pos(unsigned int fd)
 	return v;
 }
 
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * The LSM might want to change the return value of fget() and friends.
+ * This function is called with the intended return value, and fget()
+ * will /actually/ return whatever is returned from here. We call an
+ * LSM hook, and return what it returns. We adjust the reference counter
+ * if necessary.
+ */
+static struct file *unwrap_file(struct file *orig,
+				const struct capsicum_rights *required_rights,
+				const struct capsicum_rights **actual_rights,
+				bool update_refcnt)
+{
+	struct file *f;
+
+	if (orig == NULL)
+		return ERR_PTR(-EBADF);
+	if (IS_ERR(orig))
+		return orig;
+	f = orig;  /* TODO: pass to an LSM hook here */
+	if (f != orig && update_refcnt) {
+		/* We're not returning the original, and the calling code
+		 * has already incremented the refcount on it, we need to
+		 * release that reference and obtain a reference to the new
+		 * return value, if any.
+		 */
+		if (!IS_ERR(f) && !atomic_long_inc_not_zero(&f->f_count))
+			f = ERR_PTR(-EBADF);
+		atomic_long_dec(&orig->f_count);
+	}
+
+	return f;
+}
+
+struct file *fget_rights(unsigned int fd, const struct capsicum_rights *rights)
+{
+	return unwrap_file(fget(fd), rights, NULL, true);
+}
+EXPORT_SYMBOL(fget_rights);
+
+struct file *fget_raw_rights(unsigned int fd,
+			     const struct capsicum_rights *rights)
+{
+	return unwrap_file(fget_raw(fd), rights, NULL, true);
+}
+EXPORT_SYMBOL(fget_raw_rights);
+
+struct fd fdget_rights(unsigned int fd, const struct capsicum_rights *rights)
+{
+	struct fd f = fdget(fd);
+	f.file = unwrap_file(f.file, rights, NULL, (f.flags & FDPUT_FPUT));
+	return f;
+}
+EXPORT_SYMBOL(fdget_rights);
+
+struct fd fdget_raw_rights(unsigned int fd,
+			   const struct capsicum_rights **actual_rights,
+			   const struct capsicum_rights *rights)
+{
+	struct fd f = fdget_raw(fd);
+	f.file = unwrap_file(f.file, rights, actual_rights,
+			     (f.flags & FDPUT_FPUT));
+	return f;
+}
+EXPORT_SYMBOL(fdget_raw_rights);
+
+struct file *_fgetr(unsigned int fd, ...)
+{
+	struct capsicum_rights rights;
+	struct file *f;
+	va_list ap;
+	va_start(ap, fd);
+	f = fget_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fgetr);
+
+struct file *_fgetr_raw(unsigned int fd, ...)
+{
+	struct capsicum_rights rights;
+	struct file *f;
+	va_list ap;
+	va_start(ap, fd);
+	f = fget_raw_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fgetr_raw);
+
+struct fd _fdgetr(unsigned int fd, ...)
+{
+	struct fd f;
+	struct capsicum_rights rights;
+	va_list ap;
+	va_start(ap, fd);
+	f = fdget_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fdgetr);
+
+struct fd _fdgetr_raw(unsigned int fd, ...)
+{
+	struct fd f;
+	struct capsicum_rights rights;
+	va_list ap;
+	va_start(ap, fd);
+	f = fdget_raw_rights(fd, NULL, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fdgetr_raw);
+
+struct fd _fdgetr_pos(unsigned int fd, ...)
+{
+	struct fd f;
+	struct capsicum_rights rights;
+	va_list ap;
+	f = __to_fd(__fdget_pos(fd));
+	va_start(ap, fd);
+	f.file = unwrap_file(f.file, cap_rights_vinit(&rights, ap), NULL,
+			     (f.flags & FDPUT_FPUT));
+	va_end(ap);
+	return f;
+}
+EXPORT_SYMBOL(_fdgetr_pos);
+#endif
+
 /*
  * We only lock f_pos if we have threads or if the file might be
  * shared with another process. In both cases we'll have an elevated
diff --git a/fs/namei.c b/fs/namei.c
index e6b72531dfc7..c93f7993960e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -646,6 +646,19 @@ static __always_inline void set_root(struct nameidata *nd)
 		get_fs_root(current->fs, &nd->root);
 }
 
+/*
+ * Retrieval of files against a directory file descriptor requires
+ * CAP_LOOKUP. As this is common in this file, set up the required rights once
+ * and for all.
+ */
+static struct capsicum_rights lookup_rights;
+static int __init init_lookup_rights(void)
+{
+	cap_rights_init(&lookup_rights, CAP_LOOKUP);
+	return 0;
+}
+fs_initcall(init_lookup_rights);
+
 static int link_path_walk(const char *, struct nameidata *, unsigned int);
 
 static __always_inline void set_root_rcu(struct nameidata *nd)
@@ -2135,8 +2148,12 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
 }
 EXPORT_SYMBOL(lookup_one_len);
 
-int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
-		 struct path *path, int *empty)
+static int user_path_at_empty_rights(int dfd,
+				const char __user *name,
+				unsigned flags,
+				struct path *path,
+				int *empty,
+				const struct capsicum_rights *rights)
 {
 	struct nameidata nd;
 	struct filename *tmp = getname_flags(name, flags, empty);
@@ -2153,13 +2170,39 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
 	return err;
 }
 
+int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
+		 struct path *path, int *empty)
+{
+	return user_path_at_empty_rights(dfd, name, flags, path, empty,
+					 &lookup_rights);
+}
+
 int user_path_at(int dfd, const char __user *name, unsigned flags,
 		 struct path *path)
 {
-	return user_path_at_empty(dfd, name, flags, path, NULL);
+	return user_path_at_empty_rights(dfd, name, flags, path, NULL,
+					 &lookup_rights);
 }
 EXPORT_SYMBOL(user_path_at);
 
+#ifdef CONFIG_SECURITY_CAPSICUM
+int _user_path_atr(int dfd,
+		   const char __user *name,
+		   unsigned flags,
+		   struct path *path,
+		   ...)
+{
+	struct capsicum_rights rights;
+	int rc;
+	va_list ap;
+	va_start(ap, path);
+	rc = user_path_at_empty_rights(dfd, name, flags, path, NULL,
+				       cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return rc;
+}
+#endif
+
 /*
  * NB: most callers don't do anything directly with the reference to the
  *     to struct filename, but the nd->last pointer points into the name string
diff --git a/fs/read_write.c b/fs/read_write.c
index 31c6efa43183..bd4cc3770b42 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -264,11 +264,6 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
 }
 EXPORT_SYMBOL(vfs_llseek);
 
-static inline struct fd fdget_pos(int fd)
-{
-	return __to_fd(__fdget_pos(fd));
-}
-
 static inline void fdput_pos(struct fd f)
 {
 	if (f.flags & FDPUT_POS_UNLOCK)
diff --git a/include/linux/file.h b/include/linux/file.h
index 4d69123377a2..22952e26ab19 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -8,6 +8,8 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <linux/posix_types.h>
+#include <linux/err.h>
+#include <linux/capsicum.h>
 
 struct file;
 
@@ -39,6 +41,21 @@ static inline void fdput(struct fd fd)
 		fput(fd.file);
 }
 
+/*
+ * The base functions for converting a file descriptor to a struct file are:
+ *  - fget() always increments refcount, doesn't work on O_PATH files.
+ *  - fget_raw() always increments refcount, and does work on O_PATH files.
+ *  - fdget() only increments refcount if needed, doesn't work on O_PATH files.
+ *  - fdget_raw() only increments refcount if needed, works on O_PATH files.
+ *  - fdget_pos() as fdget(), but also locks the file position lock (for
+ *    operations that POSIX requires to be atomic w.r.t file position).
+ * These functions return NULL on failure, and return the actual entry in the
+ * fdtable (which may be a wrapper if the file is a Capsicum capability).
+ *
+ * These functions should normally only be used when a file is being
+ * transferred (e.g. dup(2)) or manipulated as-is; normal users should stick
+ * to the fgetr() variants below.
+ */
 extern struct file *fget(unsigned int fd);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
@@ -60,6 +77,125 @@ static inline struct fd fdget_raw(unsigned int fd)
 	return __to_fd(__fdget_raw(fd));
 }
 
+static inline struct fd fdget_pos(unsigned int fd)
+{
+	return __to_fd(__fdget_pos(fd));
+}
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * The full unwrapping variant functions are:
+ *  - fget_rights()
+ *  - fget_raw_rights()
+ *  - fdget_rights()
+ *  - fdget_raw_rights()
+ * These versions have the same behavior as the equivalent base functions, but:
+ *  - They also take a struct capsicum_rights argument describing the details
+ *    of the operations to be performed on the file.
+ *  - They remove any Capsicum capability wrapper for the file, returning the
+ *    normal underlying file.
+ *  - They return an ERR_PTR on failure (typically with either -EBADF for an
+ *    unrecognized FD, or -ENOTCAPABLE for a Capsicum capability FD that does
+ *    not have the requisite rights).
+ *
+ * The fdget_raw_rights() function also optionally returns the actual Capsicum
+ * rights associated with the file descriptor; the caller should only access
+ * this structure while it holds a reference to the file.
+ *
+ * These functions should normally only be used:
+ *  - when the operation being performed on the file requires more detailed
+ *    specification (in particular: the ioctl(2) or fcntl(2) command invoked)
+ *  - (for fdget_raw_rights()) when a new file descriptor will be created from
+ *    this file descriptor, and so should potentially inherit its rights (if
+ *    it is a Capsicum capability file descriptor).
+ * Otherwise users should stick to the simpler fgetr() variants below.
+ */
+extern struct file *fget_rights(unsigned int fd,
+				const struct capsicum_rights *rights);
+extern struct file *fget_raw_rights(unsigned int fd,
+				    const struct capsicum_rights *rights);
+extern struct fd fdget_rights(unsigned int fd,
+			      const struct capsicum_rights *rights);
+extern struct fd fdget_raw_rights(unsigned int fd,
+				  const struct capsicum_rights **actual_rights,
+				  const struct capsicum_rights *rights);
+
+/*
+ * The simple unwrapping variant functions are:
+ *  - fgetr()
+ *  - fgetr_raw()
+ *  - fdgetr()
+ *  - fdgetr_raw()
+ *  - fdgetr_pos()
+ * These versions have the same behavior as the equivalent base functions, but:
+ *  - They also take variable arguments indicating the operations to be
+ *    performed on the file.
+ *  - They remove any Capsicum capability wrapper for the file, returning the
+ *    normal underlying file.
+ *  - They return an ERR_PTR on failure (typically with either -EBADF for an
+ *    unrecognized FD, or -ENOTCAPABLE for a Capsicum capability FD that does
+ *    not have the requisite rights).
+ *
+ * These functions should normally be used for FD->file conversion.
+ */
+#define fgetr(fd, ...)		_fgetr((fd), __VA_ARGS__, CAP_LIST_END)
+#define fgetr_raw(fd, ...)	_fgetr_raw((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr(fd, ...)	_fdgetr((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr_raw(fd, ...)	_fdgetr_raw((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr_pos(fd, ...)	_fdgetr_pos((fd), __VA_ARGS__, CAP_LIST_END)
+extern struct file *_fgetr(unsigned int fd, ...);
+extern struct file *_fgetr_raw(unsigned int fd, ...);
+extern struct fd _fdgetr(unsigned int fd, ...);
+extern struct fd _fdgetr_raw(unsigned int fd, ...);
+extern struct fd _fdgetr_pos(unsigned int fd, ...);
+
+#else
+/*
+ * In a non-Capsicum build, all rights-checking fget() variants fall back to the
+ * normal versions (but still return errors as ERR_PTR values not just NULL).
+ */
+static inline struct file *fget_rights(unsigned int fd,
+				       const struct capsicum_rights *rights)
+{
+	return fget(fd) ?: ERR_PTR(-EBADF);
+}
+static inline struct file *fget_raw_rights(unsigned int fd,
+					   const struct capsicum_rights *rights)
+{
+	return fget_raw(fd) ?: ERR_PTR(-EBADF);
+}
+static inline struct fd fdget_rights(unsigned int fd,
+				     const struct capsicum_rights *rights)
+{
+	struct fd f = fdget(fd);
+	if (f.file == NULL)
+		f.file = ERR_PTR(-EBADF);
+	return f;
+}
+static inline struct fd
+fdget_raw_rights(unsigned int fd,
+		 const struct capsicum_rights **actual_rights,
+		 const struct capsicum_rights *rights)
+{
+	struct fd f = fdget_raw(fd);
+	if (f.file == NULL)
+		f.file = ERR_PTR(-EBADF);
+	return f;
+}
+
+#define fgetr(fd, ...)		(fget(fd) ?: ERR_PTR(-EBADF))
+#define fgetr_raw(fd, ...)	(fget_raw(fd) ?: ERR_PTR(-EBADF))
+#define fdgetr(fd, ...)	fdget_rights((fd), NULL)
+#define fdgetr_raw(fd, ...)	fdget_raw_rights((fd), NULL, NULL)
+static inline struct fd fdgetr_pos(int fd, ...)
+{
+	struct fd f = fdget_pos(fd);
+	if (f.file == NULL)
+		f.file = ERR_PTR(-EBADF);
+	return f;
+}
+#endif
+
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index cd56c50109fc..ce6f2fe11bcd 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -59,6 +59,15 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 
 extern int user_path_at(int, const char __user *, unsigned, struct path *);
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
+#ifdef CONFIG_SECURITY_CAPSICUM
+extern int _user_path_atr(int, const char __user *, unsigned,
+			  struct path *, ...);
+#define user_path_atr(f, n, x, p, ...) \
+	_user_path_atr((f), (n), (x), (p), __VA_ARGS__, 0ULL)
+#else
+#define user_path_atr(f, n, x, p, ...) \
+	user_path_at((f), (n), (x), (p))
+#endif
 
 #define user_path(name, path) user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW, path)
 #define user_lpath(name, path) user_path_at(AT_FDCWD, name, 0, path)
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 05/11] capsicum: convert callers to use fgetr() etc
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (3 preceding siblings ...)
  2014-06-30 10:28   ` David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 06/11] capsicum: implement sockfd_lookupr() David Drysdale
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Convert places that use fget()-like functions to use the
equivalent fgetr() variant instead.

Annotate each such call with an indication of what operations will
be performed on the retrieved struct file, to allow future policing
of rights associated with file descriptors.

Also change each call site to cope with an ERR_PTR return from
fgetr() rather than a plain NULL failure from fget().

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/alpha/kernel/osf_sys.c                     |   6 +-
 arch/ia64/kernel/perfmon.c                      |  54 +++++++-----
 arch/parisc/hpux/fs.c                           |   6 +-
 arch/powerpc/kvm/powerpc.c                      |   4 +-
 arch/powerpc/platforms/cell/spu_syscalls.c      |  15 ++--
 drivers/base/dma-buf.c                          |   6 +-
 drivers/block/loop.c                            |  14 +--
 drivers/block/nbd.c                             |   2 +-
 drivers/infiniband/core/ucma.c                  |   4 +-
 drivers/infiniband/core/uverbs_cmd.c            |   6 +-
 drivers/infiniband/core/uverbs_main.c           |   4 +-
 drivers/infiniband/hw/usnic/usnic_transport.c   |   2 +-
 drivers/md/md.c                                 |   8 +-
 drivers/staging/android/sync.c                  |   2 +-
 drivers/staging/lustre/lustre/llite/file.c      |   6 +-
 drivers/staging/lustre/lustre/lmv/lmv_obd.c     |   7 +-
 drivers/staging/lustre/lustre/mdc/lproc_mdc.c   |   8 +-
 drivers/staging/lustre/lustre/mdc/mdc_request.c |   4 +-
 drivers/vfio/pci/vfio_pci.c                     |   6 +-
 drivers/vfio/pci/vfio_pci_intrs.c               |   6 +-
 drivers/vfio/vfio.c                             |   6 +-
 drivers/vhost/net.c                             |   6 +-
 drivers/video/fbdev/msm/mdp.c                   |   4 +-
 fs/aio.c                                        |  37 +++++++-
 fs/autofs4/dev-ioctl.c                          |  16 ++--
 fs/autofs4/inode.c                              |   4 +-
 fs/btrfs/ioctl.c                                |  20 +++--
 fs/btrfs/send.c                                 |   7 +-
 fs/cifs/ioctl.c                                 |   6 +-
 fs/coda/inode.c                                 |   4 +-
 fs/coda/psdev.c                                 |   2 +-
 fs/compat.c                                     |  18 ++--
 fs/compat_ioctl.c                               |  14 ++-
 fs/eventfd.c                                    |  17 ++--
 fs/eventpoll.c                                  |  19 +++--
 fs/ext4/ioctl.c                                 |   6 +-
 fs/fcntl.c                                      | 101 ++++++++++++++++++++--
 fs/fhandle.c                                    |   6 +-
 fs/fuse/inode.c                                 |  10 ++-
 fs/ioctl.c                                      |  13 ++-
 fs/locks.c                                      |   8 +-
 fs/notify/fanotify/fanotify_user.c              |  16 ++--
 fs/notify/inotify/inotify_user.c                |  12 +--
 fs/ocfs2/cluster/heartbeat.c                    |   8 +-
 fs/open.c                                       |  42 +++++----
 fs/proc/namespaces.c                            |   6 +-
 fs/read_write.c                                 | 108 ++++++++++++++----------
 fs/readdir.c                                    |  18 ++--
 fs/select.c                                     |  11 ++-
 fs/signalfd.c                                   |   6 +-
 fs/splice.c                                     |  34 +++++---
 fs/stat.c                                       |  10 ++-
 fs/statfs.c                                     |   8 +-
 fs/sync.c                                       |  21 +++--
 fs/timerfd.c                                    |  40 +++++++--
 fs/utimes.c                                     |  10 ++-
 fs/xattr.c                                      |  26 +++---
 fs/xfs/xfs_ioctl.c                              |  14 +--
 ipc/mqueue.c                                    |  30 +++----
 kernel/events/core.c                            |  14 +--
 kernel/module.c                                 |  10 ++-
 kernel/sys.c                                    |   6 +-
 kernel/taskstats.c                              |   4 +-
 kernel/time/posix-clock.c                       |  27 +++---
 mm/fadvise.c                                    |   7 +-
 mm/internal.h                                   |  19 +++++
 mm/memcontrol.c                                 |  12 +--
 mm/mmap.c                                       |   7 +-
 mm/nommu.c                                      |   9 +-
 mm/readahead.c                                  |   6 +-
 net/9p/trans_fd.c                               |  10 +--
 sound/core/pcm_native.c                         |  10 ++-
 virt/kvm/eventfd.c                              |   6 +-
 virt/kvm/vfio.c                                 |  12 +--
 74 files changed, 686 insertions(+), 387 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 1402fcc11c2c..8f2d9597096b 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -146,7 +146,7 @@ SYSCALL_DEFINE4(osf_getdirentries, unsigned int, fd,
 		long __user *, basep)
 {
 	int error;
-	struct fd arg = fdget(fd);
+	struct fd arg = fdgetr(fd, CAP_READ);
 	struct osf_dirent_callback buf = {
 		.ctx.actor = osf_filldir,
 		.dirent = dirent,
@@ -154,8 +154,8 @@ SYSCALL_DEFINE4(osf_getdirentries, unsigned int, fd,
 		.count = count
 	};
 
-	if (!arg.file)
-		return -EBADF;
+	if (IS_ERR(arg.file))
+		return PTR_ERR(arg.file);
 
 	error = iterate_dir(arg.file, &buf.ctx);
 	if (error >= 0)
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index d841c4bd6864..d81ff6523ca9 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -471,6 +471,7 @@ typedef struct {
 	int		cmd_flags;
 	unsigned int	cmd_narg;
 	size_t		cmd_argsize;
+	u64		cmd_right;
 	int		(*cmd_getsize)(void *arg, size_t *sz);
 } pfm_cmd_desc_t;
 
@@ -4620,31 +4621,40 @@ pfm_exit_thread(struct task_struct *task)
 /*
  * functions MUST be listed in the increasing order of their index (see permfon.h)
  */
-#define PFM_CMD(name, flags, arg_count, arg_type, getsz) { name, #name, flags, arg_count, sizeof(arg_type), getsz }
-#define PFM_CMD_S(name, flags) { name, #name, flags, 0, 0, NULL }
+#define PFM_CMD(name, flags, arg_count, arg_type, right, getsz) \
+	{ name, #name, flags, arg_count, sizeof(arg_type), right, getsz }
+#define PFM_CMD_S(name, flags, right) \
+	{ name, #name, flags, 0, 0, right, NULL }
 #define PFM_CMD_PCLRWS	(PFM_CMD_FD|PFM_CMD_ARG_RW|PFM_CMD_STOP)
 #define PFM_CMD_PCLRW	(PFM_CMD_FD|PFM_CMD_ARG_RW)
-#define PFM_CMD_NONE	{ NULL, "no-cmd", 0, 0, 0, NULL}
+#define PFM_CMD_NONE	{ NULL, "no-cmd", 0, 0, 0, 0, NULL}
 
 static pfm_cmd_desc_t pfm_cmd_tab[]={
 /* 0  */PFM_CMD_NONE,
-/* 1  */PFM_CMD(pfm_write_pmcs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 2  */PFM_CMD(pfm_write_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 3  */PFM_CMD(pfm_read_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 4  */PFM_CMD_S(pfm_stop, PFM_CMD_PCLRWS),
-/* 5  */PFM_CMD_S(pfm_start, PFM_CMD_PCLRWS),
+/* 1  */PFM_CMD(pfm_write_pmcs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t,
+		CAP_WRITE, NULL),
+/* 2  */PFM_CMD(pfm_write_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t,
+		CAP_WRITE, NULL),
+/* 3  */PFM_CMD(pfm_read_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t,
+		CAP_READ, NULL),
+/* 4  */PFM_CMD_S(pfm_stop, PFM_CMD_PCLRWS, CAP_PERFMON),
+/* 5  */PFM_CMD_S(pfm_start, PFM_CMD_PCLRWS, CAP_PERFMON),
 /* 6  */PFM_CMD_NONE,
 /* 7  */PFM_CMD_NONE,
-/* 8  */PFM_CMD(pfm_context_create, PFM_CMD_ARG_RW, 1, pfarg_context_t, pfm_ctx_getsize),
+/* 8  */PFM_CMD(pfm_context_create, PFM_CMD_ARG_RW, 1, pfarg_context_t,
+		CAP_PERFMON, pfm_ctx_getsize),
 /* 9  */PFM_CMD_NONE,
-/* 10 */PFM_CMD_S(pfm_restart, PFM_CMD_PCLRW),
+/* 10 */PFM_CMD_S(pfm_restart, PFM_CMD_PCLRW, CAP_PERFMON),
 /* 11 */PFM_CMD_NONE,
-/* 12 */PFM_CMD(pfm_get_features, PFM_CMD_ARG_RW, 1, pfarg_features_t, NULL),
-/* 13 */PFM_CMD(pfm_debug, 0, 1, unsigned int, NULL),
+/* 12 */PFM_CMD(pfm_get_features, PFM_CMD_ARG_RW, 1, pfarg_features_t,
+		CAP_READ, NULL),
+/* 13 */PFM_CMD(pfm_debug, 0, 1, unsigned int, CAP_PERFMON, NULL),
 /* 14 */PFM_CMD_NONE,
-/* 15 */PFM_CMD(pfm_get_pmc_reset, PFM_CMD_ARG_RW, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 16 */PFM_CMD(pfm_context_load, PFM_CMD_PCLRWS, 1, pfarg_load_t, NULL),
-/* 17 */PFM_CMD_S(pfm_context_unload, PFM_CMD_PCLRWS),
+/* 15 */PFM_CMD(pfm_get_pmc_reset, PFM_CMD_ARG_RW, PFM_CMD_ARG_MANY,
+		pfarg_reg_t, CAP_READ, NULL),
+/* 16 */PFM_CMD(pfm_context_load, PFM_CMD_PCLRWS, 1, pfarg_load_t,
+		CAP_READ, NULL),
+/* 17 */PFM_CMD_S(pfm_context_unload, PFM_CMD_PCLRWS, CAP_READ),
 /* 18 */PFM_CMD_NONE,
 /* 19 */PFM_CMD_NONE,
 /* 20 */PFM_CMD_NONE,
@@ -4659,8 +4669,10 @@ static pfm_cmd_desc_t pfm_cmd_tab[]={
 /* 29 */PFM_CMD_NONE,
 /* 30 */PFM_CMD_NONE,
 /* 31 */PFM_CMD_NONE,
-/* 32 */PFM_CMD(pfm_write_ibrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t, NULL),
-/* 33 */PFM_CMD(pfm_write_dbrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t, NULL)
+/* 32 */PFM_CMD(pfm_write_ibrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t,
+		CAP_WRITE, NULL),
+/* 33 */PFM_CMD(pfm_write_dbrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t,
+		CAP_WRITE, NULL)
 };
 #define PFM_CMD_COUNT	(sizeof(pfm_cmd_tab)/sizeof(pfm_cmd_desc_t))
 
@@ -4866,13 +4878,13 @@ restart_args:
 
 	if (unlikely((cmd_flags & PFM_CMD_FD) == 0)) goto skip_fd;
 
-	ret = -EBADF;
-
-	f = fdget(fd);
-	if (unlikely(f.file == NULL)) {
+	f = fdgetr(fd, pfm_cmd_tab[cmd].cmd_right);
+	if (unlikely(IS_ERR(f.file)) {
 		DPRINT(("invalid fd %d\n", fd));
+		ret = PTR_ERR(f.file);
 		goto error_args;
 	}
+	ret = -EBADF;
 	if (unlikely(PFM_IS_FILE(f.file) == 0)) {
 		DPRINT(("fd %d not related to perfmon\n", fd));
 		goto error_args;
diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c
index 2bedafea3d94..8a48b5d4bb15 100644
--- a/arch/parisc/hpux/fs.c
+++ b/arch/parisc/hpux/fs.c
@@ -105,9 +105,9 @@ int hpux_getdents(unsigned int fd, struct hpux_dirent __user *dirent, unsigned i
 	};
 	int error;
 
-	arg = fdget(fd);
-	if (!arg.file)
-		return -EBADF;
+	arg = fdgetr(fd, CAP_READ);
+	if (IS_ERR(arg.file))
+		return PTR_ERR(arg.file);
 
 	error = iterate_dir(arg.file, &buf.ctx);
 	if (error >= 0)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 3cf541a53e2a..f279d852dc96 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -891,7 +891,7 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		struct kvm_device *dev;
 
 		r = -EBADF;
-		f = fdget(cap->args[0]);
+		f = fdgetr(cap->args[0], CAP_FSTAT);
 		if (!f.file)
 			break;
 
@@ -910,7 +910,7 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		struct kvm_device *dev;
 
 		r = -EBADF;
-		f = fdget(cap->args[0]);
+		f = fdgetr(cap->args[0], CAP_FSTAT);
 		if (!f.file)
 			break;
 
diff --git a/arch/powerpc/platforms/cell/spu_syscalls.c b/arch/powerpc/platforms/cell/spu_syscalls.c
index 38e0a1a5cec3..150c83ad8ed5 100644
--- a/arch/powerpc/platforms/cell/spu_syscalls.c
+++ b/arch/powerpc/platforms/cell/spu_syscalls.c
@@ -77,11 +77,13 @@ SYSCALL_DEFINE4(spu_create, const char __user *, name, unsigned int, flags,
 		return -ENOSYS;
 
 	if (flags & SPU_CREATE_AFFINITY_SPU) {
-		struct fd neighbor = fdget(neighbor_fd);
-		ret = -EBADF;
-		if (neighbor.file) {
+		struct fd neighbor = fdgetr(neighbor_fd, CAP_READ, CAP_WRITE,
+					    CAP_MAPEXEC);
+		if (!IS_ERR(neighbor.file)) {
 			ret = calls->create_thread(name, flags, mode, neighbor.file);
 			fdput(neighbor);
+		} else {
+			ret = PTR_ERR(neighbor.file);
 		}
 	} else
 		ret = calls->create_thread(name, flags, mode, NULL);
@@ -100,11 +102,12 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc, __u32 __user *ustatus)
 	if (!calls)
 		return -ENOSYS;
 
-	ret = -EBADF;
-	arg = fdget(fd);
-	if (arg.file) {
+	arg = fdgetr(fd, CAP_READ, CAP_WRITE, CAP_MAPEXEC);
+	if (!IS_ERR(arg.file)) {
 		ret = calls->spu_run(arg.file, unpc, ustatus);
 		fdput(arg);
+	} else {
+		ret = PTR_ERR(arg.file);
 	}
 
 	spufs_calls_put(calls);
diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index ea77701deda4..46650bccaa6a 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -216,10 +216,10 @@ struct dma_buf *dma_buf_get(int fd)
 {
 	struct file *file;
 
-	file = fget(fd);
+	file = fgetr(fd, CAP_MMAP, CAP_READ, CAP_WRITE);
 
-	if (!file)
-		return ERR_PTR(-EBADF);
+	if (IS_ERR(file))
+		return (struct dma_buf *)file;
 
 	if (!is_dma_buf_file(file)) {
 		fput(file);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f70a230a2945..d4a707bc3f1b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -652,10 +652,11 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
 	if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
 		goto out;
 
-	error = -EBADF;
-	file = fget(arg);
-	if (!file)
+	file = fgetr(arg, CAP_PWRITE, CAP_PREAD, CAP_FSYNC, CAP_FSTAT);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
 		goto out;
+	}
 
 	inode = file->f_mapping->host;
 	old_file = lo->lo_backing_file;
@@ -834,10 +835,11 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	/* This is safe, since we have a reference from open(). */
 	__module_get(THIS_MODULE);
 
-	error = -EBADF;
-	file = fget(arg);
-	if (!file)
+	file = fgetr(arg, CAP_PWRITE, CAP_PREAD, CAP_FSYNC, CAP_FSTAT);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
 		goto out;
+	}
 
 	error = -EBUSY;
 	if (lo->lo_state != Lo_unbound)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 3a70ea2f7cd6..d6f55e3052fb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -654,7 +654,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 			nbd->disconnect = 0; /* we're connected now */
 			return 0;
 		}
-		return -EINVAL;
+		return err;
 	}
 
 	case NBD_SET_BLKSIZE:
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 56a4b7ca7ee3..b3b0b1aea8aa 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1406,8 +1406,8 @@ static ssize_t ucma_migrate_id(struct ucma_file *new_file,
 		return -EFAULT;
 
 	/* Get current fd to protect against it being closed */
-	f = fdget(cmd.fd);
-	if (!f.file)
+	f = fdgetr(cmd.fd, CAP_READ, CAP_WRITE, CAP_POLL_EVENT);
+	if (IS_ERR(f.file))
 		return -ENOENT;
 
 	/* Validate current fd and prevent destruction of id. */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index ea6203ee7bcc..06db7ca75e1f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -719,9 +719,9 @@ ssize_t ib_uverbs_open_xrcd(struct ib_uverbs_file *file,
 
 	if (cmd.fd != -1) {
 		/* search for file descriptor */
-		f = fdget(cmd.fd);
-		if (!f.file) {
-			ret = -EBADF;
+		f = fdgetr(cmd.fd, CAP_FSTAT);
+		if (IS_ERR(f.file)) {
+			ret = PTR_ERR(f.file);
 			goto err_tree_mutex_unlock;
 		}
 
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 08219fb3338b..edaf0693ab12 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -566,9 +566,9 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
 struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd)
 {
 	struct ib_uverbs_event_file *ev_file = NULL;
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_LIST_END);
 
-	if (!f.file)
+	if (IS_ERR(f.file))
 		return NULL;
 
 	if (f.file->f_op != &uverbs_event_fops)
diff --git a/drivers/infiniband/hw/usnic/usnic_transport.c b/drivers/infiniband/hw/usnic/usnic_transport.c
index ddef6f77a78c..5e2265792b83 100644
--- a/drivers/infiniband/hw/usnic/usnic_transport.c
+++ b/drivers/infiniband/hw/usnic/usnic_transport.c
@@ -134,7 +134,7 @@ struct socket *usnic_transport_get_socket(int sock_fd)
 	char buf[25];
 
 	/* sockfd_lookup will internally do a fget */
-	sock = sockfd_lookup(sock_fd, &err);
+	sock = sockfd_lookupr(sock_fd, &err, CAP_SOCK_SERVER);
 	if (!sock) {
 		usnic_err("Unable to lookup socket for fd %d with err %d\n",
 				sock_fd, err);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2382cfc9bb3f..aad393fa4d19 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5967,12 +5967,14 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 		struct inode *inode;
 		if (mddev->bitmap)
 			return -EEXIST; /* cannot add when bitmap is present */
-		mddev->bitmap_info.file = fget(fd);
+		mddev->bitmap_info.file = fgetr(fd, CAP_READ);
 
-		if (mddev->bitmap_info.file == NULL) {
+		if (IS_ERR(mddev->bitmap_info.file)) {
+			err = PTR_ERR(mddev->bitmap_info.file);
+			mddev->bitmap_info.file = NULL;
 			printk(KERN_ERR "%s: error: failed to get bitmap file\n",
 			       mdname(mddev));
-			return -EBADF;
+			return err;
 		}
 
 		inode = mddev->bitmap_info.file->f_mapping->host;
diff --git a/drivers/staging/android/sync.c b/drivers/staging/android/sync.c
index 3d05f662110b..a22151248011 100644
--- a/drivers/staging/android/sync.c
+++ b/drivers/staging/android/sync.c
@@ -400,7 +400,7 @@ static void sync_fence_free_pts(struct sync_fence *fence)
 
 struct sync_fence *sync_fence_fdget(int fd)
 {
-	struct file *file = fget(fd);
+	struct file *file = fgetr(fd, CAP_IOCTL);
 
 	if (file == NULL)
 		return NULL;
diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 8e844a6371e0..5bb26632987d 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -2245,9 +2245,9 @@ long ll_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		if ((file->f_flags & O_ACCMODE) == 0) /* O_RDONLY */
 			return -EPERM;
 
-		file2 = fget(lsl.sl_fd);
-		if (file2 == NULL)
-			return -EBADF;
+		file2 = fgetr(lsl.sl_fd, CAP_FSTAT);
+		if (IS_ERR(file2))
+			return PTR_ERR(file2);
 
 		rc = -EPERM;
 		if ((file2->f_flags & O_ACCMODE) != 0) /* O_WRONLY or O_RDWR */
diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
index 3ba0a0a1d945..ce32a7dea277 100644
--- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
+++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
@@ -877,10 +877,9 @@ static int lmv_hsm_ct_register(struct lmv_obd *lmv, unsigned int cmd, int len,
 		return -ENOTCONN;
 
 	/* at least one registration done, with no failure */
-	filp = fget(lk->lk_wfd);
-	if (filp == NULL) {
-		return -EBADF;
-	}
+	filp = fgetr(lk->lk_wfd, CAP_READ);
+	if (IS_ERR(filp))
+		return PTR_ERR(filp);
 	rc = libcfs_kkuc_group_add(filp, lk->lk_uid, lk->lk_group, lk->lk_data);
 	if (rc != 0 && filp != NULL)
 		fput(filp);
diff --git a/drivers/staging/lustre/lustre/mdc/lproc_mdc.c b/drivers/staging/lustre/lustre/mdc/lproc_mdc.c
index 2663480a68c5..7350618766f6 100644
--- a/drivers/staging/lustre/lustre/mdc/lproc_mdc.c
+++ b/drivers/staging/lustre/lustre/mdc/lproc_mdc.c
@@ -130,9 +130,11 @@ static ssize_t mdc_kuc_write(struct file *file, const char *buffer,
 	if (fd == 0) {
 		rc = libcfs_kkuc_group_put(KUC_GRP_HSM, lh);
 	} else {
-		struct file *fp = fget(fd);
-
-		rc = libcfs_kkuc_msg_put(fp, lh);
+		struct file *fp = fgetr(fd, CAP_WRITE);
+		if (IS_ERR(fp))
+			rc = PTR_ERR(fp);
+		else
+			rc = libcfs_kkuc_msg_put(fp, lh);
 		fput(fp);
 	}
 	OBD_FREE(lh, len);
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index bde9f93c149b..c22b103a6643 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -1606,7 +1606,9 @@ static int mdc_ioc_changelog_send(struct obd_device *obd,
 	cs->cs_obd = obd;
 	cs->cs_startrec = icc->icc_recno;
 	/* matching fput in mdc_changelog_send_thread */
-	cs->cs_fp = fget(icc->icc_id);
+	cs->cs_fp = fgetr(icc->icc_id, CAP_WRITE);
+	if (IS_ERR(cs->cs_fp))
+		cs->cs_fp = NULL;
 	cs->cs_flags = icc->icc_flags;
 
 	/*
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7ba042498857..4f79e73e9d7a 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -638,9 +638,9 @@ reset_info_exit:
 		 */
 		for (i = 0; i < hdr.count; i++) {
 			struct vfio_group *group;
-			struct fd f = fdget(group_fds[i]);
-			if (!f.file) {
-				ret = -EBADF;
+			struct fd f = fdgetr(group_fds[i], CAP_FSTAT);
+			if (IS_ERR(f.file)) {
+				ret = PTR_ERR(f.file);
 				break;
 			}
 
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9839ac..4591feea9004 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -149,9 +149,9 @@ static int virqfd_enable(struct vfio_pci_device *vdev,
 	INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
 	INIT_WORK(&virqfd->inject, virqfd_inject);
 
-	irqfd = fdget(fd);
-	if (!irqfd.file) {
-		ret = -EBADF;
+	irqfd = fdgetr(fd, CAP_WRITE);
+	if (IS_ERR(irqfd.file)) {
+		ret = PTR_ERR(irqfd.file);
 		goto err_fd;
 	}
 
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 512f479d8a50..f8c71b84981f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1042,9 +1042,9 @@ static int vfio_group_set_container(struct vfio_group *group, int container_fd)
 	if (atomic_read(&group->container_users))
 		return -EINVAL;
 
-	f = fdget(container_fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(container_fd, CAP_LIST_END);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	/* Sanity check, is this really our fd? */
 	if (f.file->f_op != &vfio_fops) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index be414d2b2b22..6fed594f12d3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -866,11 +866,11 @@ err:
 
 static struct socket *get_tap_socket(int fd)
 {
-	struct file *file = fget(fd);
+	struct file *file = fgetr(fd, CAP_READ, CAP_WRITE);
 	struct socket *sock;
 
-	if (!file)
-		return ERR_PTR(-EBADF);
+	if (IS_ERR(file))
+		return ERR_PTR(PTR_ERR(file));
 	sock = tun_get_socket(file);
 	if (!IS_ERR(sock))
 		return sock;
diff --git a/drivers/video/fbdev/msm/mdp.c b/drivers/video/fbdev/msm/mdp.c
index 113c7876c855..203fb827ae83 100644
--- a/drivers/video/fbdev/msm/mdp.c
+++ b/drivers/video/fbdev/msm/mdp.c
@@ -257,8 +257,8 @@ int get_img(struct mdp_img *img, struct fb_info *info,
 	    struct file **filep)
 {
 	int ret = 0;
-	struct fd f = fdget(img->memory_id);
-	if (f.file == NULL)
+	struct fd f = fdgetr(img->memory_id, CAP_FSTAT);
+	if (IS_ERR(f.file))
 		return -1;
 
 	if (MAJOR(file_inode(f.file)->i_rdev) == FB_MAJOR) {
diff --git a/fs/aio.c b/fs/aio.c
index a0ed6c7d2cd2..4a5cc7f39753 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1390,10 +1390,38 @@ rw_common:
 	return 0;
 }
 
+static struct capsicum_rights *
+aio_opcode_rights(struct capsicum_rights *rights, int opcode)
+{
+	switch (opcode) {
+	case IOCB_CMD_PREAD:
+	case IOCB_CMD_PREADV:
+		cap_rights_init(rights, CAP_PREAD);
+		break;
+
+	case IOCB_CMD_PWRITE:
+	case IOCB_CMD_PWRITEV:
+		cap_rights_init(rights, CAP_PWRITE);
+		break;
+
+	case IOCB_CMD_FSYNC:
+	case IOCB_CMD_FDSYNC:
+		cap_rights_init(rights, CAP_FSYNC);
+		break;
+
+	default:
+		cap_rights_init(rights, CAP_PREAD, CAP_PWRITE, CAP_POLL_EVENT,
+				CAP_FSYNC);
+		break;
+	}
+	return rights;
+}
+
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct iocb *iocb, bool compat)
 {
 	struct kiocb *req;
+	struct capsicum_rights rights;
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
@@ -1416,9 +1444,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	req->ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->ki_filp)) {
-		ret = -EBADF;
+	req->ki_filp = fget_rights(iocb->aio_fildes,
+				   aio_opcode_rights(&rights,
+						     iocb->aio_lio_opcode));
+	if (unlikely(IS_ERR(req->ki_filp))) {
+		ret = PTR_ERR(req->ki_filp);
+		req->ki_filp = NULL;
 		goto out_put_req;
 	}
 
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 232e03d4780d..460c5be6c3f4 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -371,9 +371,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
 			goto out;
 		}
 
-		pipe = fget(pipefd);
-		if (!pipe) {
-			err = -EBADF;
+		pipe = fgetr(pipefd, CAP_READ, CAP_WRITE, CAP_FSYNC);
+		if (IS_ERR(pipe)) {
+			err = PTR_ERR(pipe);
 			goto out;
 		}
 		if (autofs_prepare_pipe(pipe) < 0) {
@@ -665,11 +665,15 @@ static int _autofs_dev_ioctl(unsigned int command, struct autofs_dev_ioctl __use
 	 */
 	if (cmd != AUTOFS_DEV_IOCTL_OPENMOUNT_CMD &&
 	    cmd != AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD) {
-		fp = fget(param->ioctlfd);
-		if (!fp) {
+		struct capsicum_rights rights;
+		cap_rights_init(&rights, CAP_IOCTL, CAP_FSTAT);
+		rights.nioctls = 1;
+		rights.ioctls = &cmd;
+		fp = fget_rights(param->ioctlfd, &rights);
+		if (IS_ERR(fp)) {
 			if (cmd == AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD)
 				goto cont;
-			err = -EBADF;
+			err = PTR_ERR(fp);
 			goto out;
 		}
 
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index d7bd395ab586..39e7f008734f 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -305,9 +305,9 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
 	sbi->sub_version = AUTOFS_PROTO_SUBVERSION;
 
 	DPRINTK("pipe fd = %d, pgrp = %u", pipefd, pid_nr(sbi->oz_pgrp));
-	pipe = fget(pipefd);
+	pipe = fgetr(pipefd, CAP_WRITE, CAP_FSYNC);
 
-	if (!pipe) {
+	if (IS_ERR(pipe)) {
 		printk("autofs: could not open pipe file descriptor\n");
 		goto fail_dput;
 	}
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2f6d7b13b5bd..939354163c4c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1632,10 +1632,12 @@ static noinline int btrfs_ioctl_snap_create_transid(struct file *file,
 		ret = btrfs_mksubvol(&file->f_path, name, namelen,
 				     NULL, transid, readonly, inherit);
 	} else {
-		struct fd src = fdget(fd);
+		struct fd src = fdgetr(fd, CAP_FSTAT);
 		struct inode *src_inode;
-		if (!src.file) {
-			ret = -EINVAL;
+		if (IS_ERR(src.file)) {
+			ret = PTR_ERR(src.file);
+			if (ret == -EBADF)
+				ret = -EINVAL;
 			goto out_drop_write;
 		}
 
@@ -2879,9 +2881,9 @@ static long btrfs_ioctl_file_extent_same(struct file *file,
 
 	for (i = 0, info = same->info; i < count; i++, info++) {
 		struct inode *dst;
-		struct fd dst_file = fdget(info->fd);
-		if (!dst_file.file) {
-			info->status = -EBADF;
+		struct fd dst_file = fdgetr(info->fd, CAP_FSTAT);
+		if (IS_ERR(dst_file.file)) {
+			info->status = PTR_ERR(dst_file.file);
 			continue;
 		}
 		dst = file_inode(dst_file.file);
@@ -3247,9 +3249,9 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	if (ret)
 		return ret;
 
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
+	src_file = fdgetr(srcfd, CAP_FSTAT);
+	if (IS_ERR(src_file.file)) {
+		ret = PTR_ERR(src_file.file);
 		goto out_drop_write;
 	}
 
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 484aacac2c89..0d0a8d9c3ddf 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5571,9 +5571,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 
 	sctx->flags = arg->flags;
 
-	sctx->send_filp = fget(arg->send_fd);
-	if (!sctx->send_filp) {
-		ret = -EBADF;
+	sctx->send_filp = fgetr(arg->send_fd, CAP_PWRITE);
+	if (IS_ERR(sctx->send_filp)) {
+		ret = PTR_ERR(sctx->send_filp);
+		sctx->send_filp = NULL;
 		goto out;
 	}
 
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 77492301cc2b..82cae7765004 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -61,9 +61,9 @@ static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
 		return rc;
 	}
 
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		rc = -EBADF;
+	src_file = fdgetr(srcfd, CAP_PREAD);
+	if (IS_ERR(src_file.file)) {
+		rc = PTR_ERR(src_file.file);
 		goto out_drop_write;
 	}
 
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index d9c7751f10ac..a2fc0106043e 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -128,8 +128,8 @@ static int get_device_index(struct coda_mount_data *data)
 		return -1;
 	}
 
-	f = fdget(data->fd);
-	if (!f.file)
+	f = fdgetr(data->fd, CAP_FSTAT);
+	if (IS_ERR(f.file))
 		goto Ebadf;
 	inode = file_inode(f.file);
 	if (!S_ISCHR(inode->i_mode) || imajor(inode) != CODA_PSDEV_MAJOR) {
diff --git a/fs/coda/psdev.c b/fs/coda/psdev.c
index ebc2bae6c289..370768e5e2e1 100644
--- a/fs/coda/psdev.c
+++ b/fs/coda/psdev.c
@@ -186,7 +186,7 @@ static ssize_t coda_psdev_write(struct file *file, const char __user *buf,
 		struct coda_open_by_fd_out *outp =
 			(struct coda_open_by_fd_out *)req->uc_data;
 		if (!outp->oh.result)
-			outp->fh = fget(outp->fd);
+			outp->fh = fgetr(outp->fd, CAP_LIST_END);
 	}
 
         wake_up(&req->uc_sleep);
diff --git a/fs/compat.c b/fs/compat.c
index 66d3d3c6b4b2..58c3992931c9 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -889,14 +889,14 @@ COMPAT_SYSCALL_DEFINE3(old_readdir, unsigned int, fd,
 		struct compat_old_linux_dirent __user *, dirent, unsigned int, count)
 {
 	int error;
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_READ);
 	struct compat_readdir_callback buf = {
 		.ctx.actor = compat_fillonedir,
 		.dirent = dirent
 	};
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	error = iterate_dir(f.file, &buf.ctx);
 	if (buf.result)
@@ -979,9 +979,9 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigned int, fd,
 	if (!access_ok(VERIFY_WRITE, dirent, count))
 		return -EFAULT;
 
-	f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_READ);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	error = iterate_dir(f.file, &buf.ctx);
 	if (error >= 0)
@@ -1064,9 +1064,9 @@ COMPAT_SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!access_ok(VERIFY_WRITE, dirent, count))
 		return -EFAULT;
 
-	f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_READ);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	error = iterate_dir(f.file, &buf.ctx);
 	if (error >= 0)
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index e82289047272..68f3ab88f00f 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1542,10 +1542,18 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 		       compat_ulong_t, arg32)
 {
 	unsigned long arg = arg32;
-	struct fd f = fdget(fd);
-	int error = -EBADF;
-	if (!f.file)
+	struct capsicum_rights rights;
+	struct fd f;
+	int error;
+
+	cap_rights_init(&rights, CAP_IOCTL);
+	rights.nioctls = 1;
+	rights.ioctls = &cmd;
+	f = fdget_rights(fd, &rights);
+	if (IS_ERR(f.file)) {
+		error = PTR_ERR(f.file);
 		goto out;
+	}
 
 	/* RED-PEN how should LSM module know it's handling 32bit? */
 	error = security_file_ioctl(f.file, cmd, arg);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index d6a88e7812f3..5ec2d0fbefe2 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -319,16 +319,17 @@ static const struct file_operations eventfd_fops = {
  * Returns a pointer to the eventfd file structure in case of success, or the
  * following error pointer:
  *
- * -EBADF    : Invalid @fd file descriptor.
- * -EINVAL   : The @fd file descriptor is not an eventfd file.
+ * -EBADF       : Invalid @fd file descriptor.
+ * -ENOTCAPABLE : The @fd file descriptor does not have the required rights.
+ * -EINVAL      : The @fd file descriptor is not an eventfd file.
  */
 struct file *eventfd_fget(int fd)
 {
 	struct file *file;
 
-	file = fget(fd);
-	if (!file)
-		return ERR_PTR(-EBADF);
+	file = fgetr(fd, CAP_WRITE);
+	if (IS_ERR(file))
+		return file;
 	if (file->f_op != &eventfd_fops) {
 		fput(file);
 		return ERR_PTR(-EINVAL);
@@ -350,9 +351,9 @@ EXPORT_SYMBOL_GPL(eventfd_fget);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd)
 {
 	struct eventfd_ctx *ctx;
-	struct fd f = fdget(fd);
-	if (!f.file)
-		return ERR_PTR(-EBADF);
+	struct fd f = fdgetr(fd, CAP_WRITE);
+	if (IS_ERR(f.file))
+		return (struct eventfd_ctx *) f.file;
 	ctx = eventfd_ctx_fileget(f.file);
 	fdput(f);
 	return ctx;
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index af903128891c..53de5ffbd435 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1836,15 +1836,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
 		goto error_return;
 
-	error = -EBADF;
-	f = fdget(epfd);
-	if (!f.file)
+	f = fdgetr(epfd, CAP_EPOLL_CTL);
+	if (IS_ERR(f.file)) {
+		error = PTR_ERR(f.file);
 		goto error_return;
+	}
 
 	/* Get the "struct file *" for the target file */
-	tf = fdget(fd);
-	if (!tf.file)
+	tf = fdgetr(fd, CAP_POLL_EVENT);
+	if (IS_ERR(tf.file)) {
+		error = PTR_ERR(tf.file);
 		goto error_fput;
+	}
 
 	/* The target file descriptor must support poll */
 	error = -EPERM;
@@ -1976,9 +1979,9 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		return -EFAULT;
 
 	/* Get the "struct file *" for the eventpoll file */
-	f = fdget(epfd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(epfd, CAP_POLL_EVENT);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	/*
 	 * We have to check that the file structure underneath the fd
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0f2252ec274d..a26108969d9b 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -419,9 +419,9 @@ group_extend_out:
 			return -EFAULT;
 		me.moved_len = 0;
 
-		donor = fdget(me.donor_fd);
-		if (!donor.file)
-			return -EBADF;
+		donor = fdgetr(me.donor_fd, CAP_PWRITE, CAP_FSTAT);
+		if (IS_ERR(donor.file))
+			return PTR_ERR(donor.file);
 
 		if (!(donor.file->f_mode & FMODE_WRITE)) {
 			err = -EBADF;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 79f9b09fa46b..8029981462e6 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -355,13 +355,99 @@ static int check_fcntl_cmd(unsigned cmd)
 	return 0;
 }
 
+static bool fcntl_rights(unsigned int cmd, struct capsicum_rights *rights)
+{
+	switch (cmd) {
+	case F_DUPFD:
+	case F_DUPFD_CLOEXEC:
+		/*
+		 * Returning true (=>use wrapped file) implies that no rights
+		 * are needed.
+		 */
+		cap_rights_init(rights, 0);
+		return true;
+	case F_GETFD:
+	case F_SETFD:
+		cap_rights_init(rights, 0);
+		return false;
+	case F_GETFL:
+		cap_rights_init(rights, CAP_FCNTL);
+		rights->fcntls = CAP_FCNTL_GETFL;
+		return false;
+	case F_SETFL:
+		cap_rights_init(rights, CAP_FCNTL);
+		rights->fcntls = CAP_FCNTL_SETFL;
+		return false;
+	case F_GETOWN:
+	case F_GETOWN_EX:
+	case F_GETOWNER_UIDS:
+		cap_rights_init(rights, CAP_FCNTL);
+		rights->fcntls = CAP_FCNTL_GETOWN;
+		return false;
+	case F_SETOWN:
+	case F_SETOWN_EX:
+		cap_rights_init(rights, CAP_FCNTL);
+		rights->fcntls = CAP_FCNTL_SETOWN;
+		return false;
+	case F_GETLK:
+	case F_SETLK:
+	case F_SETLKW:
+#if BITS_PER_LONG == 32
+	case F_GETLK64:
+	case F_SETLK64:
+	case F_SETLKW64:
+#endif
+		cap_rights_init(rights, CAP_FLOCK);
+		return false;
+	case F_GETSIG:
+	case F_SETSIG:
+		cap_rights_init(rights, CAP_POLL_EVENT, CAP_FSIGNAL);
+		return false;
+	case F_GETLEASE:
+	case F_SETLEASE:
+		cap_rights_init(rights, CAP_FLOCK, CAP_FSIGNAL);
+		return false;
+	case F_NOTIFY:
+		cap_rights_init(rights, CAP_NOTIFY);
+		return false;
+	case F_SETPIPE_SZ:
+		cap_rights_init(rights, CAP_SETSOCKOPT);
+		return false;
+	case F_GETPIPE_SZ:
+		cap_rights_init(rights, CAP_GETSOCKOPT);
+		return false;
+	default:
+		cap_rights_set_all(rights);
+		return false;
+	}
+}
+
+static inline struct fd fcntl_fdget_raw(unsigned int fd, unsigned int cmd,
+					struct capsicum_rights *rights)
+{
+	struct fd f;
+
+	if (fcntl_rights(cmd, rights)) {
+		/* Use the file directly, don't attempt to unwrap */
+		f = fdget_raw(fd);
+		if (f.file == NULL)
+			f.file = ERR_PTR(-EBADF);
+	} else {
+		f = fdget_raw_rights(fd, NULL, rights);
+	}
+	return f;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
-{	
-	struct fd f = fdget_raw(fd);
+{
+	struct capsicum_rights rights;
+	struct fd f = fcntl_fdget_raw(fd, cmd, &rights);
 	long err = -EBADF;
 
-	if (!f.file)
+	if (IS_ERR(f.file)) {
+		err = PTR_ERR(f.file);
 		goto out;
+	}
 
 	if (unlikely(f.file->f_mode & FMODE_PATH)) {
 		if (!check_fcntl_cmd(cmd))
@@ -381,12 +467,15 @@ out:
 #if BITS_PER_LONG == 32
 SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
 		unsigned long, arg)
-{	
-	struct fd f = fdget_raw(fd);
+{
+	struct capsicum_rights rights;
+	struct fd f = fcntl_fdget_raw(fd, cmd, &rights);
 	long err = -EBADF;
 
-	if (!f.file)
+	if (IS_ERR(f.file)) {
+		err = PTR_ERR(f.file);
 		goto out;
+	}
 
 	if (unlikely(f.file->f_mode & FMODE_PATH)) {
 		if (!check_fcntl_cmd(cmd))
diff --git a/fs/fhandle.c b/fs/fhandle.c
index 999ff5c3cab0..325575a9084d 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -121,9 +121,9 @@ static struct vfsmount *get_vfsmount_from_fd(int fd)
 		mnt = mntget(fs->pwd.mnt);
 		spin_unlock(&fs->lock);
 	} else {
-		struct fd f = fdget(fd);
-		if (!f.file)
-			return ERR_PTR(-EBADF);
+		struct fd f = fdgetr(fd, CAP_LOOKUP);
+		if (IS_ERR(f.file))
+			return (struct vfsmount *)f.file;
 		mnt = mntget(f.file->f_path.mnt);
 		fdput(f);
 	}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 754dcf23de8a..4a49dca49c8f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1025,11 +1025,15 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_time_gran = 1;
 	sb->s_export_op = &fuse_export_operations;
 
-	file = fget(d.fd);
-	err = -EINVAL;
-	if (!file)
+	file = fgetr(d.fd, CAP_READ, CAP_WRITE);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		if (err == -EBADF)
+			err = -EINVAL;
 		goto err;
+	}
 
+	err = -EINVAL;
 	if ((file->f_op != &fuse_dev_operations) ||
 	    (file->f_cred->user_ns != &init_user_ns))
 		goto err_fput;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 8ac3fad36192..07086423983e 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -604,10 +604,15 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {
 	int error;
-	struct fd f = fdget(fd);
-
-	if (!f.file)
-		return -EBADF;
+	struct capsicum_rights rights;
+	struct fd f;
+	cap_rights_init(&rights, CAP_IOCTL);
+	rights.nioctls = 1;
+	rights.ioctls = &cmd;
+	f = fdget_rights(fd, &rights);
+
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	error = security_file_ioctl(f.file, cmd, arg);
 	if (!error)
 		error = do_vfs_ioctl(f.file, fd, cmd, arg);
diff --git a/fs/locks.c b/fs/locks.c
index e390bd9ae068..375fac3392b9 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1816,19 +1816,21 @@ EXPORT_SYMBOL(flock_lock_file_wait);
  */
 SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_FLOCK);
 	struct file_lock *lock;
 	int can_sleep, unlock;
 	int error;
 
-	error = -EBADF;
-	if (!f.file)
+	if (IS_ERR(f.file)) {
+		error = PTR_ERR(f.file);
 		goto out;
+	}
 
 	can_sleep = !(cmd & LOCK_NB);
 	cmd &= ~LOCK_NB;
 	unlock = (cmd == LOCK_UN);
 
+	error = -EBADF;
 	if (!unlock && !(cmd & LOCK_MAND) &&
 	    !(f.file->f_mode & (FMODE_READ|FMODE_WRITE)))
 		goto out_putf;
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 732648b270dc..e2d80e045b85 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -420,11 +420,12 @@ static int fanotify_find_path(int dfd, const char __user *filename,
 		 dfd, filename, flags);
 
 	if (filename == NULL) {
-		struct fd f = fdget(dfd);
+		struct fd f = fdgetr(dfd, CAP_FSTAT);
 
-		ret = -EBADF;
-		if (!f.file)
+		if (IS_ERR(f.file)) {
+			ret = PTR_ERR(f.file);
 			goto out;
+		}
 
 		ret = -ENOTDIR;
 		if ((flags & FAN_MARK_ONLYDIR) &&
@@ -444,7 +445,8 @@ static int fanotify_find_path(int dfd, const char __user *filename,
 		if (flags & FAN_MARK_ONLYDIR)
 			lookup_flags |= LOOKUP_DIRECTORY;
 
-		ret = user_path_at(dfd, filename, lookup_flags, path);
+		ret = user_path_atr(dfd, filename, lookup_flags, path,
+				    CAP_FSTAT, CAP_LOOKUP);
 		if (ret)
 			goto out;
 	}
@@ -794,9 +796,9 @@ SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
 #endif
 		return -EINVAL;
 
-	f = fdget(fanotify_fd);
-	if (unlikely(!f.file))
-		return -EBADF;
+	f = fdgetr(fanotify_fd, CAP_NOTIFY);
+	if (unlikely(IS_ERR(f.file)))
+		return PTR_ERR(f.file);
 
 	/* verify that this is indeed an fanotify instance */
 	ret = -EINVAL;
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 78a2ca3966c3..5b1e506b53d5 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -711,9 +711,9 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
 	if (unlikely(!(mask & ALL_INOTIFY_BITS)))
 		return -EINVAL;
 
-	f = fdget(fd);
-	if (unlikely(!f.file))
-		return -EBADF;
+	f = fdgetr(fd, CAP_NOTIFY);
+	if (unlikely(IS_ERR(f.file)))
+		return PTR_ERR(f.file);
 
 	/* verify that this is indeed an inotify instance */
 	if (unlikely(f.file->f_op != &inotify_fops)) {
@@ -749,9 +749,9 @@ SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
 	struct fd f;
 	int ret = 0;
 
-	f = fdget(fd);
-	if (unlikely(!f.file))
-		return -EBADF;
+	f = fdgetr(fd, CAP_NOTIFY);
+	if (unlikely(IS_ERR(f.file)))
+		return PTR_ERR(f.file);
 
 	/* verify that this is indeed an inotify instance */
 	ret = -EINVAL;
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index bf482dfed14f..615baf7ae747 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -1740,9 +1740,13 @@ static ssize_t o2hb_region_dev_write(struct o2hb_region *reg,
 	if (fd < 0 || fd >= INT_MAX)
 		goto out;
 
-	f = fdget(fd);
-	if (f.file == NULL)
+	f = fdgetr(fd, CAP_FSTAT);
+	if (IS_ERR(f.file)) {
+		ret = PTR_ERR(f.file);
+		if (ret == -EBADF)
+			ret = -EINVAL;
 		goto out;
+	}
 
 	if (reg->hr_blocks == 0 || reg->hr_start_block == 0 ||
 	    reg->hr_block_bytes == 0)
diff --git a/fs/open.c b/fs/open.c
index f26c492f3698..f9c4e2fd7987 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -159,10 +159,11 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	error = -EINVAL;
 	if (length < 0)
 		goto out;
-	error = -EBADF;
-	f = fdget(fd);
-	if (!f.file)
+	f = fdgetr(fd, CAP_FTRUNCATE);
+	if (IS_ERR(f.file)) {
+		error = PTR_ERR(f.file);
 		goto out;
+	}
 
 	/* explicitly opened as large or we are on 64-bit box */
 	if (f.file->f_flags & O_LARGEFILE)
@@ -302,12 +303,14 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 
 SYSCALL_DEFINE4(fallocate, int, fd, int, mode, loff_t, offset, loff_t, len)
 {
-	struct fd f = fdget(fd);
-	int error = -EBADF;
+	struct fd f = fdgetr(fd, CAP_WRITE);
+	int error;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		error = do_fallocate(f.file, mode, offset, len);
 		fdput(f);
+	} else {
+		error = PTR_ERR(f.file);
 	}
 	return error;
 }
@@ -348,7 +351,7 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
 
 	old_cred = override_creds(override_cred);
 retry:
-	res = user_path_at(dfd, filename, lookup_flags, &path);
+	res = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FSTAT);
 	if (res)
 		goto out;
 
@@ -426,13 +429,14 @@ out:
 
 SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 {
-	struct fd f = fdget_raw(fd);
+	struct fd f = fdgetr_raw(fd, CAP_FCHDIR);
 	struct inode *inode;
 	int error = -EBADF;
 
-	error = -EBADF;
-	if (!f.file)
+	if (IS_ERR(f.file)) {
+		error = PTR_ERR(f.file);
 		goto out;
+	}
 
 	inode = file_inode(f.file);
 
@@ -513,13 +517,15 @@ out_unlock:
 
 SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_FCHMOD);
 	int err = -EBADF;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		audit_inode(NULL, f.file->f_path.dentry, 0);
 		err = chmod_common(&f.file->f_path, mode);
 		fdput(f);
+	} else {
+		err = PTR_ERR(f.file);
 	}
 	return err;
 }
@@ -530,7 +536,7 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, umode_t, mode
 	int error;
 	unsigned int lookup_flags = LOOKUP_FOLLOW;
 retry:
-	error = user_path_at(dfd, filename, lookup_flags, &path);
+	error = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FCHMODAT);
 	if (!error) {
 		error = chmod_common(&path, mode);
 		path_put(&path);
@@ -603,7 +609,7 @@ SYSCALL_DEFINE5(fchownat, int, dfd, const char __user *, filename, uid_t, user,
 	if (flag & AT_EMPTY_PATH)
 		lookup_flags |= LOOKUP_EMPTY;
 retry:
-	error = user_path_at(dfd, filename, lookup_flags, &path);
+	error = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FCHOWNAT);
 	if (error)
 		goto out;
 	error = mnt_want_write(path.mnt);
@@ -634,11 +640,13 @@ SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group
 
 SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
 {
-	struct fd f = fdget(fd);
-	int error = -EBADF;
+	struct fd f = fdgetr(fd, CAP_FCHOWN);
+	int error;
 
-	if (!f.file)
+	if (IS_ERR(f.file)) {
+		error = PTR_ERR(f.file);
 		goto out;
+	}
 
 	error = mnt_want_write_file(f.file);
 	if (error)
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 89026095f2b5..dc29c4d6f050 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -272,9 +272,9 @@ struct file *proc_ns_fget(int fd)
 {
 	struct file *file;
 
-	file = fget(fd);
-	if (!file)
-		return ERR_PTR(-EBADF);
+	file = fgetr(fd, CAP_SETNS);
+	if (IS_ERR(file))
+		return file;
 
 	if (file->f_op != &ns_file_operations)
 		goto out_invalid;
diff --git a/fs/read_write.c b/fs/read_write.c
index bd4cc3770b42..29404f2245f2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -274,9 +274,9 @@ static inline void fdput_pos(struct fd f)
 SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
 {
 	off_t retval;
-	struct fd f = fdget_pos(fd);
-	if (!f.file)
-		return -EBADF;
+	struct fd f = fdgetr_pos(fd, CAP_SEEK);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	retval = -EINVAL;
 	if (whence <= SEEK_MAX) {
@@ -302,11 +302,11 @@ SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
 		unsigned int, whence)
 {
 	int retval;
-	struct fd f = fdget_pos(fd);
+	struct fd f = fdgetr_pos(fd, CAP_SEEK);
 	loff_t offset;
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	retval = -EINVAL;
 	if (whence > SEEK_MAX)
@@ -505,15 +505,17 @@ static inline void file_pos_write(struct file *file, loff_t pos)
 
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
-	struct fd f = fdget_pos(fd);
-	ssize_t ret = -EBADF;
+	struct fd f = fdgetr_pos(fd, CAP_READ);
+	ssize_t ret;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		loff_t pos = file_pos_read(f.file);
 		ret = vfs_read(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 	return ret;
 }
@@ -521,15 +523,17 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		size_t, count)
 {
-	struct fd f = fdget_pos(fd);
-	ssize_t ret = -EBADF;
+	struct fd f = fdgetr_pos(fd, CAP_WRITE);
+	ssize_t ret;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		loff_t pos = file_pos_read(f.file);
 		ret = vfs_write(f.file, buf, count, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	return ret;
@@ -544,12 +548,14 @@ SYSCALL_DEFINE4(pread64, unsigned int, fd, char __user *, buf,
 	if (pos < 0)
 		return -EINVAL;
 
-	f = fdget(fd);
-	if (f.file) {
+	f = fdgetr(fd, CAP_PREAD);
+	if (!IS_ERR(f.file)) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
 			ret = vfs_read(f.file, buf, count, &pos);
 		fdput(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	return ret;
@@ -564,12 +570,14 @@ SYSCALL_DEFINE4(pwrite64, unsigned int, fd, const char __user *, buf,
 	if (pos < 0)
 		return -EINVAL;
 
-	f = fdget(fd);
-	if (f.file) {
+	f = fdgetr(fd, CAP_PWRITE);
+	if (!IS_ERR(f.file)) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)  
 			ret = vfs_write(f.file, buf, count, &pos);
 		fdput(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	return ret;
@@ -804,15 +812,17 @@ EXPORT_SYMBOL(vfs_writev);
 SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget_pos(fd);
-	ssize_t ret = -EBADF;
+	struct fd f = fdgetr_pos(fd, CAP_READ);
+	ssize_t ret;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		loff_t pos = file_pos_read(f.file);
 		ret = vfs_readv(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	if (ret > 0)
@@ -824,15 +834,17 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
-	struct fd f = fdget_pos(fd);
-	ssize_t ret = -EBADF;
+	struct fd f = fdgetr_pos(fd, CAP_WRITE);
+	ssize_t ret;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		loff_t pos = file_pos_read(f.file);
 		ret = vfs_writev(f.file, vec, vlen, &pos);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	if (ret > 0)
@@ -852,17 +864,19 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
-	ssize_t ret = -EBADF;
+	ssize_t ret;
 
 	if (pos < 0)
 		return -EINVAL;
 
-	f = fdget(fd);
-	if (f.file) {
+	f = fdgetr(fd, CAP_PREAD);
+	if (!IS_ERR(f.file)) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
 			ret = vfs_readv(f.file, vec, vlen, &pos);
 		fdput(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	if (ret > 0)
@@ -876,17 +890,19 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
-	ssize_t ret = -EBADF;
+	ssize_t ret;
 
 	if (pos < 0)
 		return -EINVAL;
 
-	f = fdget(fd);
-	if (f.file) {
+	f = fdgetr(fd, CAP_PWRITE);
+	if (!IS_ERR(f.file)) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
 			ret = vfs_writev(f.file, vec, vlen, &pos);
 		fdput(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 
 	if (ret > 0)
@@ -975,12 +991,12 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 		const struct compat_iovec __user *,vec,
 		compat_ulong_t, vlen)
 {
-	struct fd f = fdget_pos(fd);
+	struct fd f = fdgetr_pos(fd, CAP_READ);
 	ssize_t ret;
 	loff_t pos;
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	pos = f.file->f_pos;
 	ret = compat_readv(f.file, vec, vlen, &pos);
 	if (ret >= 0)
@@ -998,9 +1014,9 @@ static long __compat_sys_preadv64(unsigned long fd,
 
 	if (pos < 0)
 		return -EINVAL;
-	f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_PREAD);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	ret = -ESPIPE;
 	if (f.file->f_mode & FMODE_PREAD)
 		ret = compat_readv(f.file, vec, vlen, &pos);
@@ -1052,12 +1068,12 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 		const struct compat_iovec __user *, vec,
 		compat_ulong_t, vlen)
 {
-	struct fd f = fdget_pos(fd);
+	struct fd f = fdgetr_pos(fd, CAP_WRITE);
 	ssize_t ret;
 	loff_t pos;
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	pos = f.file->f_pos;
 	ret = compat_writev(f.file, vec, vlen, &pos);
 	if (ret >= 0)
@@ -1075,9 +1091,9 @@ static long __compat_sys_pwritev64(unsigned long fd,
 
 	if (pos < 0)
 		return -EINVAL;
-	f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_PWRITE);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	ret = -ESPIPE;
 	if (f.file->f_mode & FMODE_PWRITE)
 		ret = compat_writev(f.file, vec, vlen, &pos);
@@ -1118,9 +1134,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	 * Get input file, and verify that it is ok..
 	 */
 	retval = -EBADF;
-	in = fdget(in_fd);
-	if (!in.file)
+	in = fdgetr(in_fd, CAP_PREAD);
+	if (IS_ERR(in.file)) {
+		retval = PTR_ERR(in.file);
 		goto out;
+	}
 	if (!(in.file->f_mode & FMODE_READ))
 		goto fput_in;
 	retval = -ESPIPE;
@@ -1140,9 +1158,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	 * Get output file, and verify that it is ok..
 	 */
 	retval = -EBADF;
-	out = fdget(out_fd);
-	if (!out.file)
+	out = fdgetr(out_fd, CAP_WRITE);
+	if (IS_ERR(out.file)) {
+		retval = PTR_ERR(out.file);
 		goto fput_in;
+	}
 	if (!(out.file->f_mode & FMODE_WRITE))
 		goto fput_out;
 	retval = -EINVAL;
diff --git a/fs/readdir.c b/fs/readdir.c
index 5b53d995cae6..fffbc8395236 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -108,14 +108,14 @@ SYSCALL_DEFINE3(old_readdir, unsigned int, fd,
 		struct old_linux_dirent __user *, dirent, unsigned int, count)
 {
 	int error;
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_READ);
 	struct readdir_callback buf = {
 		.ctx.actor = fillonedir,
 		.dirent = dirent
 	};
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	error = iterate_dir(f.file, &buf.ctx);
 	if (buf.result)
@@ -204,9 +204,9 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
 	if (!access_ok(VERIFY_WRITE, dirent, count))
 		return -EFAULT;
 
-	f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_READ);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	error = iterate_dir(f.file, &buf.ctx);
 	if (error >= 0)
@@ -284,9 +284,9 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!access_ok(VERIFY_WRITE, dirent, count))
 		return -EFAULT;
 
-	f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_READ);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	error = iterate_dir(f.file, &buf.ctx);
 	if (error >= 0)
diff --git a/fs/select.c b/fs/select.c
index 467bb1cb3ea5..079bb7e9c126 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -449,8 +449,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 					break;
 				if (!(bit & all_bits))
 					continue;
-				f = fdget(i);
-				if (f.file) {
+				f = fdgetr(i, CAP_POLL_EVENT);
+				if (!IS_ERR(f.file)) {
 					const struct file_operations *f_op;
 					f_op = f.file->f_op;
 					mask = DEFAULT_POLLMASK;
@@ -487,6 +487,9 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 					} else if (busy_flag & mask)
 						can_busy_loop = true;
 
+				} else if (PTR_ERR(f.file) != -EBADF) {
+					retval = PTR_ERR(f.file);
+					break;
 				}
 			}
 			if (res_in)
@@ -757,9 +760,9 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait,
 	mask = 0;
 	fd = pollfd->fd;
 	if (fd >= 0) {
-		struct fd f = fdget(fd);
+		struct fd f = fdgetr(fd, CAP_POLL_EVENT);
 		mask = POLLNVAL;
-		if (f.file) {
+		if (!IS_ERR(f.file)) {
 			mask = DEFAULT_POLLMASK;
 			if (f.file->f_op->poll) {
 				pwait->_key = pollfd->events|POLLERR|POLLHUP;
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 424b7b65321f..949fdb4d1ae9 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -288,9 +288,9 @@ SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask,
 		if (ufd < 0)
 			kfree(ctx);
 	} else {
-		struct fd f = fdget(ufd);
-		if (!f.file)
-			return -EBADF;
+		struct fd f = fdgetr(ufd, CAP_FSIGNAL);
+		if (IS_ERR(f.file))
+			return PTR_ERR(f.file);
 		ctx = f.file->private_data;
 		if (f.file->f_op != &signalfd_fops) {
 			fdput(f);
diff --git a/fs/splice.c b/fs/splice.c
index e246954ea48c..279b47455149 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1636,14 +1636,16 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
 		return 0;
 
 	error = -EBADF;
-	f = fdget(fd);
-	if (f.file) {
+	f = fdgetr(fd, CAP_WRITE);
+	if (!IS_ERR(f.file)) {
 		if (f.file->f_mode & FMODE_WRITE)
 			error = vmsplice_to_pipe(f.file, iov, nr_segs, flags);
 		else if (f.file->f_mode & FMODE_READ)
 			error = vmsplice_to_user(f.file, iov, nr_segs, flags);
 
 		fdput(f);
+	} else {
+		error = PTR_ERR(f.file);
 	}
 
 	return error;
@@ -1681,19 +1683,23 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
 		return 0;
 
 	error = -EBADF;
-	in = fdget(fd_in);
-	if (in.file) {
+	in = fdgetr(fd_in, CAP_PREAD);
+	if (!IS_ERR(in.file)) {
 		if (in.file->f_mode & FMODE_READ) {
-			out = fdget(fd_out);
-			if (out.file) {
+			out = fdgetr(fd_out, CAP_PWRITE);
+			if (!IS_ERR(out.file)) {
 				if (out.file->f_mode & FMODE_WRITE)
 					error = do_splice(in.file, off_in,
 							  out.file, off_out,
 							  len, flags);
 				fdput(out);
+			} else {
+				error = PTR_ERR(out.file);
 			}
 		}
 		fdput(in);
+	} else {
+		error = PTR_ERR(in.file);
 	}
 	return error;
 }
@@ -2012,19 +2018,23 @@ SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
 		return 0;
 
 	error = -EBADF;
-	in = fdget(fdin);
-	if (in.file) {
+	in = fdgetr(fdin, CAP_READ);
+	if (!IS_ERR(in.file)) {
 		if (in.file->f_mode & FMODE_READ) {
-			struct fd out = fdget(fdout);
-			if (out.file) {
+			struct fd out = fdgetr(fdout, CAP_WRITE);
+			if (!IS_ERR(out.file)) {
 				if (out.file->f_mode & FMODE_WRITE)
 					error = do_tee(in.file, out.file,
 							len, flags);
 				fdput(out);
+			} else {
+				error = PTR_ERR(out.file);
 			}
 		}
- 		fdput(in);
- 	}
+		fdput(in);
+	} else {
+		error = PTR_ERR(in.file);
+	}
 
 	return error;
 }
diff --git a/fs/stat.c b/fs/stat.c
index ae0c3cef9927..f40b3530eab4 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -76,12 +76,14 @@ EXPORT_SYMBOL(vfs_getattr);
 
 int vfs_fstat(unsigned int fd, struct kstat *stat)
 {
-	struct fd f = fdget_raw(fd);
-	int error = -EBADF;
+	struct fd f = fdgetr_raw(fd, CAP_FSTAT);
+	int error;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		error = vfs_getattr(&f.file->f_path, stat);
 		fdput(f);
+	} else {
+		error = PTR_ERR(f.file);
 	}
 	return error;
 }
@@ -103,7 +105,7 @@ int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
 	if (flag & AT_EMPTY_PATH)
 		lookup_flags |= LOOKUP_EMPTY;
 retry:
-	error = user_path_at(dfd, filename, lookup_flags, &path);
+	error = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FSTAT);
 	if (error)
 		goto out;
 
diff --git a/fs/statfs.c b/fs/statfs.c
index 083dc0ac9140..f1b60bf1d14c 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -94,11 +94,13 @@ retry:
 
 int fd_statfs(int fd, struct kstatfs *st)
 {
-	struct fd f = fdget_raw(fd);
-	int error = -EBADF;
-	if (f.file) {
+	struct fd f = fdgetr_raw(fd, CAP_FSTATFS);
+	int error;
+	if (!IS_ERR(f.file)) {
 		error = vfs_statfs(&f.file->f_path, st);
 		fdput(f);
+	} else {
+		error = PTR_ERR(f.file);
 	}
 	return error;
 }
diff --git a/fs/sync.c b/fs/sync.c
index b28d1dd10e8b..663afe812600 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -148,12 +148,12 @@ void emergency_sync(void)
  */
 SYSCALL_DEFINE1(syncfs, int, fd)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_FSYNC);
 	struct super_block *sb;
 	int ret;
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	sb = f.file->f_dentry->d_sb;
 
 	down_read(&sb->s_umount);
@@ -199,12 +199,14 @@ EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
 {
-	struct fd f = fdget(fd);
-	int ret = -EBADF;
+	struct fd f = fdgetr(fd, CAP_FSYNC);
+	int ret;
 
-	if (f.file) {
+	if (!IS_ERR(f.file)) {
 		ret = vfs_fsync(f.file, datasync);
 		fdput(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 	return ret;
 }
@@ -310,10 +312,11 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
 	else
 		endbyte--;		/* inclusive */
 
-	ret = -EBADF;
-	f = fdget(fd);
-	if (!f.file)
+	f = fdgetr(fd, CAP_FSYNC, CAP_SEEK);
+	if (IS_ERR(f.file)) {
+		ret = PTR_ERR(f.file);
 		goto out;
+	}
 
 	i_mode = file_inode(f.file)->i_mode;
 	ret = -ESPIPE;
diff --git a/fs/timerfd.c b/fs/timerfd.c
index 0013142c0475..daf04417e2cf 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -291,6 +291,32 @@ static const struct file_operations timerfd_fops = {
 	.llseek		= noop_llseek,
 };
 
+#ifdef CONFIG_SECURITY_CAPSICUM
+#define timerfd_fgetr(f, p, ...) \
+	_timerfd_fgetr((f), (p), __VA_ARGS__, 0ULL)
+static int _timerfd_fgetr(int fd, struct fd *p, ...)
+{
+	struct capsicum_rights rights;
+	struct fd f;
+	va_list ap;
+
+	va_start(ap, p);
+	f = fdget_rights(fd, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
+	if (f.file->f_op != &timerfd_fops) {
+		fdput(f);
+		return -EINVAL;
+	}
+	*p = f;
+	return 0;
+}
+
+#else
+
+#define timerfd_fgetr(f, p, ...) \
+	timerfd_fget((f), (p))
 static int timerfd_fget(int fd, struct fd *p)
 {
 	struct fd f = fdget(fd);
@@ -304,6 +330,8 @@ static int timerfd_fget(int fd, struct fd *p)
 	return 0;
 }
 
+#endif
+
 SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
 {
 	int ufd;
@@ -359,7 +387,7 @@ static int do_timerfd_settime(int ufd, int flags,
 	    !timespec_valid(&new->it_interval))
 		return -EINVAL;
 
-	ret = timerfd_fget(ufd, &f);
+	ret = timerfd_fgetr(ufd, &f, CAP_WRITE, (old ? CAP_READ : 0));
 	if (ret)
 		return ret;
 	ctx = f.file->private_data;
@@ -397,8 +425,10 @@ static int do_timerfd_settime(int ufd, int flags,
 			hrtimer_forward_now(&ctx->t.tmr, ctx->tintv);
 	}
 
-	old->it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
-	old->it_interval = ktime_to_timespec(ctx->tintv);
+	if (old) {
+		old->it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
+		old->it_interval = ktime_to_timespec(ctx->tintv);
+	}
 
 	/*
 	 * Re-program the timer to the new value ...
@@ -414,7 +444,7 @@ static int do_timerfd_gettime(int ufd, struct itimerspec *t)
 {
 	struct fd f;
 	struct timerfd_ctx *ctx;
-	int ret = timerfd_fget(ufd, &f);
+	int ret = timerfd_fgetr(ufd, &f, CAP_READ);
 	if (ret)
 		return ret;
 	ctx = f.file->private_data;
@@ -451,7 +481,7 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
 
 	if (copy_from_user(&new, utmr, sizeof(new)))
 		return -EFAULT;
-	ret = do_timerfd_settime(ufd, flags, &new, &old);
+	ret = do_timerfd_settime(ufd, flags, &new, otmr ? &old : NULL);
 	if (ret)
 		return ret;
 	if (otmr && copy_to_user(otmr, &old, sizeof(old)))
diff --git a/fs/utimes.c b/fs/utimes.c
index aa138d64560a..1d451efd6ae2 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -152,10 +152,11 @@ long do_utimes(int dfd, const char __user *filename, struct timespec *times,
 		if (flags & AT_SYMLINK_NOFOLLOW)
 			goto out;
 
-		f = fdget(dfd);
-		error = -EBADF;
-		if (!f.file)
+		f = fdgetr(dfd, CAP_FUTIMES);
+		if (IS_ERR(f.file)) {
+			error = PTR_ERR(f.file);
 			goto out;
+		}
 
 		error = utimes_common(&f.file->f_path, times);
 		fdput(f);
@@ -166,7 +167,8 @@ long do_utimes(int dfd, const char __user *filename, struct timespec *times,
 		if (!(flags & AT_SYMLINK_NOFOLLOW))
 			lookup_flags |= LOOKUP_FOLLOW;
 retry:
-		error = user_path_at(dfd, filename, lookup_flags, &path);
+		error = user_path_atr(dfd, filename, lookup_flags, &path,
+				      CAP_FUTIMESAT);
 		if (error)
 			goto out;
 
diff --git a/fs/xattr.c b/fs/xattr.c
index 3377dff18404..3013dc4cbf27 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -415,12 +415,12 @@ retry:
 SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
 		const void __user *,value, size_t, size, int, flags)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_EXTATTR_SET);
 	struct dentry *dentry;
-	int error = -EBADF;
+	int error;
 
-	if (!f.file)
-		return error;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	dentry = f.file->f_path.dentry;
 	audit_inode(NULL, dentry, 0);
 	error = mnt_want_write_file(f.file);
@@ -522,11 +522,11 @@ retry:
 SYSCALL_DEFINE4(fgetxattr, int, fd, const char __user *, name,
 		void __user *, value, size_t, size)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_EXTATTR_GET);
 	ssize_t error = -EBADF;
 
-	if (!f.file)
-		return error;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	audit_inode(NULL, f.file->f_path.dentry, 0);
 	error = getxattr(f.file->f_path.dentry, name, value, size);
 	fdput(f);
@@ -611,11 +611,11 @@ retry:
 
 SYSCALL_DEFINE3(flistxattr, int, fd, char __user *, list, size_t, size)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_EXTATTR_LIST);
 	ssize_t error = -EBADF;
 
-	if (!f.file)
-		return error;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	audit_inode(NULL, f.file->f_path.dentry, 0);
 	error = listxattr(f.file->f_path.dentry, list, size);
 	fdput(f);
@@ -688,12 +688,12 @@ retry:
 
 SYSCALL_DEFINE2(fremovexattr, int, fd, const char __user *, name)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_EXTATTR_DELETE);
 	struct dentry *dentry;
 	int error = -EBADF;
 
-	if (!f.file)
-		return error;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 	dentry = f.file->f_path.dentry;
 	audit_inode(NULL, dentry, 0);
 	error = mnt_want_write_file(f.file);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0b18776b075e..a034c21be2a0 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -76,9 +76,9 @@ xfs_find_handle(
 	struct xfs_inode	*ip;
 
 	if (cmd == XFS_IOC_FD_TO_HANDLE) {
-		f = fdget(hreq->fd);
-		if (!f.file)
-			return -EBADF;
+		f = fdgetr(hreq->fd, CAP_FSTAT);
+		if (IS_ERR(f.file))
+			return PTR_ERR(f.file);
 		inode = file_inode(f.file);
 	} else {
 		error = user_lpath((const char __user *)hreq->path, &path);
@@ -1449,8 +1449,8 @@ xfs_ioc_swapext(
 	int		error = 0;
 
 	/* Pull information for the target fd */
-	f = fdget((int)sxp->sx_fdtarget);
-	if (!f.file) {
+	f = fdgetr((int)sxp->sx_fdtarget, CAP_READ, CAP_WRITE, CAP_FSTAT);
+	if (IS_ERR(f.file)) {
 		error = XFS_ERROR(EINVAL);
 		goto out;
 	}
@@ -1462,8 +1462,8 @@ xfs_ioc_swapext(
 		goto out_put_file;
 	}
 
-	tmp = fdget((int)sxp->sx_fdtmp);
-	if (!tmp.file) {
+	tmp = fdgetr((int)sxp->sx_fdtmp, CAP_READ, CAP_WRITE, CAP_FSTAT);
+	if (IS_ERR(tmp.file)) {
 		error = XFS_ERROR(EINVAL);
 		goto out_put_file;
 	}
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 4fcf39af1776..f38639676b00 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -978,9 +978,9 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 
 	audit_mq_sendrecv(mqdes, msg_len, msg_prio, timeout ? &ts : NULL);
 
-	f = fdget(mqdes);
-	if (unlikely(!f.file)) {
-		ret = -EBADF;
+	f = fdgetr(mqdes, CAP_WRITE);
+	if (unlikely(IS_ERR(f.file))) {
+		ret = PTR_ERR(f.file);
 		goto out;
 	}
 
@@ -1094,9 +1094,9 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
 
 	audit_mq_sendrecv(mqdes, msg_len, 0, timeout ? &ts : NULL);
 
-	f = fdget(mqdes);
-	if (unlikely(!f.file)) {
-		ret = -EBADF;
+	f = fdgetr(mqdes, CAP_READ);
+	if (unlikely(IS_ERR(f.file))) {
+		ret = PTR_ERR(f.file);
 		goto out;
 	}
 
@@ -1229,9 +1229,9 @@ SYSCALL_DEFINE2(mq_notify, mqd_t, mqdes,
 			skb_put(nc, NOTIFY_COOKIE_LEN);
 			/* and attach it to the socket */
 retry:
-			f = fdget(notification.sigev_signo);
-			if (!f.file) {
-				ret = -EBADF;
+			f = fdgetr(notification.sigev_signo, CAP_POLL_EVENT);
+			if (IS_ERR(f.file)) {
+				ret = PTR_ERR(f.file);
 				goto out;
 			}
 			sock = netlink_getsockbyfilp(f.file);
@@ -1254,9 +1254,9 @@ retry:
 		}
 	}
 
-	f = fdget(mqdes);
-	if (!f.file) {
-		ret = -EBADF;
+	f = fdgetr(mqdes, CAP_POLL_EVENT);
+	if (IS_ERR(f.file)) {
+		ret = PTR_ERR(f.file);
 		goto out;
 	}
 
@@ -1328,9 +1328,9 @@ SYSCALL_DEFINE3(mq_getsetattr, mqd_t, mqdes,
 			return -EINVAL;
 	}
 
-	f = fdget(mqdes);
-	if (!f.file) {
-		ret = -EBADF;
+	f = fdgetr(mqdes, CAP_POLL_EVENT);
+	if (IS_ERR(f.file)) {
+		ret = PTR_ERR(f.file);
 		goto out;
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 440eefc67397..43aa1a2cbc84 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -601,11 +601,11 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 {
 	struct perf_cgroup *cgrp;
 	struct cgroup_subsys_state *css;
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_FSTAT);
 	int ret = 0;
 
-	if (!f.file)
-		return -EBADF;
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	css = css_tryget_from_dir(f.file->f_dentry, &perf_event_cgrp_subsys);
 	if (IS_ERR(css)) {
@@ -3598,9 +3598,9 @@ static const struct file_operations perf_fops;
 
 static inline int perf_fget_light(int fd, struct fd *p)
 {
-	struct fd f = fdget(fd);
-	if (!f.file)
-		return -EBADF;
+	struct fd f = fdgetr(fd, CAP_WRITE);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	if (f.file->f_op != &perf_fops) {
 		fdput(f);
@@ -3651,7 +3651,7 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		int ret;
 		if (arg != -1) {
 			struct perf_event *output_event;
-			struct fd output;
+			struct fd output = { .file = NULL };
 			ret = perf_fget_light(arg, &output);
 			if (ret)
 				return ret;
diff --git a/kernel/module.c b/kernel/module.c
index 079c4615607d..6a0a1a28d34a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2513,14 +2513,18 @@ static int copy_module_from_user(const void __user *umod, unsigned long len,
 /* Sets info->hdr and info->len. */
 static int copy_module_from_fd(int fd, struct load_info *info)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_FEXECVE);
 	int err;
 	struct kstat stat;
 	loff_t pos;
 	ssize_t bytes = 0;
 
-	if (!f.file)
-		return -ENOEXEC;
+	if (IS_ERR(f.file)) {
+		err = PTR_ERR(f.file);
+		if (err == -EBADF)
+			err = -ENOEXEC;
+		return err;
+	}
 
 	err = security_kernel_module_from_file(f.file);
 	if (err)
diff --git a/kernel/sys.c b/kernel/sys.c
index fba0f29401ea..e158bce22c6b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1634,9 +1634,9 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 	struct inode *inode;
 	int err;
 
-	exe = fdget(fd);
-	if (!exe.file)
-		return -EBADF;
+	exe = fdgetr(fd, CAP_FEXECVE);
+	if (IS_ERR(exe.file))
+		return PTR_ERR(exe.file);
 
 	inode = file_inode(exe.file);
 
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 13d2f7cd65db..1ad5fd005334 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -437,8 +437,8 @@ static int cgroupstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
 		return -EINVAL;
 
 	fd = nla_get_u32(info->attrs[CGROUPSTATS_CMD_ATTR_FD]);
-	f = fdget(fd);
-	if (!f.file)
+	f = fdgetr(fd, CAP_FSTAT);
+	if (IS_ERR(f.file))
 		return 0;
 
 	size = nla_total_size(sizeof(struct cgroupstats));
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ce033c7aa2e8..0766473c77ea 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -246,13 +246,18 @@ struct posix_clock_desc {
 	struct posix_clock *clk;
 };
 
-static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd)
+static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd,
+			  u64 right)
 {
-	struct file *fp = fget(CLOCKID_TO_FD(id));
+	struct file *fp = fgetr(CLOCKID_TO_FD(id), right);
 	int err = -EINVAL;
 
-	if (!fp)
+	if (IS_ERR(fp)) {
+		err = PTR_ERR(fp);
+		if (err == -EBADF)
+			err = -EINVAL;
 		return err;
+	}
 
 	if (fp->f_op->open != posix_clock_open || !fp->private_data)
 		goto out;
@@ -278,7 +283,7 @@ static int pc_clock_adjtime(clockid_t id, struct timex *tx)
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_WRITE);
 	if (err)
 		return err;
 
@@ -302,7 +307,7 @@ static int pc_clock_gettime(clockid_t id, struct timespec *ts)
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_READ);
 	if (err)
 		return err;
 
@@ -321,7 +326,7 @@ static int pc_clock_getres(clockid_t id, struct timespec *ts)
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_READ);
 	if (err)
 		return err;
 
@@ -340,7 +345,7 @@ static int pc_clock_settime(clockid_t id, const struct timespec *ts)
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_WRITE);
 	if (err)
 		return err;
 
@@ -365,7 +370,7 @@ static int pc_timer_create(struct k_itimer *kit)
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_WRITE);
 	if (err)
 		return err;
 
@@ -385,7 +390,7 @@ static int pc_timer_delete(struct k_itimer *kit)
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_WRITE);
 	if (err)
 		return err;
 
@@ -404,7 +409,7 @@ static void pc_timer_gettime(struct k_itimer *kit, struct itimerspec *ts)
 	clockid_t id = kit->it_clock;
 	struct posix_clock_desc cd;
 
-	if (get_clock_desc(id, &cd))
+	if (get_clock_desc(id, &cd, CAP_READ))
 		return;
 
 	if (cd.clk->ops.timer_gettime)
@@ -420,7 +425,7 @@ static int pc_timer_settime(struct k_itimer *kit, int flags,
 	struct posix_clock_desc cd;
 	int err;
 
-	err = get_clock_desc(id, &cd);
+	err = get_clock_desc(id, &cd, CAP_WRITE);
 	if (err)
 		return err;
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81db45e..69d51a43dc56 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -27,7 +27,7 @@
  */
 SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 {
-	struct fd f = fdget(fd);
+	struct fd f;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
 	loff_t endbyte;			/* inclusive */
@@ -36,8 +36,9 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	unsigned long nrpages;
 	int ret = 0;
 
-	if (!f.file)
-		return -EBADF;
+	f = fdgetr(fd, CAP_LIST_END);
+	if (IS_ERR(f.file))
+		return PTR_ERR(f.file);
 
 	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
 		ret = -ESPIPE;
diff --git a/mm/internal.h b/mm/internal.h
index 07b67361a40a..fc58791021af 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,8 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/capsicum.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -91,6 +93,23 @@ static inline void get_page_foll(struct page *page)
 	}
 }
 
+static inline struct capsicum_rights *
+mmap_rights(struct capsicum_rights *rights,
+	    unsigned long prot,
+	    unsigned long flags)
+{
+#ifdef CONFIG_SECURITY_CAPSICUM
+	cap_rights_init(rights, CAP_MMAP);
+	if (prot & PROT_READ)
+		cap_rights_set(rights, CAP_MMAP_R);
+	if ((flags & MAP_SHARED) && (prot & PROT_WRITE))
+		cap_rights_set(rights, CAP_MMAP_W);
+	if (prot & PROT_EXEC)
+		cap_rights_set(rights, CAP_MMAP_X);
+#endif
+	return rights;
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5177c6d4a2dd..b113301b5b2b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6000,9 +6000,9 @@ static int memcg_write_event_control(struct cgroup_subsys_state *css,
 	init_waitqueue_func_entry(&event->wait, memcg_event_wake);
 	INIT_WORK(&event->remove, memcg_event_remove);
 
-	efile = fdget(efd);
-	if (!efile.file) {
-		ret = -EBADF;
+	efile = fdgetr(efd, CAP_WRITE);
+	if (IS_ERR(efile.file)) {
+		ret = PTR_ERR(efile.file);
 		goto out_kfree;
 	}
 
@@ -6012,9 +6012,9 @@ static int memcg_write_event_control(struct cgroup_subsys_state *css,
 		goto out_put_efile;
 	}
 
-	cfile = fdget(cfd);
-	if (!cfile.file) {
-		ret = -EBADF;
+	cfile = fdgetr(cfd, CAP_READ);
+	if (IS_ERR(cfile.file)) {
+		ret = PTR_ERR(cfile.file);
 		goto out_put_eventfd;
 	}
 
diff --git a/mm/mmap.c b/mm/mmap.c
index b1202cf81f4b..b347a2c5984c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1379,10 +1379,13 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 	unsigned long retval = -EBADF;
 
 	if (!(flags & MAP_ANONYMOUS)) {
+		struct capsicum_rights rights;
 		audit_mmap_fd(fd, flags);
-		file = fget(fd);
-		if (!file)
+		file = fget_rights(fd, mmap_rights(&rights, prot, flags));
+		if (IS_ERR(file)) {
+			retval = PTR_ERR(file);
 			goto out;
+		}
 		if (is_file_hugepages(file))
 			len = ALIGN(len, huge_page_size(hstate_file(file)));
 		retval = -EINVAL;
diff --git a/mm/nommu.c b/mm/nommu.c
index 85f8d6698d48..a2d03530a8c4 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1493,13 +1493,16 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		unsigned long, fd, unsigned long, pgoff)
 {
 	struct file *file = NULL;
-	unsigned long retval = -EBADF;
+	unsigned long retval;
 
 	audit_mmap_fd(fd, flags);
 	if (!(flags & MAP_ANONYMOUS)) {
-		file = fget(fd);
-		if (!file)
+		struct capsicum_rights rights;
+		file = fget_rights(fd, mmap_rights(&rights, prot, flags));
+		if (IS_ERR(file)) {
+			retval = PTR_ERR(file);
 			goto out;
+		}
 	}
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
diff --git a/mm/readahead.c b/mm/readahead.c
index 0ca36a7770b1..781125653dcf 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -566,8 +566,8 @@ SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
 	struct fd f;
 
 	ret = -EBADF;
-	f = fdget(fd);
-	if (f.file) {
+	f = fdgetr(fd, CAP_PREAD);
+	if (!IS_ERR(f.file)) {
 		if (f.file->f_mode & FMODE_READ) {
 			struct address_space *mapping = f.file->f_mapping;
 			pgoff_t start = offset >> PAGE_CACHE_SHIFT;
@@ -576,6 +576,8 @@ SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
 			ret = do_readahead(mapping, f.file, start, len);
 		}
 		fdput(f);
+	} else {
+		ret = PTR_ERR(f.file);
 	}
 	return ret;
 }
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 80d08f6664cb..6d0866ac873d 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -789,12 +789,12 @@ static int p9_fd_open(struct p9_client *client, int rfd, int wfd)
 	if (!ts)
 		return -ENOMEM;
 
-	ts->rd = fget(rfd);
-	ts->wr = fget(wfd);
-	if (!ts->rd || !ts->wr) {
-		if (ts->rd)
+	ts->rd = fgetr(rfd, CAP_READ, CAP_POLL_EVENT);
+	ts->wr = fgetr(wfd, CAP_WRITE, CAP_POLL_EVENT);
+	if (IS_ERR(ts->rd) || IS_ERR(ts->wr)) {
+		if (!IS_ERR(ts->rd))
 			fput(ts->rd);
-		if (ts->wr)
+		if (!IS_ERR(ts->wr))
 			fput(ts->wr);
 		kfree(ts);
 		return -EIO;
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index b653ab001fba..8bd3eb38f260 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -1611,10 +1611,14 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
 	struct snd_pcm_file *pcm_file;
 	struct snd_pcm_substream *substream1;
 	struct snd_pcm_group *group;
-	struct fd f = fdget(fd);
+	struct fd f = fdgetr(fd, CAP_LIST_END);
 
-	if (!f.file)
-		return -EBADFD;
+	if (IS_ERR(f.file)) {
+		res = PTR_ERR(f.file);
+		if (res == -EBADF)
+			return -EBADFD;
+		return res;
+	}
 	if (!is_pcm_file(f.file)) {
 		res = -EBADFD;
 		goto _badf;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 29c2a04e036e..75c101677edd 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -306,9 +306,9 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	INIT_WORK(&irqfd->inject, irqfd_inject);
 	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
 
-	f = fdget(args->fd);
-	if (!f.file) {
-		ret = -EBADF;
+	f = fdgetr(args->fd, CAP_WRITE);
+	if (IS_ERR(f.file)) {
+		ret = PTR_ERR(f.file);
 		goto out;
 	}
 
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index ba1a93f935c7..1f427fafa03b 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -124,9 +124,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		if (get_user(fd, argp))
 			return -EFAULT;
 
-		f = fdget(fd);
-		if (!f.file)
-			return -EBADF;
+		f = fdgetr(fd, CAP_FSTAT);
+		if (IS_ERR(f.file))
+			return PTR_ERR(f.file);
 
 		vfio_group = kvm_vfio_group_get_external_user(f.file);
 		fdput(f);
@@ -164,9 +164,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
 		if (get_user(fd, argp))
 			return -EFAULT;
 
-		f = fdget(fd);
-		if (!f.file)
-			return -EBADF;
+		f = fdgetr(fd, CAP_FSTAT);
+		if (IS_ERR(f.file))
+			return PTR_ERR(f.file);
 
 		vfio_group = kvm_vfio_group_get_external_user(f.file);
 		fdput(f);
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 06/11] capsicum: implement sockfd_lookupr()
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (4 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 05/11] capsicum: convert callers to use fgetr() etc David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 07/11] capsicum: convert callers to use sockfd_lookupr() etc David Drysdale
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Add variants of sockfd_lookup() and related functions where the caller
indicates the operations that will be performed on the socket.

If CONFIG_SECURITY_CAPSICUM is defined, these variants use the
fgetr()-style functions to retrieve the struct file from the file
descriptor.

If CONFIG_SECURITY_CAPSICUM is not defined, these variants use the
normal fget() functions.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 include/linux/net.h |  16 +++++++
 net/socket.c        | 118 ++++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 117 insertions(+), 17 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 17d83393afcc..05429ce3b730 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -24,6 +24,7 @@
 #include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
 #include <linux/kmemcheck.h>
 #include <linux/rcupdate.h>
+#include <linux/capsicum.h>
 #include <linux/jump_label.h>
 #include <uapi/linux/net.h>
 
@@ -222,6 +223,21 @@ struct socket *sock_from_file(struct file *file, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
 int net_ratelimit(void);
 
+#ifdef CONFIG_SECURITY_CAPSICUM
+struct socket *sockfd_lookup_rights(int fd, int *err,
+				    struct capsicum_rights *rights);
+struct socket *_sockfd_lookupr(int fd, int *err, ...);
+#define sockfd_lookupr(fd, err, ...) \
+	_sockfd_lookupr((fd), (err), __VA_ARGS__, 0ULL)
+#else
+static inline struct socket *
+sockfd_lookup_rights(int fd, int *err, struct capsicum_rights *rights)
+{
+	return sockfd_lookup(fd, err);
+}
+#define sockfd_lookupr(fd, err, ...)	sockfd_lookup((fd), (err))
+#endif
+
 #define net_ratelimited_function(function, ...)			\
 do {								\
 	if (net_ratelimit())					\
diff --git a/net/socket.c b/net/socket.c
index abf56b2a14f9..f254e9bf9c4d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -96,6 +96,7 @@
 #include <net/compat.h>
 #include <net/wext.h>
 #include <net/cls_cgroup.h>
+#include <net/sctp/sctp.h>
 
 #include <net/sock.h>
 #include <linux/netfilter.h>
@@ -418,6 +419,106 @@ struct socket *sock_from_file(struct file *file, int *err)
 }
 EXPORT_SYMBOL(sock_from_file);
 
+static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
+{
+	struct fd f = fdget(fd);
+	struct socket *sock;
+
+	*err = -EBADF;
+	if (f.file) {
+		sock = sock_from_file(f.file, err);
+		if (likely(sock)) {
+			*fput_needed = f.flags;
+			return sock;
+		}
+		fdput(f);
+	}
+	return NULL;
+}
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+struct socket *sockfd_lookup_rights(int fd, int *err,
+				    struct capsicum_rights *rights)
+{
+	struct file *file;
+	struct socket *sock;
+
+	file = fget_rights(fd, rights);
+	if (IS_ERR(file)) {
+		*err = PTR_ERR(file);
+		return NULL;
+	}
+
+	sock = sock_from_file(file, err);
+	if (!sock)
+		fput(file);
+	return sock;
+}
+EXPORT_SYMBOL(sockfd_lookup_rights);
+
+static struct socket *
+sockfd_lookup_light_rights(int fd, int *err, int *fput_needed,
+			   const struct capsicum_rights **actual_rights,
+			   const struct capsicum_rights *required_rights)
+{
+	struct fd f = fdget_raw_rights(fd, actual_rights, required_rights);
+	struct socket *sock;
+
+	*err = -EBADF;
+	if (!IS_ERR(f.file)) {
+		sock = sock_from_file(f.file, err);
+		if (likely(sock)) {
+			*fput_needed = f.flags;
+			return sock;
+		}
+		fdput(f);
+	} else {
+		*err = PTR_ERR(f.file);
+	}
+	return NULL;
+}
+
+struct socket *_sockfd_lookupr(int fd, int *err, ...)
+{
+	struct capsicum_rights rights;
+	struct socket *sock;
+	va_list ap;
+	va_start(ap, err);
+	sock = sockfd_lookup_rights(fd, err, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return sock;
+}
+EXPORT_SYMBOL(_sockfd_lookupr);
+
+struct socket *_sockfd_lookupr_light(int fd, int *err, int *fput_needed, ...)
+{
+	struct capsicum_rights rights;
+	struct socket *sock;
+	va_list ap;
+	va_start(ap, fput_needed);
+	sock = sockfd_lookup_light_rights(fd, err, fput_needed,
+					  NULL, cap_rights_vinit(&rights, ap));
+	va_end(ap);
+	return sock;
+}
+#define sockfd_lookupr_light(fd, err, fpn, ...) \
+	_sockfd_lookupr_light((fd), (err), (fpn), __VA_ARGS__, 0ULL)
+
+#else
+
+static inline struct socket *
+sockfd_lookup_light_rights(int fd, int *err, int *fput_needed,
+			   const struct capsicum_rights **actual_rights,
+			   const struct capsicum_rights *required_rights)
+{
+	return sockfd_lookup_light(fd, err, fput_needed);
+}
+
+#define sockfd_lookupr_light(f, e, p, ...) \
+	sockfd_lookup_light((f), (e), (p))
+
+#endif
+
 /**
  *	sockfd_lookup - Go from a file number to its socket slot
  *	@fd: file handle
@@ -449,23 +550,6 @@ struct socket *sockfd_lookup(int fd, int *err)
 }
 EXPORT_SYMBOL(sockfd_lookup);
 
-static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
-{
-	struct fd f = fdget(fd);
-	struct socket *sock;
-
-	*err = -EBADF;
-	if (f.file) {
-		sock = sock_from_file(f.file, err);
-		if (likely(sock)) {
-			*fput_needed = f.flags;
-			return sock;
-		}
-		fdput(f);
-	}
-	return NULL;
-}
-
 #define XATTR_SOCKPROTONAME_SUFFIX "sockprotoname"
 #define XATTR_NAME_SOCKPROTONAME (XATTR_SYSTEM_PREFIX XATTR_SOCKPROTONAME_SUFFIX)
 #define XATTR_NAME_SOCKPROTONAME_LEN (sizeof(XATTR_NAME_SOCKPROTONAME)-1)
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 07/11] capsicum: convert callers to use sockfd_lookupr() etc
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (5 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 06/11] capsicum: implement sockfd_lookupr() David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 08/11] capsicum: add new LSM hooks on FD/file conversion David Drysdale
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Convert places that use sockfd_lookup() functions to use the
equivalent sockfd_lookupr() variant instead.

Annotate each such call with an indication of what operations will
be performed on the retrieved socket, to allow future policing
of rights associated with file descriptors.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 drivers/block/nbd.c                |   3 +-
 drivers/scsi/iscsi_tcp.c           |   2 +-
 drivers/staging/usbip/stub_dev.c   |   2 +-
 drivers/staging/usbip/vhci_sysfs.c |   2 +-
 drivers/vhost/net.c                |   2 +-
 fs/ncpfs/inode.c                   |   5 +-
 net/bluetooth/bnep/sock.c          |   2 +-
 net/bluetooth/cmtp/sock.c          |   2 +-
 net/bluetooth/hidp/sock.c          |   4 +-
 net/compat.c                       |   4 +-
 net/l2tp/l2tp_core.c               |  11 ++--
 net/l2tp/l2tp_core.h               |   2 +
 net/sched/sch_atm.c                |   2 +-
 net/socket.c                       | 115 +++++++++++++++++++++++--------------
 net/sunrpc/svcsock.c               |   4 +-
 15 files changed, 98 insertions(+), 64 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index d6f55e3052fb..8439bbd1ad17 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -646,7 +646,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		int err;
 		if (nbd->sock)
 			return -EBUSY;
-		sock = sockfd_lookup(arg, &err);
+		sock = sockfd_lookupr(arg, &err,
+				      CAP_READ, CAP_WRITE, CAP_SHUTDOWN);
 		if (sock) {
 			nbd->sock = sock;
 			if (max_part > 0)
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 11854845393b..9354b333887c 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -652,7 +652,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
 	int err;
 
 	/* lookup for existing socket */
-	sock = sockfd_lookup((int)transport_eph, &err);
+	sock = sockfd_lookupr((int)transport_eph, &err, CAP_SOCK_SERVER);
 	if (!sock) {
 		iscsi_conn_printk(KERN_ERR, conn,
 				  "sockfd_lookup failed %d\n", err);
diff --git a/drivers/staging/usbip/stub_dev.c b/drivers/staging/usbip/stub_dev.c
index de692d7011a5..3ac80c595343 100644
--- a/drivers/staging/usbip/stub_dev.c
+++ b/drivers/staging/usbip/stub_dev.c
@@ -108,7 +108,7 @@ static ssize_t store_sockfd(struct device *dev, struct device_attribute *attr,
 			goto err;
 		}
 
-		socket = sockfd_lookup(sockfd, &err);
+		socket = sockfd_lookupr(sockfd, &err, CAP_LIST_END);
 		if (!socket)
 			goto err;
 
diff --git a/drivers/staging/usbip/vhci_sysfs.c b/drivers/staging/usbip/vhci_sysfs.c
index 211f43f67ea2..efe9d7625433 100644
--- a/drivers/staging/usbip/vhci_sysfs.c
+++ b/drivers/staging/usbip/vhci_sysfs.c
@@ -195,7 +195,7 @@ static ssize_t store_attach(struct device *dev, struct device_attribute *attr,
 		return -EINVAL;
 
 	/* Extract socket from fd. */
-	socket = sockfd_lookup(sockfd, &err);
+	socket = sockfd_lookupr(sockfd, &err, CAP_LIST_END);
 	if (!socket)
 		return -EINVAL;
 
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 6fed594f12d3..f4db0caf817d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -838,7 +838,7 @@ static struct socket *get_raw_socket(int fd)
 		char  buf[MAX_ADDR_LEN];
 	} uaddr;
 	int uaddr_len = sizeof uaddr, r;
-	struct socket *sock = sockfd_lookup(fd, &r);
+	struct socket *sock = sockfd_lookupr(fd, &r, CAP_READ, CAP_WRITE);
 
 	if (!sock)
 		return ERR_PTR(-ENOTSOCK);
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index e31e589369a4..580024e60d20 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -539,7 +539,7 @@ static int ncp_fill_super(struct super_block *sb, void *raw_data, int silent)
 	if (!uid_valid(data.mounted_uid) || !uid_valid(data.uid) ||
 	    !gid_valid(data.gid))
 		goto out;
-	sock = sockfd_lookup(data.ncp_fd, &error);
+	sock = sockfd_lookupr(data.ncp_fd, &error, CAP_WRITE, CAP_FSTAT);
 	if (!sock)
 		goto out;
 
@@ -567,7 +567,8 @@ static int ncp_fill_super(struct super_block *sb, void *raw_data, int silent)
 	server->ncp_sock = sock;
 	
 	if (data.info_fd != -1) {
-		struct socket *info_sock = sockfd_lookup(data.info_fd, &error);
+		struct socket *info_sock = sockfd_lookupr(data.info_fd, &error,
+							  CAP_WRITE, CAP_FSTAT);
 		if (!info_sock)
 			goto out_bdi;
 		server->info_sock = info_sock;
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index 5f051290daba..1a69b6b05d2e 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -69,7 +69,7 @@ static int bnep_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long
 		if (copy_from_user(&ca, argp, sizeof(ca)))
 			return -EFAULT;
 
-		nsock = sockfd_lookup(ca.sock, &err);
+		nsock = sockfd_lookupr(ca.sock, &err, CAP_READ, CAP_WRITE);
 		if (!nsock)
 			return err;
 
diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index d82787d417bd..4033b771e6ca 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -83,7 +83,7 @@ static int cmtp_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long
 		if (copy_from_user(&ca, argp, sizeof(ca)))
 			return -EFAULT;
 
-		nsock = sockfd_lookup(ca.sock, &err);
+		nsock = sockfd_lookupr(ca.sock, &err, CAP_READ, CAP_WRITE);
 		if (!nsock)
 			return err;
 
diff --git a/net/bluetooth/hidp/sock.c b/net/bluetooth/hidp/sock.c
index cb3fdde1968a..85afd39595f3 100644
--- a/net/bluetooth/hidp/sock.c
+++ b/net/bluetooth/hidp/sock.c
@@ -67,11 +67,11 @@ static int hidp_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long
 		if (copy_from_user(&ca, argp, sizeof(ca)))
 			return -EFAULT;
 
-		csock = sockfd_lookup(ca.ctrl_sock, &err);
+		csock = sockfd_lookupr(ca.ctrl_sock, &err, CAP_READ, CAP_WRITE);
 		if (!csock)
 			return err;
 
-		isock = sockfd_lookup(ca.intr_sock, &err);
+		isock = sockfd_lookupr(ca.intr_sock, &err, CAP_READ, CAP_WRITE);
 		if (!isock) {
 			sockfd_put(csock);
 			return err;
diff --git a/net/compat.c b/net/compat.c
index 9a76eaf63184..06655190173e 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -388,7 +388,7 @@ COMPAT_SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
 		       char __user *, optval, unsigned int, optlen)
 {
 	int err;
-	struct socket *sock = sockfd_lookup(fd, &err);
+	struct socket *sock = sockfd_lookupr(fd, &err, CAP_SETSOCKOPT);
 
 	if (sock) {
 		err = security_socket_setsockopt(sock, level, optname);
@@ -508,7 +508,7 @@ COMPAT_SYSCALL_DEFINE5(getsockopt, int, fd, int, level, int, optname,
 		       char __user *, optval, int __user *, optlen)
 {
 	int err;
-	struct socket *sock = sockfd_lookup(fd, &err);
+	struct socket *sock = sockfd_lookupr(fd, &err, CAP_GETSOCKOPT);
 
 	if (sock) {
 		err = security_socket_getsockopt(sock, level, optname);
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index a4e37d7158dc..64e6df42cfda 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -175,7 +175,8 @@ l2tp_session_id_hash_2(struct l2tp_net *pn, u32 session_id)
  * owned by userspace.  A struct sock returned from this function must be
  * released using l2tp_tunnel_sock_put once you're done with it.
  */
-static struct sock *l2tp_tunnel_sock_lookup(struct l2tp_tunnel *tunnel)
+static struct sock *l2tp_tunnel_sock_lookup(struct l2tp_tunnel *tunnel,
+					    struct capsicum_rights *rights)
 {
 	int err = 0;
 	struct socket *sock = NULL;
@@ -189,7 +190,7 @@ static struct sock *l2tp_tunnel_sock_lookup(struct l2tp_tunnel *tunnel)
 		 * of closing it.  Look the socket up using the fd to ensure
 		 * consistency.
 		 */
-		sock = sockfd_lookup(tunnel->fd, &err);
+		sock = sockfd_lookup_rights(tunnel->fd, &err, rights);
 		if (sock)
 			sk = sock->sk;
 	} else {
@@ -1411,9 +1412,11 @@ static void l2tp_tunnel_del_work(struct work_struct *work)
 	struct l2tp_tunnel *tunnel = NULL;
 	struct socket *sock = NULL;
 	struct sock *sk = NULL;
+	struct capsicum_rights rights;
 
 	tunnel = container_of(work, struct l2tp_tunnel, del_work);
-	sk = l2tp_tunnel_sock_lookup(tunnel);
+	sk = l2tp_tunnel_sock_lookup(tunnel,
+				     cap_rights_init(&rights, CAP_SHUTDOWN));
 	if (!sk)
 		return;
 
@@ -1614,7 +1617,7 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
 		if (err < 0)
 			goto err;
 	} else {
-		sock = sockfd_lookup(fd, &err);
+		sock = sockfd_lookupr(fd, &err, CAP_READ, CAP_WRITE);
 		if (!sock) {
 			pr_err("tunl %u: sockfd_lookup(fd=%d) returned %d\n",
 			       tunnel_id, fd, err);
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 3f93ccd6ba97..fd1e282d4e8a 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -11,6 +11,8 @@
 #ifndef _L2TP_CORE_H_
 #define _L2TP_CORE_H_
 
+#include <linux/capsicum.h>
+
 /* Just some random numbers */
 #define L2TP_TUNNEL_MAGIC	0x42114DDA
 #define L2TP_SESSION_MAGIC	0x0C04EB7D
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index 8449b337f9e3..8131efa6d164 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -238,7 +238,7 @@ static int atm_tc_change(struct Qdisc *sch, u32 classid, u32 parent,
 	}
 	pr_debug("atm_tc_change: type %d, payload %d, hdr_len %d\n",
 		 opt->nla_type, nla_len(opt), hdr_len);
-	sock = sockfd_lookup(fd, &error);
+	sock = sockfd_lookupr(fd, &error, CAP_GETSOCKNAME);
 	if (!sock)
 		return error;	/* f_count++ */
 	pr_debug("atm_tc_change: f_count %ld\n", file_count(sock->file));
diff --git a/net/socket.c b/net/socket.c
index f254e9bf9c4d..dbc00f0b992a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -419,23 +419,6 @@ struct socket *sock_from_file(struct file *file, int *err)
 }
 EXPORT_SYMBOL(sock_from_file);
 
-static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
-{
-	struct fd f = fdget(fd);
-	struct socket *sock;
-
-	*err = -EBADF;
-	if (f.file) {
-		sock = sock_from_file(f.file, err);
-		if (likely(sock)) {
-			*fput_needed = f.flags;
-			return sock;
-		}
-		fdput(f);
-	}
-	return NULL;
-}
-
 #ifdef CONFIG_SECURITY_CAPSICUM
 struct socket *sockfd_lookup_rights(int fd, int *err,
 				    struct capsicum_rights *rights)
@@ -506,6 +489,23 @@ struct socket *_sockfd_lookupr_light(int fd, int *err, int *fput_needed, ...)
 
 #else
 
+static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
+{
+	struct fd f = fdget(fd);
+	struct socket *sock;
+
+	*err = -EBADF;
+	if (f.file) {
+		sock = sock_from_file(f.file, err);
+		if (likely(sock)) {
+			*fput_needed = f.flags;
+			return sock;
+		}
+		fdput(f);
+	}
+	return NULL;
+}
+
 static inline struct socket *
 sockfd_lookup_light_rights(int fd, int *err, int *fput_needed,
 			   const struct capsicum_rights **actual_rights,
@@ -1608,7 +1608,7 @@ SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
 	struct sockaddr_storage address;
 	int err, fput_needed;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_BIND);
 	if (sock) {
 		err = move_addr_to_kernel(umyaddr, addrlen, &address);
 		if (err >= 0) {
@@ -1637,7 +1637,7 @@ SYSCALL_DEFINE2(listen, int, fd, int, backlog)
 	int err, fput_needed;
 	int somaxconn;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_LISTEN);
 	if (sock) {
 		somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
 		if ((unsigned int)backlog > somaxconn)
@@ -1671,6 +1671,8 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
 	struct file *newfile;
 	int err, len, newfd, fput_needed;
 	struct sockaddr_storage address;
+	struct capsicum_rights rights;
+	const struct capsicum_rights *listen_rights = NULL;
 
 	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
 		return -EINVAL;
@@ -1678,7 +1680,9 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
 		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookup_light_rights(fd, &err, &fput_needed,
+					  &listen_rights,
+					  cap_rights_init(&rights, CAP_ACCEPT));
 	if (!sock)
 		goto out;
 
@@ -1770,7 +1774,7 @@ SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
 	struct sockaddr_storage address;
 	int err, fput_needed;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_CONNECT);
 	if (!sock)
 		goto out;
 	err = move_addr_to_kernel(uservaddr, addrlen, &address);
@@ -1802,7 +1806,7 @@ SYSCALL_DEFINE3(getsockname, int, fd, struct sockaddr __user *, usockaddr,
 	struct sockaddr_storage address;
 	int len, err, fput_needed;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_GETSOCKNAME);
 	if (!sock)
 		goto out;
 
@@ -1833,7 +1837,7 @@ SYSCALL_DEFINE3(getpeername, int, fd, struct sockaddr __user *, usockaddr,
 	struct sockaddr_storage address;
 	int len, err, fput_needed;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_GETPEERNAME);
 	if (sock != NULL) {
 		err = security_socket_getpeername(sock);
 		if (err) {
@@ -1871,7 +1875,8 @@ SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
 
 	if (len > INT_MAX)
 		len = INT_MAX;
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed,
+				    CAP_WRITE, addr ? CAP_CONNECT : 0ULL);
 	if (!sock)
 		goto out;
 
@@ -1930,7 +1935,7 @@ SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
 
 	if (size > INT_MAX)
 		size = INT_MAX;
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_READ);
 	if (!sock)
 		goto out;
 
@@ -1984,7 +1989,7 @@ SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
 	if (optlen < 0)
 		return -EINVAL;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_SETSOCKOPT);
 	if (sock != NULL) {
 		err = security_socket_setsockopt(sock, level, optname);
 		if (err)
@@ -2015,7 +2020,10 @@ SYSCALL_DEFINE5(getsockopt, int, fd, int, level, int, optname,
 	int err, fput_needed;
 	struct socket *sock;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_GETSOCKOPT,
+				(level == SOL_SCTP &&
+				 optname == SCTP_SOCKOPT_PEELOFF)
+				? CAP_PEELOFF : 0ULL);
 	if (sock != NULL) {
 		err = security_socket_getsockopt(sock, level, optname);
 		if (err)
@@ -2044,7 +2052,7 @@ SYSCALL_DEFINE2(shutdown, int, fd, int, how)
 	int err, fput_needed;
 	struct socket *sock;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_SHUTDOWN);
 	if (sock != NULL) {
 		err = security_socket_shutdown(sock, how);
 		if (!err)
@@ -2080,10 +2088,12 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
 	return 0;
 }
 
-static int ___sys_sendmsg(struct socket *sock, struct msghdr __user *msg,
+static int ___sys_sendmsg(struct socket *sock_noaddr, struct socket *sock_addr,
+			 struct msghdr __user *msg,
 			 struct msghdr *msg_sys, unsigned int flags,
 			 struct used_address *used_address)
 {
+	struct socket *sock;
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
 	struct sockaddr_storage address;
@@ -2103,6 +2113,9 @@ static int ___sys_sendmsg(struct socket *sock, struct msghdr __user *msg,
 		if (err)
 			return err;
 	}
+	sock = (msg_sys->msg_name ? sock_addr : sock_noaddr);
+	if (!sock)
+		return -EBADF;
 
 	if (msg_sys->msg_iovlen > UIO_FASTIOV) {
 		err = -EMSGSIZE;
@@ -2202,15 +2215,22 @@ long __sys_sendmsg(int fd, struct msghdr __user *msg, unsigned flags)
 {
 	int fput_needed, err;
 	struct msghdr msg_sys;
-	struct socket *sock;
-
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
-	if (!sock)
+	struct socket *sock_addr;
+	struct socket *sock_noaddr;
+
+	sock_addr = sockfd_lookupr_light(fd, &err, &fput_needed,
+					 CAP_WRITE, CAP_CONNECT);
+	sock_noaddr = sock_addr;
+	if (!sock_noaddr)
+		sock_noaddr = sockfd_lookupr_light(fd, &err, &fput_needed,
+						   CAP_WRITE);
+	if (!sock_noaddr)
 		goto out;
 
-	err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL);
+	err = ___sys_sendmsg(sock_noaddr, sock_addr, msg, &msg_sys, flags,
+			     NULL);
 
-	fput_light(sock->file, fput_needed);
+	fput_light(sock_noaddr->file, fput_needed);
 out:
 	return err;
 }
@@ -2230,7 +2250,8 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		   unsigned int flags)
 {
 	int fput_needed, err, datagrams;
-	struct socket *sock;
+	struct socket *sock_addr;
+	struct socket *sock_noaddr;
 	struct mmsghdr __user *entry;
 	struct compat_mmsghdr __user *compat_entry;
 	struct msghdr msg_sys;
@@ -2241,8 +2262,13 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 
 	datagrams = 0;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
-	if (!sock)
+	sock_addr = sockfd_lookupr_light(fd, &err, &fput_needed,
+					 CAP_WRITE, CAP_CONNECT);
+	sock_noaddr = sock_addr;
+	if (!sock_noaddr)
+		sock_noaddr = sockfd_lookupr_light(fd, &err, &fput_needed,
+						   CAP_WRITE);
+	if (!sock_noaddr)
 		return err;
 
 	used_address.name_len = UINT_MAX;
@@ -2252,14 +2278,15 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 
 	while (datagrams < vlen) {
 		if (MSG_CMSG_COMPAT & flags) {
-			err = ___sys_sendmsg(sock, (struct msghdr __user *)compat_entry,
-					     &msg_sys, flags, &used_address);
+			err = ___sys_sendmsg(sock_noaddr, sock_addr,
+					(struct msghdr __user *)compat_entry,
+					&msg_sys, flags, &used_address);
 			if (err < 0)
 				break;
 			err = __put_user(err, &compat_entry->msg_len);
 			++compat_entry;
 		} else {
-			err = ___sys_sendmsg(sock,
+			err = ___sys_sendmsg(sock_noaddr, sock_addr,
 					     (struct msghdr __user *)entry,
 					     &msg_sys, flags, &used_address);
 			if (err < 0)
@@ -2273,7 +2300,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		++datagrams;
 	}
 
-	fput_light(sock->file, fput_needed);
+	fput_light(sock_noaddr->file, fput_needed);
 
 	/* We only return an error if no datagrams were able to be sent */
 	if (datagrams != 0)
@@ -2392,7 +2419,7 @@ long __sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags)
 	struct msghdr msg_sys;
 	struct socket *sock;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_READ);
 	if (!sock)
 		goto out;
 
@@ -2432,7 +2459,7 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 
 	datagrams = 0;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_READ);
 	if (!sock)
 		return err;
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 43bcb4699d69..9568b63b8aef 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1400,7 +1400,7 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
 bool svc_alien_sock(struct net *net, int fd)
 {
 	int err;
-	struct socket *sock = sockfd_lookup(fd, &err);
+	struct socket *sock = sockfd_lookupr(fd, &err, CAP_LIST_END);
 	bool ret = false;
 
 	if (!sock)
@@ -1428,7 +1428,7 @@ int svc_addsock(struct svc_serv *serv, const int fd, char *name_return,
 		const size_t len)
 {
 	int err = 0;
-	struct socket *so = sockfd_lookup(fd, &err);
+	struct socket *so = sockfd_lookupr(fd, &err, CAP_LISTEN);
 	struct svc_sock *svsk = NULL;
 	struct sockaddr_storage addr;
 	struct sockaddr *sin = (struct sockaddr *)&addr;
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 08/11] capsicum: add new LSM hooks on FD/file conversion
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (6 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 07/11] capsicum: convert callers to use sockfd_lookupr() etc David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 09/11] capsicum: implementations of new LSM hooks David Drysdale
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Add the following new LSM hooks:
 - file_lookup: check an fd->struct file conversion operation,
   potentially failing the lookup or potentially altering the looked
   up file
 - file_install: check a file to be installed in the fd table, to
   potentially allow the LSM to replace it.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 include/linux/security.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 security/security.c      | 13 +++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/linux/security.h b/include/linux/security.h
index 6478ce3252c7..4d0c079187d4 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -53,6 +53,7 @@ struct msg_queue;
 struct xattr;
 struct xfrm_sec_ctx;
 struct mm_struct;
+struct capsicum_rights;
 
 /* Maximum number of letters for an LSM name string */
 #define SECURITY_NAME_MAX	10
@@ -656,6 +657,28 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	to receive an open file descriptor via socket IPC.
  *	@file contains the file structure being received.
  *	Return 0 if permission is granted.
+ * @file_lookup:
+ *	This hook allows security modules to intercept file descriptor lookups
+ *	to check whether a required set of rights are available for the file
+ *	descriptor. This allows the security model to fail the lookup, or to
+ *	substitute a new return value for fget().
+ *	@file is the file in the process's file table, which may be replaced by
+ *	another file as the return value from the hook.
+ *	@required_rights is the rights that the file descriptor should hold, or
+ *	may be NULL to indicate that no specific rights are needed.
+ *	@actual_rights is returned (if it is non-NULL) as a pointer to the
+ *	rights that the file descriptor has.  The caller does not own this
+ *	memory, and should only use if while maintaining a refcount to the
+ *	returned unwrapped file.
+ *	Return PTR_ERR holding the unwrapped file.
+ * @file_install:
+ *	This hook allows security modules to intercept newly created files that
+ *	are about to be installed in the file descriptor table, to potentially
+ *	substitute a different file for the newly opened file.
+ *	@base_rights is the rights associated with an existing file that the
+ *	new file is derived from; CAP_ALL for non-capabilities.
+ *	@file is the newly opened struct file.
+ *	Return PTR_ERR holding the struct file to be used.
  * @file_open
  *	Save open-time permission checking state for later use upon
  *	file_permission, and recheck access if anything has changed
@@ -1555,6 +1578,11 @@ struct security_operations {
 				    struct fown_struct *fown, int sig);
 	int (*file_receive) (struct file *file);
 	int (*file_open) (struct file *file, const struct cred *cred);
+	struct file * (*file_lookup)(struct file *orig,
+				const struct capsicum_rights *required_rights,
+				const struct capsicum_rights **actual_rights);
+	struct file * (*file_install)(const struct capsicum_rights *base_rights,
+				      struct file *file);
 
 	int (*task_create) (unsigned long clone_flags);
 	void (*task_free) (struct task_struct *task);
@@ -1829,6 +1857,11 @@ int security_file_send_sigiotask(struct task_struct *tsk,
 				 struct fown_struct *fown, int sig);
 int security_file_receive(struct file *file);
 int security_file_open(struct file *file, const struct cred *cred);
+struct file *security_file_lookup(struct file *orig,
+				  const struct capsicum_rights *required_rights,
+				  const struct capsicum_rights **actual_rights);
+struct file *security_file_install(const struct capsicum_rights *base_rights,
+				   struct file *file);
 int security_task_create(unsigned long clone_flags);
 void security_task_free(struct task_struct *task);
 int security_cred_alloc_blank(struct cred *cred, gfp_t gfp);
@@ -2324,6 +2357,21 @@ static inline int security_file_open(struct file *file,
 	return 0;
 }
 
+static inline struct file *
+security_file_lookup(struct file *orig,
+		     const struct capsicum_rights *required_rights,
+		     const struct capsicum_rights **actual_rights)
+{
+	return orig;
+}
+
+static inline struct file *
+security_file_install(const struct capsicum_rights *base_rights,
+		      struct file *file)
+{
+	return file;
+}
+
 static inline int security_task_create(unsigned long clone_flags)
 {
 	return 0;
diff --git a/security/security.c b/security/security.c
index 8b774f362a3d..5ab3e893b46c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -802,6 +802,19 @@ int security_file_open(struct file *file, const struct cred *cred)
 	return fsnotify_perm(file, MAY_OPEN);
 }
 
+struct file *security_file_lookup(struct file *file,
+				  const struct capsicum_rights *required_rights,
+				  const struct capsicum_rights **actual_rights)
+{
+	return security_ops->file_lookup(file, required_rights, actual_rights);
+}
+
+struct file *security_file_install(const struct capsicum_rights *base_rights,
+				   struct file *file)
+{
+	return security_ops->file_install(base_rights, file);
+}
+
 int security_task_create(unsigned long clone_flags)
 {
 	return security_ops->task_create(clone_flags);
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 09/11] capsicum: implementations of new LSM hooks
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (7 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 08/11] capsicum: add new LSM hooks on FD/file conversion David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 16:05     ` Andy Lutomirski
  2014-06-30 10:28 ` [PATCH 10/11] capsicum: invocation " David Drysdale
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

If the LSM does not provide implementations of the .file_lookup and
.file_install LSM hooks, always use the Capsicum implementations.

The Capsicum implementation of file_lookup checks for a Capsicum
capability wrapper file and unwraps to if the appropriate rights
are available.

The Capsicum implementation of file_install checks whether the file
has restricted rights associated with it.  If it does, it is replaced
with a Capsicum capability wrapper file before installation into the
fdtable.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 include/linux/capsicum.h         |   7 ++
 include/uapi/asm-generic/errno.h |   3 +
 security/Makefile                |   2 +-
 security/capability.c            |  17 ++-
 security/capsicum.c              | 257 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 283 insertions(+), 3 deletions(-)
 create mode 100644 security/capsicum.c

diff --git a/include/linux/capsicum.h b/include/linux/capsicum.h
index 74f79756097a..a3e7540f15e7 100644
--- a/include/linux/capsicum.h
+++ b/include/linux/capsicum.h
@@ -13,6 +13,13 @@ struct capsicum_rights {
 	unsigned int *ioctls;
 };
 
+/* LSM hook fallback functions */
+struct file *capsicum_file_lookup(struct file *file,
+				  const struct capsicum_rights *required_rights,
+				  const struct capsicum_rights **actual_rights);
+struct file *capsicum_file_install(const struct capsicum_rights *base_rights,
+				   struct file *file);
+
 #define CAP_LIST_END	0ULL
 
 #ifdef CONFIG_SECURITY_CAPSICUM
diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
index 1e1ea6e6e7a5..550570ed7b9f 100644
--- a/include/uapi/asm-generic/errno.h
+++ b/include/uapi/asm-generic/errno.h
@@ -110,4 +110,7 @@
 
 #define EHWPOISON	133	/* Memory page has hardware error */
 
+#define ECAPMODE        134     /* Not permitted in capability mode */
+#define ENOTCAPABLE     135     /* Capability FD rights insufficient */
+
 #endif
diff --git a/security/Makefile b/security/Makefile
index c5e1363ae136..e46d014a74b3 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -14,7 +14,7 @@ obj-y					+= commoncap.o
 obj-$(CONFIG_MMU)			+= min_addr.o
 
 # Object file lists
-obj-$(CONFIG_SECURITY)			+= security.o capability.o capsicum-rights.o
+obj-$(CONFIG_SECURITY)			+= security.o capability.o capsicum.o capsicum-rights.o
 obj-$(CONFIG_SECURITYFS)		+= inode.o
 obj-$(CONFIG_SECURITY_SELINUX)		+= selinux/
 obj-$(CONFIG_SECURITY_SMACK)		+= smack/
diff --git a/security/capability.c b/security/capability.c
index ad0d4de69944..11d5a1bd6e57 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -11,6 +11,7 @@
  */
 
 #include <linux/security.h>
+#include <linux/capsicum.h>
 
 static int cap_syslog(int type)
 {
@@ -917,9 +918,19 @@ static void cap_audit_rule_free(void *lsmrule)
 #define set_to_cap_if_null(ops, function)				\
 	do {								\
 		if (!ops->function) {					\
-			ops->function = cap_##function;			\
+			ops->function = cap_##function;		\
 			pr_debug("Had to override the " #function	\
-				 " security operation with the default.\n");\
+				 " security operation with the default "\
+				 "cap_" #function ".\n");		\
+			}						\
+	} while (0)
+#define set_to_capsicum_if_null(ops, function)				\
+	do {								\
+		if (!ops->function) {					\
+			ops->function = capsicum_##function;		\
+			pr_debug("Had to override the " #function	\
+				 " security operation with the default "\
+				 "capsicum_" #function ".\n");		\
 			}						\
 	} while (0)
 
@@ -1007,6 +1018,8 @@ void __init security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, file_send_sigiotask);
 	set_to_cap_if_null(ops, file_receive);
 	set_to_cap_if_null(ops, file_open);
+	set_to_capsicum_if_null(ops, file_lookup);
+	set_to_capsicum_if_null(ops, file_install);
 	set_to_cap_if_null(ops, task_create);
 	set_to_cap_if_null(ops, task_free);
 	set_to_cap_if_null(ops, cred_alloc_blank);
diff --git a/security/capsicum.c b/security/capsicum.c
new file mode 100644
index 000000000000..83677eef3fb6
--- /dev/null
+++ b/security/capsicum.c
@@ -0,0 +1,257 @@
+/*
+ * Main implementation of Capsicum, a capability framework for UNIX.
+ *
+ * Copyright (C) 2012-2013 The Chromium OS Authors
+ *                         <chromium-os-dev@chromium.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ *
+ * See Documentation/security/capsicum.txt for information on Capsicum.
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/security.h>
+#include <linux/syscalls.h>
+#include <linux/capsicum.h>
+
+#include "capsicum-rights.h"
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * Capsicum capability structure, holding the associated rights and underlying
+ * real file.  Capabilities are not stacked, i.e. underlying always points to a
+ * normal file not another Capsicum capability. Accessed via file->private_data.
+ */
+struct capsicum_capability {
+	struct capsicum_rights rights;
+	struct file *underlying;
+};
+
+static void capsicum_panic_not_unwrapped(void);
+static int capsicum_release(struct inode *i, struct file *capf);
+static int capsicum_show_fdinfo(struct seq_file *m, struct file *capf);
+
+#define panic_ptr ((void *)&capsicum_panic_not_unwrapped)
+static const struct file_operations capsicum_file_ops = {
+	.owner = NULL,
+	.llseek = panic_ptr,
+	.read = panic_ptr,
+	.write = panic_ptr,
+	.aio_read = panic_ptr,
+	.aio_write = panic_ptr,
+	.iterate = panic_ptr,
+	.poll = panic_ptr,
+	.unlocked_ioctl = panic_ptr,
+	.compat_ioctl = panic_ptr,
+	.mmap = panic_ptr,
+	.open = panic_ptr,
+	.flush = NULL,  /* This is called on close if implemented. */
+	.release = capsicum_release,  /* This is the only one we want. */
+	.fsync = panic_ptr,
+	.aio_fsync = panic_ptr,
+	.fasync = panic_ptr,
+	.lock = panic_ptr,
+	.sendpage = panic_ptr,
+	.get_unmapped_area = panic_ptr,
+	.check_flags = panic_ptr,
+	.flock = panic_ptr,
+	.splice_write = panic_ptr,
+	.splice_read = panic_ptr,
+	.setlease = panic_ptr,
+	.fallocate = panic_ptr,
+	.show_fdinfo = capsicum_show_fdinfo
+};
+
+static inline bool capsicum_is_cap(const struct file *file)
+{
+	return file->f_op == &capsicum_file_ops;
+}
+
+static struct capsicum_rights all_rights = {
+	.primary = {.cr_rights = {CAP_ALL0, CAP_ALL1} },
+	.fcntls = CAP_FCNTL_ALL,
+	.nioctls = -1,
+	.ioctls = NULL
+};
+
+static struct file *capsicum_cap_alloc(const struct capsicum_rights *rights,
+				       bool take_ioctls)
+{
+	int err;
+	struct file *capf;
+	/* memory to be freed on error exit: */
+	struct capsicum_capability *cap = NULL;
+	unsigned int *ioctls = (take_ioctls ? rights->ioctls : NULL);
+
+	BUG_ON((rights->nioctls > 0) != (rights->ioctls != NULL));
+
+	cap = kmalloc(sizeof(*cap), GFP_KERNEL);
+	if (!cap) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+	cap->underlying = NULL;
+	cap->rights = *rights;
+	if (!take_ioctls && rights->nioctls > 0) {
+		cap->rights.ioctls = kmemdup(rights->ioctls,
+					rights->nioctls * sizeof(unsigned int),
+					GFP_KERNEL);
+		if (!cap->rights.ioctls) {
+			err = -ENOMEM;
+			goto out_err;
+		}
+		ioctls = cap->rights.ioctls;
+	}
+
+	capf = anon_inode_getfile("[capability]", &capsicum_file_ops, cap, 0);
+	if (IS_ERR(capf)) {
+		err = PTR_ERR(capf);
+		goto out_err;
+	}
+	return capf;
+
+out_err:
+	kfree(ioctls);
+	kfree(cap);
+	return ERR_PTR(err);
+}
+
+/*
+ * File operations functions.
+ */
+
+/*
+ * When we release a Capsicum capability, release our reference to the
+ * underlying (wrapped) file as well.
+ */
+static int capsicum_release(struct inode *i, struct file *capf)
+{
+	struct capsicum_capability *cap;
+
+	if (!capsicum_is_cap(capf))
+		return -EINVAL;
+
+	cap = capf->private_data;
+	BUG_ON(!cap);
+	if (cap->underlying)
+		fput(cap->underlying);
+	cap->underlying = NULL;
+	kfree(cap->rights.ioctls);
+	kfree(cap);
+	return 0;
+}
+
+static int capsicum_show_fdinfo(struct seq_file *m, struct file *capf)
+{
+	int i;
+	struct capsicum_capability *cap;
+
+	if (!capsicum_is_cap(capf))
+		return -EINVAL;
+
+	cap = capf->private_data;
+	BUG_ON(!cap);
+	seq_puts(m, "rights:");
+	for (i = 0; i < (CAP_RIGHTS_VERSION + 2); i++)
+		seq_printf(m, "\t%#016llx", cap->rights.primary.cr_rights[i]);
+	seq_puts(m, "\n");
+	seq_printf(m, " fcntls: %#08x\n", cap->rights.fcntls);
+	if (cap->rights.nioctls > 0) {
+		seq_puts(m, " ioctls:");
+		for (i = 0; i < cap->rights.nioctls; i++)
+			seq_printf(m, "\t%#08x", cap->rights.ioctls[i]);
+		seq_puts(m, "\n");
+	}
+	return 0;
+}
+
+static void capsicum_panic_not_unwrapped(void)
+{
+	/*
+	 * General Capsicum file operations should never be called, because the
+	 * relevant file should always be unwrapped and the underlying real file
+	 * used instead.
+	 */
+	panic("Called a file_operations member on a Capsicum wrapper");
+}
+
+/*
+ * LSM hook fallback functions.
+ */
+
+/*
+ * We are looking up a file by its file descriptor. If it is a Capsicum
+ * capability, and has the required rights, we unwrap it and return the
+ * underlying file.
+ */
+struct file *capsicum_file_lookup(struct file *file,
+				  const struct capsicum_rights *required_rights,
+				  const struct capsicum_rights **actual_rights)
+{
+	struct capsicum_capability *cap;
+
+	/* See if the file in question is a Capsicum capability. */
+	if (!capsicum_is_cap(file)) {
+		if (actual_rights)
+			*actual_rights = &all_rights;
+		return file;
+	}
+	cap = file->private_data;
+	if (required_rights &&
+	    !cap_rights_contains(&cap->rights, required_rights)) {
+		return ERR_PTR(-ENOTCAPABLE);
+	}
+	if (actual_rights)
+		*actual_rights = &cap->rights;
+	return cap->underlying;
+}
+EXPORT_SYMBOL(capsicum_file_lookup);
+
+struct file *capsicum_file_install(const struct capsicum_rights *base_rights,
+				   struct file *file)
+{
+	struct file *capf;
+	struct capsicum_capability *cap;
+	if (!base_rights || cap_rights_is_all(base_rights))
+		return file;
+
+	capf = capsicum_cap_alloc(base_rights, false);
+	if (IS_ERR(capf))
+		return capf;
+
+	if (!atomic_long_inc_not_zero(&file->f_count)) {
+		fput(capf);
+		return ERR_PTR(-EBADF);
+	}
+	cap = capf->private_data;
+	cap->underlying = file;
+	return capf;
+}
+EXPORT_SYMBOL(capsicum_file_install);
+
+#else
+
+struct file *capsicum_file_lookup(struct file *file,
+				  const struct capsicum_rights *required_rights,
+				  const struct capsicum_rights **actual_rights)
+{
+	return file;
+}
+
+struct file *
+capsicum_file_install(const const struct capsicum_rights *base_rights,
+		      struct file *file)
+{
+	return file;
+}
+
+#endif
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 10/11] capsicum: invocation of new LSM hooks
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (8 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 09/11] capsicum: implementations of new LSM hooks David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 11/11] capsicum: add syscalls to limit FD rights David Drysdale
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Places that call fcheck() to convert a file descriptor into a
struct file need to call the new .file_lookup LSM hook.  The
most important instances of this are in the fget() function,
but there are a few other direct users of fcheck().

If a new file descriptor is created from an existing file
descriptor, then any rights associated with the original FD
need to be propagated to the new FD.  The .file_install LSM
hook takes care of this, by potentially changing the struct
file that is about to be installed into the FD table.  This
affects accept(2) and openat(2); for the latter, the rights
associated with the dfd need to be propagated through the
code in fs/namei.c to allow this.

The path walking code in fs/namei.c is also modified to enable
the O_BENEATH_ONLY flag if the process is in capability mode,
or if the dfd is a Capsicum capability.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/powerpc/platforms/cell/spufs/coredump.c |   2 +
 fs/file.c                                    |   2 +-
 fs/locks.c                                   |   2 +
 fs/namei.c                                   | 217 ++++++++++++++++++++-------
 fs/notify/dnotify/dnotify.c                  |   2 +
 fs/proc/fd.c                                 |  16 +-
 net/socket.c                                 |  10 +-
 7 files changed, 192 insertions(+), 59 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index be6212ddbf06..589fad12c715 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -29,6 +29,7 @@
 #include <linux/syscalls.h>
 #include <linux/coredump.h>
 #include <linux/binfmts.h>
+#include <linux/security.h>
 
 #include <asm/uaccess.h>
 
@@ -101,6 +102,7 @@ static struct spu_context *coredump_next_context(int *fd)
 		return NULL;
 	*fd = n - 1;
 	file = fcheck(*fd);
+	file = security_file_lookup(file, NULL, NULL);
 	return SPUFS_I(file_inode(file))->i_ctx;
 }
 
diff --git a/fs/file.c b/fs/file.c
index 562cc82ba442..5a784234fd3a 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -742,7 +742,7 @@ static struct file *unwrap_file(struct file *orig,
 		return ERR_PTR(-EBADF);
 	if (IS_ERR(orig))
 		return orig;
-	f = orig;  /* TODO: pass to an LSM hook here */
+	f = security_file_lookup(orig, required_rights, actual_rights);
 	if (f != orig && update_refcnt) {
 		/* We're not returning the original, and the calling code
 		 * has already incremented the refcount on it, we need to
diff --git a/fs/locks.c b/fs/locks.c
index 375fac3392b9..fd95ced5ced1 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2121,6 +2121,7 @@ again:
 	 */
 	spin_lock(&current->files->file_lock);
 	f = fcheck(fd);
+	f = security_file_lookup(f, NULL, NULL);
 	spin_unlock(&current->files->file_lock);
 	if (!error && f != filp && flock.l_type != F_UNLCK) {
 		flock.l_type = F_UNLCK;
@@ -2255,6 +2256,7 @@ again:
 	 */
 	spin_lock(&current->files->file_lock);
 	f = fcheck(fd);
+	f = security_file_lookup(f, NULL, NULL);
 	spin_unlock(&current->files->file_lock);
 	if (!error && f != filp && flock.l_type != F_UNLCK) {
 		flock.l_type = F_UNLCK;
diff --git a/fs/namei.c b/fs/namei.c
index c93f7993960e..001baf46b7a5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -34,6 +34,7 @@
 #include <linux/device_cgroup.h>
 #include <linux/fs_struct.h>
 #include <linux/posix_acl.h>
+#include <linux/capsicum.h>
 #include <asm/uaccess.h>
 
 #include "internal.h"
@@ -1750,7 +1751,7 @@ static int link_path_walk(const char *name, struct nameidata *nd,
 {
 	struct path next;
 	int err;
-	
+
 	while (*name == '/') {
 		if (flags & LOOKUP_BENEATH_ONLY) {
 			err = -EACCES;
@@ -1836,15 +1837,18 @@ exit:
 	return err;
 }
 
-static int path_init(int dfd, const char *name, unsigned int flags,
-		     struct nameidata *nd, struct file **fp)
+static int path_init(int dfd, const char *name, unsigned int *flags,
+		struct nameidata *nd, struct file **fp,
+		const struct capsicum_rights **dfd_rights,
+		const struct capsicum_rights *rights)
 {
 	int retval = 0;
 
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
-	nd->flags = flags | LOOKUP_JUMPED;
+	nd->flags = (*flags) | LOOKUP_PARENT | LOOKUP_JUMPED;
 	nd->depth = 0;
-	if (flags & LOOKUP_ROOT) {
+
+	if ((*flags) & LOOKUP_ROOT) {
 		struct dentry *root = nd->root.dentry;
 		struct inode *inode = root->d_inode;
 		if (*name) {
@@ -1856,7 +1860,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		}
 		nd->path = nd->root;
 		nd->inode = inode;
-		if (flags & LOOKUP_RCU) {
+		if ((*flags) & LOOKUP_RCU) {
 			rcu_read_lock();
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 			nd->m_seq = read_seqbegin(&mount_lock);
@@ -1870,9 +1874,11 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 	if (*name=='/') {
-		if (flags & LOOKUP_BENEATH_ONLY)
+		if ((*flags) & LOOKUP_BENEATH_ONLY)
 			return -EACCES;
-		if (flags & LOOKUP_RCU) {
+		if (dfd_rights)
+			*dfd_rights = NULL;
+		if ((*flags) & LOOKUP_RCU) {
 			rcu_read_lock();
 			set_root_rcu(nd);
 		} else {
@@ -1881,7 +1887,9 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		}
 		nd->path = nd->root;
 	} else if (dfd == AT_FDCWD) {
-		if (flags & LOOKUP_RCU) {
+		if (dfd_rights)
+			*dfd_rights = NULL;
+		if ((*flags) & LOOKUP_RCU) {
 			struct fs_struct *fs = current->fs;
 			unsigned seq;
 
@@ -1897,11 +1905,13 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		}
 	} else {
 		/* Caller must check execute permissions on the starting path component */
-		struct fd f = fdget_raw(dfd);
+		struct fd f = fdget_raw_rights(dfd, dfd_rights, rights);
 		struct dentry *dentry;
 
-		if (!f.file)
-			return -EBADF;
+		if (IS_ERR(f.file))
+			return PTR_ERR(f.file);
+		if (!cap_rights_is_all(*dfd_rights))
+			*flags |= LOOKUP_BENEATH_ONLY;
 
 		dentry = f.file->f_path.dentry;
 
@@ -1913,7 +1923,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 		}
 
 		nd->path = f.file->f_path;
-		if (flags & LOOKUP_RCU) {
+		if ((*flags) & LOOKUP_RCU) {
 			if (f.flags & FDPUT_FPUT)
 				*fp = f.file;
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
@@ -1938,9 +1948,12 @@ static inline int lookup_last(struct nameidata *nd, struct path *path)
 }
 
 /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
-static int path_lookupat(int dfd, const char *name,
-				unsigned int flags, struct nameidata *nd)
+static int path_lookupat(int dfd,
+			 const char *name, unsigned int flags,
+			 struct nameidata *nd,
+			 const struct capsicum_rights *rights)
 {
+	const struct capsicum_rights *dfd_rights;
 	struct file *base = NULL;
 	struct path path;
 	int err;
@@ -1959,7 +1972,7 @@ static int path_lookupat(int dfd, const char *name,
 	 * be handled by restarting a traditional ref-walk (which will always
 	 * be able to complete).
 	 */
-	err = path_init(dfd, name, flags | LOOKUP_PARENT, nd, &base);
+	err = path_init(dfd, name, &flags, nd, &base, &dfd_rights, rights);
 
 	if (unlikely(err))
 		return err;
@@ -2004,27 +2017,32 @@ static int path_lookupat(int dfd, const char *name,
 	return err;
 }
 
-static int filename_lookup(int dfd, struct filename *name,
-				unsigned int flags, struct nameidata *nd)
+static int filename_lookup(int dfd,
+			struct filename *name, unsigned int flags,
+			struct nameidata *nd,
+			const struct capsicum_rights *rights)
 {
-	int retval = path_lookupat(dfd, name->name, flags | LOOKUP_RCU, nd);
+	int retval = path_lookupat(dfd, name->name, flags | LOOKUP_RCU, nd,
+				   rights);
 	if (unlikely(retval == -ECHILD))
-		retval = path_lookupat(dfd, name->name, flags, nd);
+		retval = path_lookupat(dfd, name->name, flags, nd, rights);
 	if (unlikely(retval == -ESTALE))
-		retval = path_lookupat(dfd, name->name,
-						flags | LOOKUP_REVAL, nd);
+		retval = path_lookupat(dfd, name->name, flags | LOOKUP_REVAL,
+				       nd, rights);
 
 	if (likely(!retval))
 		audit_inode(name, nd->path.dentry, flags & LOOKUP_PARENT);
 	return retval;
 }
 
-static int do_path_lookup(int dfd, const char *name,
-				unsigned int flags, struct nameidata *nd)
+static int do_path_lookup(int dfd,
+			  const char *name, unsigned int flags,
+			  struct nameidata *nd,
+			  const struct capsicum_rights *rights)
 {
 	struct filename filename = { .name = name };
 
-	return filename_lookup(dfd, &filename, flags, nd);
+	return filename_lookup(dfd, &filename, flags, nd, rights);
 }
 
 /* does lookup, returns the object with parent locked */
@@ -2032,7 +2050,8 @@ struct dentry *kern_path_locked(const char *name, struct path *path)
 {
 	struct nameidata nd;
 	struct dentry *d;
-	int err = do_path_lookup(AT_FDCWD, name, LOOKUP_PARENT, &nd);
+	int err;
+	err = do_path_lookup(AT_FDCWD, name, LOOKUP_PARENT, &nd, NULL);
 	if (err)
 		return ERR_PTR(err);
 	if (nd.last_type != LAST_NORM) {
@@ -2053,7 +2072,8 @@ struct dentry *kern_path_locked(const char *name, struct path *path)
 int kern_path(const char *name, unsigned int flags, struct path *path)
 {
 	struct nameidata nd;
-	int res = do_path_lookup(AT_FDCWD, name, flags, &nd);
+	int res;
+	res = do_path_lookup(AT_FDCWD, name, flags, &nd, NULL);
 	if (!res)
 		*path = nd.path;
 	return res;
@@ -2078,7 +2098,7 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
 	nd.root.mnt = mnt;
 	BUG_ON(flags & LOOKUP_PARENT);
 	/* the first argument of do_path_lookup() is ignored with LOOKUP_ROOT */
-	err = do_path_lookup(AT_FDCWD, name, flags | LOOKUP_ROOT, &nd);
+	err = do_path_lookup(AT_FDCWD, name, flags | LOOKUP_ROOT, &nd, NULL);
 	if (!err)
 		*path = nd.path;
 	return err;
@@ -2161,8 +2181,7 @@ static int user_path_at_empty_rights(int dfd,
 	if (!IS_ERR(tmp)) {
 
 		BUG_ON(flags & LOOKUP_PARENT);
-
-		err = filename_lookup(dfd, tmp, flags, &nd);
+		err = filename_lookup(dfd, tmp, flags, &nd, rights);
 		putname(tmp);
 		if (!err)
 			*path = nd.path;
@@ -2211,7 +2230,7 @@ int _user_path_atr(int dfd,
  */
 static struct filename *
 user_path_parent(int dfd, const char __user *path, struct nameidata *nd,
-		 unsigned int flags)
+		 unsigned int flags, const struct capsicum_rights *rights)
 {
 	struct filename *s = getname(path);
 	int error;
@@ -2222,7 +2241,7 @@ user_path_parent(int dfd, const char __user *path, struct nameidata *nd,
 	if (IS_ERR(s))
 		return s;
 
-	error = filename_lookup(dfd, s, flags | LOOKUP_PARENT, nd);
+	error = filename_lookup(dfd, s, flags | LOOKUP_PARENT, nd, rights);
 	if (error) {
 		putname(s);
 		return ERR_PTR(error);
@@ -2338,9 +2357,11 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 {
 	struct file *base = NULL;
 	struct nameidata nd;
+	const struct capsicum_rights *dfd_rights;
 	int err;
 
-	err = path_init(dfd, name, flags | LOOKUP_PARENT, &nd, &base);
+	err = path_init(dfd, name, &flags, &nd, &base,
+			&dfd_rights, &lookup_rights);
 	if (unlikely(err))
 		return err;
 
@@ -3165,8 +3186,9 @@ static int do_tmpfile(int dfd, struct filename *pathname,
 	static const struct qstr name = QSTR_INIT("/", 1);
 	struct dentry *dentry, *child;
 	struct inode *dir;
-	int error = path_lookupat(dfd, pathname->name,
-				  flags | LOOKUP_DIRECTORY, nd);
+	int error;
+	error = path_lookupat(dfd, pathname->name, flags | LOOKUP_DIRECTORY, nd,
+			      &lookup_rights);
 	if (unlikely(error))
 		return error;
 	error = mnt_want_write(nd->path.mnt);
@@ -3218,15 +3240,42 @@ out:
 	return error;
 }
 
+static void openat_primary_rights(struct capsicum_rights *rights,
+				  unsigned int flags)
+{
+	switch (flags & O_ACCMODE) {
+	case O_RDONLY:
+		cap_rights_set(rights, CAP_READ);
+		break;
+	case O_RDWR:
+		cap_rights_set(rights, CAP_READ);
+		/* FALLTHRU */
+	case O_WRONLY:
+		cap_rights_set(rights, CAP_WRITE);
+		if (!(flags & (O_APPEND | O_TRUNC)))
+			cap_rights_set(rights, CAP_SEEK);
+		break;
+	}
+	if (flags & O_CREAT)
+		cap_rights_set(rights, CAP_CREATE);
+	if (flags & O_TRUNC)
+		cap_rights_set(rights, CAP_FTRUNCATE);
+	if (flags & (O_DSYNC|FASYNC))
+		cap_rights_set(rights, CAP_FSYNC);
+}
+
 static struct file *path_openat(int dfd, struct filename *pathname,
 		struct nameidata *nd, const struct open_flags *op, int flags)
 {
+	struct capsicum_rights rights;
+	const struct capsicum_rights *dfd_rights;
 	struct file *base = NULL;
 	struct file *file;
 	struct path path;
 	int opened = 0;
 	int error;
 
+	cap_rights_init(&rights, CAP_LOOKUP);
 	file = get_empty_filp();
 	if (IS_ERR(file))
 		return file;
@@ -3238,7 +3287,9 @@ static struct file *path_openat(int dfd, struct filename *pathname,
 		goto out;
 	}
 
-	error = path_init(dfd, pathname->name, flags | LOOKUP_PARENT, nd, &base);
+	openat_primary_rights(&rights, file->f_flags);
+	error = path_init(dfd, pathname->name, &flags, nd, &base,
+			  &dfd_rights, &rights);
 	if (unlikely(error))
 		goto out;
 
@@ -3268,6 +3319,16 @@ static struct file *path_openat(int dfd, struct filename *pathname,
 		error = do_last(nd, &path, file, op, &opened, pathname);
 		put_link(nd, &link, cookie);
 	}
+	if (!error) {
+		struct file *install_file;
+		install_file = security_file_install(dfd_rights, file);
+		if (IS_ERR(install_file)) {
+			error = PTR_ERR(install_file);
+			goto out;
+		} else {
+			file = install_file;
+		}
+	}
 out:
 	if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT))
 		path_put(&nd->root);
@@ -3326,8 +3387,12 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 	return file;
 }
 
-struct dentry *kern_path_create(int dfd, const char *pathname,
-				struct path *path, unsigned int lookup_flags)
+static struct dentry *
+kern_path_create_rights(int dfd,
+			const char *pathname,
+			struct path *path,
+			unsigned int lookup_flags,
+			const struct capsicum_rights *rights)
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 	struct nameidata nd;
@@ -3341,7 +3406,8 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
 	 */
 	lookup_flags &= LOOKUP_REVAL;
 
-	error = do_path_lookup(dfd, pathname, LOOKUP_PARENT|lookup_flags, &nd);
+	error = do_path_lookup(dfd, pathname, LOOKUP_PARENT|lookup_flags, &nd,
+			       rights);
 	if (error)
 		return ERR_PTR(error);
 
@@ -3395,6 +3461,13 @@ out:
 	path_put(&nd.path);
 	return dentry;
 }
+
+struct dentry *kern_path_create(int dfd, const char *pathname,
+				struct path *path, unsigned int lookup_flags)
+{
+	return kern_path_create_rights(dfd, pathname, path, lookup_flags,
+				       &lookup_rights);
+}
 EXPORT_SYMBOL(kern_path_create);
 
 void done_path_create(struct path *path, struct dentry *dentry)
@@ -3406,17 +3479,29 @@ void done_path_create(struct path *path, struct dentry *dentry)
 }
 EXPORT_SYMBOL(done_path_create);
 
-struct dentry *user_path_create(int dfd, const char __user *pathname,
-				struct path *path, unsigned int lookup_flags)
+static struct dentry *
+user_path_create_rights(int dfd,
+			const char __user *pathname,
+			struct path *path,
+			unsigned int lookup_flags,
+			const struct capsicum_rights *rights)
 {
 	struct filename *tmp = getname(pathname);
 	struct dentry *res;
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	res = kern_path_create(dfd, tmp->name, path, lookup_flags);
+	res = kern_path_create_rights(dfd, tmp->name, path, lookup_flags,
+				      rights);
 	putname(tmp);
 	return res;
 }
+
+struct dentry *user_path_create(int dfd, const char __user *pathname,
+				struct path *path, unsigned int lookup_flags)
+{
+	return user_path_create_rights(dfd, pathname, path, lookup_flags,
+				       &lookup_rights);
+}
 EXPORT_SYMBOL(user_path_create);
 
 int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
@@ -3467,16 +3552,28 @@ static int may_mknod(umode_t mode)
 SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 		unsigned, dev)
 {
+	struct capsicum_rights rights;
 	struct dentry *dentry;
 	struct path path;
 	int error;
 	unsigned int lookup_flags = 0;
 
+	cap_rights_init(&rights, CAP_LOOKUP);
 	error = may_mknod(mode);
 	if (error)
 		return error;
+
+	switch (mode & S_IFMT) {
+	case S_IFCHR: case S_IFBLK:
+		cap_rights_set(&rights, CAP_MKNODAT);
+		break;
+	case S_IFIFO:
+		cap_rights_set(&rights, CAP_MKFIFOAT);
+		break;
+	}
 retry:
-	dentry = user_path_create(dfd, filename, &path, lookup_flags);
+	dentry = user_path_create_rights(dfd, filename, &path, lookup_flags,
+					 &rights);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -3543,9 +3640,12 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	struct capsicum_rights rights;
+	cap_rights_init(&rights, CAP_LOOKUP, CAP_MKDIRAT);
 
 retry:
-	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
+	dentry = user_path_create_rights(dfd, pathname, &path, lookup_flags,
+					 &rights);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -3636,9 +3736,11 @@ static long do_rmdir(int dfd, const char __user *pathname)
 	struct filename *name;
 	struct dentry *dentry;
 	struct nameidata nd;
+	struct capsicum_rights rights;
 	unsigned int lookup_flags = 0;
+	cap_rights_init(&rights, CAP_UNLINKAT);
 retry:
-	name = user_path_parent(dfd, pathname, &nd, lookup_flags);
+	name = user_path_parent(dfd, pathname, &nd, lookup_flags, &rights);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
@@ -3763,8 +3865,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
 	unsigned int lookup_flags = 0;
+	struct capsicum_rights rights;
+	cap_rights_init(&rights, CAP_UNLINKAT);
 retry:
-	name = user_path_parent(dfd, pathname, &nd, lookup_flags);
+	name = user_path_parent(dfd, pathname, &nd, lookup_flags, &rights);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
@@ -3870,12 +3974,15 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	struct capsicum_rights rights;
 
 	from = getname(oldname);
 	if (IS_ERR(from))
 		return PTR_ERR(from);
+	cap_rights_init(&rights, CAP_SYMLINKAT);
 retry:
-	dentry = user_path_create(newdfd, newname, &path, lookup_flags);
+	dentry = user_path_create_rights(newdfd, newname, &path, lookup_flags,
+					 &rights);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putname;
@@ -3986,6 +4093,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 	struct dentry *new_dentry;
 	struct path old_path, new_path;
 	struct inode *delegated_inode = NULL;
+	struct capsicum_rights rights;
 	int how = 0;
 	int error;
 
@@ -4004,13 +4112,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 
 	if (flags & AT_SYMLINK_FOLLOW)
 		how |= LOOKUP_FOLLOW;
+	cap_rights_init(&rights, CAP_LINKAT);
 retry:
 	error = user_path_at(olddfd, oldname, how, &old_path);
 	if (error)
 		return error;
 
-	new_dentry = user_path_create(newdfd, newname, &new_path,
-					(how & LOOKUP_REVAL));
+	new_dentry = user_path_create_rights(newdfd, newname, &new_path,
+					     (how & LOOKUP_REVAL), &rights);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out;
@@ -4241,6 +4350,8 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
 	struct inode *delegated_inode = NULL;
 	struct filename *from;
 	struct filename *to;
+	struct capsicum_rights old_rights;
+	struct capsicum_rights new_rights;
 	unsigned int lookup_flags = 0;
 	bool should_retry = false;
 	int error;
@@ -4251,14 +4362,18 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
 	if ((flags & RENAME_NOREPLACE) && (flags & RENAME_EXCHANGE))
 		return -EINVAL;
 
+	cap_rights_init(&old_rights, CAP_RENAMEAT);
+	cap_rights_init(&new_rights, CAP_LINKAT);
 retry:
-	from = user_path_parent(olddfd, oldname, &oldnd, lookup_flags);
+	from = user_path_parent(olddfd, oldname, &oldnd, lookup_flags,
+				&old_rights);
 	if (IS_ERR(from)) {
 		error = PTR_ERR(from);
 		goto exit;
 	}
 
-	to = user_path_parent(newdfd, newname, &newnd, lookup_flags);
+	to = user_path_parent(newdfd, newname, &newnd, lookup_flags,
+			      &new_rights);
 	if (IS_ERR(to)) {
 		error = PTR_ERR(to);
 		goto exit1;
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index abc8cbcfe90e..33a269166b05 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify_backend.h>
+#include <linux/security.h>
 
 int dir_notify_enable __read_mostly = 1;
 
@@ -327,6 +328,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 
 	rcu_read_lock();
 	f = fcheck(fd);
+	f = security_file_lookup(f, NULL, NULL);
 	rcu_read_unlock();
 
 	/* if (f != filp) means that we lost a race and another task/thread
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 0788d093f5d8..d260dd1acdee 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -20,6 +20,7 @@ static int seq_show(struct seq_file *m, void *v)
 	struct files_struct *files = NULL;
 	int f_flags = 0, ret = -ENOENT;
 	struct file *file = NULL;
+	struct file *underlying = NULL;
 	struct task_struct *task;
 
 	task = get_proc_task(m->private);
@@ -36,12 +37,13 @@ static int seq_show(struct seq_file *m, void *v)
 		file = fcheck_files(files, fd);
 		if (file) {
 			struct fdtable *fdt = files_fdtable(files);
-
-			f_flags = file->f_flags;
+			underlying = security_file_lookup(file, NULL, NULL);
+			f_flags = underlying->f_flags;
 			if (close_on_exec(fd, fdt))
 				f_flags |= O_CLOEXEC;
 
 			get_file(file);
+			get_file(underlying);
 			ret = 0;
 		}
 		spin_unlock(&files->file_lock);
@@ -50,10 +52,11 @@ static int seq_show(struct seq_file *m, void *v)
 
 	if (!ret) {
 		seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
-			   (long long)file->f_pos, f_flags,
-			   real_mount(file->f_path.mnt)->mnt_id);
+			   (long long)underlying->f_pos, f_flags,
+			   real_mount(underlying->f_path.mnt)->mnt_id);
 		if (file->f_op->show_fdinfo)
 			ret = file->f_op->show_fdinfo(m, file);
+		fput(underlying);
 		fput(file);
 	}
 
@@ -95,7 +98,9 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
 			rcu_read_lock();
 			file = fcheck_files(files, fd);
 			if (file) {
-				unsigned f_mode = file->f_mode;
+				unsigned f_mode;
+				file = security_file_lookup(file, NULL, NULL);
+				f_mode = file->f_mode;
 
 				rcu_read_unlock();
 				put_files_struct(files);
@@ -158,6 +163,7 @@ static int proc_fd_link(struct dentry *dentry, struct path *path)
 		spin_lock(&files->file_lock);
 		fd_file = fcheck_files(files, fd);
 		if (fd_file) {
+			fd_file = security_file_lookup(fd_file, NULL, NULL);
 			*path = fd_file->f_path;
 			path_get(&fd_file->f_path);
 			ret = 0;
diff --git a/net/socket.c b/net/socket.c
index dbc00f0b992a..f635dc3f9a3c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1669,6 +1669,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
 {
 	struct socket *sock, *newsock;
 	struct file *newfile;
+	struct file *installfile;
 	int err, len, newfd, fput_needed;
 	struct sockaddr_storage address;
 	struct capsicum_rights rights;
@@ -1736,7 +1737,12 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
 
 	/* File flags are not inherited via accept() unlike another OSes. */
 
-	fd_install(newfd, newfile);
+	installfile = security_file_install(listen_rights, newfile);
+	if (IS_ERR(installfile)) {
+		err = PTR_ERR(installfile);
+		goto out_fd;
+	}
+	fd_install(newfd, installfile);
 	err = newfd;
 
 out_put:
@@ -2115,7 +2121,7 @@ static int ___sys_sendmsg(struct socket *sock_noaddr, struct socket *sock_addr,
 	}
 	sock = (msg_sys->msg_name ? sock_addr : sock_noaddr);
 	if (!sock)
-		return -EBADF;
+		return -ENOTCAPABLE;
 
 	if (msg_sys->msg_iovlen > UIO_FASTIOV) {
 		err = -EMSGSIZE;
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 11/11] capsicum: add syscalls to limit FD rights
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (9 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 10/11] capsicum: invocation " David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 1/5] man-pages: open.2: describe O_BENEATH_ONLY flag David Drysdale
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Add the cap_rights_get(2) and cap_rights_set(2) syscalls to
allow retrieval and modification of the rights associated with
a file descriptor.

When a normal file descriptor has its rights restricted in any
way, it becomes a Capsicum capability file descriptor.  This is
a wrapper struct file that is installed in the fdtable in place
of the original file.  From this point on, when the FD is converted
to a struct file by fget() (or equivalent), the wrapper is checked
for the appropriate rights and the wrapped inner normal file is
returned.

When a Capsicum capability file descriptor has its rights restricted
further (they cannot be expanded), a new wrapper is created with
the restricted rights, also wrapping the same inner normal file.
In other words, the .underlying field in a struct capsicum_capability
is always a normal file, never another Capsicum capability file.

These syscalls specify the different components of the compound
rights structure separately, allowing components to be unspecified
for no change.

Note that in FreeBSD 10.x the function of this pair of syscalls
is implemented as 3 distinct pairs of syscalls, one pair for each
component of the compound rights (primary/fcntl/ioctl).

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/x86/syscalls/syscall_64.tbl |   2 +
 include/linux/syscalls.h         |  12 ++++
 kernel/sys_ni.c                  |   4 ++
 security/capsicum.c              | 143 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 161 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac3d9ef..d408116dace5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,8 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+318	common	cap_rights_limit	sys_cap_rights_limit
+319	common	cap_rights_get		sys_cap_rights_get
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a4a0588c5397..55666f3a4185 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+struct cap_rights;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -866,4 +867,15 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_cap_rights_limit(unsigned int orig_fd,
+				     const struct cap_rights __user *new_rights,
+				     unsigned int fcntls,
+				     int nioctls,
+				     unsigned int __user *ioctls);
+asmlinkage long sys_cap_rights_get(unsigned int fd,
+				   struct cap_rights __user *rightsp,
+				   unsigned int __user *fcntls,
+				   int __user *nioctls,
+				   unsigned int __user *ioctls);
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bc8d1b74a6b9..2f09e5ee64f7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -211,3 +211,7 @@ cond_syscall(compat_sys_open_by_handle_at);
 
 /* compare kernel pointers */
 cond_syscall(sys_kcmp);
+
+/* capsicum object capabilities */
+cond_syscall(sys_cap_rights_get);
+cond_syscall(sys_cap_rights_limit);
diff --git a/security/capsicum.c b/security/capsicum.c
index 83677eef3fb6..4e4458801866 100644
--- a/security/capsicum.c
+++ b/security/capsicum.c
@@ -125,6 +125,149 @@ out_err:
 	return ERR_PTR(err);
 }
 
+/* Takes ownership of rights->ioctls */
+static int capsicum_rights_limit(unsigned int fd,
+				 struct capsicum_rights *rights)
+{
+	int rc = -EBADF;
+	struct capsicum_capability *cap;
+	struct file *capf = NULL;
+	struct file *file;  /* current file for fd */
+	struct file *underlying; /* base file for capability */
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+
+	/* Allocate capability before taking files->file_lock */
+	capf = capsicum_cap_alloc(rights, true);
+	rights->ioctls = NULL;  /* capsicum_cap_alloc took ownership */
+	if (IS_ERR(capf))
+		return PTR_ERR(capf);
+	cap = capf->private_data;
+
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (fd >= fdt->max_fds)
+		goto out_err;
+	file = fdt->fd[fd];
+	if (!file)
+		goto out_err;
+
+	/* If we're limiting an existing Capsicum capability object, ensure
+	 * we wrap its underlying normal file. */
+	if (capsicum_is_cap(file)) {
+		struct capsicum_capability *old_cap = file->private_data;
+		/* Reject attempts to widen existing rights */
+		if (!cap_rights_contains(&old_cap->rights, &cap->rights)) {
+			rc = -ENOTCAPABLE;
+			goto out_err;
+		}
+		underlying = old_cap->underlying;
+	} else {
+		underlying = file;
+	}
+	if (!atomic_long_inc_not_zero(&underlying->f_count)) {
+		rc = -EBADF;
+		goto out_err;
+	}
+	cap->underlying = underlying;
+
+	fput(file);
+	rcu_assign_pointer(fdt->fd[fd], capf);
+	spin_unlock(&files->file_lock);
+	return 0;
+out_err:
+	spin_unlock(&files->file_lock);
+	fput(capf);
+	return rc;
+}
+
+SYSCALL_DEFINE5(cap_rights_limit,
+		unsigned int, fd,
+		const struct cap_rights __user *, new_rights,
+		unsigned int, new_fcntls,
+		int, nioctls,
+		unsigned int __user *, new_ioctls)
+{
+	struct capsicum_rights rights;
+
+	if (!new_rights)
+		return -EFAULT;
+	if (nioctls < 0 && nioctls != -1)
+		return -EINVAL;
+	if (copy_from_user(&rights.primary, new_rights,
+			   sizeof(struct cap_rights)))
+		return -EFAULT;
+	rights.fcntls = new_fcntls;
+	rights.nioctls = nioctls;
+	if (rights.nioctls > 0) {
+		size_t size;
+		if (!new_ioctls)
+			return -EINVAL;
+		size = rights.nioctls * sizeof(unsigned int);
+		rights.ioctls = kmalloc(size, GFP_KERNEL);
+		if (!rights.ioctls)
+			return -ENOMEM;
+		if (copy_from_user(rights.ioctls, new_ioctls, size)) {
+			kfree(rights.ioctls);
+			return -EFAULT;
+		}
+	} else {
+		rights.ioctls = NULL;
+	}
+	if (cap_rights_regularize(&rights))
+		return -ENOTCAPABLE;
+
+	return capsicum_rights_limit(fd, &rights);
+}
+
+SYSCALL_DEFINE5(cap_rights_get,
+		unsigned int, fd,
+		struct cap_rights __user *, rightsp,
+		unsigned int __user *, fcntls,
+		int __user *, nioctls,
+		unsigned int __user *, ioctls)
+{
+	int result = -EFAULT;
+	struct file *file;
+	struct capsicum_rights *rights = &all_rights;
+	int ioctls_to_copy = -1;
+
+	file = fget_raw(fd);
+	if (file == NULL)
+		return -EBADF;
+	if (capsicum_is_cap(file)) {
+		struct capsicum_capability *cap = file->private_data;
+		rights = &cap->rights;
+	}
+
+	if (rightsp) {
+		if (copy_to_user(rightsp, &rights->primary,
+				 sizeof(struct cap_rights)))
+			goto out;
+	}
+	if (fcntls) {
+		if (put_user(rights->fcntls, fcntls))
+			goto out;
+	}
+	if (nioctls) {
+		int n;
+		if (get_user(n, nioctls))
+			goto out;
+		if (put_user(rights->nioctls, nioctls))
+			goto out;
+		ioctls_to_copy = min(rights->nioctls, n);
+	}
+	if (ioctls && ioctls_to_copy > 0) {
+		if (copy_to_user(ioctls, rights->ioctls,
+				 ioctls_to_copy * sizeof(unsigned int)))
+			goto out;
+	}
+	result = 0;
+out:
+	fput(file);
+	return result;
+}
+
 /*
  * File operations functions.
  */
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 1/5] man-pages: open.2: describe O_BENEATH_ONLY flag
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (10 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 11/11] capsicum: add syscalls to limit FD rights David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 22:22   ` Andy Lutomirski
  2014-06-30 10:28 ` [PATCH 2/5] man-pages: capsicum.7: describe Capsicum capability framework David Drysdale
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man2/open.2 | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/man2/open.2 b/man2/open.2
index 3824ab5be1f0..ba0da01c1a4f 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -713,7 +713,7 @@ in a fully formed state (using
 as described above).
 .RE
 .IP
-.B O_TMPFILE
+.B O_TMPFILE " (since Linux 3.??)"
 requires support by the underlying filesystem;
 only a subset of Linux filesystems provide that support.
 In the initial implementation, support was provided in
@@ -723,6 +723,31 @@ XFS support was added
 .\" commit ab29743117f9f4c22ac44c13c1647fb24fb2bafe
 in Linux 3.15.
 .TP
+.B O_BENEATH_ONLY
+Ensure that the
+.I pathname
+is beneath the current working directory (for
+.BR open (2))
+or the
+.I dirfd
+(for
+.BR openat (2)).
+If the
+.I pathname
+is absolute or contains a path component of "..", the
+.BR open ()
+fails with the error
+.BR EACCES.
+This occurs even if ".." path component would not actually
+escape the original directory; for example, a
+.I pathname
+of "subdir/../filename" would be rejected.
+Path components that are symbolic links to absolute paths, or that are
+relative paths containing a ".." component, are cause the
+.BR open ()
+operation to fail with the error
+.BR EACCES.
+.TP
 .B O_TRUNC
 If the file already exists and is a regular file and the access mode allows
 writing (i.e., is
@@ -799,7 +824,11 @@ The requested access to the file is not allowed, or search permission
 is denied for one of the directories in the path prefix of
 .IR pathname ,
 or the file did not exist yet and write access to the parent directory
-is not allowed.
+is not allowed, or the
+.B O_BENEATH_ONLY
+flag was specified and the
+.I pathname
+was not beneath the relevant directory.
 (See also
 .BR path_resolution (7).)
 .TP
--
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 2/5] man-pages: capsicum.7: describe Capsicum capability framework
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (11 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 1/5] man-pages: open.2: describe O_BENEATH_ONLY flag David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 3/5] man-pages: rights.7: Describe Capsicum primary rights David Drysdale
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man7/capsicum.7 | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 man7/capsicum.7

diff --git a/man7/capsicum.7 b/man7/capsicum.7
new file mode 100644
index 000000000000..e736060bb5bc
--- /dev/null
+++ b/man7/capsicum.7
@@ -0,0 +1,97 @@
+.\"
+.\" Copyright (c) 2014 Google, Inc.
+.\" Copyright (c) 2011, 2013 Robert N. M. Watson
+.\" Copyright (c) 2011 Jonathan Anderson
+.\" All rights reserved.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH CAPSICUM 7 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+capsicum \- lightweight OS capability and sandbox framework
+.SH SYNOPSIS
+.B #include <sys/capsicum.h>
+.SH DESCRIPTION
+Capsicum is a lightweight OS capability and sandbox framework implementing a hybrid
+capability system model.
+Capsicum can be used for application and library compartmentalisation, the
+decomposition of larger bodies of software into isolated (sandboxed)
+components in order to implement security policies and limit the impact of
+software vulnerabilities.
+.PP
+Capsicum provides three core kernel mechanisms,
+.IR "Capsicum capabilities",
+.I "capability mode"
+and
+.IR "process descriptors",
+each described below.
+
+.SS Capsicum Capabilities
+A
+.I Capsicum capability
+is a file descriptor that has been limited so that only
+certain operations can be performed on it.
+For example, a file descriptor returned by
+.BR open (2)
+may be refined using
+.BR cap_rights_limit (2)
+so that only
+.BR read (2)
+and
+.BR write (2)
+can be called on it, but not
+.BR fchmod (2).
+The complete list of the capability rights can be found in the
+.BR rights (7)
+manual page.
+
+.SS Capability Mode
+Capsicum capability mode is a process mode, entered by invoking
+.BR cap_enter (3),
+in which access to global OS namespaces (such as the file system and PID
+namespaces) is restricted; only explicitly delegated rights, referenced by
+memory mappings or file descriptors, may be used.
+Once set, the flag is inherited by future children processes, and may not be
+cleared.
+
+.SS Process Descriptors
+.I Process descriptors
+are file descriptors representing processes, allowing parent processes to manage
+child processes without requiring access to the PID namespace, and are described in
+greater detail in
+.BR procdesc (7).
+.SH VERSIONS
+Capsicum support is available in the kernel since version 3.???.
+.SH SEE ALSO
+.BR cap_enter (3),
+.BR cap_getmode (3) ,
+.BR cap_rights_get (2),
+.BR cap_rights_limit (2) ,
+.BR pdfork (2),
+.BR pdgetpid (2),
+.BR pdkill (2),
+.BR pdwait4 (2),
+.BR procdesc (7),
+.BR rights (7)
+
--
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 3/5] man-pages: rights.7: Describe Capsicum primary rights
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (12 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 2/5] man-pages: capsicum.7: describe Capsicum capability framework David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 10:28 ` [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum David Drysdale
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man7/rights.7 | 525 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 525 insertions(+)
 create mode 100644 man7/rights.7

diff --git a/man7/rights.7 b/man7/rights.7
new file mode 100644
index 000000000000..33bb8e48d12f
--- /dev/null
+++ b/man7/rights.7
@@ -0,0 +1,525 @@
+.\"
+.\" Copyright (c) 2014 Google, Inc.
+.\" Copyright (c) 2012-2013 The FreeBSD Foundation
+.\" Copyright (c) 2008-2010 Robert N. M. Watson
+.\" All rights reserved.
+.\"
+.\" This software was developed at the University of Cambridge Computer
+.\" Laboratory with support from a grant from Google, Inc.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Portions of this documentation were written by Pawel Jakub Dawidek
+.\" under sponsorship from the FreeBSD Foundation.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH RIGHTS 7 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+Capsicum capability rights for file descriptors
+.SH SYNOPSIS
+.B #include <linux/capsicum.h>
+.SH DESCRIPTION
+When a file descriptor is created by a function such as
+.BR accept (2),
+.BR accept4 (2),
+.BR creat (2),
+.BR epoll_create (2),
+.BR eventfd (2),
+.BR mq_open (2),
+.BR open (2),
+.BR openat (2),
+.BR pdfork (2),
+.BR pipe (2),
+.BR pipe2 (2),
+.BR signalfd (2),
+.BR socket (2),
+.BR socketpair (2)
+or
+.BR timerfd_create (2),
+it implicitly has all Capsicum capability rights.
+Those rights can be reduced (but never expanded) by using the
+.BR cap_rights_limit (2)
+system call.
+Once capability rights are reduced, operations on the file descriptor will be
+limited to those permitted by the associated rights.
+.Pp
+The list of primary capability rights is provided below. In addition,
+.BR ioctl (2)
+and
+.BR fcntl (2)
+can also be restricted to only allow specific commands.
+.PP
+The
+.I "struct cap_rights"
+type is used to store a list of primary capability rights; the
+.BR cap_rights_init (3)
+family of functions should be used to manage the structure.
+.SH RIGHTS
+The following rights may be specified in a rights mask:
+.TP
+.B CAP_ACCEPT
+Permit
+.BR accept (2)
+and
+.BR accept4 (2).
+.TP
+.B CAP_BIND
+Permit
+.BR bind (2).
+Note that sockets can also become bound implicitly as a result of
+.BR connect (2)
+or
+.BR send (2),
+and that socket options set with
+.BR setsockopt (2)
+may also affect binding behavior.
+.TP
+.B CAP_CONNECT
+Permit
+.BR connect (2);
+also required for
+.BR sendto (2)
+with a non-NULL destination address.
+.TP
+.B CAP_CREATE
+Permit
+.BR openat (2)
+with the
+.B O_CREAT
+flag.
+.TP
+.B CAP_EVENT
+Permit
+.BR select (2),
+.BR poll (2),
+and
+.BR epoll (7)
+to be used in monitoring the file descriptor for events.
+.TP
+.B CAP_EXTATTR_DELETE
+Permit
+.BR fremovexattr (2).
+.TP
+.B CAP_EXTATTR_GET
+Permit
+.BR fgetxattr (2).
+.TP
+.B CAP_EXTATTR_LIST
+Permit
+.BR flistxattr (2).
+.TP
+.B CAP_EXTATTR_SET
+Permit
+.BR fsetxattr (2).
+.TP
+.B CAP_FCHDIR
+Permit
+.BR fchdir (2).
+.TP
+.B CAP_FCHMOD
+Permit
+.BR fchmod (2)
+and
+.BR fchmodat (2)
+if the
+.B CAP_LOOKUP
+right is also present.
+.TP
+.B CAP_FCHMODAT
+An alias to
+.B CAP_FCHMOD
+and
+.BR CAP_LOOKUP .
+.TP
+.B CAP_FCHOWN
+Permit
+.BR fchown (2)
+and
+.BR fchownat (2)
+if the
+.B CAP_LOOKUP
+right is also present.
+.TP
+.B CAP_FCHOWNAT
+An alias to
+.B CAP_FCHOWN
+and
+.BR CAP_LOOKUP .
+.TP
+.B CAP_FCNTL
+Permit
+.BR fcntl (2).
+Note that only the
+.BR F_GETFL ,
+.BR F_SETFL ,
+.B F_GETOWN ,
+.B F_SETOWN ,
+.B F_GETOWN_EX
+and
+.B F_SETOWN_EX
+commands require this capability right.
+Also note that the list of permitted commands can be further limited with the
+.BR cap_rights_limit (2)
+system call.
+.TP
+.B CAP_FEXECVE
+Permit
+.BR execveat (2)
+and
+.BR openat (2)
+with the
+.B O_EXEC
+flag;
+.B CAP_READ
+is also required.
+.TP
+.B CAP_FLOCK
+Permit
+.BR flock (2)
+and
+.BR fcntl (2)
+(with
+.BR F_GETLK ,
+.BR F_SETLK
+or
+.B F_SETLKW
+flag).
+.TP
+.B CAP_FSTAT
+Permit
+.BR fstat (2).
+.TP
+.B CAP_FSTATFS
+Permit
+.BR fstatfs (2).
+.TP
+.B CAP_FSYNC
+Permit
+.BR fsync (2)
+and
+.BR openat (2)
+with the
+.B O_SYNC
+flag.
+.TP
+.B CAP_FTRUNCATE
+Permit
+.BR ftruncate (2)
+and
+.BR openat (2)
+with the
+.B O_TRUNC
+flag.
+.TP
+.B CAP_FUTIMES
+Permit
+.BR futimesat (2)
+if the
+.B CAP_LOOKUP
+right is also present.
+.TP
+.B CAP_FUTIMESAT
+An alias to
+.B CAP_FUTIMES
+and
+.BR CAP_LOOKUP .
+.TP
+.B CAP_GETPEERNAME
+Permit
+.BR getpeername (2).
+.TP
+.B CAP_GETSOCKNAME
+Permit
+.BR getsockname (2).
+.TP
+.B CAP_GETSOCKOPT
+Permit
+.BR getsockopt (2).
+.TP
+.B CAP_IOCTL
+Permit
+.BR ioctl (2).
+Be aware that this system call has enormous scope, including potentially
+global scope for some objects.
+The list of permitted ioctl commands can be further limited with the
+.BR cap_rights_limit (2)
+system call.
+.TP
+.B CAP_LINKAT
+Permit
+.BR linkat (2)
+and
+.BR renameat (2)
+on the destination directory descriptor.
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_LISTEN
+Permit
+.BR listen (2);
+not much use (generally) without
+.BR CAP_BIND .
+.TP
+.B CAP_LOOKUP
+Permit the file descriptor to be used as a starting directory for calls such as
+.BR linkat (2),
+.BR openat (2),
+and
+.BR unlinkat (2).
+.TP
+.B CAP_MKDIRAT
+Permit
+.BR mkdirat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_MKFIFOAT
+Permit
+.BR mkfifoat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_MKNODAT
+Permit
+.BR mknodat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_MMAP
+Permit
+.BR mmap (2)
+with the
+.B PROT_NONE
+protection.
+.TP
+.B CAP_MMAP_R
+Permit
+.BR mmap (2)
+with the
+.B PROT_READ
+protection.
+This right includes the
+.B CAP_READ
+and
+.B CAP_SEEK
+rights.
+.TP
+.B CAP_MMAP_RW
+An alias to
+.B CAP_MMAP_R
+and
+.BR CAP_MMAP_W .
+.TP
+.B CAP_MMAP_RWX
+An alias to
+.BR CAP_MMAP_R ,
+.B CAP_MMAP_W
+and
+.BR CAP_MMAP_X .
+.TP
+.B CAP_MMAP_RX
+An alias to
+.B CAP_MMAP_R
+and
+.BR CAP_MMAP_X .
+.TP
+.B CAP_MMAP_W
+Permit
+.BR mmap (2)
+with the
+.B PROT_WRITE
+protection.
+This right includes the
+.B CAP_WRITE
+and
+.B CAP_SEEK
+rights.
+.TP
+.B CAP_MMAP_WX
+An alias to
+.B CAP_MMAP_W
+and
+.BR CAP_MMAP_X .
+.TP
+.B CAP_MMAP_X
+Permit
+.BR mmap (2)
+with the
+.B PROT_EXEC
+protection.
+This right includes the
+.B CAP_SEEK
+right.
+.TP
+.B CAP_PDGETPID
+Permit
+.BR pdgetpid (2).
+.TP
+.B CAP_PDKILL
+Permit
+.BR pdkill (2).
+.TP
+.B CAP_PDWAIT
+Permit
+.BR pdwait4 (2).
+.TP
+.B CAP_PEELOFF
+Permit
+.BR sctp_peeloff (3).
+.TP
+.B CAP_PREAD
+An alias to
+.B CAP_READ
+and
+.BR CAP_SEEK .
+.TP
+.B CAP_PWRITE
+An alias to
+.B CAP_SEEK
+and
+.BR CAP_WRITE .
+.TP
+.B CAP_READ
+Permit
+.BR openat (2)
+with the
+.BR O_RDONLY flag,
+.BR read (2),
+.BR readv (2),
+.BR recv (2),
+.BR recvfrom (2),
+.BR recvmsg (2),
+.BR pread (2)
+(
+.B CAP_SEEK
+is also required),
+.BR preadv (2)
+(
+.B CAP_SEEK
+is also required) and related system calls.
+.TP
+.B CAP_RECV
+An alias to
+.BR CAP_READ .
+.TP
+.B CAP_RENAMEAT
+Permit
+.BR renameat (2).
+This right is required on the source directory descriptor.
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_SEEK
+Permit operations that seek on the file descriptor, such as
+.BR lseek (2),
+but also required for I/O system calls that can read or write at any position
+in the file, such as
+.BR pread (2)
+and
+.BR pwrite (2).
+.TP
+.B CAP_SEND
+An alias to
+.BR CAP_WRITE .
+.TP
+.B CAP_SETSOCKOPT
+Permit
+.BR setsockopt (2);
+this controls various aspects of socket behavior and may affect binding,
+connecting, and other behaviors with global scope.
+.TP
+.B CAP_SHUTDOWN
+Permit explicit
+.BR shutdown (2);
+closing the socket will also generally shut down any connections on it.
+.TP
+.B CAP_SYMLINKAT
+Permit
+.BR symlinkat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_UNLINKAT
+Permit
+.BR unlinkat (2)
+and
+.BR renameat (2).
+This right is only required for
+.BR renameat (2)
+on the destination directory descriptor if the destination object already
+exists and will be removed by the rename.
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_WRITE
+Allow
+.BR openat (2)
+with
+.B O_WRONLY
+and
+.B O_APPEND
+flags set,
+.BR send (2),
+.BR sendmsg (2),
+.BR sendto (2),
+.BR write (2),
+.BR writev (2),
+.BR pwrite (2),
+.BR pwritev (2)
+and related system calls.
+For
+.BR sendto (2)
+with a non-NULL connection address,
+.B CAP_CONNECT
+is also required.
+For
+.BR openat (2)
+with the
+.B O_WRONLY
+flag, but without the
+.B O_APPEND
+flag,
+.B CAP_SEEK
+is also required.
+For
+.BR pwrite (2)
+and
+.BR pwritev (2)
+.B CAP_SEEK
+is also required.
+.SH VERSIONS
+Capsicum support was originally added to the kernel in version 3.???.
+.SH SEE ALSO
+.BR cap_enter (3),
+.BR cap_fcntls_limit (3),
+.BR cap_ioctls_limit (3),
+.BR cap_rights_limit (2),
+.BR cap_rights_limit (3),
+.BR capsicum (7)
--
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (13 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 3/5] man-pages: rights.7: Describe Capsicum primary rights David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 14:53     ` Andy Lutomirski
  2014-06-30 10:28 ` [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights David Drysdale
  2014-07-03  9:12   ` [Qemu-devel] " Paolo Bonzini
  16 siblings, 1 reply; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 man2/cap_rights_limit.2

diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
new file mode 100644
index 000000000000..3484ee1076aa
--- /dev/null
+++ b/man2/cap_rights_limit.2
@@ -0,0 +1,171 @@
+.\"
+.\" Copyright (c) 2008-2010 Robert N. M. Watson
+.\" Copyright (c) 2012-2013 The FreeBSD Foundation
+.\" Copyright (c) 2013-2014 Google, Inc.
+.\" All rights reserved.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+cap_rights_limit \- limit Capsicum capability rights
+.SH SYNOPSIS
+.nf
+.B #include <sys/capsicum.h>
+.sp
+.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
+.BI "                     unsigned int " fcntls ,
+.BI "                     int " nioctls ", unsigned int *" ioctls );
+.SH DESCRIPTION
+When a file descriptor is created by a function such as
+.BR accept (2),
+.BR accept4 (2),
+.BR creat (2),
+.BR epoll_create (2),
+.BR eventfd (2),
+.BR mq_open (2),
+.BR open (2),
+.BR openat (2),
+.BR pdfork (2),
+.BR pipe (2),
+.BR pipe2 (2),
+.BR signalfd (2),
+.BR socket (2),
+.BR socketpair (2)
+or
+.BR timerfd_create (2),
+it implicitly has all Capsicum capability rights.
+Those rights can be reduced (but never expanded) by using the
+.BR cap_rights_limit ()
+system call.
+Once Capsicum capability rights are reduced, operations on the file descriptor
+.I fd
+will be limited to those permitted by the remainder of the arguments.
+.PP
+The
+.I rights
+argument describes the primary rights for the file descriptor, and
+should be prepared using
+.BR cap_rights_init (3)
+family of functions.  The complete list of primary rights can be found in the
+.BR rights (7)
+manual page.
+.PP
+If a file descriptor is granted the
+.B CAP_FCNTL
+primary capability right, the list of allowed
+.BR fcntl (2)
+commands can be selectively reduced (but never expanded) with the
+.I fcntls
+argument.  The following flags may be specified in the
+.I fcntls
+argument:
+.TP
+.B CAP_FCNTL_GETFL
+Permit
+.B F_GETFL
+command.
+.TP
+.B CAP_FCNTL_SETFL
+Permit
+.B F_SETFL
+command.
+.TP
+.B CAP_FCNTL_GETOWN
+Permit
+.B F_GETOWN
+command.
+.TP
+.B CAP_FCNTL_SETOWN
+Permit
+.B F_SETOWN
+command.
+.PP
+A value of
+.B CAP_FCNTL_ALL
+for the
+.I fcntls
+argument leaves the set of allowed
+.BR fcntl (2)
+commands unchanged.
+.PP
+If a file descriptor is granted the
+.B CAP_IOCTL
+capability right, the list of allowed
+.BR ioctl (2)
+commands can be selectively reduced (but never expanded) using the
+.I nioctls
+and
+.I ioctls
+arguments.
+The
+.I ioctls
+argument is an array of
+.BR ioctl (2)
+command values and the
+.I nioctls
+argument specifies the number of elements in the array.
+.PP
+If the
+.I nioctls
+argument is -1 or 0, the
+.I ioctls
+argument is ignored, and either all
+.BR ioctl (2)
+operations or no
+.BR ioctl (2)
+operations (respectively) will be allowed.
+.PP
+Capsicum capability rights assigned to a file descriptor can be obtained with the
+.BR cap_rights_get (2)
+system call.
+.SH RETURN VALUE
+.BR cap_rights_limit ()
+returns zero on success. On error, -1 is returned and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+isn't a valid open file descriptor.
+.TP
+.B EINVAL
+An invalid set of rights has been requested in
+.IR rights .
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOTCAPABLE
+The arguments contain capability rights not present for the given file descriptor (Capsicum
+capability rights list can only be reduced, never expanded).
+.SH VERSION
+Capsicum support was added to the kernel in version 3.???.
+.SH SEE ALSO
+.BR cap_enter (2),
+.BR cap_rights_get (2),
+.BR cap_rights_init (3),
+.BR capsicum (7),
+.BR rights (7)
--
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
                   ` (14 preceding siblings ...)
  2014-06-30 10:28 ` [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum David Drysdale
@ 2014-06-30 10:28 ` David Drysdale
  2014-06-30 22:28     ` Andy Lutomirski
  2014-07-03  9:12   ` [Qemu-devel] " Paolo Bonzini
  16 siblings, 1 reply; 87+ messages in thread
From: David Drysdale @ 2014-06-30 10:28 UTC (permalink / raw)
  To: linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man2/cap_rights_get.2 | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)
 create mode 100644 man2/cap_rights_get.2

diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
new file mode 100644
index 000000000000..966c0ed7e336
--- /dev/null
+++ b/man2/cap_rights_get.2
@@ -0,0 +1,126 @@
+.\"
+.\" Copyright (c) 2008-2010 Robert N. M. Watson
+.\" Copyright (c) 2012-2013 The FreeBSD Foundation
+.\" Copyright (c) 2013-2014 Google, Inc.
+.\" All rights reserved.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+cap_rights_get \- retrieve Capsicum capability rights
+.SH SYNOPSIS
+.nf
+.B #include <sys/capsicum.h>
+.sp
+.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
+.BI "                   unsigned int *" fcntls ,
+.BI "                   int *" nioctls ", unsigned int *" ioctls );
+.SH DESCRIPTION
+Obtain the current Capsicum capability rights for a file descriptor.
+.PP
+The function will fill the
+.I rights
+argument (if non-NULL) with the primary capability rights of the
+.I fd
+descriptor.  The result can be examined with the
+.BR cap_rights_is_set (3)
+family of functions.  The complete list of primary rights can be found in the
+.BR rights (7)
+manual page.
+.PP
+If the
+.I fcntls
+argument is non-NULL, it will be filled in with a bitmask of allowed
+.BR fcntl (2)
+commands; see
+.BR cap_rights_limit (2)
+for values.  If the file descriptor does not have the
+.B CAP_FCNTL
+primary right, the returned
+.I fcntls
+value will be zero.
+.PP
+If the
+.I nioctls
+argument is non-NULL, it will be filled in with the number of allowed
+.BR ioctl (2)
+commands, or with the value CAP_IOCTLS_ALL to indicate that all
+.BR ioctl (2)
+commands are allowed.  If the file descriptor does not have the
+.B CAP_IOCTL
+primary right, the returned
+.I nioctls
+value will be zero.
+.PP
+The
+.I ioctls
+argument (if non-NULL) should point at memory that can hold up to
+.I nioctls
+values.
+The system call populates the provided buffer with up to
+.I nioctls
+elements, but always returns the total number of
+.BR ioctl (2)
+commands allowed for the given file descriptor in
+.I nioctls
+as described above.
+.PP
+If all
+.BR ioctl (2)
+commands are allowed (the
+.B CAP_IOCTL
+primary capability right is assigned to the file descriptor and the
+set of allowed
+.BR ioctl (2)
+commands was never limited for this file descriptor), the
+system call will not modify the buffer pointed to by the
+.I ioctls
+argument.
+.PP
+Capsicum capability rights assigned to a file descriptor can be reduced with the
+.BR cap_rights_limit (2)
+system call.
+.SH RETURN VALUE
+.BR cap_rights_get ()
+returns zero on success. On error, -1 is returned and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+isn't a valid open file descriptor.
+.TP
+.B EFAULT
+Invalid pointer argument.
+.SH VERSION
+Capsicum support was added to the kernel in version 3.???.
+.SH SEE ALSO
+.BR cap_enter (2),
+.BR cap_rights_limit (2),
+.BR cap_rights_init (3),
+.BR capsicum (7),
+.BR rights (7)
+
--
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-06-30 10:28 ` [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2) David Drysdale
@ 2014-06-30 14:49   ` Andy Lutomirski
  2014-06-30 15:49     ` David Drysdale
  2014-06-30 20:40   ` Andi Kleen
  2014-07-08 12:03     ` Christoph Hellwig
  2 siblings, 1 reply; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 14:49 UTC (permalink / raw)
  To: David Drysdale
  Cc: Al Viro, LSM List, Greg Kroah-Hartman, James Morris, Kees Cook,
	Linux API, Meredydd Luff, linux-kernel

On Jun 30, 2014 3:36 AM, "David Drysdale" <drysdale@google.com> wrote:
>
> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> provided path, rejecting (with -EACCES) paths that are not beneath
> the provided dfd.  In particular, reject:
>  - paths that contain .. components
>  - paths that begin with /
>  - symlinks that have paths as above.

I like this a lot.  However, I think I'd like it even better if it
were AT_BENEATH_ONLY so that it could be added to the rest of the *at
family.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
@ 2014-06-30 14:53     ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 14:53 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 171 insertions(+)
>  create mode 100644 man2/cap_rights_limit.2
>
> diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
> new file mode 100644
> index 000000000000..3484ee1076aa
> --- /dev/null
> +++ b/man2/cap_rights_limit.2
> @@ -0,0 +1,171 @@
> +.\"
> +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> +.\" Copyright (c) 2013-2014 Google, Inc.
> +.\" All rights reserved.
> +.\"
> +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> +.\" Redistribution and use in source and binary forms, with or without
> +.\" modification, are permitted provided that the following conditions
> +.\" are met:
> +.\" 1. Redistributions of source code must retain the above copyright
> +.\"    notice, this list of conditions and the following disclaimer.
> +.\" 2. Redistributions in binary form must reproduce the above copyright
> +.\"    notice, this list of conditions and the following disclaimer in the
> +.\"    documentation and/or other materials provided with the distribution.
> +.\"
> +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> +.\" SUCH DAMAGE.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +cap_rights_limit \- limit Capsicum capability rights
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/capsicum.h>
> +.sp
> +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
> +.BI "                     unsigned int " fcntls ,
> +.BI "                     int " nioctls ", unsigned int *" ioctls );

Am I missing the docs for struct cap_rights somewhere?

> +.SH DESCRIPTION
> +When a file descriptor is created by a function such as
> +.BR accept (2),
> +.BR accept4 (2),
> +.BR creat (2),
> +.BR epoll_create (2),
> +.BR eventfd (2),
> +.BR mq_open (2),
> +.BR open (2),
> +.BR openat (2),
> +.BR pdfork (2),
> +.BR pipe (2),
> +.BR pipe2 (2),
> +.BR signalfd (2),
> +.BR socket (2),
> +.BR socketpair (2)
> +or
> +.BR timerfd_create (2),
> +it implicitly has all Capsicum capability rights.
> +Those rights can be reduced (but never expanded) by using the
> +.BR cap_rights_limit ()
> +system call.
> +Once Capsicum capability rights are reduced, operations on the file descriptor
> +.I fd
> +will be limited to those permitted by the remainder of the arguments.
> +.PP
> +The
> +.I rights
> +argument describes the primary rights for the file descriptor, and
> +should be prepared using
> +.BR cap_rights_init (3)
> +family of functions.  The complete list of primary rights can be found in the
> +.BR rights (7)
> +manual page.
> +.PP
> +If a file descriptor is granted the
> +.B CAP_FCNTL
> +primary capability right, the list of allowed
> +.BR fcntl (2)
> +commands can be selectively reduced (but never expanded) with the
> +.I fcntls
> +argument.  The following flags may be specified in the
> +.I fcntls
> +argument:
> +.TP
> +.B CAP_FCNTL_GETFL
> +Permit
> +.B F_GETFL
> +command.
> +.TP
> +.B CAP_FCNTL_SETFL
> +Permit
> +.B F_SETFL
> +command.
> +.TP
> +.B CAP_FCNTL_GETOWN
> +Permit
> +.B F_GETOWN
> +command.
> +.TP
> +.B CAP_FCNTL_SETOWN
> +Permit
> +.B F_SETOWN
> +command.
> +.PP
> +A value of
> +.B CAP_FCNTL_ALL
> +for the
> +.I fcntls
> +argument leaves the set of allowed
> +.BR fcntl (2)
> +commands unchanged.

What about the locking fcntl operations?  (Arguably the old crappy
POSIX lock operations should be flat-out disallowed on capability fds,
but I see nothing wrong with selectively allowing the new open file
description locks.)

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
@ 2014-06-30 14:53     ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 14:53 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 171 insertions(+)
>  create mode 100644 man2/cap_rights_limit.2
>
> diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
> new file mode 100644
> index 000000000000..3484ee1076aa
> --- /dev/null
> +++ b/man2/cap_rights_limit.2
> @@ -0,0 +1,171 @@
> +.\"
> +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> +.\" Copyright (c) 2013-2014 Google, Inc.
> +.\" All rights reserved.
> +.\"
> +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> +.\" Redistribution and use in source and binary forms, with or without
> +.\" modification, are permitted provided that the following conditions
> +.\" are met:
> +.\" 1. Redistributions of source code must retain the above copyright
> +.\"    notice, this list of conditions and the following disclaimer.
> +.\" 2. Redistributions in binary form must reproduce the above copyright
> +.\"    notice, this list of conditions and the following disclaimer in the
> +.\"    documentation and/or other materials provided with the distribution.
> +.\"
> +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> +.\" SUCH DAMAGE.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +cap_rights_limit \- limit Capsicum capability rights
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/capsicum.h>
> +.sp
> +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
> +.BI "                     unsigned int " fcntls ,
> +.BI "                     int " nioctls ", unsigned int *" ioctls );

Am I missing the docs for struct cap_rights somewhere?

> +.SH DESCRIPTION
> +When a file descriptor is created by a function such as
> +.BR accept (2),
> +.BR accept4 (2),
> +.BR creat (2),
> +.BR epoll_create (2),
> +.BR eventfd (2),
> +.BR mq_open (2),
> +.BR open (2),
> +.BR openat (2),
> +.BR pdfork (2),
> +.BR pipe (2),
> +.BR pipe2 (2),
> +.BR signalfd (2),
> +.BR socket (2),
> +.BR socketpair (2)
> +or
> +.BR timerfd_create (2),
> +it implicitly has all Capsicum capability rights.
> +Those rights can be reduced (but never expanded) by using the
> +.BR cap_rights_limit ()
> +system call.
> +Once Capsicum capability rights are reduced, operations on the file descriptor
> +.I fd
> +will be limited to those permitted by the remainder of the arguments.
> +.PP
> +The
> +.I rights
> +argument describes the primary rights for the file descriptor, and
> +should be prepared using
> +.BR cap_rights_init (3)
> +family of functions.  The complete list of primary rights can be found in the
> +.BR rights (7)
> +manual page.
> +.PP
> +If a file descriptor is granted the
> +.B CAP_FCNTL
> +primary capability right, the list of allowed
> +.BR fcntl (2)
> +commands can be selectively reduced (but never expanded) with the
> +.I fcntls
> +argument.  The following flags may be specified in the
> +.I fcntls
> +argument:
> +.TP
> +.B CAP_FCNTL_GETFL
> +Permit
> +.B F_GETFL
> +command.
> +.TP
> +.B CAP_FCNTL_SETFL
> +Permit
> +.B F_SETFL
> +command.
> +.TP
> +.B CAP_FCNTL_GETOWN
> +Permit
> +.B F_GETOWN
> +command.
> +.TP
> +.B CAP_FCNTL_SETOWN
> +Permit
> +.B F_SETOWN
> +command.
> +.PP
> +A value of
> +.B CAP_FCNTL_ALL
> +for the
> +.I fcntls
> +argument leaves the set of allowed
> +.BR fcntl (2)
> +commands unchanged.

What about the locking fcntl operations?  (Arguably the old crappy
POSIX lock operations should be flat-out disallowed on capability fds,
but I see nothing wrong with selectively allowing the new open file
description locks.)

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
@ 2014-06-30 15:35       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 15:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 07:53:57AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> > Signed-off-by: David Drysdale <drysdale@google.com>
> > ---
> >  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 171 insertions(+)
> >  create mode 100644 man2/cap_rights_limit.2
> >
> > diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
> > new file mode 100644
> > index 000000000000..3484ee1076aa
> > --- /dev/null
> > +++ b/man2/cap_rights_limit.2
> > @@ -0,0 +1,171 @@
> > +.\"
> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> > +.\" Copyright (c) 2013-2014 Google, Inc.
> > +.\" All rights reserved.
> > +.\"
> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> > +.\" Redistribution and use in source and binary forms, with or without
> > +.\" modification, are permitted provided that the following conditions
> > +.\" are met:
> > +.\" 1. Redistributions of source code must retain the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer.
> > +.\" 2. Redistributions in binary form must reproduce the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer in the
> > +.\"    documentation and/or other materials provided with the distribution.
> > +.\"
> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > +.\" SUCH DAMAGE.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +cap_rights_limit \- limit Capsicum capability rights
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/capsicum.h>
> > +.sp
> > +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
> > +.BI "                     unsigned int " fcntls ,
> > +.BI "                     int " nioctls ", unsigned int *" ioctls );
> 
> Am I missing the docs for struct cap_rights somewhere?

There's a little bit of discussion in rights.7 (mail 3/5 of the
man-pages set), but there isn't a structure description.

I was trying to keep the structure opaque to userspace, which would
be expected to manipulate the rights with various utility functions
rather than directly.

But I now realize this leaves a gap -- the description of this syscall
doesn't include a full description of its ABI.

So I'll add in a description of the structure to this page -- basically:

  struct cap_rights {
  	__u64	cr_rights[2];
  };

with a slightly complicated scheme to encode rights into the bitmask
array.  (The encoding scheme is taken from the FreeBSD implementation,
which I've tried to stick to unless there's good reason to change.)
 
> > +.SH DESCRIPTION
> > +When a file descriptor is created by a function such as
> > +.BR accept (2),
> > +.BR accept4 (2),
> > +.BR creat (2),
> > +.BR epoll_create (2),
> > +.BR eventfd (2),
> > +.BR mq_open (2),
> > +.BR open (2),
> > +.BR openat (2),
> > +.BR pdfork (2),
> > +.BR pipe (2),
> > +.BR pipe2 (2),
> > +.BR signalfd (2),
> > +.BR socket (2),
> > +.BR socketpair (2)
> > +or
> > +.BR timerfd_create (2),
> > +it implicitly has all Capsicum capability rights.
> > +Those rights can be reduced (but never expanded) by using the
> > +.BR cap_rights_limit ()
> > +system call.
> > +Once Capsicum capability rights are reduced, operations on the file descriptor
> > +.I fd
> > +will be limited to those permitted by the remainder of the arguments.
> > +.PP
> > +The
> > +.I rights
> > +argument describes the primary rights for the file descriptor, and
> > +should be prepared using
> > +.BR cap_rights_init (3)
> > +family of functions.  The complete list of primary rights can be found in the
> > +.BR rights (7)
> > +manual page.
> > +.PP
> > +If a file descriptor is granted the
> > +.B CAP_FCNTL
> > +primary capability right, the list of allowed
> > +.BR fcntl (2)
> > +commands can be selectively reduced (but never expanded) with the
> > +.I fcntls
> > +argument.  The following flags may be specified in the
> > +.I fcntls
> > +argument:
> > +.TP
> > +.B CAP_FCNTL_GETFL
> > +Permit
> > +.B F_GETFL
> > +command.
> > +.TP
> > +.B CAP_FCNTL_SETFL
> > +Permit
> > +.B F_SETFL
> > +command.
> > +.TP
> > +.B CAP_FCNTL_GETOWN
> > +Permit
> > +.B F_GETOWN
> > +command.
> > +.TP
> > +.B CAP_FCNTL_SETOWN
> > +Permit
> > +.B F_SETOWN
> > +command.
> > +.PP
> > +A value of
> > +.B CAP_FCNTL_ALL
> > +for the
> > +.I fcntls
> > +argument leaves the set of allowed
> > +.BR fcntl (2)
> > +commands unchanged.
> 
> What about the locking fcntl operations?  (Arguably the old crappy
> POSIX lock operations should be flat-out disallowed on capability fds,
> but I see nothing wrong with selectively allowing the new open file
> description locks.)
> 
> --Andy

The locking operations are policed against a separate CAP_FLOCK right,
consistently with flock(2).  I'll try to improve the wording -- there
are actually a few fcntl operations that are covered by different
rights because they're analogous to other functionality (e.g.
F_GETPIPE_SZ/F_SETPIPE_SZ needs CAP_GETSOCKOPT/CAP_SETSOCKOPT).

 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
@ 2014-06-30 15:35       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 15:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Mon, Jun 30, 2014 at 07:53:57AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> >  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 171 insertions(+)
> >  create mode 100644 man2/cap_rights_limit.2
> >
> > diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
> > new file mode 100644
> > index 000000000000..3484ee1076aa
> > --- /dev/null
> > +++ b/man2/cap_rights_limit.2
> > @@ -0,0 +1,171 @@
> > +.\"
> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> > +.\" Copyright (c) 2013-2014 Google, Inc.
> > +.\" All rights reserved.
> > +.\"
> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> > +.\" Redistribution and use in source and binary forms, with or without
> > +.\" modification, are permitted provided that the following conditions
> > +.\" are met:
> > +.\" 1. Redistributions of source code must retain the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer.
> > +.\" 2. Redistributions in binary form must reproduce the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer in the
> > +.\"    documentation and/or other materials provided with the distribution.
> > +.\"
> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > +.\" SUCH DAMAGE.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +cap_rights_limit \- limit Capsicum capability rights
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/capsicum.h>
> > +.sp
> > +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
> > +.BI "                     unsigned int " fcntls ,
> > +.BI "                     int " nioctls ", unsigned int *" ioctls );
> 
> Am I missing the docs for struct cap_rights somewhere?

There's a little bit of discussion in rights.7 (mail 3/5 of the
man-pages set), but there isn't a structure description.

I was trying to keep the structure opaque to userspace, which would
be expected to manipulate the rights with various utility functions
rather than directly.

But I now realize this leaves a gap -- the description of this syscall
doesn't include a full description of its ABI.

So I'll add in a description of the structure to this page -- basically:

  struct cap_rights {
  	__u64	cr_rights[2];
  };

with a slightly complicated scheme to encode rights into the bitmask
array.  (The encoding scheme is taken from the FreeBSD implementation,
which I've tried to stick to unless there's good reason to change.)
 
> > +.SH DESCRIPTION
> > +When a file descriptor is created by a function such as
> > +.BR accept (2),
> > +.BR accept4 (2),
> > +.BR creat (2),
> > +.BR epoll_create (2),
> > +.BR eventfd (2),
> > +.BR mq_open (2),
> > +.BR open (2),
> > +.BR openat (2),
> > +.BR pdfork (2),
> > +.BR pipe (2),
> > +.BR pipe2 (2),
> > +.BR signalfd (2),
> > +.BR socket (2),
> > +.BR socketpair (2)
> > +or
> > +.BR timerfd_create (2),
> > +it implicitly has all Capsicum capability rights.
> > +Those rights can be reduced (but never expanded) by using the
> > +.BR cap_rights_limit ()
> > +system call.
> > +Once Capsicum capability rights are reduced, operations on the file descriptor
> > +.I fd
> > +will be limited to those permitted by the remainder of the arguments.
> > +.PP
> > +The
> > +.I rights
> > +argument describes the primary rights for the file descriptor, and
> > +should be prepared using
> > +.BR cap_rights_init (3)
> > +family of functions.  The complete list of primary rights can be found in the
> > +.BR rights (7)
> > +manual page.
> > +.PP
> > +If a file descriptor is granted the
> > +.B CAP_FCNTL
> > +primary capability right, the list of allowed
> > +.BR fcntl (2)
> > +commands can be selectively reduced (but never expanded) with the
> > +.I fcntls
> > +argument.  The following flags may be specified in the
> > +.I fcntls
> > +argument:
> > +.TP
> > +.B CAP_FCNTL_GETFL
> > +Permit
> > +.B F_GETFL
> > +command.
> > +.TP
> > +.B CAP_FCNTL_SETFL
> > +Permit
> > +.B F_SETFL
> > +command.
> > +.TP
> > +.B CAP_FCNTL_GETOWN
> > +Permit
> > +.B F_GETOWN
> > +command.
> > +.TP
> > +.B CAP_FCNTL_SETOWN
> > +Permit
> > +.B F_SETOWN
> > +command.
> > +.PP
> > +A value of
> > +.B CAP_FCNTL_ALL
> > +for the
> > +.I fcntls
> > +argument leaves the set of allowed
> > +.BR fcntl (2)
> > +commands unchanged.
> 
> What about the locking fcntl operations?  (Arguably the old crappy
> POSIX lock operations should be flat-out disallowed on capability fds,
> but I see nothing wrong with selectively allowing the new open file
> description locks.)
> 
> --Andy

The locking operations are policed against a separate CAP_FLOCK right,
consistently with flock(2).  I'll try to improve the wording -- there
are actually a few fcntl operations that are covered by different
rights because they're analogous to other functionality (e.g.
F_GETPIPE_SZ/F_SETPIPE_SZ needs CAP_GETSOCKOPT/CAP_SETSOCKOPT).

 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-06-30 14:49   ` Andy Lutomirski
@ 2014-06-30 15:49     ` David Drysdale
  2014-06-30 15:53       ` Andy Lutomirski
  0 siblings, 1 reply; 87+ messages in thread
From: David Drysdale @ 2014-06-30 15:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, LSM List, Greg Kroah-Hartman, James Morris, Kees Cook,
	Linux API, Meredydd Luff, linux-kernel

On Mon, Jun 30, 2014 at 07:49:41AM -0700, Andy Lutomirski wrote:
> On Jun 30, 2014 3:36 AM, "David Drysdale" <drysdale@google.com> wrote:
> >
> > Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> > provided path, rejecting (with -EACCES) paths that are not beneath
> > the provided dfd.  In particular, reject:
> >  - paths that contain .. components
> >  - paths that begin with /
> >  - symlinks that have paths as above.
> 
> I like this a lot.  However, I think I'd like it even better if it
> were AT_BENEATH_ONLY so that it could be added to the rest of the *at
> family.
> 
> --Andy

Wouldn't it need to be both O_BENEATH_ONLY (for openat()) and 
AT_BENEATH_ONLY (for other *at() functions), like O_NOFOLLOW and
AT_SYMLINK_NOFOLLOW?  (I.e. aren't the AT_* flags in a different
numbering space than O_* flags?)

Or am I misunderstanding?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-06-30 15:49     ` David Drysdale
@ 2014-06-30 15:53       ` Andy Lutomirski
  2014-07-08 12:07           ` Christoph Hellwig
  0 siblings, 1 reply; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 15:53 UTC (permalink / raw)
  To: David Drysdale
  Cc: Al Viro, LSM List, Greg Kroah-Hartman, James Morris, Kees Cook,
	Linux API, Meredydd Luff, linux-kernel

On Mon, Jun 30, 2014 at 8:49 AM, David Drysdale <drysdale@google.com> wrote:
> On Mon, Jun 30, 2014 at 07:49:41AM -0700, Andy Lutomirski wrote:
>> On Jun 30, 2014 3:36 AM, "David Drysdale" <drysdale@google.com> wrote:
>> >
>> > Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
>> > provided path, rejecting (with -EACCES) paths that are not beneath
>> > the provided dfd.  In particular, reject:
>> >  - paths that contain .. components
>> >  - paths that begin with /
>> >  - symlinks that have paths as above.
>>
>> I like this a lot.  However, I think I'd like it even better if it
>> were AT_BENEATH_ONLY so that it could be added to the rest of the *at
>> family.
>>
>> --Andy
>
> Wouldn't it need to be both O_BENEATH_ONLY (for openat()) and
> AT_BENEATH_ONLY (for other *at() functions), like O_NOFOLLOW and
> AT_SYMLINK_NOFOLLOW?  (I.e. aren't the AT_* flags in a different
> numbering space than O_* flags?)
>
> Or am I misunderstanding?
>

Ugh, you're probably right.  I wish openat had separate flags and
atflags arguments.  Oh well.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/11] capsicum: implementations of new LSM hooks
@ 2014-06-30 16:05     ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 16:05 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> If the LSM does not provide implementations of the .file_lookup and
> .file_install LSM hooks, always use the Capsicum implementations.
>
> The Capsicum implementation of file_lookup checks for a Capsicum
> capability wrapper file and unwraps to if the appropriate rights
> are available.
>
> The Capsicum implementation of file_install checks whether the file
> has restricted rights associated with it.  If it does, it is replaced
> with a Capsicum capability wrapper file before installation into the
> fdtable.

I think I fall on the "no LSM" side of the fence.  This kind of stuff
should be available regardless of selected LSM (as it is in your
code), but until someone has a use case for the LSM hooks in real
LSMs, I don't really see the point.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/11] capsicum: implementations of new LSM hooks
@ 2014-06-30 16:05     ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 16:05 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> If the LSM does not provide implementations of the .file_lookup and
> .file_install LSM hooks, always use the Capsicum implementations.
>
> The Capsicum implementation of file_lookup checks for a Capsicum
> capability wrapper file and unwraps to if the appropriate rights
> are available.
>
> The Capsicum implementation of file_install checks whether the file
> has restricted rights associated with it.  If it does, it is replaced
> with a Capsicum capability wrapper file before installation into the
> fdtable.

I think I fall on the "no LSM" side of the fence.  This kind of stuff
should be available regardless of selected LSM (as it is in your
code), but until someone has a use case for the LSM hooks in real
LSMs, I don't really see the point.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
@ 2014-06-30 16:06         ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 16:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 8:35 AM, David Drysdale <drysdale@google.com> wrote:
> On Mon, Jun 30, 2014 at 07:53:57AM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
>> > Signed-off-by: David Drysdale <drysdale@google.com>
>> > ---
>> >  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 171 insertions(+)
>> >  create mode 100644 man2/cap_rights_limit.2
>> >
>> > diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
>> > new file mode 100644
>> > index 000000000000..3484ee1076aa
>> > --- /dev/null
>> > +++ b/man2/cap_rights_limit.2
>> > @@ -0,0 +1,171 @@
>> > +.\"
>> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
>> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
>> > +.\" Copyright (c) 2013-2014 Google, Inc.
>> > +.\" All rights reserved.
>> > +.\"
>> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
>> > +.\" Redistribution and use in source and binary forms, with or without
>> > +.\" modification, are permitted provided that the following conditions
>> > +.\" are met:
>> > +.\" 1. Redistributions of source code must retain the above copyright
>> > +.\"    notice, this list of conditions and the following disclaimer.
>> > +.\" 2. Redistributions in binary form must reproduce the above copyright
>> > +.\"    notice, this list of conditions and the following disclaimer in the
>> > +.\"    documentation and/or other materials provided with the distribution.
>> > +.\"
>> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
>> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
>> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> > +.\" SUCH DAMAGE.
>> > +.\" %%%LICENSE_END
>> > +.\"
>> > +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
>> > +.SH NAME
>> > +cap_rights_limit \- limit Capsicum capability rights
>> > +.SH SYNOPSIS
>> > +.nf
>> > +.B #include <sys/capsicum.h>
>> > +.sp
>> > +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
>> > +.BI "                     unsigned int " fcntls ,
>> > +.BI "                     int " nioctls ", unsigned int *" ioctls );
>>
>> Am I missing the docs for struct cap_rights somewhere?
>
> There's a little bit of discussion in rights.7 (mail 3/5 of the
> man-pages set), but there isn't a structure description.
>
> I was trying to keep the structure opaque to userspace, which would
> be expected to manipulate the rights with various utility functions
> rather than directly.
>
> But I now realize this leaves a gap -- the description of this syscall
> doesn't include a full description of its ABI.
>
> So I'll add in a description of the structure to this page -- basically:
>
>   struct cap_rights {
>         __u64   cr_rights[2];
>   };
>
> with a slightly complicated scheme to encode rights into the bitmask
> array.  (The encoding scheme is taken from the FreeBSD implementation,
> which I've tried to stick to unless there's good reason to change.)

How does extensibility work?  For example, what happens when someone
needs to add a new right for whatever reason and they fall off the end
of the list?

Linux so-called capabilities have done this a few times, resulting in
a giant mess.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
@ 2014-06-30 16:06         ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 16:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Mon, Jun 30, 2014 at 8:35 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Jun 30, 2014 at 07:53:57AM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> > Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> > ---
>> >  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 171 insertions(+)
>> >  create mode 100644 man2/cap_rights_limit.2
>> >
>> > diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
>> > new file mode 100644
>> > index 000000000000..3484ee1076aa
>> > --- /dev/null
>> > +++ b/man2/cap_rights_limit.2
>> > @@ -0,0 +1,171 @@
>> > +.\"
>> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
>> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
>> > +.\" Copyright (c) 2013-2014 Google, Inc.
>> > +.\" All rights reserved.
>> > +.\"
>> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
>> > +.\" Redistribution and use in source and binary forms, with or without
>> > +.\" modification, are permitted provided that the following conditions
>> > +.\" are met:
>> > +.\" 1. Redistributions of source code must retain the above copyright
>> > +.\"    notice, this list of conditions and the following disclaimer.
>> > +.\" 2. Redistributions in binary form must reproduce the above copyright
>> > +.\"    notice, this list of conditions and the following disclaimer in the
>> > +.\"    documentation and/or other materials provided with the distribution.
>> > +.\"
>> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
>> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
>> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> > +.\" SUCH DAMAGE.
>> > +.\" %%%LICENSE_END
>> > +.\"
>> > +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
>> > +.SH NAME
>> > +cap_rights_limit \- limit Capsicum capability rights
>> > +.SH SYNOPSIS
>> > +.nf
>> > +.B #include <sys/capsicum.h>
>> > +.sp
>> > +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
>> > +.BI "                     unsigned int " fcntls ,
>> > +.BI "                     int " nioctls ", unsigned int *" ioctls );
>>
>> Am I missing the docs for struct cap_rights somewhere?
>
> There's a little bit of discussion in rights.7 (mail 3/5 of the
> man-pages set), but there isn't a structure description.
>
> I was trying to keep the structure opaque to userspace, which would
> be expected to manipulate the rights with various utility functions
> rather than directly.
>
> But I now realize this leaves a gap -- the description of this syscall
> doesn't include a full description of its ABI.
>
> So I'll add in a description of the structure to this page -- basically:
>
>   struct cap_rights {
>         __u64   cr_rights[2];
>   };
>
> with a slightly complicated scheme to encode rights into the bitmask
> array.  (The encoding scheme is taken from the FreeBSD implementation,
> which I've tried to stick to unless there's good reason to change.)

How does extensibility work?  For example, what happens when someone
needs to add a new right for whatever reason and they fall off the end
of the list?

Linux so-called capabilities have done this a few times, resulting in
a giant mess.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum
  2014-06-30 16:06         ` Andy Lutomirski
  (?)
@ 2014-06-30 16:32         ` David Drysdale
  -1 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-06-30 16:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 09:06:41AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 8:35 AM, David Drysdale <drysdale@google.com> wrote:
> > On Mon, Jun 30, 2014 at 07:53:57AM -0700, Andy Lutomirski wrote:
> >> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> >> > Signed-off-by: David Drysdale <drysdale@google.com>
> >> > ---
> >> >  man2/cap_rights_limit.2 | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
> >> >  1 file changed, 171 insertions(+)
> >> >  create mode 100644 man2/cap_rights_limit.2
> >> >
> >> > diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
> >> > new file mode 100644
> >> > index 000000000000..3484ee1076aa
> >> > --- /dev/null
> >> > +++ b/man2/cap_rights_limit.2
> >> > @@ -0,0 +1,171 @@
> >> > +.\"
> >> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> >> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> >> > +.\" Copyright (c) 2013-2014 Google, Inc.
> >> > +.\" All rights reserved.
> >> > +.\"
> >> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> >> > +.\" Redistribution and use in source and binary forms, with or without
> >> > +.\" modification, are permitted provided that the following conditions
> >> > +.\" are met:
> >> > +.\" 1. Redistributions of source code must retain the above copyright
> >> > +.\"    notice, this list of conditions and the following disclaimer.
> >> > +.\" 2. Redistributions in binary form must reproduce the above copyright
> >> > +.\"    notice, this list of conditions and the following disclaimer in the
> >> > +.\"    documentation and/or other materials provided with the distribution.
> >> > +.\"
> >> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> >> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> >> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> >> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> >> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> >> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> >> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> >> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> >> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> >> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> >> > +.\" SUCH DAMAGE.
> >> > +.\" %%%LICENSE_END
> >> > +.\"
> >> > +.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> >> > +.SH NAME
> >> > +cap_rights_limit \- limit Capsicum capability rights
> >> > +.SH SYNOPSIS
> >> > +.nf
> >> > +.B #include <sys/capsicum.h>
> >> > +.sp
> >> > +.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
> >> > +.BI "                     unsigned int " fcntls ,
> >> > +.BI "                     int " nioctls ", unsigned int *" ioctls );
> >>
> >> Am I missing the docs for struct cap_rights somewhere?
> >
> > There's a little bit of discussion in rights.7 (mail 3/5 of the
> > man-pages set), but there isn't a structure description.
> >
> > I was trying to keep the structure opaque to userspace, which would
> > be expected to manipulate the rights with various utility functions
> > rather than directly.
> >
> > But I now realize this leaves a gap -- the description of this syscall
> > doesn't include a full description of its ABI.
> >
> > So I'll add in a description of the structure to this page -- basically:
> >
> >   struct cap_rights {
> >         __u64   cr_rights[2];
> >   };
> >
> > with a slightly complicated scheme to encode rights into the bitmask
> > array.  (The encoding scheme is taken from the FreeBSD implementation,
> > which I've tried to stick to unless there's good reason to change.)
> 
> How does extensibility work?  For example, what happens when someone
> needs to add a new right for whatever reason and they fall off the end
> of the list?
> 
> Linux so-called capabilities have done this a few times, resulting in
> a giant mess.
> 
> --Andy

The rights encoding scheme is supposed to cope with extensions, so let me
have a go at explaining it.

The size of the array in the structure can potentially change in future,
so a less abbreviated version is:

  #define CAP_RIGHTS_VERSION_00   0
  #define CAP_RIGHTS_VERSION_01   1
  #define CAP_RIGHTS_VERSION_02   2
  #define CAP_RIGHTS_VERSION_03   3
  #define CAP_RIGHTS_VERSION      CAP_RIGHTS_VERSION_00
  struct cap_rights {
      uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
  };

The encoding rules are then:
 - There are between 2 and 5 entries in the array.
 - The number of entries in the array is indicated by the top 2 bits of
   cr_rights[0] (as array size minus 2); this allows for future
   expansion (up to 285 distinct rights):
     0b00 = 2 entries
     0b01 = 3 entries
     0b10 = 4 entries
     0b11 = 5 entries
 - The top 2 bits of cr_rights[i] are 0b00 for i>0.
 - The next 5 bits of each array entry indicate its position in the
   array:
     0b00001 for cr_rights[0]
     0b00010 for cr_rights[1]
     0b00100 for cr_rights[2]
     0b01000 for cr_rights[3]
     0b10000 for cr_rights[4]
 - The remaining 57 bits of each entry are used to hold rights values,
   so the current structure can hold 114 rights, and the maximum is
   285.

So a future kernel (with an expanded array) can cope with an old binary
(that uses a narrow array) by reading the first u64 from the structure,
and using the top 2 bits to figure out how much more memory to copy
from userspace.  Slightly inefficient, but I wouldn't expect rights
setting to be a performance critical operation.

Of course, we can deviate from the FreeBSD implementation details if
we want to -- these details are deliberately hidden from userspace
programs in the rights-manipulation library functions, so a different
implementation under the covers wouldn't affect Capsicum-using
applications.  But I figured it's best to stay close unless there's
a good reason to diverge.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-06-30 10:28 ` [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2) David Drysdale
  2014-06-30 14:49   ` Andy Lutomirski
@ 2014-06-30 20:40   ` Andi Kleen
  2014-06-30 21:11     ` Andy Lutomirski
  2014-07-01  9:53       ` David Drysdale
  2014-07-08 12:03     ` Christoph Hellwig
  2 siblings, 2 replies; 87+ messages in thread
From: Andi Kleen @ 2014-06-30 20:40 UTC (permalink / raw)
  To: David Drysdale
  Cc: linux-security-module, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api

David Drysdale <drysdale@google.com> writes:

> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> provided path, rejecting (with -EACCES) paths that are not beneath
> the provided dfd.  In particular, reject:
>  - paths that contain .. components
>  - paths that begin with /
>  - symlinks that have paths as above.

How about bind mounts?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-06-30 20:40   ` Andi Kleen
@ 2014-06-30 21:11     ` Andy Lutomirski
  2014-07-01  9:53       ` David Drysdale
  1 sibling, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 21:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Drysdale, LSM List, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	Linux API

On Mon, Jun 30, 2014 at 1:40 PM, Andi Kleen <andi@firstfloor.org> wrote:
> David Drysdale <drysdale@google.com> writes:
>
>> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
>> provided path, rejecting (with -EACCES) paths that are not beneath
>> the provided dfd.  In particular, reject:
>>  - paths that contain .. components
>>  - paths that begin with /
>>  - symlinks that have paths as above.
>
> How about bind mounts?

What's the problematic scenario?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/5] man-pages: open.2: describe O_BENEATH_ONLY flag
  2014-06-30 10:28 ` [PATCH 1/5] man-pages: open.2: describe O_BENEATH_ONLY flag David Drysdale
@ 2014-06-30 22:22   ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 22:22 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/open.2 | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/man2/open.2 b/man2/open.2
> index 3824ab5be1f0..ba0da01c1a4f 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -713,7 +713,7 @@ in a fully formed state (using
>  as described above).
>  .RE
>  .IP
> -.B O_TMPFILE
> +.B O_TMPFILE " (since Linux 3.??)"
>  requires support by the underlying filesystem;
>  only a subset of Linux filesystems provide that support.
>  In the initial implementation, support was provided in
> @@ -723,6 +723,31 @@ XFS support was added
>  .\" commit ab29743117f9f4c22ac44c13c1647fb24fb2bafe
>  in Linux 3.15.
>  .TP
> +.B O_BENEATH_ONLY
> +Ensure that the
> +.I pathname
> +is beneath the current working directory (for
> +.BR open (2))
> +or the
> +.I dirfd
> +(for
> +.BR openat (2)).
> +If the
> +.I pathname
> +is absolute or contains a path component of "..", the
> +.BR open ()
> +fails with the error
> +.BR EACCES.
> +This occurs even if ".." path component would not actually
> +escape the original directory; for example, a
> +.I pathname
> +of "subdir/../filename" would be rejected.
> +Path components that are symbolic links to absolute paths, or that are
> +relative paths containing a ".." component, are cause the

"are cause" is a typo.  Do you mean "will also cause"?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights
@ 2014-06-30 22:28     ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 22:28 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/cap_rights_get.2 | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 126 insertions(+)
>  create mode 100644 man2/cap_rights_get.2
>
> diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
> new file mode 100644
> index 000000000000..966c0ed7e336
> --- /dev/null
> +++ b/man2/cap_rights_get.2
> @@ -0,0 +1,126 @@
> +.\"
> +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> +.\" Copyright (c) 2013-2014 Google, Inc.
> +.\" All rights reserved.
> +.\"
> +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> +.\" Redistribution and use in source and binary forms, with or without
> +.\" modification, are permitted provided that the following conditions
> +.\" are met:
> +.\" 1. Redistributions of source code must retain the above copyright
> +.\"    notice, this list of conditions and the following disclaimer.
> +.\" 2. Redistributions in binary form must reproduce the above copyright
> +.\"    notice, this list of conditions and the following disclaimer in the
> +.\"    documentation and/or other materials provided with the distribution.
> +.\"
> +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> +.\" SUCH DAMAGE.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +cap_rights_get \- retrieve Capsicum capability rights
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/capsicum.h>
> +.sp
> +.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
> +.BI "                   unsigned int *" fcntls ,
> +.BI "                   int *" nioctls ", unsigned int *" ioctls );
> +.SH DESCRIPTION
> +Obtain the current Capsicum capability rights for a file descriptor.
> +.PP
> +The function will fill the
> +.I rights
> +argument (if non-NULL) with the primary capability rights of the
> +.I fd
> +descriptor.  The result can be examined with the
> +.BR cap_rights_is_set (3)
> +family of functions.  The complete list of primary rights can be found in the
> +.BR rights (7)
> +manual page.
> +.PP
> +If the
> +.I fcntls
> +argument is non-NULL, it will be filled in with a bitmask of allowed
> +.BR fcntl (2)
> +commands; see
> +.BR cap_rights_limit (2)
> +for values.  If the file descriptor does not have the
> +.B CAP_FCNTL
> +primary right, the returned
> +.I fcntls
> +value will be zero.
> +.PP
> +If the
> +.I nioctls
> +argument is non-NULL, it will be filled in with the number of allowed
> +.BR ioctl (2)
> +commands, or with the value CAP_IOCTLS_ALL to indicate that all
> +.BR ioctl (2)
> +commands are allowed.  If the file descriptor does not have the
> +.B CAP_IOCTL
> +primary right, the returned
> +.I nioctls
> +value will be zero.
> +.PP
> +The
> +.I ioctls
> +argument (if non-NULL) should point at memory that can hold up to
> +.I nioctls
> +values.
> +The system call populates the provided buffer with up to
> +.I nioctls
> +elements, but always returns the total number of

I assume you mean "up to the initial value of *nioctls elements" or
something.  Can you clarify?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights
@ 2014-06-30 22:28     ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-06-30 22:28 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  man2/cap_rights_get.2 | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 126 insertions(+)
>  create mode 100644 man2/cap_rights_get.2
>
> diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
> new file mode 100644
> index 000000000000..966c0ed7e336
> --- /dev/null
> +++ b/man2/cap_rights_get.2
> @@ -0,0 +1,126 @@
> +.\"
> +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> +.\" Copyright (c) 2013-2014 Google, Inc.
> +.\" All rights reserved.
> +.\"
> +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> +.\" Redistribution and use in source and binary forms, with or without
> +.\" modification, are permitted provided that the following conditions
> +.\" are met:
> +.\" 1. Redistributions of source code must retain the above copyright
> +.\"    notice, this list of conditions and the following disclaimer.
> +.\" 2. Redistributions in binary form must reproduce the above copyright
> +.\"    notice, this list of conditions and the following disclaimer in the
> +.\"    documentation and/or other materials provided with the distribution.
> +.\"
> +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> +.\" SUCH DAMAGE.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +cap_rights_get \- retrieve Capsicum capability rights
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/capsicum.h>
> +.sp
> +.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
> +.BI "                   unsigned int *" fcntls ,
> +.BI "                   int *" nioctls ", unsigned int *" ioctls );
> +.SH DESCRIPTION
> +Obtain the current Capsicum capability rights for a file descriptor.
> +.PP
> +The function will fill the
> +.I rights
> +argument (if non-NULL) with the primary capability rights of the
> +.I fd
> +descriptor.  The result can be examined with the
> +.BR cap_rights_is_set (3)
> +family of functions.  The complete list of primary rights can be found in the
> +.BR rights (7)
> +manual page.
> +.PP
> +If the
> +.I fcntls
> +argument is non-NULL, it will be filled in with a bitmask of allowed
> +.BR fcntl (2)
> +commands; see
> +.BR cap_rights_limit (2)
> +for values.  If the file descriptor does not have the
> +.B CAP_FCNTL
> +primary right, the returned
> +.I fcntls
> +value will be zero.
> +.PP
> +If the
> +.I nioctls
> +argument is non-NULL, it will be filled in with the number of allowed
> +.BR ioctl (2)
> +commands, or with the value CAP_IOCTLS_ALL to indicate that all
> +.BR ioctl (2)
> +commands are allowed.  If the file descriptor does not have the
> +.B CAP_IOCTL
> +primary right, the returned
> +.I nioctls
> +value will be zero.
> +.PP
> +The
> +.I ioctls
> +argument (if non-NULL) should point at memory that can hold up to
> +.I nioctls
> +values.
> +The system call populates the provided buffer with up to
> +.I nioctls
> +elements, but always returns the total number of

I assume you mean "up to the initial value of *nioctls elements" or
something.  Can you clarify?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights
  2014-06-30 22:28     ` Andy Lutomirski
@ 2014-07-01  9:19       ` David Drysdale
  -1 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-01  9:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 03:28:14PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> > Signed-off-by: David Drysdale <drysdale@google.com>
> > ---
> >  man2/cap_rights_get.2 | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 126 insertions(+)
> >  create mode 100644 man2/cap_rights_get.2
> >
> > diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
> > new file mode 100644
> > index 000000000000..966c0ed7e336
> > --- /dev/null
> > +++ b/man2/cap_rights_get.2
> > @@ -0,0 +1,126 @@
> > +.\"
> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> > +.\" Copyright (c) 2013-2014 Google, Inc.
> > +.\" All rights reserved.
> > +.\"
> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> > +.\" Redistribution and use in source and binary forms, with or without
> > +.\" modification, are permitted provided that the following conditions
> > +.\" are met:
> > +.\" 1. Redistributions of source code must retain the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer.
> > +.\" 2. Redistributions in binary form must reproduce the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer in the
> > +.\"    documentation and/or other materials provided with the distribution.
> > +.\"
> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > +.\" SUCH DAMAGE.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +cap_rights_get \- retrieve Capsicum capability rights
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/capsicum.h>
> > +.sp
> > +.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
> > +.BI "                   unsigned int *" fcntls ,
> > +.BI "                   int *" nioctls ", unsigned int *" ioctls );
> > +.SH DESCRIPTION
> > +Obtain the current Capsicum capability rights for a file descriptor.
> > +.PP
> > +The function will fill the
> > +.I rights
> > +argument (if non-NULL) with the primary capability rights of the
> > +.I fd
> > +descriptor.  The result can be examined with the
> > +.BR cap_rights_is_set (3)
> > +family of functions.  The complete list of primary rights can be found in the
> > +.BR rights (7)
> > +manual page.
> > +.PP
> > +If the
> > +.I fcntls
> > +argument is non-NULL, it will be filled in with a bitmask of allowed
> > +.BR fcntl (2)
> > +commands; see
> > +.BR cap_rights_limit (2)
> > +for values.  If the file descriptor does not have the
> > +.B CAP_FCNTL
> > +primary right, the returned
> > +.I fcntls
> > +value will be zero.
> > +.PP
> > +If the
> > +.I nioctls
> > +argument is non-NULL, it will be filled in with the number of allowed
> > +.BR ioctl (2)
> > +commands, or with the value CAP_IOCTLS_ALL to indicate that all
> > +.BR ioctl (2)
> > +commands are allowed.  If the file descriptor does not have the
> > +.B CAP_IOCTL
> > +primary right, the returned
> > +.I nioctls
> > +value will be zero.
> > +.PP
> > +The
> > +.I ioctls
> > +argument (if non-NULL) should point at memory that can hold up to
> > +.I nioctls
> > +values.
> > +The system call populates the provided buffer with up to
> > +.I nioctls
> > +elements, but always returns the total number of
> 
> I assume you mean "up to the initial value of *nioctls elements" or
> something.  Can you clarify?
> 
> --Andy

Yeah, that's what I meant.  Is this clearer?

  If  the  ioctls argument is non-NULL, the caller should specify
  the size of the provided buffer as the  initial  value  of  the
  nioctls  argument (as a count of the number of ioctl(2) command
  values the buffer can hold).  On successful completion  of  the
  system call, the ioctls buffer is filled with the ioctl(2) com‐
  mand values, up to maximum of the initial value of nioctls.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights
@ 2014-07-01  9:19       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-01  9:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Mon, Jun 30, 2014 at 03:28:14PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> > Signed-off-by: David Drysdale <drysdale@google.com>
> > ---
> >  man2/cap_rights_get.2 | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 126 insertions(+)
> >  create mode 100644 man2/cap_rights_get.2
> >
> > diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
> > new file mode 100644
> > index 000000000000..966c0ed7e336
> > --- /dev/null
> > +++ b/man2/cap_rights_get.2
> > @@ -0,0 +1,126 @@
> > +.\"
> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
> > +.\" Copyright (c) 2013-2014 Google, Inc.
> > +.\" All rights reserved.
> > +.\"
> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
> > +.\" Redistribution and use in source and binary forms, with or without
> > +.\" modification, are permitted provided that the following conditions
> > +.\" are met:
> > +.\" 1. Redistributions of source code must retain the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer.
> > +.\" 2. Redistributions in binary form must reproduce the above copyright
> > +.\"    notice, this list of conditions and the following disclaimer in the
> > +.\"    documentation and/or other materials provided with the distribution.
> > +.\"
> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > +.\" SUCH DAMAGE.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +cap_rights_get \- retrieve Capsicum capability rights
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/capsicum.h>
> > +.sp
> > +.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
> > +.BI "                   unsigned int *" fcntls ,
> > +.BI "                   int *" nioctls ", unsigned int *" ioctls );
> > +.SH DESCRIPTION
> > +Obtain the current Capsicum capability rights for a file descriptor.
> > +.PP
> > +The function will fill the
> > +.I rights
> > +argument (if non-NULL) with the primary capability rights of the
> > +.I fd
> > +descriptor.  The result can be examined with the
> > +.BR cap_rights_is_set (3)
> > +family of functions.  The complete list of primary rights can be found in the
> > +.BR rights (7)
> > +manual page.
> > +.PP
> > +If the
> > +.I fcntls
> > +argument is non-NULL, it will be filled in with a bitmask of allowed
> > +.BR fcntl (2)
> > +commands; see
> > +.BR cap_rights_limit (2)
> > +for values.  If the file descriptor does not have the
> > +.B CAP_FCNTL
> > +primary right, the returned
> > +.I fcntls
> > +value will be zero.
> > +.PP
> > +If the
> > +.I nioctls
> > +argument is non-NULL, it will be filled in with the number of allowed
> > +.BR ioctl (2)
> > +commands, or with the value CAP_IOCTLS_ALL to indicate that all
> > +.BR ioctl (2)
> > +commands are allowed.  If the file descriptor does not have the
> > +.B CAP_IOCTL
> > +primary right, the returned
> > +.I nioctls
> > +value will be zero.
> > +.PP
> > +The
> > +.I ioctls
> > +argument (if non-NULL) should point at memory that can hold up to
> > +.I nioctls
> > +values.
> > +The system call populates the provided buffer with up to
> > +.I nioctls
> > +elements, but always returns the total number of
> 
> I assume you mean "up to the initial value of *nioctls elements" or
> something.  Can you clarify?
> 
> --Andy

Yeah, that's what I meant.  Is this clearer?

  If  the  ioctls argument is non-NULL, the caller should specify
  the size of the provided buffer as the  initial  value  of  the
  nioctls  argument (as a count of the number of ioctl(2) command
  values the buffer can hold).  On successful completion  of  the
  system call, the ioctls buffer is filled with the ioctl(2) com‐
  mand values, up to maximum of the initial value of nioctls.

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-01  9:53       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-01  9:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-security-module, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api

On Mon, Jun 30, 2014 at 01:40:40PM -0700, Andi Kleen wrote:
> David Drysdale <drysdale@google.com> writes:
> 
> > Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> > provided path, rejecting (with -EACCES) paths that are not beneath
> > the provided dfd.  In particular, reject:
> >  - paths that contain .. components
> >  - paths that begin with /
> >  - symlinks that have paths as above.
> 
> How about bind mounts?
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only

Bind mounts won't get rejected because they just look like normal
path components.  In other words, if dir/subdir is a bind mount to
/root/dir then:
  fd = openat(AT_FDCWD, "dir/subdir", O_RDONLY|O_BENEATH_ONLY);
will work fine.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-01  9:53       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-01  9:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 30, 2014 at 01:40:40PM -0700, Andi Kleen wrote:
> David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> > provided path, rejecting (with -EACCES) paths that are not beneath
> > the provided dfd.  In particular, reject:
> >  - paths that contain .. components
> >  - paths that begin with /
> >  - symlinks that have paths as above.
> 
> How about bind mounts?
> 
> -Andi
> 
> -- 
> ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only

Bind mounts won't get rejected because they just look like normal
path components.  In other words, if dir/subdir is a bind mount to
/root/dir then:
  fd = openat(AT_FDCWD, "dir/subdir", O_RDONLY|O_BENEATH_ONLY);
will work fine.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights
  2014-07-01  9:19       ` David Drysdale
  (?)
@ 2014-07-01 14:18       ` Andy Lutomirski
  -1 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2014-07-01 14:18 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Tue, Jul 1, 2014 at 2:19 AM, David Drysdale <drysdale@google.com> wrote:
> On Mon, Jun 30, 2014 at 03:28:14PM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
>> > Signed-off-by: David Drysdale <drysdale@google.com>
>> > ---
>> >  man2/cap_rights_get.2 | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 126 insertions(+)
>> >  create mode 100644 man2/cap_rights_get.2
>> >
>> > diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
>> > new file mode 100644
>> > index 000000000000..966c0ed7e336
>> > --- /dev/null
>> > +++ b/man2/cap_rights_get.2
>> > @@ -0,0 +1,126 @@
>> > +.\"
>> > +.\" Copyright (c) 2008-2010 Robert N. M. Watson
>> > +.\" Copyright (c) 2012-2013 The FreeBSD Foundation
>> > +.\" Copyright (c) 2013-2014 Google, Inc.
>> > +.\" All rights reserved.
>> > +.\"
>> > +.\" %%%LICENSE_START(BSD_2_CLAUSE)
>> > +.\" Redistribution and use in source and binary forms, with or without
>> > +.\" modification, are permitted provided that the following conditions
>> > +.\" are met:
>> > +.\" 1. Redistributions of source code must retain the above copyright
>> > +.\"    notice, this list of conditions and the following disclaimer.
>> > +.\" 2. Redistributions in binary form must reproduce the above copyright
>> > +.\"    notice, this list of conditions and the following disclaimer in the
>> > +.\"    documentation and/or other materials provided with the distribution.
>> > +.\"
>> > +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
>> > +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> > +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> > +.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
>> > +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> > +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> > +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> > +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> > +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> > +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> > +.\" SUCH DAMAGE.
>> > +.\" %%%LICENSE_END
>> > +.\"
>> > +.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
>> > +.SH NAME
>> > +cap_rights_get \- retrieve Capsicum capability rights
>> > +.SH SYNOPSIS
>> > +.nf
>> > +.B #include <sys/capsicum.h>
>> > +.sp
>> > +.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
>> > +.BI "                   unsigned int *" fcntls ,
>> > +.BI "                   int *" nioctls ", unsigned int *" ioctls );
>> > +.SH DESCRIPTION
>> > +Obtain the current Capsicum capability rights for a file descriptor.
>> > +.PP
>> > +The function will fill the
>> > +.I rights
>> > +argument (if non-NULL) with the primary capability rights of the
>> > +.I fd
>> > +descriptor.  The result can be examined with the
>> > +.BR cap_rights_is_set (3)
>> > +family of functions.  The complete list of primary rights can be found in the
>> > +.BR rights (7)
>> > +manual page.
>> > +.PP
>> > +If the
>> > +.I fcntls
>> > +argument is non-NULL, it will be filled in with a bitmask of allowed
>> > +.BR fcntl (2)
>> > +commands; see
>> > +.BR cap_rights_limit (2)
>> > +for values.  If the file descriptor does not have the
>> > +.B CAP_FCNTL
>> > +primary right, the returned
>> > +.I fcntls
>> > +value will be zero.
>> > +.PP
>> > +If the
>> > +.I nioctls
>> > +argument is non-NULL, it will be filled in with the number of allowed
>> > +.BR ioctl (2)
>> > +commands, or with the value CAP_IOCTLS_ALL to indicate that all
>> > +.BR ioctl (2)
>> > +commands are allowed.  If the file descriptor does not have the
>> > +.B CAP_IOCTL
>> > +primary right, the returned
>> > +.I nioctls
>> > +value will be zero.
>> > +.PP
>> > +The
>> > +.I ioctls
>> > +argument (if non-NULL) should point at memory that can hold up to
>> > +.I nioctls
>> > +values.
>> > +The system call populates the provided buffer with up to
>> > +.I nioctls
>> > +elements, but always returns the total number of
>>
>> I assume you mean "up to the initial value of *nioctls elements" or
>> something.  Can you clarify?
>>
>> --Andy
>
> Yeah, that's what I meant.  Is this clearer?
>
>   If  the  ioctls argument is non-NULL, the caller should specify
>   the size of the provided buffer as the  initial  value  of  the
>   nioctls  argument (as a count of the number of ioctl(2) command
>   values the buffer can hold).  On successful completion  of  the
>   system call, the ioctls buffer is filled with the ioctl(2) com‐
>   mand values, up to maximum of the initial value of nioctls.
>

Yes.  Thanks.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-07-01  9:53       ` David Drysdale
  (?)
@ 2014-07-01 18:58       ` Loganaden Velvindron
  -1 siblings, 0 replies; 87+ messages in thread
From: Loganaden Velvindron @ 2014-07-01 18:58 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andi Kleen, linux-security-module, linux-kernel,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, linux-api

On Tue, Jul 1, 2014 at 1:53 PM, David Drysdale <drysdale@google.com> wrote:
> On Mon, Jun 30, 2014 at 01:40:40PM -0700, Andi Kleen wrote:
>> David Drysdale <drysdale@google.com> writes:
>>
>> > Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
>> > provided path, rejecting (with -EACCES) paths that are not beneath
>> > the provided dfd.  In particular, reject:
>> >  - paths that contain .. components
>> >  - paths that begin with /
>> >  - symlinks that have paths as above.
>>
>> How about bind mounts?
>>
>> -Andi
>>
>> --
>> ak@linux.intel.com -- Speaking for myself only
>
> Bind mounts won't get rejected because they just look like normal
> path components.  In other words, if dir/subdir is a bind mount to
> /root/dir then:
>   fd = openat(AT_FDCWD, "dir/subdir", O_RDONLY|O_BENEATH_ONLY);
> will work fine.

Talking about David's efforts at porting Capsicum to Linux, I've
already implemented
support for Capsicum in OpenSSH. It shouldn't be complicated to enable
it on Linux
systems that support it.

I would very like to see capsicum integrated into mainline, as it's a
high quality sandbox
solution, that will benefit a lot of server software that implement
privilege separation.




> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
This message is strictly personal and the opinions expressed do not
represent those of my employers, either past or present.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/11] capsicum: implementations of new LSM hooks
@ 2014-07-02 13:49       ` Paul Moore
  0 siblings, 0 replies; 87+ messages in thread
From: Paul Moore @ 2014-07-02 13:49 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, LSM List, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	Linux API

On Monday, June 30, 2014 09:05:38 AM Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
> > If the LSM does not provide implementations of the .file_lookup and
> > .file_install LSM hooks, always use the Capsicum implementations.
> > 
> > The Capsicum implementation of file_lookup checks for a Capsicum
> > capability wrapper file and unwraps to if the appropriate rights
> > are available.
> > 
> > The Capsicum implementation of file_install checks whether the file
> > has restricted rights associated with it.  If it does, it is replaced
> > with a Capsicum capability wrapper file before installation into the
> > fdtable.
> 
> I think I fall on the "no LSM" side of the fence.  This kind of stuff
> should be available regardless of selected LSM (as it is in your
> code) ...

I agree.  Looking quickly at the patches, the code seems to take an odd 
approach of living largely outside the LSM framework, but then relying on a 
couple of LSM hooks.  Capsicum should either live fully as a LSM or fully 
outside of it, this mix seems a bit silly to me.

-- 
paul moore
www.paul-moore.com


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/11] capsicum: implementations of new LSM hooks
@ 2014-07-02 13:49       ` Paul Moore
  0 siblings, 0 replies; 87+ messages in thread
From: Paul Moore @ 2014-07-02 13:49 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Monday, June 30, 2014 09:05:38 AM Andy Lutomirski wrote:
> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > If the LSM does not provide implementations of the .file_lookup and
> > .file_install LSM hooks, always use the Capsicum implementations.
> > 
> > The Capsicum implementation of file_lookup checks for a Capsicum
> > capability wrapper file and unwraps to if the appropriate rights
> > are available.
> > 
> > The Capsicum implementation of file_install checks whether the file
> > has restricted rights associated with it.  If it does, it is replaced
> > with a Capsicum capability wrapper file before installation into the
> > fdtable.
> 
> I think I fall on the "no LSM" side of the fence.  This kind of stuff
> should be available regardless of selected LSM (as it is in your
> code) ...

I agree.  Looking quickly at the patches, the code seems to take an odd 
approach of living largely outside the LSM framework, but then relying on a 
couple of LSM hooks.  Capsicum should either live fully as a LSM or fully 
outside of it, this mix seems a bit silly to me.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/11] capsicum: implementations of new LSM hooks
  2014-07-02 13:49       ` Paul Moore
@ 2014-07-02 17:09         ` David Drysdale
  -1 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-02 17:09 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andy Lutomirski, LSM List, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	Linux API

On Wed, Jul 2, 2014 at 2:49 PM, Paul Moore <paul@paul-moore.com> wrote:
> On Monday, June 30, 2014 09:05:38 AM Andy Lutomirski wrote:
>> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale@google.com> wrote:
>> > If the LSM does not provide implementations of the .file_lookup and
>> > .file_install LSM hooks, always use the Capsicum implementations.
>> >
>> > The Capsicum implementation of file_lookup checks for a Capsicum
>> > capability wrapper file and unwraps to if the appropriate rights
>> > are available.
>> >
>> > The Capsicum implementation of file_install checks whether the file
>> > has restricted rights associated with it.  If it does, it is replaced
>> > with a Capsicum capability wrapper file before installation into the
>> > fdtable.
>>
>> I think I fall on the "no LSM" side of the fence.  This kind of stuff
>> should be available regardless of selected LSM (as it is in your
>> code) ...
>
> I agree.  Looking quickly at the patches, the code seems to take an odd
> approach of living largely outside the LSM framework, but then relying on a
> couple of LSM hooks.  Capsicum should either live fully as a LSM or fully
> outside of it, this mix seems a bit silly to me.

Yeah, the end result was definitely a bit odd, hence the queries in the
cover email.  The consensus so far seems to be that they don't help,
so I'll remove the gratuitous LSM hooks on the next iteration.

Thanks,
David

> --
> paul moore
> www.paul-moore.com
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 09/11] capsicum: implementations of new LSM hooks
@ 2014-07-02 17:09         ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-02 17:09 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andy Lutomirski, LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Wed, Jul 2, 2014 at 2:49 PM, Paul Moore <paul-r2n+y4ga6xFZroRs9YW3xA@public.gmane.org> wrote:
> On Monday, June 30, 2014 09:05:38 AM Andy Lutomirski wrote:
>> On Mon, Jun 30, 2014 at 3:28 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> > If the LSM does not provide implementations of the .file_lookup and
>> > .file_install LSM hooks, always use the Capsicum implementations.
>> >
>> > The Capsicum implementation of file_lookup checks for a Capsicum
>> > capability wrapper file and unwraps to if the appropriate rights
>> > are available.
>> >
>> > The Capsicum implementation of file_install checks whether the file
>> > has restricted rights associated with it.  If it does, it is replaced
>> > with a Capsicum capability wrapper file before installation into the
>> > fdtable.
>>
>> I think I fall on the "no LSM" side of the fence.  This kind of stuff
>> should be available regardless of selected LSM (as it is in your
>> code) ...
>
> I agree.  Looking quickly at the patches, the code seems to take an odd
> approach of living largely outside the LSM framework, but then relying on a
> couple of LSM hooks.  Capsicum should either live fully as a LSM or fully
> outside of it, this mix seems a bit silly to me.

Yeah, the end result was definitely a bit odd, hence the queries in the
cover email.  The consensus so far seems to be that they don't help,
so I'll remove the gratuitous LSM hooks on the next iteration.

Thanks,
David

> --
> paul moore
> www.paul-moore.com
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
@ 2014-07-03  9:12   ` Paolo Bonzini
  2014-06-30 10:28   ` David Drysdale
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-03  9:12 UTC (permalink / raw)
  To: David Drysdale, linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api, qemu-devel

Il 30/06/2014 12:28, David Drysdale ha scritto:
> Hi all,
>
> The last couple of versions of FreeBSD (9.x/10.x) have included the
> Capsicum security framework [1], which allows security-aware
> applications to sandbox themselves in a very fine-grained way.  For
> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> restrict sshd's credentials checking process, to reduce the chances of
> credential leakage.

Hi David,

we've had similar goals in QEMU.  QEMU can be used as a virtual machine 
monitor from the command line, but it also has an API that lets a 
management tool drive QEMU via AF_UNIX sockets.  Long term, we would 
like to have a restricted mode for QEMU where all file descriptors are 
obtained via SCM_RIGHTS or /dev/fd, and syscalls can be locked down.

Currently we do use seccomp v2 BPF filters, but unfortunately this 
didn't help very much.  QEMU supports hotplugging hence the filter must 
whitelist anything that _might_ be used in the future, which is 
generally... too much.

Something like Capsicum would be really nice because it attaches 
capabilities to file descriptors.  However, I wonder however how 
extensible Capsicum could be, and I am worried about the proliferation 
of capabilities that its design naturally leads to.

Given Linux's previous experience with BPF filters, what do you think 
about attaching specific BPF programs to file descriptors?  Then 
whenever a syscall is run that affects a file descriptor, the BPF 
program for the file descriptor (attached to a struct file* as in 
Capsicum) would run in addition to the process-wide filter.

An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file 
descriptors, so that a program that doesn't lock down syscalls can still 
lock down the operations (including fcntls and ioctls) on specific file 
descriptors.

Converting FreeBSD capabilities to BPF programs can be easily 
implemented in userspace.

>   [Capsicum also includes 'capability mode', which locks down the
>   available syscalls so the rights restrictions can't just be bypassed
>   by opening new file descriptors; I'll describe that separately later.]

This can also be implemented in userspace via seccomp and 
PR_SET_NO_NEW_PRIVS.

>   [Policing the rights checks anywhere else, for example at the system
>   call boundary, isn't a good idea because it opens up the possibility
>   of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>   changed (as openat/close/dup2 are allowed in capability mode) between
>   the 'check' at syscall entry and the 'use' at fget() invocation.]

In the case of BPF filters, I wonder if you could stash the BPF 
"environment" somewhere and then use it at fget() invocation. 
Alternatively, it can be reconstructed at fget() time, similar to your 
introduction of fgetr().

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-03  9:12   ` Paolo Bonzini
  0 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-03  9:12 UTC (permalink / raw)
  To: David Drysdale, linux-security-module, linux-kernel, Greg Kroah-Hartman
  Cc: Kees Cook, linux-api, Meredydd Luff, qemu-devel, Alexander Viro,
	James Morris

Il 30/06/2014 12:28, David Drysdale ha scritto:
> Hi all,
>
> The last couple of versions of FreeBSD (9.x/10.x) have included the
> Capsicum security framework [1], which allows security-aware
> applications to sandbox themselves in a very fine-grained way.  For
> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> restrict sshd's credentials checking process, to reduce the chances of
> credential leakage.

Hi David,

we've had similar goals in QEMU.  QEMU can be used as a virtual machine 
monitor from the command line, but it also has an API that lets a 
management tool drive QEMU via AF_UNIX sockets.  Long term, we would 
like to have a restricted mode for QEMU where all file descriptors are 
obtained via SCM_RIGHTS or /dev/fd, and syscalls can be locked down.

Currently we do use seccomp v2 BPF filters, but unfortunately this 
didn't help very much.  QEMU supports hotplugging hence the filter must 
whitelist anything that _might_ be used in the future, which is 
generally... too much.

Something like Capsicum would be really nice because it attaches 
capabilities to file descriptors.  However, I wonder however how 
extensible Capsicum could be, and I am worried about the proliferation 
of capabilities that its design naturally leads to.

Given Linux's previous experience with BPF filters, what do you think 
about attaching specific BPF programs to file descriptors?  Then 
whenever a syscall is run that affects a file descriptor, the BPF 
program for the file descriptor (attached to a struct file* as in 
Capsicum) would run in addition to the process-wide filter.

An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file 
descriptors, so that a program that doesn't lock down syscalls can still 
lock down the operations (including fcntls and ioctls) on specific file 
descriptors.

Converting FreeBSD capabilities to BPF programs can be easily 
implemented in userspace.

>   [Capsicum also includes 'capability mode', which locks down the
>   available syscalls so the rights restrictions can't just be bypassed
>   by opening new file descriptors; I'll describe that separately later.]

This can also be implemented in userspace via seccomp and 
PR_SET_NO_NEW_PRIVS.

>   [Policing the rights checks anywhere else, for example at the system
>   call boundary, isn't a good idea because it opens up the possibility
>   of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>   changed (as openat/close/dup2 are allowed in capability mode) between
>   the 'check' at syscall entry and the 'use' at fget() invocation.]

In the case of BPF filters, I wonder if you could stash the BPF 
"environment" somewhere and then use it at fget() invocation. 
Alternatively, it can be reconstructed at fget() time, similar to your 
introduction of fgetr().

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-03  9:12   ` [Qemu-devel] " Paolo Bonzini
@ 2014-07-03 10:01     ` Loganaden Velvindron
  -1 siblings, 0 replies; 87+ messages in thread
From: Loganaden Velvindron @ 2014-07-03 10:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Drysdale, linux-security-module, linux-kernel,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, linux-api, qemu-devel

On Thu, Jul 3, 2014 at 1:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
>>
>> Hi all,
>>
>> The last couple of versions of FreeBSD (9.x/10.x) have included the
>> Capsicum security framework [1], which allows security-aware
>> applications to sandbox themselves in a very fine-grained way.  For
>> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
>> restrict sshd's credentials checking process, to reduce the chances of
>> credential leakage.

Aside from OpenSSH, I've also been working on implementing Capsicum,
in other userspace software.



>
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual machine
> monitor from the command line, but it also has an API that lets a management
> tool drive QEMU via AF_UNIX sockets.  Long term, we would like to have a
> restricted mode for QEMU where all file descriptors are obtained via
> SCM_RIGHTS or /dev/fd, and syscalls can be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this didn't
> help very much.  QEMU supports hotplugging hence the filter must whitelist
> anything that _might_ be used in the future, which is generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how extensible
> Capsicum could be, and I am worried about the proliferation of capabilities
> that its design naturally leads to.
>
> Given Linux's previous experience with BPF filters, what do you think about
> attaching specific BPF programs to file descriptors?  Then whenever a
> syscall is run that affects a file descriptor, the BPF program for the file
> descriptor (attached to a struct file* as in Capsicum) would run in addition
> to the process-wide filter.
>
> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file descriptors,
> so that a program that doesn't lock down syscalls can still lock down the
> operations (including fcntls and ioctls) on specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily implemented in
> userspace.
>
>>   [Capsicum also includes 'capability mode', which locks down the
>>   available syscalls so the rights restrictions can't just be bypassed
>>   by opening new file descriptors; I'll describe that separately later.]
>
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.
>
>>   [Policing the rights checks anywhere else, for example at the system
>>   call boundary, isn't a good idea because it opens up the possibility
>>   of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>   changed (as openat/close/dup2 are allowed in capability mode) between
>>   the 'check' at syscall entry and the 'use' at fget() invocation.]
>
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation. Alternatively,
> it can be reconstructed at fget() time, similar to your introduction of
> fgetr().
>
> Thanks,
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
This message is strictly personal and the opinions expressed do not
represent those of my employers, either past or present.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-03 10:01     ` Loganaden Velvindron
  0 siblings, 0 replies; 87+ messages in thread
From: Loganaden Velvindron @ 2014-07-03 10:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, linux-security-module, Alexander Viro, James Morris,
	linux-api, David Drysdale

On Thu, Jul 3, 2014 at 1:12 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
>>
>> Hi all,
>>
>> The last couple of versions of FreeBSD (9.x/10.x) have included the
>> Capsicum security framework [1], which allows security-aware
>> applications to sandbox themselves in a very fine-grained way.  For
>> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
>> restrict sshd's credentials checking process, to reduce the chances of
>> credential leakage.

Aside from OpenSSH, I've also been working on implementing Capsicum,
in other userspace software.



>
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual machine
> monitor from the command line, but it also has an API that lets a management
> tool drive QEMU via AF_UNIX sockets.  Long term, we would like to have a
> restricted mode for QEMU where all file descriptors are obtained via
> SCM_RIGHTS or /dev/fd, and syscalls can be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this didn't
> help very much.  QEMU supports hotplugging hence the filter must whitelist
> anything that _might_ be used in the future, which is generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how extensible
> Capsicum could be, and I am worried about the proliferation of capabilities
> that its design naturally leads to.
>
> Given Linux's previous experience with BPF filters, what do you think about
> attaching specific BPF programs to file descriptors?  Then whenever a
> syscall is run that affects a file descriptor, the BPF program for the file
> descriptor (attached to a struct file* as in Capsicum) would run in addition
> to the process-wide filter.
>
> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file descriptors,
> so that a program that doesn't lock down syscalls can still lock down the
> operations (including fcntls and ioctls) on specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily implemented in
> userspace.
>
>>   [Capsicum also includes 'capability mode', which locks down the
>>   available syscalls so the rights restrictions can't just be bypassed
>>   by opening new file descriptors; I'll describe that separately later.]
>
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.
>
>>   [Policing the rights checks anywhere else, for example at the system
>>   call boundary, isn't a good idea because it opens up the possibility
>>   of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>   changed (as openat/close/dup2 are allowed in capability mode) between
>>   the 'check' at syscall entry and the 'use' at fget() invocation.]
>
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation. Alternatively,
> it can be reconstructed at fget() time, similar to your introduction of
> fgetr().
>
> Thanks,
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
This message is strictly personal and the opinions expressed do not
represent those of my employers, either past or present.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-03 18:39     ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-03 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API, qemu-devel

On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
> >Hi all,
> >
> >The last couple of versions of FreeBSD (9.x/10.x) have included the
> >Capsicum security framework [1], which allows security-aware
> >applications to sandbox themselves in a very fine-grained way.  For
> >example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> >restrict sshd's credentials checking process, to reduce the chances of
> >credential leakage.
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual
> machine monitor from the command line, but it also has an API that
> lets a management tool drive QEMU via AF_UNIX sockets.  Long term,
> we would like to have a restricted mode for QEMU where all file
> descriptors are obtained via SCM_RIGHTS or /dev/fd, and syscalls can
> be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this
> didn't help very much.  QEMU supports hotplugging hence the filter
> must whitelist anything that _might_ be used in the future, which is
> generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how
> extensible Capsicum could be, and I am worried about the
> proliferation of capabilities that its design naturally leads to.

True, capability rights are likely to expand over time (although
FreeBSD only expanded from 55 to 60 between 9.x and 10.x).
 
> Given Linux's previous experience with BPF filters, what do you
> think about attaching specific BPF programs to file descriptors?
> Then whenever a syscall is run that affects a file descriptor, the
> BPF program for the file descriptor (attached to a struct file* as
> in Capsicum) would run in addition to the process-wide filter.

That sounds kind of clever, but also kind of complicated.

Off the top of my head, one particular problem is that not all
fd->struct file conversions in the kernel are completely specified
by an enclosing syscall and the explicit values of its parameters.

For example, the actual contents of the arguments to io_submit(2)
aren't visible to a seccomp-bpf program (as it can't read the __user
memory for the iocb structures), and so it can't distinguish a
read from a write.

Also, there could potentially be some odd interactions with file
descriptors passed between processes, if the BPF program relies
on assumptions about the environment of the original process.  For
example, what happens if an x86_64 process passes a filter-attached
FD to an ia32 process?  Given that the syscall numbers are
arch-specific, I guess that means the filter program would have
to include arch-specific branches for any possible variant.

More generally, I suspect that keeping things simpler will end
up being more secure.  Capsicum was based on well-studied ideas
from the world of object capability-based security, and I'd be
nervous about adding complications that take us further away from
that. 

> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file
> descriptors, so that a program that doesn't lock down syscalls can
> still lock down the operations (including fcntls and ioctls) on
> specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily
> implemented in userspace.

I get the idea, but I'm not sure it would be that easy!  The
BPF-generation library would need to hold all of the mappings
from system calls (and their arguments) to the equivalent
required rights -- and vice versa.

That mapping would also need be kept closely in sync with the kernel
and other system libraries -- if a new syscall is added and libc (or
some other library) started using it, the equivalent BPF chunks would
need to be updated to cope.
 
> >  [Capsicum also includes 'capability mode', which locks down the
> >  available syscalls so the rights restrictions can't just be bypassed
> >  by opening new file descriptors; I'll describe that separately later.]
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.

Well, mostly (and in fact I've got an attempt to do exactly that at
https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).

But there are a few wrinkles with that approach.

First, we need Kees Cook's patches to allow seccomp filters
to be synchronized across existing threads, but hopefully they
will make it in soon.

Next, there's one awkward syscall case.  In capability mode we'd like
to prevent processes from sending signals with kill(2)/tgkill(2)
to other processes, but they should still be able to send themselves
signals.  For example, abort(3) generates:
  tgkill(gettid(), gettid(), SIGABRT)

Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
least in a way that survives forking.

Finally, capability mode also turns on strict-relative lookups
process-wide; in other words, every openat(dfd, ...) operation
acts as though it has the O_BENEATH_ONLY flag set, regardless of
whether the dfd is a Capsicum capability.  I can't see a way to
do that with a BPF program (although it would be possible to add
a filter that polices the requirement to include O_BENEATH_ONLY
rather than implicitly adding it).
 
So although a capability-mode implementation in terms of seccomp-bpf
is tantalizingly close, at the moment I've got it implemented as a new
seccomp mode.

> >  [Policing the rights checks anywhere else, for example at the system
> >  call boundary, isn't a good idea because it opens up the possibility
> >  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
> >  changed (as openat/close/dup2 are allowed in capability mode) between
> >  the 'check' at syscall entry and the 'use' at fget() invocation.]
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation.
> Alternatively, it can be reconstructed at fget() time, similar to
> your introduction of fgetr().

Stashing something at syscall entry to be referred to later always
makes me worry about TOCTOU vulnerabilities, but the details might
be OK in this case (given that no check occurs at syscall entry)...
 
> Thanks,
>
> Paolo

Many thanks for taking the time to comment and think of innovative
ideas!

David

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-03 18:39     ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-03 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API, qemu-devel

On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
> >Hi all,
> >
> >The last couple of versions of FreeBSD (9.x/10.x) have included the
> >Capsicum security framework [1], which allows security-aware
> >applications to sandbox themselves in a very fine-grained way.  For
> >example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> >restrict sshd's credentials checking process, to reduce the chances of
> >credential leakage.
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual
> machine monitor from the command line, but it also has an API that
> lets a management tool drive QEMU via AF_UNIX sockets.  Long term,
> we would like to have a restricted mode for QEMU where all file
> descriptors are obtained via SCM_RIGHTS or /dev/fd, and syscalls can
> be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this
> didn't help very much.  QEMU supports hotplugging hence the filter
> must whitelist anything that _might_ be used in the future, which is
> generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how
> extensible Capsicum could be, and I am worried about the
> proliferation of capabilities that its design naturally leads to.

True, capability rights are likely to expand over time (although
FreeBSD only expanded from 55 to 60 between 9.x and 10.x).
 
> Given Linux's previous experience with BPF filters, what do you
> think about attaching specific BPF programs to file descriptors?
> Then whenever a syscall is run that affects a file descriptor, the
> BPF program for the file descriptor (attached to a struct file* as
> in Capsicum) would run in addition to the process-wide filter.

That sounds kind of clever, but also kind of complicated.

Off the top of my head, one particular problem is that not all
fd->struct file conversions in the kernel are completely specified
by an enclosing syscall and the explicit values of its parameters.

For example, the actual contents of the arguments to io_submit(2)
aren't visible to a seccomp-bpf program (as it can't read the __user
memory for the iocb structures), and so it can't distinguish a
read from a write.

Also, there could potentially be some odd interactions with file
descriptors passed between processes, if the BPF program relies
on assumptions about the environment of the original process.  For
example, what happens if an x86_64 process passes a filter-attached
FD to an ia32 process?  Given that the syscall numbers are
arch-specific, I guess that means the filter program would have
to include arch-specific branches for any possible variant.

More generally, I suspect that keeping things simpler will end
up being more secure.  Capsicum was based on well-studied ideas
from the world of object capability-based security, and I'd be
nervous about adding complications that take us further away from
that. 

> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file
> descriptors, so that a program that doesn't lock down syscalls can
> still lock down the operations (including fcntls and ioctls) on
> specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily
> implemented in userspace.

I get the idea, but I'm not sure it would be that easy!  The
BPF-generation library would need to hold all of the mappings
from system calls (and their arguments) to the equivalent
required rights -- and vice versa.

That mapping would also need be kept closely in sync with the kernel
and other system libraries -- if a new syscall is added and libc (or
some other library) started using it, the equivalent BPF chunks would
need to be updated to cope.
 
> >  [Capsicum also includes 'capability mode', which locks down the
> >  available syscalls so the rights restrictions can't just be bypassed
> >  by opening new file descriptors; I'll describe that separately later.]
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.

Well, mostly (and in fact I've got an attempt to do exactly that at
https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).

But there are a few wrinkles with that approach.

First, we need Kees Cook's patches to allow seccomp filters
to be synchronized across existing threads, but hopefully they
will make it in soon.

Next, there's one awkward syscall case.  In capability mode we'd like
to prevent processes from sending signals with kill(2)/tgkill(2)
to other processes, but they should still be able to send themselves
signals.  For example, abort(3) generates:
  tgkill(gettid(), gettid(), SIGABRT)

Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
least in a way that survives forking.

Finally, capability mode also turns on strict-relative lookups
process-wide; in other words, every openat(dfd, ...) operation
acts as though it has the O_BENEATH_ONLY flag set, regardless of
whether the dfd is a Capsicum capability.  I can't see a way to
do that with a BPF program (although it would be possible to add
a filter that polices the requirement to include O_BENEATH_ONLY
rather than implicitly adding it).
 
So although a capability-mode implementation in terms of seccomp-bpf
is tantalizingly close, at the moment I've got it implemented as a new
seccomp mode.

> >  [Policing the rights checks anywhere else, for example at the system
> >  call boundary, isn't a good idea because it opens up the possibility
> >  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
> >  changed (as openat/close/dup2 are allowed in capability mode) between
> >  the 'check' at syscall entry and the 'use' at fget() invocation.]
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation.
> Alternatively, it can be reconstructed at fget() time, similar to
> your introduction of fgetr().

Stashing something at syscall entry to be referred to later always
makes me worry about TOCTOU vulnerabilities, but the details might
be OK in this case (given that no check occurs at syscall entry)...
 
> Thanks,
>
> Paolo

Many thanks for taking the time to comment and think of innovative
ideas!

David

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-03 18:39     ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-03 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API

On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
> Il 30/06/2014 12:28, David Drysdale ha scritto:
> >Hi all,
> >
> >The last couple of versions of FreeBSD (9.x/10.x) have included the
> >Capsicum security framework [1], which allows security-aware
> >applications to sandbox themselves in a very fine-grained way.  For
> >example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> >restrict sshd's credentials checking process, to reduce the chances of
> >credential leakage.
>
> Hi David,
>
> we've had similar goals in QEMU.  QEMU can be used as a virtual
> machine monitor from the command line, but it also has an API that
> lets a management tool drive QEMU via AF_UNIX sockets.  Long term,
> we would like to have a restricted mode for QEMU where all file
> descriptors are obtained via SCM_RIGHTS or /dev/fd, and syscalls can
> be locked down.
>
> Currently we do use seccomp v2 BPF filters, but unfortunately this
> didn't help very much.  QEMU supports hotplugging hence the filter
> must whitelist anything that _might_ be used in the future, which is
> generally... too much.
>
> Something like Capsicum would be really nice because it attaches
> capabilities to file descriptors.  However, I wonder however how
> extensible Capsicum could be, and I am worried about the
> proliferation of capabilities that its design naturally leads to.

True, capability rights are likely to expand over time (although
FreeBSD only expanded from 55 to 60 between 9.x and 10.x).
 
> Given Linux's previous experience with BPF filters, what do you
> think about attaching specific BPF programs to file descriptors?
> Then whenever a syscall is run that affects a file descriptor, the
> BPF program for the file descriptor (attached to a struct file* as
> in Capsicum) would run in addition to the process-wide filter.

That sounds kind of clever, but also kind of complicated.

Off the top of my head, one particular problem is that not all
fd->struct file conversions in the kernel are completely specified
by an enclosing syscall and the explicit values of its parameters.

For example, the actual contents of the arguments to io_submit(2)
aren't visible to a seccomp-bpf program (as it can't read the __user
memory for the iocb structures), and so it can't distinguish a
read from a write.

Also, there could potentially be some odd interactions with file
descriptors passed between processes, if the BPF program relies
on assumptions about the environment of the original process.  For
example, what happens if an x86_64 process passes a filter-attached
FD to an ia32 process?  Given that the syscall numbers are
arch-specific, I guess that means the filter program would have
to include arch-specific branches for any possible variant.

More generally, I suspect that keeping things simpler will end
up being more secure.  Capsicum was based on well-studied ideas
from the world of object capability-based security, and I'd be
nervous about adding complications that take us further away from
that. 

> An equivalent of PR_SET_NO_NEW_PRIVS can also be added to file
> descriptors, so that a program that doesn't lock down syscalls can
> still lock down the operations (including fcntls and ioctls) on
> specific file descriptors.
>
> Converting FreeBSD capabilities to BPF programs can be easily
> implemented in userspace.

I get the idea, but I'm not sure it would be that easy!  The
BPF-generation library would need to hold all of the mappings
from system calls (and their arguments) to the equivalent
required rights -- and vice versa.

That mapping would also need be kept closely in sync with the kernel
and other system libraries -- if a new syscall is added and libc (or
some other library) started using it, the equivalent BPF chunks would
need to be updated to cope.
 
> >  [Capsicum also includes 'capability mode', which locks down the
> >  available syscalls so the rights restrictions can't just be bypassed
> >  by opening new file descriptors; I'll describe that separately later.]
>
> This can also be implemented in userspace via seccomp and
> PR_SET_NO_NEW_PRIVS.

Well, mostly (and in fact I've got an attempt to do exactly that at
https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).

But there are a few wrinkles with that approach.

First, we need Kees Cook's patches to allow seccomp filters
to be synchronized across existing threads, but hopefully they
will make it in soon.

Next, there's one awkward syscall case.  In capability mode we'd like
to prevent processes from sending signals with kill(2)/tgkill(2)
to other processes, but they should still be able to send themselves
signals.  For example, abort(3) generates:
  tgkill(gettid(), gettid(), SIGABRT)

Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
least in a way that survives forking.

Finally, capability mode also turns on strict-relative lookups
process-wide; in other words, every openat(dfd, ...) operation
acts as though it has the O_BENEATH_ONLY flag set, regardless of
whether the dfd is a Capsicum capability.  I can't see a way to
do that with a BPF program (although it would be possible to add
a filter that polices the requirement to include O_BENEATH_ONLY
rather than implicitly adding it).
 
So although a capability-mode implementation in terms of seccomp-bpf
is tantalizingly close, at the moment I've got it implemented as a new
seccomp mode.

> >  [Policing the rights checks anywhere else, for example at the system
> >  call boundary, isn't a good idea because it opens up the possibility
> >  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
> >  changed (as openat/close/dup2 are allowed in capability mode) between
> >  the 'check' at syscall entry and the 'use' at fget() invocation.]
>
> In the case of BPF filters, I wonder if you could stash the BPF
> "environment" somewhere and then use it at fget() invocation.
> Alternatively, it can be reconstructed at fget() time, similar to
> your introduction of fgetr().

Stashing something at syscall entry to be referred to later always
makes me worry about TOCTOU vulnerabilities, but the details might
be OK in this case (given that no check occurs at syscall entry)...
 
> Thanks,
>
> Paolo

Many thanks for taking the time to comment and think of innovative
ideas!

David

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-04  7:03       ` Paolo Bonzini
  0 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-04  7:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API, qemu-devel


Il 03/07/2014 20:39, David Drysdale ha scritto:
> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>> Given Linux's previous experience with BPF filters, what do you
>> think about attaching specific BPF programs to file descriptors?
>> Then whenever a syscall is run that affects a file descriptor, the
>> BPF program for the file descriptor (attached to a struct file* as
>> in Capsicum) would run in addition to the process-wide filter.
>
> That sounds kind of clever, but also kind of complicated.
>
> Off the top of my head, one particular problem is that not all
> fd->struct file conversions in the kernel are completely specified
> by an enclosing syscall and the explicit values of its parameters.
>
> For example, the actual contents of the arguments to io_submit(2)
> aren't visible to a seccomp-bpf program (as it can't read the __user
> memory for the iocb structures), and so it can't distinguish a
> read from a write.

I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
/O_RDWR.   You could do it by running the file descriptor's seccomp-bpf 
program once per iocb with synthesized syscall numbers and argument 
vectors.

BTW, there's one thing I'm not sure I understand (because my knowledge 
of VFS is really only cursory).  Are the capabilities associated to the 
file _descriptor_ (a la F_GETFD/SETFD) or _description_ 
(F_GETFL/SETFL)?!?

If it is the former, there is some value in read/write capabilities 
because you could for example block a child process from reading an 
eventfd and simulate the two file descriptors returned by pipe(2).  But 
if it is the latter, it looks like an important usability problem in 
the Capsicum model.  (Granted, it's just about usability---in the end 
it does exactly what it's meant and documented to do).

> Also, there could potentially be some odd interactions with file
> descriptors passed between processes, if the BPF program relies
> on assumptions about the environment of the original process.  For
> example, what happens if an x86_64 process passes a filter-attached
> FD to an ia32 process?  Given that the syscall numbers are
> arch-specific, I guess that means the filter program would have
> to include arch-specific branches for any possible variant.

This is the same for using seccompv2 to limit child processes, no?  So 
there may be a problem but it has to be solved anyway by libseccomp.

> More generally, I suspect that keeping things simpler will end
> up being more secure.  Capsicum was based on well-studied ideas
> from the world of object capability-based security, and I'd be
> nervous about adding complications that take us further away from
> that.

True.

> That mapping would also need be kept closely in sync with the kernel
> and other system libraries -- if a new syscall is added and libc (or
> some other library) started using it, the equivalent BPF chunks would
> need to be updated to cope.

Again, this is the same problem that has to be solved for process-wide 
seccompv2.

>>>  [Capsicum also includes 'capability mode', which locks down the
>>>  available syscalls so the rights restrictions can't just be bypassed
>>>  by opening new file descriptors; I'll describe that separately later.]
>>
>> This can also be implemented in userspace via seccomp and
>> PR_SET_NO_NEW_PRIVS.
>
> Well, mostly (and in fact I've got an attempt to do exactly that at
> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>
> [..] there's one awkward syscall case.  In capability mode we'd like
> to prevent processes from sending signals with kill(2)/tgkill(2)
> to other processes, but they should still be able to send themselves
> signals.  For example, abort(3) generates:
>   tgkill(gettid(), gettid(), SIGABRT)
>
> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
> least in a way that survives forking.

I guess the thread id could be added as a special seccomp-bpf argument 
(ancillary datum?).

> Finally, capability mode also turns on strict-relative lookups
> process-wide; in other words, every openat(dfd, ...) operation
> acts as though it has the O_BENEATH_ONLY flag set, regardless of
> whether the dfd is a Capsicum capability.  I can't see a way to
> do that with a BPF program (although it would be possible to add
> a filter that polices the requirement to include O_BENEATH_ONLY
> rather than implicitly adding it).

That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up). 
It seems useful independent of Capsicum, and the Linux APIs tend to be 
fine-grained more often than coarse-grained.

>>>  [Policing the rights checks anywhere else, for example at the system
>>>  call boundary, isn't a good idea because it opens up the possibility
>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>
>> In the case of BPF filters, I wonder if you could stash the BPF
>> "environment" somewhere and then use it at fget() invocation.
>> Alternatively, it can be reconstructed at fget() time, similar to
>> your introduction of fgetr().
>
> Stashing something at syscall entry to be referred to later always
> makes me worry about TOCTOU vulnerabilities, but the details might
> be OK in this case (given that no check occurs at syscall entry)...

Yeah, that was pretty much the idea.  But I was cautious enough to 
label it with "I wonder"...

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-04  7:03       ` Paolo Bonzini
  0 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-04  7:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API, qemu-devel


Il 03/07/2014 20:39, David Drysdale ha scritto:
> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>> Given Linux's previous experience with BPF filters, what do you
>> think about attaching specific BPF programs to file descriptors?
>> Then whenever a syscall is run that affects a file descriptor, the
>> BPF program for the file descriptor (attached to a struct file* as
>> in Capsicum) would run in addition to the process-wide filter.
>
> That sounds kind of clever, but also kind of complicated.
>
> Off the top of my head, one particular problem is that not all
> fd->struct file conversions in the kernel are completely specified
> by an enclosing syscall and the explicit values of its parameters.
>
> For example, the actual contents of the arguments to io_submit(2)
> aren't visible to a seccomp-bpf program (as it can't read the __user
> memory for the iocb structures), and so it can't distinguish a
> read from a write.

I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
/O_RDWR.   You could do it by running the file descriptor's seccomp-bpf 
program once per iocb with synthesized syscall numbers and argument 
vectors.

BTW, there's one thing I'm not sure I understand (because my knowledge 
of VFS is really only cursory).  Are the capabilities associated to the 
file _descriptor_ (a la F_GETFD/SETFD) or _description_ 
(F_GETFL/SETFL)?!?

If it is the former, there is some value in read/write capabilities 
because you could for example block a child process from reading an 
eventfd and simulate the two file descriptors returned by pipe(2).  But 
if it is the latter, it looks like an important usability problem in 
the Capsicum model.  (Granted, it's just about usability---in the end 
it does exactly what it's meant and documented to do).

> Also, there could potentially be some odd interactions with file
> descriptors passed between processes, if the BPF program relies
> on assumptions about the environment of the original process.  For
> example, what happens if an x86_64 process passes a filter-attached
> FD to an ia32 process?  Given that the syscall numbers are
> arch-specific, I guess that means the filter program would have
> to include arch-specific branches for any possible variant.

This is the same for using seccompv2 to limit child processes, no?  So 
there may be a problem but it has to be solved anyway by libseccomp.

> More generally, I suspect that keeping things simpler will end
> up being more secure.  Capsicum was based on well-studied ideas
> from the world of object capability-based security, and I'd be
> nervous about adding complications that take us further away from
> that.

True.

> That mapping would also need be kept closely in sync with the kernel
> and other system libraries -- if a new syscall is added and libc (or
> some other library) started using it, the equivalent BPF chunks would
> need to be updated to cope.

Again, this is the same problem that has to be solved for process-wide 
seccompv2.

>>>  [Capsicum also includes 'capability mode', which locks down the
>>>  available syscalls so the rights restrictions can't just be bypassed
>>>  by opening new file descriptors; I'll describe that separately later.]
>>
>> This can also be implemented in userspace via seccomp and
>> PR_SET_NO_NEW_PRIVS.
>
> Well, mostly (and in fact I've got an attempt to do exactly that at
> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>
> [..] there's one awkward syscall case.  In capability mode we'd like
> to prevent processes from sending signals with kill(2)/tgkill(2)
> to other processes, but they should still be able to send themselves
> signals.  For example, abort(3) generates:
>   tgkill(gettid(), gettid(), SIGABRT)
>
> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
> least in a way that survives forking.

I guess the thread id could be added as a special seccomp-bpf argument 
(ancillary datum?).

> Finally, capability mode also turns on strict-relative lookups
> process-wide; in other words, every openat(dfd, ...) operation
> acts as though it has the O_BENEATH_ONLY flag set, regardless of
> whether the dfd is a Capsicum capability.  I can't see a way to
> do that with a BPF program (although it would be possible to add
> a filter that polices the requirement to include O_BENEATH_ONLY
> rather than implicitly adding it).

That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up). 
It seems useful independent of Capsicum, and the Linux APIs tend to be 
fine-grained more often than coarse-grained.

>>>  [Policing the rights checks anywhere else, for example at the system
>>>  call boundary, isn't a good idea because it opens up the possibility
>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>
>> In the case of BPF filters, I wonder if you could stash the BPF
>> "environment" somewhere and then use it at fget() invocation.
>> Alternatively, it can be reconstructed at fget() time, similar to
>> your introduction of fgetr().
>
> Stashing something at syscall entry to be referred to later always
> makes me worry about TOCTOU vulnerabilities, but the details might
> be OK in this case (given that no check occurs at syscall entry)...

Yeah, that was pretty much the idea.  But I was cautious enough to 
label it with "I wonder"...

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-04  7:03       ` Paolo Bonzini
  0 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-04  7:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API


Il 03/07/2014 20:39, David Drysdale ha scritto:
> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>> Given Linux's previous experience with BPF filters, what do you
>> think about attaching specific BPF programs to file descriptors?
>> Then whenever a syscall is run that affects a file descriptor, the
>> BPF program for the file descriptor (attached to a struct file* as
>> in Capsicum) would run in addition to the process-wide filter.
>
> That sounds kind of clever, but also kind of complicated.
>
> Off the top of my head, one particular problem is that not all
> fd->struct file conversions in the kernel are completely specified
> by an enclosing syscall and the explicit values of its parameters.
>
> For example, the actual contents of the arguments to io_submit(2)
> aren't visible to a seccomp-bpf program (as it can't read the __user
> memory for the iocb structures), and so it can't distinguish a
> read from a write.

I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
/O_RDWR.   You could do it by running the file descriptor's seccomp-bpf 
program once per iocb with synthesized syscall numbers and argument 
vectors.

BTW, there's one thing I'm not sure I understand (because my knowledge 
of VFS is really only cursory).  Are the capabilities associated to the 
file _descriptor_ (a la F_GETFD/SETFD) or _description_ 
(F_GETFL/SETFL)?!?

If it is the former, there is some value in read/write capabilities 
because you could for example block a child process from reading an 
eventfd and simulate the two file descriptors returned by pipe(2).  But 
if it is the latter, it looks like an important usability problem in 
the Capsicum model.  (Granted, it's just about usability---in the end 
it does exactly what it's meant and documented to do).

> Also, there could potentially be some odd interactions with file
> descriptors passed between processes, if the BPF program relies
> on assumptions about the environment of the original process.  For
> example, what happens if an x86_64 process passes a filter-attached
> FD to an ia32 process?  Given that the syscall numbers are
> arch-specific, I guess that means the filter program would have
> to include arch-specific branches for any possible variant.

This is the same for using seccompv2 to limit child processes, no?  So 
there may be a problem but it has to be solved anyway by libseccomp.

> More generally, I suspect that keeping things simpler will end
> up being more secure.  Capsicum was based on well-studied ideas
> from the world of object capability-based security, and I'd be
> nervous about adding complications that take us further away from
> that.

True.

> That mapping would also need be kept closely in sync with the kernel
> and other system libraries -- if a new syscall is added and libc (or
> some other library) started using it, the equivalent BPF chunks would
> need to be updated to cope.

Again, this is the same problem that has to be solved for process-wide 
seccompv2.

>>>  [Capsicum also includes 'capability mode', which locks down the
>>>  available syscalls so the rights restrictions can't just be bypassed
>>>  by opening new file descriptors; I'll describe that separately later.]
>>
>> This can also be implemented in userspace via seccomp and
>> PR_SET_NO_NEW_PRIVS.
>
> Well, mostly (and in fact I've got an attempt to do exactly that at
> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>
> [..] there's one awkward syscall case.  In capability mode we'd like
> to prevent processes from sending signals with kill(2)/tgkill(2)
> to other processes, but they should still be able to send themselves
> signals.  For example, abort(3) generates:
>   tgkill(gettid(), gettid(), SIGABRT)
>
> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
> least in a way that survives forking.

I guess the thread id could be added as a special seccomp-bpf argument 
(ancillary datum?).

> Finally, capability mode also turns on strict-relative lookups
> process-wide; in other words, every openat(dfd, ...) operation
> acts as though it has the O_BENEATH_ONLY flag set, regardless of
> whether the dfd is a Capsicum capability.  I can't see a way to
> do that with a BPF program (although it would be possible to add
> a filter that polices the requirement to include O_BENEATH_ONLY
> rather than implicitly adding it).

That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up). 
It seems useful independent of Capsicum, and the Linux APIs tend to be 
fine-grained more often than coarse-grained.

>>>  [Policing the rights checks anywhere else, for example at the system
>>>  call boundary, isn't a good idea because it opens up the possibility
>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>
>> In the case of BPF filters, I wonder if you could stash the BPF
>> "environment" somewhere and then use it at fget() invocation.
>> Alternatively, it can be reconstructed at fget() time, similar to
>> your introduction of fgetr().
>
> Stashing something at syscall entry to be referred to later always
> makes me worry about TOCTOU vulnerabilities, but the details might
> be OK in this case (given that no check occurs at syscall entry)...

Yeah, that was pretty much the idea.  But I was cautious enough to 
label it with "I wonder"...

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-04  7:03       ` Paolo Bonzini
@ 2014-07-07 10:29         ` David Drysdale
  -1 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-07 10:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API, qemu-devel

On Fri, Jul 4, 2014 at 8:03 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Il 03/07/2014 20:39, David Drysdale ha scritto:
>> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>>> Given Linux's previous experience with BPF filters, what do you
>>> think about attaching specific BPF programs to file descriptors?
>>> Then whenever a syscall is run that affects a file descriptor, the
>>> BPF program for the file descriptor (attached to a struct file* as
>>> in Capsicum) would run in addition to the process-wide filter.
>>
>> That sounds kind of clever, but also kind of complicated.
>>
>> Off the top of my head, one particular problem is that not all
>> fd->struct file conversions in the kernel are completely specified
>> by an enclosing syscall and the explicit values of its parameters.
>>
>> For example, the actual contents of the arguments to io_submit(2)
>> aren't visible to a seccomp-bpf program (as it can't read the __user
>> memory for the iocb structures), and so it can't distinguish a
>> read from a write.
>
> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> program once per iocb with synthesized syscall numbers and argument
> vectors.

Right, but generating the equivalent seccomp input environment for an
equivalent single-fd syscall is going to be subtle and complex (which
are worrying words to mention in a security context).  And how many
other syscalls are going to need similar special-case processing?
(poll? select? send[m]msg? ...)

> BTW, there's one thing I'm not sure I understand (because my knowledge
> of VFS is really only cursory).  Are the capabilities associated to the
> file _descriptor_ (a la F_GETFD/SETFD) or _description_
> (F_GETFL/SETFL)?!?

Capsicum capabilities are associated with the file descriptor (a la
F_GETFD), not the open file itself -- different FDs with different
associated rights can map to the same underlying open file.

> If it is the former, there is some value in read/write capabilities
> because you could for example block a child process from reading an
> eventfd and simulate the two file descriptors returned by pipe(2).  But
> if it is the latter, it looks like an important usability problem in
> the Capsicum model.  (Granted, it's just about usability---in the end
> it does exactly what it's meant and documented to do).

Attaching the rights to the FD also comes back to the association with
object-capability security.  The FD is an unforgeable reference to the
object (file) in question, but these references (with their rights) can
be transferred to other programs -- either by inheritance after fork, or
by explicitly passing the FD across a Unix domain socket.

>> Also, there could potentially be some odd interactions with file
>> descriptors passed between processes, if the BPF program relies
>> on assumptions about the environment of the original process.  For
>> example, what happens if an x86_64 process passes a filter-attached
>> FD to an ia32 process?  Given that the syscall numbers are
>> arch-specific, I guess that means the filter program would have
>> to include arch-specific branches for any possible variant.
>
> This is the same for using seccompv2 to limit child processes, no?  So
> there may be a problem but it has to be solved anyway by libseccomp.

I don't know whether libseccomp would worry about this, but being able
to send FDs between processes via Unix domain sockets makes this more
visible in the Capsicum case.

>> More generally, I suspect that keeping things simpler will end
>> up being more secure.  Capsicum was based on well-studied ideas
>> from the world of object capability-based security, and I'd be
>> nervous about adding complications that take us further away from
>> that.
>
> True.
>
>> That mapping would also need be kept closely in sync with the kernel
>> and other system libraries -- if a new syscall is added and libc (or
>> some other library) started using it, the equivalent BPF chunks would
>> need to be updated to cope.
>
> Again, this is the same problem that has to be solved for process-wide
> seccompv2.

True.  I guess new syscalls are sufficiently rare in practice that this
isn't a serious concern.

>>>>  [Capsicum also includes 'capability mode', which locks down the
>>>>  available syscalls so the rights restrictions can't just be bypassed
>>>>  by opening new file descriptors; I'll describe that separately later.]
>>>
>>> This can also be implemented in userspace via seccomp and
>>> PR_SET_NO_NEW_PRIVS.
>>
>> Well, mostly (and in fact I've got an attempt to do exactly that at
>> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>>
>> [..] there's one awkward syscall case.  In capability mode we'd like
>> to prevent processes from sending signals with kill(2)/tgkill(2)
>> to other processes, but they should still be able to send themselves
>> signals.  For example, abort(3) generates:
>>   tgkill(gettid(), gettid(), SIGABRT)
>>
>> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
>> least in a way that survives forking.
>
> I guess the thread id could be added as a special seccomp-bpf argument
> (ancillary datum?).

Yeah, I tried exactly that a while ago
(https://github.com/google/capsicum-linux/commit/e163c6348328)
but didn't run with it because of the process-wide beneath-only issue below.
But a combination of that and your new prctl() suggestion below might do
the trick.

>> Finally, capability mode also turns on strict-relative lookups
>> process-wide; in other words, every openat(dfd, ...) operation
>> acts as though it has the O_BENEATH_ONLY flag set, regardless of
>> whether the dfd is a Capsicum capability.  I can't see a way to
>> do that with a BPF program (although it would be possible to add
>> a filter that polices the requirement to include O_BENEATH_ONLY
>> rather than implicitly adding it).
>
> That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up).
> It seems useful independent of Capsicum, and the Linux APIs tend to be
> fine-grained more often than coarse-grained.

That sounds like a good idea, particularly in combination with the idea
above -- thanks!  I'll have a think/investigate...

>>>>  [Policing the rights checks anywhere else, for example at the system
>>>>  call boundary, isn't a good idea because it opens up the possibility
>>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>>
>>> In the case of BPF filters, I wonder if you could stash the BPF
>>> "environment" somewhere and then use it at fget() invocation.
>>> Alternatively, it can be reconstructed at fget() time, similar to
>>> your introduction of fgetr().
>>
>> Stashing something at syscall entry to be referred to later always
>> makes me worry about TOCTOU vulnerabilities, but the details might
>> be OK in this case (given that no check occurs at syscall entry)...
>
> Yeah, that was pretty much the idea.  But I was cautious enough to
> label it with "I wonder"...
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 10:29         ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-07 10:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API

On Fri, Jul 4, 2014 at 8:03 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Il 03/07/2014 20:39, David Drysdale ha scritto:
>> On Thu, Jul 03, 2014 at 11:12:33AM +0200, Paolo Bonzini wrote:
>>> Given Linux's previous experience with BPF filters, what do you
>>> think about attaching specific BPF programs to file descriptors?
>>> Then whenever a syscall is run that affects a file descriptor, the
>>> BPF program for the file descriptor (attached to a struct file* as
>>> in Capsicum) would run in addition to the process-wide filter.
>>
>> That sounds kind of clever, but also kind of complicated.
>>
>> Off the top of my head, one particular problem is that not all
>> fd->struct file conversions in the kernel are completely specified
>> by an enclosing syscall and the explicit values of its parameters.
>>
>> For example, the actual contents of the arguments to io_submit(2)
>> aren't visible to a seccomp-bpf program (as it can't read the __user
>> memory for the iocb structures), and so it can't distinguish a
>> read from a write.
>
> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> program once per iocb with synthesized syscall numbers and argument
> vectors.

Right, but generating the equivalent seccomp input environment for an
equivalent single-fd syscall is going to be subtle and complex (which
are worrying words to mention in a security context).  And how many
other syscalls are going to need similar special-case processing?
(poll? select? send[m]msg? ...)

> BTW, there's one thing I'm not sure I understand (because my knowledge
> of VFS is really only cursory).  Are the capabilities associated to the
> file _descriptor_ (a la F_GETFD/SETFD) or _description_
> (F_GETFL/SETFL)?!?

Capsicum capabilities are associated with the file descriptor (a la
F_GETFD), not the open file itself -- different FDs with different
associated rights can map to the same underlying open file.

> If it is the former, there is some value in read/write capabilities
> because you could for example block a child process from reading an
> eventfd and simulate the two file descriptors returned by pipe(2).  But
> if it is the latter, it looks like an important usability problem in
> the Capsicum model.  (Granted, it's just about usability---in the end
> it does exactly what it's meant and documented to do).

Attaching the rights to the FD also comes back to the association with
object-capability security.  The FD is an unforgeable reference to the
object (file) in question, but these references (with their rights) can
be transferred to other programs -- either by inheritance after fork, or
by explicitly passing the FD across a Unix domain socket.

>> Also, there could potentially be some odd interactions with file
>> descriptors passed between processes, if the BPF program relies
>> on assumptions about the environment of the original process.  For
>> example, what happens if an x86_64 process passes a filter-attached
>> FD to an ia32 process?  Given that the syscall numbers are
>> arch-specific, I guess that means the filter program would have
>> to include arch-specific branches for any possible variant.
>
> This is the same for using seccompv2 to limit child processes, no?  So
> there may be a problem but it has to be solved anyway by libseccomp.

I don't know whether libseccomp would worry about this, but being able
to send FDs between processes via Unix domain sockets makes this more
visible in the Capsicum case.

>> More generally, I suspect that keeping things simpler will end
>> up being more secure.  Capsicum was based on well-studied ideas
>> from the world of object capability-based security, and I'd be
>> nervous about adding complications that take us further away from
>> that.
>
> True.
>
>> That mapping would also need be kept closely in sync with the kernel
>> and other system libraries -- if a new syscall is added and libc (or
>> some other library) started using it, the equivalent BPF chunks would
>> need to be updated to cope.
>
> Again, this is the same problem that has to be solved for process-wide
> seccompv2.

True.  I guess new syscalls are sufficiently rare in practice that this
isn't a serious concern.

>>>>  [Capsicum also includes 'capability mode', which locks down the
>>>>  available syscalls so the rights restrictions can't just be bypassed
>>>>  by opening new file descriptors; I'll describe that separately later.]
>>>
>>> This can also be implemented in userspace via seccomp and
>>> PR_SET_NO_NEW_PRIVS.
>>
>> Well, mostly (and in fact I've got an attempt to do exactly that at
>> https://github.com/google/capsicum-test/blob/dev/linux-bpf-capmode.c).
>>
>> [..] there's one awkward syscall case.  In capability mode we'd like
>> to prevent processes from sending signals with kill(2)/tgkill(2)
>> to other processes, but they should still be able to send themselves
>> signals.  For example, abort(3) generates:
>>   tgkill(gettid(), gettid(), SIGABRT)
>>
>> Only allowing kill(self) is hard to encode in a seccomp-bpf program, at
>> least in a way that survives forking.
>
> I guess the thread id could be added as a special seccomp-bpf argument
> (ancillary datum?).

Yeah, I tried exactly that a while ago
(https://github.com/google/capsicum-linux/commit/e163c6348328)
but didn't run with it because of the process-wide beneath-only issue below.
But a combination of that and your new prctl() suggestion below might do
the trick.

>> Finally, capability mode also turns on strict-relative lookups
>> process-wide; in other words, every openat(dfd, ...) operation
>> acts as though it has the O_BENEATH_ONLY flag set, regardless of
>> whether the dfd is a Capsicum capability.  I can't see a way to
>> do that with a BPF program (although it would be possible to add
>> a filter that polices the requirement to include O_BENEATH_ONLY
>> rather than implicitly adding it).
>
> That can be a new prctl (one that PR_SET_NO_NEW_PRIVS would lock up).
> It seems useful independent of Capsicum, and the Linux APIs tend to be
> fine-grained more often than coarse-grained.

That sounds like a good idea, particularly in combination with the idea
above -- thanks!  I'll have a think/investigate...

>>>>  [Policing the rights checks anywhere else, for example at the system
>>>>  call boundary, isn't a good idea because it opens up the possibility
>>>>  of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>>  changed (as openat/close/dup2 are allowed in capability mode) between
>>>>  the 'check' at syscall entry and the 'use' at fget() invocation.]
>>>
>>> In the case of BPF filters, I wonder if you could stash the BPF
>>> "environment" somewhere and then use it at fget() invocation.
>>> Alternatively, it can be reconstructed at fget() time, similar to
>>> your introduction of fgetr().
>>
>> Stashing something at syscall entry to be referred to later always
>> makes me worry about TOCTOU vulnerabilities, but the details might
>> be OK in this case (given that no check occurs at syscall entry)...
>
> Yeah, that was pretty much the idea.  But I was cautious enough to
> label it with "I wonder"...
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
  2014-07-07 10:29         ` [Qemu-devel] " David Drysdale
@ 2014-07-07 12:20           ` Paolo Bonzini
  -1 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-07 12:20 UTC (permalink / raw)
  To: David Drysdale
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API, qemu-devel

Il 07/07/2014 12:29, David Drysdale ha scritto:
>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>> program once per iocb with synthesized syscall numbers and argument
>> vectors.
>
> Right, but generating the equivalent seccomp input environment for an
> equivalent single-fd syscall is going to be subtle and complex (which
> are worrying words to mention in a security context).  And how many
> other syscalls are going to need similar special-case processing?
> (poll? select? send[m]msg? ...)

Yeah, the difficult part is getting the right balance between:

1) limitations due to seccomp's impossibility to chase pointers (which 
is not something that can be lifted, as it's required for correctness)

2) subtlety and complexity of the resulting code.

The problem stems when you have a single a single syscall operating on 
multiple file descriptors.  So for example among the cases you mention 
poll and select are problematic; sendm?msg are not.  They would be if 
Capsicum had a capability for SCM_RIGHTS file descriptor passing, but I 
cannot find it.

But then you also have to strike the right balance between a complete 
solution and an overengineered one.

For example, even though poll and select are problematic, I wonder what 
would really the point be in blocking them; poll/select are 
level-triggered, and calling them should be idempotent as far as the 
file descriptor is concerned.  If you want to prevent a process/thread 
from issuing blocking system calls, but you'd do that with a per-process 
filter, not with per-file-descriptor filters or capabilities.

> Capsicum capabilities are associated with the file descriptor (a la
> F_GETFD), not the open file itself -- different FDs with different
> associated rights can map to the same underlying open file.

Good to know, thanks.  I suppose you have testcases that cover this.

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 12:20           ` Paolo Bonzini
  0 siblings, 0 replies; 87+ messages in thread
From: Paolo Bonzini @ 2014-07-07 12:20 UTC (permalink / raw)
  To: David Drysdale
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API

Il 07/07/2014 12:29, David Drysdale ha scritto:
>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>> program once per iocb with synthesized syscall numbers and argument
>> vectors.
>
> Right, but generating the equivalent seccomp input environment for an
> equivalent single-fd syscall is going to be subtle and complex (which
> are worrying words to mention in a security context).  And how many
> other syscalls are going to need similar special-case processing?
> (poll? select? send[m]msg? ...)

Yeah, the difficult part is getting the right balance between:

1) limitations due to seccomp's impossibility to chase pointers (which 
is not something that can be lifted, as it's required for correctness)

2) subtlety and complexity of the resulting code.

The problem stems when you have a single a single syscall operating on 
multiple file descriptors.  So for example among the cases you mention 
poll and select are problematic; sendm?msg are not.  They would be if 
Capsicum had a capability for SCM_RIGHTS file descriptor passing, but I 
cannot find it.

But then you also have to strike the right balance between a complete 
solution and an overengineered one.

For example, even though poll and select are problematic, I wonder what 
would really the point be in blocking them; poll/select are 
level-triggered, and calling them should be idempotent as far as the 
file descriptor is concerned.  If you want to prevent a process/thread 
from issuing blocking system calls, but you'd do that with a per-process 
filter, not with per-file-descriptor filters or capabilities.

> Capsicum capabilities are associated with the file descriptor (a la
> F_GETFD), not the open file itself -- different FDs with different
> associated rights can map to the same underlying open file.

Good to know, thanks.  I suppose you have testcases that cover this.

Paolo

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 14:11             ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-07 14:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API, qemu-devel

On Mon, Jul 7, 2014 at 1:20 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo

Yeah, there's lots of tests at:
  https://github.com/google/capsicum-test
(which is in a separate repo so it's easy to run against
FreeBSD as well as the Linux code); in particular
  https://github.com/google/capsicum-test/blob/dev/capability-fd.cc
has various interactions of capability FDs.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 14:11             ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-07 14:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API, qemu-devel

On Mon, Jul 7, 2014 at 1:20 PM, Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo

Yeah, there's lots of tests at:
  https://github.com/google/capsicum-test
(which is in a separate repo so it's easy to run against
FreeBSD as well as the Linux code); in particular
  https://github.com/google/capsicum-test/blob/dev/capability-fd.cc
has various interactions of capability FDs.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 14:11             ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-07 14:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API

On Mon, Jul 7, 2014 at 1:20 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo

Yeah, there's lots of tests at:
  https://github.com/google/capsicum-test
(which is in a separate repo so it's easy to run against
FreeBSD as well as the Linux code); in particular
  https://github.com/google/capsicum-test/blob/dev/capability-fd.cc
has various interactions of capability FDs.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 22:33             ` Alexei Starovoitov
  0 siblings, 0 replies; 87+ messages in thread
From: Alexei Starovoitov @ 2014-07-07 22:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Drysdale, LSM List, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	Linux API, qemu-devel

On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>
>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>> program once per iocb with synthesized syscall numbers and argument
>>> vectors.
>>
>>
>> Right, but generating the equivalent seccomp input environment for an
>> equivalent single-fd syscall is going to be subtle and complex (which
>> are worrying words to mention in a security context).  And how many
>> other syscalls are going to need similar special-case processing?
>> (poll? select? send[m]msg? ...)
>
>
> Yeah, the difficult part is getting the right balance between:
>
> 1) limitations due to seccomp's impossibility to chase pointers (which is
> not something that can be lifted, as it's required for correctness)

btw once seccomp moves to eBPF it will be able to 'chase pointers',
since pointer walking will be possible via bpf_load_pointer() function call,
which is a wrapper of:
  probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
  return ptr;
Not sure whether it helps this case or not. Just fyi.

> 2) subtlety and complexity of the resulting code.
>
> The problem stems when you have a single a single syscall operating on
> multiple file descriptors.  So for example among the cases you mention poll
> and select are problematic; sendm?msg are not.  They would be if Capsicum
> had a capability for SCM_RIGHTS file descriptor passing, but I cannot find
> it.
>
> But then you also have to strike the right balance between a complete
> solution and an overengineered one.
>
> For example, even though poll and select are problematic, I wonder what
> would really the point be in blocking them; poll/select are level-triggered,
> and calling them should be idempotent as far as the file descriptor is
> concerned.  If you want to prevent a process/thread from issuing blocking
> system calls, but you'd do that with a per-process filter, not with
> per-file-descriptor filters or capabilities.
>
>
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo
> --
>
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 22:33             ` Alexei Starovoitov
  0 siblings, 0 replies; 87+ messages in thread
From: Alexei Starovoitov @ 2014-07-07 22:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Drysdale, LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API, qemu-devel

On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>
>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>> program once per iocb with synthesized syscall numbers and argument
>>> vectors.
>>
>>
>> Right, but generating the equivalent seccomp input environment for an
>> equivalent single-fd syscall is going to be subtle and complex (which
>> are worrying words to mention in a security context).  And how many
>> other syscalls are going to need similar special-case processing?
>> (poll? select? send[m]msg? ...)
>
>
> Yeah, the difficult part is getting the right balance between:
>
> 1) limitations due to seccomp's impossibility to chase pointers (which is
> not something that can be lifted, as it's required for correctness)

btw once seccomp moves to eBPF it will be able to 'chase pointers',
since pointer walking will be possible via bpf_load_pointer() function call,
which is a wrapper of:
  probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
  return ptr;
Not sure whether it helps this case or not. Just fyi.

> 2) subtlety and complexity of the resulting code.
>
> The problem stems when you have a single a single syscall operating on
> multiple file descriptors.  So for example among the cases you mention poll
> and select are problematic; sendm?msg are not.  They would be if Capsicum
> had a capability for SCM_RIGHTS file descriptor passing, but I cannot find
> it.
>
> But then you also have to strike the right balance between a complete
> solution and an overengineered one.
>
> For example, even though poll and select are problematic, I wonder what
> would really the point be in blocking them; poll/select are level-triggered,
> and calling them should be idempotent as far as the file descriptor is
> concerned.  If you want to prevent a process/thread from issuing blocking
> system calls, but you'd do that with a per-process filter, not with
> per-file-descriptor filters or capabilities.
>
>
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo
> --
>
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-07 22:33             ` Alexei Starovoitov
  0 siblings, 0 replies; 87+ messages in thread
From: Alexei Starovoitov @ 2014-07-07 22:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API,
	David Drysdale

On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 07/07/2014 12:29, David Drysdale ha scritto:
>
>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>> program once per iocb with synthesized syscall numbers and argument
>>> vectors.
>>
>>
>> Right, but generating the equivalent seccomp input environment for an
>> equivalent single-fd syscall is going to be subtle and complex (which
>> are worrying words to mention in a security context).  And how many
>> other syscalls are going to need similar special-case processing?
>> (poll? select? send[m]msg? ...)
>
>
> Yeah, the difficult part is getting the right balance between:
>
> 1) limitations due to seccomp's impossibility to chase pointers (which is
> not something that can be lifted, as it's required for correctness)

btw once seccomp moves to eBPF it will be able to 'chase pointers',
since pointer walking will be possible via bpf_load_pointer() function call,
which is a wrapper of:
  probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
  return ptr;
Not sure whether it helps this case or not. Just fyi.

> 2) subtlety and complexity of the resulting code.
>
> The problem stems when you have a single a single syscall operating on
> multiple file descriptors.  So for example among the cases you mention poll
> and select are problematic; sendm?msg are not.  They would be if Capsicum
> had a capability for SCM_RIGHTS file descriptor passing, but I cannot find
> it.
>
> But then you also have to strike the right balance between a complete
> solution and an overengineered one.
>
> For example, even though poll and select are problematic, I wonder what
> would really the point be in blocking them; poll/select are level-triggered,
> and calling them should be idempotent as far as the file descriptor is
> concerned.  If you want to prevent a process/thread from issuing blocking
> system calls, but you'd do that with a per-process filter, not with
> per-file-descriptor filters or capabilities.
>
>
>> Capsicum capabilities are associated with the file descriptor (a la
>> F_GETFD), not the open file itself -- different FDs with different
>> associated rights can map to the same underlying open file.
>
>
> Good to know, thanks.  I suppose you have testcases that cover this.
>
> Paolo
> --
>
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:03     ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: linux-security-module, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api

On Mon, Jun 30, 2014 at 11:28:01AM +0100, David Drysdale wrote:
> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> provided path, rejecting (with -EACCES) paths that are not beneath
> the provided dfd.  In particular, reject:
>  - paths that contain .. components
>  - paths that begin with /
>  - symlinks that have paths as above.


How is this implemented in FreeBSD?  I can't find any references to
O_BENEATH_ONLY except for your patchset.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:03     ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:03 UTC (permalink / raw)
  To: David Drysdale
  Cc: linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 30, 2014 at 11:28:01AM +0100, David Drysdale wrote:
> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
> provided path, rejecting (with -EACCES) paths that are not beneath
> the provided dfd.  In particular, reject:
>  - paths that contain .. components
>  - paths that begin with /
>  - symlinks that have paths as above.


How is this implemented in FreeBSD?  I can't find any references to
O_BENEATH_ONLY except for your patchset.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:07           ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Drysdale, Al Viro, LSM List, Greg Kroah-Hartman,
	James Morris, Kees Cook, Linux API, Meredydd Luff, linux-kernel,
	linux-man

On Mon, Jun 30, 2014 at 08:53:01AM -0700, Andy Lutomirski wrote:
> > Wouldn't it need to be both O_BENEATH_ONLY (for openat()) and
> > AT_BENEATH_ONLY (for other *at() functions), like O_NOFOLLOW and
> > AT_SYMLINK_NOFOLLOW?  (I.e. aren't the AT_* flags in a different
> > numbering space than O_* flags?)
> >
> > Or am I misunderstanding?
> >
> 
> Ugh, you're probably right.  I wish openat had separate flags and
> atflags arguments.  Oh well.

There's two different AT_* namespaces.  The flags that most *at syscalls
has, and the the one for the dfd argument, which currently only contains
AT_FDCWD, although a new constant has recently been proposed to it.

Having an AT_BENEATH magic value for the dfd argument certainly feels
elegant to me, but seems to be against the language for openat in Posix:

"The openat() function shall be equivalent to the open() function except
in the case where path specifies a relative path. In this case the file
to be opened is determined relative to the directory associated with the
file descriptor fd instead of the current working directory. If the file
descriptor was opened without O_SEARCH, the function shall check whether
directory searches are permitted using the current permissions of the
directory underlying the file descriptor. If the file descriptor was
opened with O_SEARCH, the function shall not perform the check.

The oflag parameter and the optional fourth parameter correspond exactly
to the parameters of open().

If openat() is passed the special value AT_FDCWD in the fd parameter,
the current working directory shall be used and the behavior shall be
identical to a call to open()."


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:07           ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Drysdale, Al Viro, LSM List, Greg Kroah-Hartman,
	James Morris, Kees Cook, Linux API, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 30, 2014 at 08:53:01AM -0700, Andy Lutomirski wrote:
> > Wouldn't it need to be both O_BENEATH_ONLY (for openat()) and
> > AT_BENEATH_ONLY (for other *at() functions), like O_NOFOLLOW and
> > AT_SYMLINK_NOFOLLOW?  (I.e. aren't the AT_* flags in a different
> > numbering space than O_* flags?)
> >
> > Or am I misunderstanding?
> >
> 
> Ugh, you're probably right.  I wish openat had separate flags and
> atflags arguments.  Oh well.

There's two different AT_* namespaces.  The flags that most *at syscalls
has, and the the one for the dfd argument, which currently only contains
AT_FDCWD, although a new constant has recently been proposed to it.

Having an AT_BENEATH magic value for the dfd argument certainly feels
elegant to me, but seems to be against the language for openat in Posix:

"The openat() function shall be equivalent to the open() function except
in the case where path specifies a relative path. In this case the file
to be opened is determined relative to the directory associated with the
file descriptor fd instead of the current working directory. If the file
descriptor was opened without O_SEARCH, the function shall check whether
directory searches are permitted using the current permissions of the
directory underlying the file descriptor. If the file descriptor was
opened with O_SEARCH, the function shall not perform the check.

The oflag parameter and the optional fourth parameter correspond exactly
to the parameters of open().

If openat() is passed the special value AT_FDCWD in the fd parameter,
the current working directory shall be used and the behavior shall be
identical to a call to open()."

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:48             ` Meredydd Luff
  0 siblings, 0 replies; 87+ messages in thread
From: Meredydd Luff @ 2014-07-08 12:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, David Drysdale, Al Viro, LSM List,
	Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel, linux-man

On 8 July 2014 13:07, Christoph Hellwig <hch@infradead.org> wrote:
> There's two different AT_* namespaces.  The flags that most *at syscalls
> has, and the the one for the dfd argument, which currently only contains
> AT_FDCWD, although a new constant has recently been proposed to it.
>
> Having an AT_BENEATH magic value for the dfd argument certainly feels
> elegant to me

How would that work? The directory beneath which openat is looking is
conveyed in the dfd argument itself. If I'm understanding this right,
you'd have to pass a different value for "open relative to fd#5" and
"open relative to fd#5, but beneath it only", which doesn't sound
hugely elegant to me.

Meredydd

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:48             ` Meredydd Luff
  0 siblings, 0 replies; 87+ messages in thread
From: Meredydd Luff @ 2014-07-08 12:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, David Drysdale, Al Viro, LSM List,
	Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA

On 8 July 2014 13:07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> There's two different AT_* namespaces.  The flags that most *at syscalls
> has, and the the one for the dfd argument, which currently only contains
> AT_FDCWD, although a new constant has recently been proposed to it.
>
> Having an AT_BENEATH magic value for the dfd argument certainly feels
> elegant to me

How would that work? The directory beneath which openat is looking is
conveyed in the dfd argument itself. If I'm understanding this right,
you'd have to pass a different value for "open relative to fd#5" and
"open relative to fd#5, but beneath it only", which doesn't sound
hugely elegant to me.

Meredydd

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:51               ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:51 UTC (permalink / raw)
  To: Meredydd Luff
  Cc: Christoph Hellwig, Andy Lutomirski, David Drysdale, Al Viro,
	LSM List, Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel, linux-man

On Tue, Jul 08, 2014 at 01:48:27PM +0100, Meredydd Luff wrote:
> How would that work? The directory beneath which openat is looking is
> conveyed in the dfd argument itself. If I'm understanding this right,
> you'd have to pass a different value for "open relative to fd#5" and
> "open relative to fd#5, but beneath it only", which doesn't sound
> hugely elegant to me.

Yeah, it won't work for an explicit directory - I was thinking of
working relative to $CWD.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 12:51               ` Christoph Hellwig
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 12:51 UTC (permalink / raw)
  To: Meredydd Luff
  Cc: Christoph Hellwig, Andy Lutomirski, David Drysdale, Al Viro,
	LSM List, Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 08, 2014 at 01:48:27PM +0100, Meredydd Luff wrote:
> How would that work? The directory beneath which openat is looking is
> conveyed in the dfd argument itself. If I'm understanding this right,
> you'd have to pass a different value for "open relative to fd#5" and
> "open relative to fd#5, but beneath it only", which doesn't sound
> hugely elegant to me.

Yeah, it won't work for an explicit directory - I was thinking of
working relative to $CWD.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 13:04                 ` Meredydd Luff
  0 siblings, 0 replies; 87+ messages in thread
From: Meredydd Luff @ 2014-07-08 13:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, David Drysdale, Al Viro, LSM List,
	Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel, linux-man

On 8 July 2014 13:51, Christoph Hellwig <hch@infradead.org> wrote:
> Yeah, it won't work for an explicit directory - I was thinking of
> working relative to $CWD.

I think that would sacrifice far too much flexibility. Even without
Capsicum, it would be worthwhile to be able to wire up a static
seccomp-bpf filter to enforce constraints such as "you can open files
under fd#5 for reading, but you can only write to files under fd#6,
and you can't do any global lookups."

Meredydd

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 13:04                 ` Meredydd Luff
  0 siblings, 0 replies; 87+ messages in thread
From: Meredydd Luff @ 2014-07-08 13:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, David Drysdale, Al Viro, LSM List,
	Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-man

On 8 July 2014 13:51, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> Yeah, it won't work for an explicit directory - I was thinking of
> working relative to $CWD.

I think that would sacrifice far too much flexibility. Even without
Capsicum, it would be worthwhile to be able to wire up a static
seccomp-bpf filter to enforce constraints such as "you can open files
under fd#5 for reading, but you can only write to files under fd#6,
and you can't do any global lookups."

Meredydd
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-07-08 13:04                 ` Meredydd Luff
  (?)
@ 2014-07-08 13:12                 ` Christoph Hellwig
  -1 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-08 13:12 UTC (permalink / raw)
  To: Meredydd Luff
  Cc: Christoph Hellwig, Andy Lutomirski, David Drysdale, Al Viro,
	LSM List, Greg Kroah-Hartman, James Morris, Kees Cook, Linux API,
	linux-kernel, linux-man

On Tue, Jul 08, 2014 at 02:04:45PM +0100, Meredydd Luff wrote:
> On 8 July 2014 13:51, Christoph Hellwig <hch@infradead.org> wrote:
> > Yeah, it won't work for an explicit directory - I was thinking of
> > working relative to $CWD.
> 
> I think that would sacrifice far too much flexibility. Even without
> Capsicum, it would be worthwhile to be able to wire up a static
> seccomp-bpf filter to enforce constraints such as "you can open files
> under fd#5 for reading, but you can only write to files under fd#6,
> and you can't do any global lookups."

Yeah, I didn't intend to advocate this further after your reply.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-08 14:58               ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2014-07-08 14:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Paolo Bonzini, David Drysdale, LSM List, linux-kernel,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, James Morris,
	Linux API, qemu-devel

On Mon, Jul 7, 2014 at 3:33 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 07/07/2014 12:29, David Drysdale ha scritto:
>>
>>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>>> program once per iocb with synthesized syscall numbers and argument
>>>> vectors.
>>>
>>>
>>> Right, but generating the equivalent seccomp input environment for an
>>> equivalent single-fd syscall is going to be subtle and complex (which
>>> are worrying words to mention in a security context).  And how many
>>> other syscalls are going to need similar special-case processing?
>>> (poll? select? send[m]msg? ...)
>>
>>
>> Yeah, the difficult part is getting the right balance between:
>>
>> 1) limitations due to seccomp's impossibility to chase pointers (which is
>> not something that can be lifted, as it's required for correctness)
>
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:
>   probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
>   return ptr;
> Not sure whether it helps this case or not. Just fyi.

It won't immediately help, since threads can race pointer target
contents (i.e. seccomp sees one thing, and then the syscall see
another thing). Having an immutable memory area could help with this
(i.e. some kind of "locked" memory range that holds all the "approved"
argument strings, at which point seccomp could then trust the chased
pointers that land in this range.) Obviously eBPF is a prerequisite to
this, but it isn't the full solution, as far as I understand it.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-08 14:58               ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2014-07-08 14:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Paolo Bonzini, David Drysdale, LSM List,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, James Morris, Linux API,
	qemu-devel

On Mon, Jul 7, 2014 at 3:33 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Il 07/07/2014 12:29, David Drysdale ha scritto:
>>
>>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>>> program once per iocb with synthesized syscall numbers and argument
>>>> vectors.
>>>
>>>
>>> Right, but generating the equivalent seccomp input environment for an
>>> equivalent single-fd syscall is going to be subtle and complex (which
>>> are worrying words to mention in a security context).  And how many
>>> other syscalls are going to need similar special-case processing?
>>> (poll? select? send[m]msg? ...)
>>
>>
>> Yeah, the difficult part is getting the right balance between:
>>
>> 1) limitations due to seccomp's impossibility to chase pointers (which is
>> not something that can be lifted, as it's required for correctness)
>
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:
>   probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
>   return ptr;
> Not sure whether it helps this case or not. Just fyi.

It won't immediately help, since threads can race pointer target
contents (i.e. seccomp sees one thing, and then the syscall see
another thing). Having an immutable memory area could help with this
(i.e. some kind of "locked" memory range that holds all the "approved"
argument strings, at which point seccomp could then trust the chased
pointers that land in this range.) Obviously eBPF is a prerequisite to
this, but it isn't the full solution, as far as I understand it.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-07-08 14:58               ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2014-07-08 14:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Greg Kroah-Hartman, Meredydd Luff, linux-kernel, qemu-devel,
	LSM List, Alexander Viro, James Morris, Linux API, Paolo Bonzini,
	David Drysdale

On Mon, Jul 7, 2014 at 3:33 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jul 7, 2014 at 5:20 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 07/07/2014 12:29, David Drysdale ha scritto:
>>
>>>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
>>>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
>>>> program once per iocb with synthesized syscall numbers and argument
>>>> vectors.
>>>
>>>
>>> Right, but generating the equivalent seccomp input environment for an
>>> equivalent single-fd syscall is going to be subtle and complex (which
>>> are worrying words to mention in a security context).  And how many
>>> other syscalls are going to need similar special-case processing?
>>> (poll? select? send[m]msg? ...)
>>
>>
>> Yeah, the difficult part is getting the right balance between:
>>
>> 1) limitations due to seccomp's impossibility to chase pointers (which is
>> not something that can be lifted, as it's required for correctness)
>
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:
>   probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
>   return ptr;
> Not sure whether it helps this case or not. Just fyi.

It won't immediately help, since threads can race pointer target
contents (i.e. seccomp sees one thing, and then the syscall see
another thing). Having an immutable memory area could help with this
(i.e. some kind of "locked" memory range that holds all the "approved"
argument strings, at which point seccomp could then trust the chased
pointers that land in this range.) Obviously eBPF is a prerequisite to
this, but it isn't the full solution, as far as I understand it.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 16:54       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-08 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LSM List, linux-kernel, Greg Kroah-Hartman, Alexander Viro,
	Meredydd Luff, Kees Cook, James Morris, Linux API

On Tue, Jul 8, 2014 at 1:03 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Jun 30, 2014 at 11:28:01AM +0100, David Drysdale wrote:
>> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
>> provided path, rejecting (with -EACCES) paths that are not beneath
>> the provided dfd.  In particular, reject:
>>  - paths that contain .. components
>>  - paths that begin with /
>>  - symlinks that have paths as above.
>
>
> How is this implemented in FreeBSD?  I can't find any references to
> O_BENEATH_ONLY except for your patchset.

FreeBSD have the relative-only behaviour for openat() relative to a
Capsicum capability dfd [1], and for a process in capability-mode [2],
but they don't have the O_BENEATH_ONLY as a separately-accessible
openat() flag.  However, it seemed like a more widely useful idea so
separating it out was suggested.

[1] http://fxr.watson.org/fxr/source/kern/vfs_lookup.c?v=FREEBSD10#L238
[2] http://fxr.watson.org/fxr/source/kern/vfs_lookup.c?v=FREEBSD10#L171

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
@ 2014-07-08 16:54       ` David Drysdale
  0 siblings, 0 replies; 87+ messages in thread
From: David Drysdale @ 2014-07-08 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LSM List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API

On Tue, Jul 8, 2014 at 1:03 PM, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Mon, Jun 30, 2014 at 11:28:01AM +0100, David Drysdale wrote:
>> Add a new O_BENEATH_ONLY flag for openat(2) which restricts the
>> provided path, rejecting (with -EACCES) paths that are not beneath
>> the provided dfd.  In particular, reject:
>>  - paths that contain .. components
>>  - paths that begin with /
>>  - symlinks that have paths as above.
>
>
> How is this implemented in FreeBSD?  I can't find any references to
> O_BENEATH_ONLY except for your patchset.

FreeBSD have the relative-only behaviour for openat() relative to a
Capsicum capability dfd [1], and for a process in capability-mode [2],
but they don't have the O_BENEATH_ONLY as a separately-accessible
openat() flag.  However, it seemed like a more widely useful idea so
separating it out was suggested.

[1] http://fxr.watson.org/fxr/source/kern/vfs_lookup.c?v=FREEBSD10#L238
[2] http://fxr.watson.org/fxr/source/kern/vfs_lookup.c?v=FREEBSD10#L171

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2)
  2014-07-08 16:54       ` David Drysdale
  (?)
@ 2014-07-09  8:48       ` Christoph Hellwig
  -1 siblings, 0 replies; 87+ messages in thread
From: Christoph Hellwig @ 2014-07-09  8:48 UTC (permalink / raw)
  To: David Drysdale
  Cc: Christoph Hellwig, LSM List, linux-kernel, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	Linux API

On Tue, Jul 08, 2014 at 05:54:24PM +0100, David Drysdale wrote:
> > How is this implemented in FreeBSD?  I can't find any references to
> > O_BENEATH_ONLY except for your patchset.
> 
> FreeBSD have the relative-only behaviour for openat() relative to a
> Capsicum capability dfd [1], and for a process in capability-mode [2],
> but they don't have the O_BENEATH_ONLY as a separately-accessible
> openat() flag.  However, it seemed like a more widely useful idea so
> separating it out was suggested.

In that case we should make sure to use the same name and semantics for
it.  As far as I'm concerned I'd prefer a less clumsy name like
O_BENEATH.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-08-16 15:41               ` Pavel Machek
  0 siblings, 0 replies; 87+ messages in thread
From: Pavel Machek @ 2014-08-16 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Paolo Bonzini, David Drysdale, LSM List, linux-kernel,
	Greg Kroah-Hartman, Alexander Viro, Meredydd Luff, Kees Cook,
	James Morris, Linux API, qemu-devel

Hi!

> >>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> >>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> >>> program once per iocb with synthesized syscall numbers and argument
> >>> vectors.
> >>
> >>
> >> Right, but generating the equivalent seccomp input environment for an
> >> equivalent single-fd syscall is going to be subtle and complex (which
> >> are worrying words to mention in a security context).  And how many
> >> other syscalls are going to need similar special-case processing?
> >> (poll? select? send[m]msg? ...)
> >
> >
> > Yeah, the difficult part is getting the right balance between:
> >
> > 1) limitations due to seccomp's impossibility to chase pointers (which is
> > not something that can be lifted, as it's required for correctness)
> 
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:

Even if you could make capscium work with eBPF... please don't.

Capscium is kind of obvious, elegant solution. BPF is quite
complex. And security semantics should not be pushed to userspace...

						Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-08-16 15:41               ` Pavel Machek
  0 siblings, 0 replies; 87+ messages in thread
From: Pavel Machek @ 2014-08-16 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Paolo Bonzini, David Drysdale, LSM List,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Alexander Viro, Meredydd Luff, Kees Cook, James Morris,
	Linux API, qemu-devel

Hi!

> >>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> >>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> >>> program once per iocb with synthesized syscall numbers and argument
> >>> vectors.
> >>
> >>
> >> Right, but generating the equivalent seccomp input environment for an
> >> equivalent single-fd syscall is going to be subtle and complex (which
> >> are worrying words to mention in a security context).  And how many
> >> other syscalls are going to need similar special-case processing?
> >> (poll? select? send[m]msg? ...)
> >
> >
> > Yeah, the difficult part is getting the right balance between:
> >
> > 1) limitations due to seccomp's impossibility to chase pointers (which is
> > not something that can be lifted, as it's required for correctness)
> 
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:

Even if you could make capscium work with eBPF... please don't.

Capscium is kind of obvious, elegant solution. BPF is quite
complex. And security semantics should not be pushed to userspace...

						Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1)
@ 2014-08-16 15:41               ` Pavel Machek
  0 siblings, 0 replies; 87+ messages in thread
From: Pavel Machek @ 2014-08-16 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, Greg Kroah-Hartman, Meredydd Luff, linux-kernel,
	qemu-devel, LSM List, Alexander Viro, James Morris, Linux API,
	Paolo Bonzini, David Drysdale

Hi!

> >>> I think that's more easily done by opening the file as O_RDONLY/O_WRONLY
> >>> /O_RDWR.   You could do it by running the file descriptor's seccomp-bpf
> >>> program once per iocb with synthesized syscall numbers and argument
> >>> vectors.
> >>
> >>
> >> Right, but generating the equivalent seccomp input environment for an
> >> equivalent single-fd syscall is going to be subtle and complex (which
> >> are worrying words to mention in a security context).  And how many
> >> other syscalls are going to need similar special-case processing?
> >> (poll? select? send[m]msg? ...)
> >
> >
> > Yeah, the difficult part is getting the right balance between:
> >
> > 1) limitations due to seccomp's impossibility to chase pointers (which is
> > not something that can be lifted, as it's required for correctness)
> 
> btw once seccomp moves to eBPF it will be able to 'chase pointers',
> since pointer walking will be possible via bpf_load_pointer() function call,
> which is a wrapper of:

Even if you could make capscium work with eBPF... please don't.

Capscium is kind of obvious, elegant solution. BPF is quite
complex. And security semantics should not be pushed to userspace...

						Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2014-08-16 15:42 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-30 10:28 [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) David Drysdale
2014-06-30 10:28 ` [PATCH 01/11] fs: add O_BENEATH_ONLY flag to openat(2) David Drysdale
2014-06-30 14:49   ` Andy Lutomirski
2014-06-30 15:49     ` David Drysdale
2014-06-30 15:53       ` Andy Lutomirski
2014-07-08 12:07         ` Christoph Hellwig
2014-07-08 12:07           ` Christoph Hellwig
2014-07-08 12:48           ` Meredydd Luff
2014-07-08 12:48             ` Meredydd Luff
2014-07-08 12:51             ` Christoph Hellwig
2014-07-08 12:51               ` Christoph Hellwig
2014-07-08 13:04               ` Meredydd Luff
2014-07-08 13:04                 ` Meredydd Luff
2014-07-08 13:12                 ` Christoph Hellwig
2014-06-30 20:40   ` Andi Kleen
2014-06-30 21:11     ` Andy Lutomirski
2014-07-01  9:53     ` David Drysdale
2014-07-01  9:53       ` David Drysdale
2014-07-01 18:58       ` Loganaden Velvindron
2014-07-08 12:03   ` Christoph Hellwig
2014-07-08 12:03     ` Christoph Hellwig
2014-07-08 16:54     ` David Drysdale
2014-07-08 16:54       ` David Drysdale
2014-07-09  8:48       ` Christoph Hellwig
2014-06-30 10:28 ` [PATCH 02/11] selftests: Add test of O_BENEATH_ONLY & openat(2) David Drysdale
2014-06-30 10:28   ` David Drysdale
2014-06-30 10:28 ` [PATCH 03/11] capsicum: rights values and structure definitions David Drysdale
2014-06-30 10:28   ` David Drysdale
2014-06-30 10:28 ` [PATCH 04/11] capsicum: implement fgetr() and friends David Drysdale
2014-06-30 10:28   ` David Drysdale
2014-06-30 10:28 ` [PATCH 05/11] capsicum: convert callers to use fgetr() etc David Drysdale
2014-06-30 10:28 ` [PATCH 06/11] capsicum: implement sockfd_lookupr() David Drysdale
2014-06-30 10:28 ` [PATCH 07/11] capsicum: convert callers to use sockfd_lookupr() etc David Drysdale
2014-06-30 10:28 ` [PATCH 08/11] capsicum: add new LSM hooks on FD/file conversion David Drysdale
2014-06-30 10:28 ` [PATCH 09/11] capsicum: implementations of new LSM hooks David Drysdale
2014-06-30 16:05   ` Andy Lutomirski
2014-06-30 16:05     ` Andy Lutomirski
2014-07-02 13:49     ` Paul Moore
2014-07-02 13:49       ` Paul Moore
2014-07-02 17:09       ` David Drysdale
2014-07-02 17:09         ` David Drysdale
2014-06-30 10:28 ` [PATCH 10/11] capsicum: invocation " David Drysdale
2014-06-30 10:28 ` [PATCH 11/11] capsicum: add syscalls to limit FD rights David Drysdale
2014-06-30 10:28 ` [PATCH 1/5] man-pages: open.2: describe O_BENEATH_ONLY flag David Drysdale
2014-06-30 22:22   ` Andy Lutomirski
2014-06-30 10:28 ` [PATCH 2/5] man-pages: capsicum.7: describe Capsicum capability framework David Drysdale
2014-06-30 10:28 ` [PATCH 3/5] man-pages: rights.7: Describe Capsicum primary rights David Drysdale
2014-06-30 10:28 ` [PATCH 4/5] man-pages: cap_rights_limit.2: limit FD rights for Capsicum David Drysdale
2014-06-30 14:53   ` Andy Lutomirski
2014-06-30 14:53     ` Andy Lutomirski
2014-06-30 15:35     ` David Drysdale
2014-06-30 15:35       ` David Drysdale
2014-06-30 16:06       ` Andy Lutomirski
2014-06-30 16:06         ` Andy Lutomirski
2014-06-30 16:32         ` David Drysdale
2014-06-30 10:28 ` [PATCH 5/5] man-pages: cap_rights_get: retrieve Capsicum fd rights David Drysdale
2014-06-30 22:28   ` Andy Lutomirski
2014-06-30 22:28     ` Andy Lutomirski
2014-07-01  9:19     ` David Drysdale
2014-07-01  9:19       ` David Drysdale
2014-07-01 14:18       ` Andy Lutomirski
2014-07-03  9:12 ` [RFC PATCH 00/11] Adding FreeBSD's Capsicum security framework (part 1) Paolo Bonzini
2014-07-03  9:12   ` [Qemu-devel] " Paolo Bonzini
2014-07-03 10:01   ` Loganaden Velvindron
2014-07-03 10:01     ` [Qemu-devel] " Loganaden Velvindron
2014-07-03 18:39   ` David Drysdale
2014-07-03 18:39     ` [Qemu-devel] " David Drysdale
2014-07-03 18:39     ` David Drysdale
2014-07-04  7:03     ` Paolo Bonzini
2014-07-04  7:03       ` [Qemu-devel] " Paolo Bonzini
2014-07-04  7:03       ` Paolo Bonzini
2014-07-07 10:29       ` David Drysdale
2014-07-07 10:29         ` [Qemu-devel] " David Drysdale
2014-07-07 12:20         ` Paolo Bonzini
2014-07-07 12:20           ` [Qemu-devel] " Paolo Bonzini
2014-07-07 14:11           ` David Drysdale
2014-07-07 14:11             ` [Qemu-devel] " David Drysdale
2014-07-07 14:11             ` David Drysdale
2014-07-07 22:33           ` Alexei Starovoitov
2014-07-07 22:33             ` [Qemu-devel] " Alexei Starovoitov
2014-07-07 22:33             ` Alexei Starovoitov
2014-07-08 14:58             ` Kees Cook
2014-07-08 14:58               ` [Qemu-devel] " Kees Cook
2014-07-08 14:58               ` Kees Cook
2014-08-16 15:41             ` Pavel Machek
2014-08-16 15:41               ` [Qemu-devel] " Pavel Machek
2014-08-16 15:41               ` Pavel Machek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.