linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/28] Permit filesystem local caching [try #2]
@ 2007-12-05 19:38 David Howells
  2007-12-05 19:38 ` [PATCH 01/28] KEYS: Increase the payload size when instantiating a key " David Howells
                   ` (27 more replies)
  0 siblings, 28 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells



These patches add local caching for network filesystems such as NFS and AFS.

The patches can roughly be broken down into a number of sets:

  (*) 01-keys-inc-payload.diff
  (*) 02-keys-search-keyring.diff
  (*) 03-keys-callout-blob.diff

      Three patches to the keyring code made to help the CIFS people.
      Included because of patches 05-08.

  (*) 04-keys-get-label.diff

      A patch to allow the security label of a key to be retrieved.
      Included because of patches 05-08.

  (*) 05-security-current-fsugid.diff
  (*) 06-security-separate-task-bits.diff
  (*) 07-security-subjective.diff
  (*) 08-security-kernel-service.diff

      Patches to permit the subjective security of a task to be overridden.
      All the security details in task_struct are decanted into a new struct
      that task_struct then has two pointers two: one that defines the
      objective security of that task (how other tasks may affect it) and one
      that defines the subjective security (how it may affect other objects).

      Note that I have dropped the idea of struct cred for the moment.  With
      the amount of stuff that was excluded from it, it wasn't actually any
      use to me.  However, it can be added later.

      Required for cachefiles.

  (*) 09-release-page.diff
  (*) 10-fscache-page-flags.diff
  (*) 11-add_wait_queue_tail.diff
  (*) 12-fscache.diff

      Patches to provide a local caching facility for network filesystems.

  (*) 13-cachefiles-ia64.diff
  (*) 14-cachefiles-ext3-f_mapping.diff
  (*) 15-cachefiles-write.diff
  (*) 16-cachefiles-monitor.diff
  (*) 17-cachefiles-export.diff
  (*) 18-cachefiles.diff

      Patches to provide a local cache in a directory of an already mounted
      filesystem.

  (*) 19-fscache-nfs.diff
  (*) 20-fscache-nfs-mount.diff
  (*) 21-fscache-nfs-display.diff

      Patches to provide NFS with local caching.

  (*) 22-fcrypt-bit-annotate.diff

      A fix for AFS.

  (*) 23-afs-testsetpageerror.diff
  (*) 24-afs-cancel_rejected_write.diff
  (*) 25-afs-rejected-writeback.diff
  (*) 26-afs-opID.diff
  (*) 27-afs-shared-writable-mmap.diff

      Patches to provide AFS with improved write support.

  (*) 28-fscache-afs.diff

      Patches to provide AFS with local caching.

There are some issues with these patches that I'd like advice on:

 (1) Is the security override stuff acceptable?

 (2) Should the audit context be placed in the task_security struct?

 (3) Should the task security context actually be shared by CLONE_THREAD?
     (should it be placed in struct thread_group_security).

 (4) How to handle superblock sharing in NFS?  (I've sent a separate email on
     this)

Andrew, Linus, can you please hold off on taking these patches for the moment.


--
A tarball of the patches is available at:

	http://people.redhat.com/~dhowells/fscache/patches/nfs+fscache-25.tar.bz2


To use this version of CacheFiles, the cachefilesd-0.9 is also required.  It
is available as an SRPM:

	http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9-1.fc7.src.rpm

Or as individual bits:

	http://people.redhat.com/~dhowells/fscache/cachefilesd-0.9.tar.bz2
	http://people.redhat.com/~dhowells/fscache/cachefilesd.fc
	http://people.redhat.com/~dhowells/fscache/cachefilesd.if
	http://people.redhat.com/~dhowells/fscache/cachefilesd.te
	http://people.redhat.com/~dhowells/fscache/cachefilesd.spec

The .fc, .if and .te files are for manipulating SELinux.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 01/28] KEYS: Increase the payload size when instantiating a key [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 02/28] KEYS: Check starting keyring as part of search " David Howells
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Increase the size of a payload that can be used to instantiate a key in
add_key() and keyctl_instantiate_key().  This permits huge CIFS SPNEGO blobs to
be passed around.  The limit is raised to 1MB.  If kmalloc() can't allocate a
buffer of sufficient size, vmalloc() will be tried instead.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/keyctl.c |   38 ++++++++++++++++++++++++++++++--------
 1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index d9ca15c..8ec8432 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -19,6 +19,7 @@
 #include <linux/capability.h>
 #include <linux/string.h>
 #include <linux/err.h>
+#include <linux/vmalloc.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -62,9 +63,10 @@ asmlinkage long sys_add_key(const char __user *_type,
 	char type[32], *description;
 	void *payload;
 	long ret;
+	bool vm;
 
 	ret = -EINVAL;
-	if (plen > 32767)
+	if (plen > 1024 * 1024 - 1)
 		goto error;
 
 	/* draw all the data into kernel space */
@@ -81,11 +83,18 @@ asmlinkage long sys_add_key(const char __user *_type,
 	/* pull the payload in if one was supplied */
 	payload = NULL;
 
+	vm = false;
 	if (_payload) {
 		ret = -ENOMEM;
 		payload = kmalloc(plen, GFP_KERNEL);
-		if (!payload)
-			goto error2;
+		if (!payload) {
+			if (plen <= PAGE_SIZE)
+				goto error2;
+			vm = true;
+			payload = vmalloc(plen);
+			if (!payload)
+				goto error2;
+		}
 
 		ret = -EFAULT;
 		if (copy_from_user(payload, _payload, plen) != 0)
@@ -113,7 +122,10 @@ asmlinkage long sys_add_key(const char __user *_type,
 
 	key_ref_put(keyring_ref);
  error3:
-	kfree(payload);
+	if (!vm)
+		kfree(payload);
+	else
+		vfree(payload);
  error2:
 	kfree(description);
  error:
@@ -821,9 +833,10 @@ long keyctl_instantiate_key(key_serial_t id,
 	key_ref_t keyring_ref;
 	void *payload;
 	long ret;
+	bool vm = false;
 
 	ret = -EINVAL;
-	if (plen > 32767)
+	if (plen > 1024 * 1024 - 1)
 		goto error;
 
 	/* the appropriate instantiation authorisation key must have been
@@ -843,8 +856,14 @@ long keyctl_instantiate_key(key_serial_t id,
 	if (_payload) {
 		ret = -ENOMEM;
 		payload = kmalloc(plen, GFP_KERNEL);
-		if (!payload)
-			goto error;
+		if (!payload) {
+			if (plen <= PAGE_SIZE)
+				goto error;
+			vm = true;
+			payload = vmalloc(plen);
+			if (!payload)
+				goto error;
+		}
 
 		ret = -EFAULT;
 		if (copy_from_user(payload, _payload, plen) != 0)
@@ -877,7 +896,10 @@ long keyctl_instantiate_key(key_serial_t id,
 	}
 
 error2:
-	kfree(payload);
+	if (!vm)
+		kfree(payload);
+	else
+		vfree(payload);
 error:
 	return ret;
 


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 02/28] KEYS: Check starting keyring as part of search [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
  2007-12-05 19:38 ` [PATCH 01/28] KEYS: Increase the payload size when instantiating a key " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 03/28] KEYS: Allow the callout data to be passed as a blob rather than a string " David Howells
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Check the starting keyring as part of the search to (a) see if that is what
we're searching for, and (b) to check it is still valid for searching.

The scenario:  User in process A does things that cause things to be
created in its process session keyring.  The user then does an su to
another user and starts a new process, B.  The two processes now
share the same process session keyring.

Process B does an NFS access which results in an upcall to gssd.
When gssd attempts to instantiate the context key (to be linked
into the process session keyring), it is denied access even though it
has an authorization key.

The order of calls is:

   keyctl_instantiate_key()
      lookup_user_key()				    (the default: case)
         search_process_keyrings(current)
	    search_process_keyrings(rka->context)   (recursive call)
	       keyring_search_aux()

keyring_search_aux() verifies the keys and keyrings underneath the
top-level keyring it is given, but that top-level keyring is neither
fully validated nor checked to see if it is the thing being searched for.

This patch changes keyring_search_aux() to:
1) do more validation on the top keyring it is given and
2) check whether that top-level keyring is the thing being searched for


Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 security/keys/keyring.c |   35 +++++++++++++++++++++++++++++++----
 1 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 88292e3..76b89b2 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -292,7 +292,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 
 	struct keyring_list *keylist;
 	struct timespec now;
-	unsigned long possessed;
+	unsigned long possessed, kflags;
 	struct key *keyring, *key;
 	key_ref_t key_ref;
 	long err;
@@ -318,6 +318,32 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 	now = current_kernel_time();
 	err = -EAGAIN;
 	sp = 0;
+	
+	/* firstly we should check to see if this top-level keyring is what we
+	 * are looking for */
+	key_ref = ERR_PTR(-EAGAIN);
+	kflags = keyring->flags;
+	if (keyring->type == type && match(keyring, description)) {
+		key = keyring;
+
+		/* check it isn't negative and hasn't expired or been
+		 * revoked */
+		if (kflags & (1 << KEY_FLAG_REVOKED))
+			goto error_2;
+		if (key->expiry && now.tv_sec >= key->expiry)
+			goto error_2;
+		key_ref = ERR_PTR(-ENOKEY);
+		if (kflags & (1 << KEY_FLAG_NEGATIVE))
+			goto error_2;
+		goto found;
+	}
+
+	/* otherwise, the top keyring must not be revoked, expired, or
+	 * negatively instantiated if we are to search it */
+	key_ref = ERR_PTR(-EAGAIN);
+	if (kflags & ((1 << KEY_FLAG_REVOKED) | (1 << KEY_FLAG_NEGATIVE)) ||
+	    (keyring->expiry && now.tv_sec >= keyring->expiry))
+		goto error_2;
 
 	/* start processing a new keyring */
 descend:
@@ -331,13 +357,14 @@ descend:
 	/* iterate through the keys in this keyring first */
 	for (kix = 0; kix < keylist->nkeys; kix++) {
 		key = keylist->keys[kix];
+		kflags = key->flags;
 
 		/* ignore keys not of this type */
 		if (key->type != type)
 			continue;
 
 		/* skip revoked keys and expired keys */
-		if (test_bit(KEY_FLAG_REVOKED, &key->flags))
+		if (kflags & (1 << KEY_FLAG_REVOKED))
 			continue;
 
 		if (key->expiry && now.tv_sec >= key->expiry)
@@ -352,8 +379,8 @@ descend:
 					context, KEY_SEARCH) < 0)
 			continue;
 
-		/* we set a different error code if we find a negative key */
-		if (test_bit(KEY_FLAG_NEGATIVE, &key->flags)) {
+		/* we set a different error code if we pass a negative key */
+		if (kflags & (1 << KEY_FLAG_NEGATIVE)) {
 			err = -ENOKEY;
 			continue;
 		}


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 03/28] KEYS: Allow the callout data to be passed as a blob rather than a string [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
  2007-12-05 19:38 ` [PATCH 01/28] KEYS: Increase the payload size when instantiating a key " David Howells
  2007-12-05 19:38 ` [PATCH 02/28] KEYS: Check starting keyring as part of search " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 04/28] KEYS: Add keyctl function to get a security label " David Howells
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Allow the callout data to be passed as a blob rather than a string for internal
kernel services that call any request_key_*() interface other than
request_key().  request_key() itself still takes a NUL-terminated string.

The functions that change are:

	request_key_with_auxdata()
	request_key_async()
	request_key_async_with_auxdata()

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/keys-request-key.txt |   11 +++++---
 Documentation/keys.txt             |   14 +++++++---
 include/linux/key.h                |    9 ++++---
 security/keys/internal.h           |    9 ++++---
 security/keys/keyctl.c             |    7 ++++-
 security/keys/request_key.c        |   49 ++++++++++++++++++++++--------------
 security/keys/request_key_auth.c   |   12 +++++----
 7 files changed, 70 insertions(+), 41 deletions(-)

diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt
index 266955d..09b55e4 100644
--- a/Documentation/keys-request-key.txt
+++ b/Documentation/keys-request-key.txt
@@ -11,26 +11,29 @@ request_key*():
 
 	struct key *request_key(const struct key_type *type,
 				const char *description,
-				const char *callout_string);
+				const char *callout_info);
 
 or:
 
 	struct key *request_key_with_auxdata(const struct key_type *type,
 					     const char *description,
-					     const char *callout_string,
+					     const char *callout_info,
+					     size_t callout_len,
 					     void *aux);
 
 or:
 
 	struct key *request_key_async(const struct key_type *type,
 				      const char *description,
-				      const char *callout_string);
+				      const char *callout_info,
+				      size_t callout_len);
 
 or:
 
 	struct key *request_key_async_with_auxdata(const struct key_type *type,
 						   const char *description,
-						   const char *callout_string,
+						   const char *callout_info,
+					     	   size_t callout_len,
 						   void *aux);
 
 Or by userspace invoking the request_key system call:
diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 51652d3..b82d38d 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -771,7 +771,7 @@ payload contents" for more information.
 
 	struct key *request_key(const struct key_type *type,
 				const char *description,
-				const char *callout_string);
+				const char *callout_info);
 
     This is used to request a key or keyring with a description that matches
     the description specified according to the key type's match function. This
@@ -793,24 +793,28 @@ payload contents" for more information.
 
 	struct key *request_key_with_auxdata(const struct key_type *type,
 					     const char *description,
-					     const char *callout_string,
+					     const void *callout_info,
+					     size_t callout_len,
 					     void *aux);
 
     This is identical to request_key(), except that the auxiliary data is
-    passed to the key_type->request_key() op if it exists.
+    passed to the key_type->request_key() op if it exists, and the callout_info
+    is a blob of length callout_len, if given (the length may be 0).
 
 
 (*) A key can be requested asynchronously by calling one of:
 
 	struct key *request_key_async(const struct key_type *type,
 				      const char *description,
-				      const char *callout_string);
+				      const void *callout_info,
+				      size_t callout_len);
 
     or:
 
 	struct key *request_key_async_with_auxdata(const struct key_type *type,
 						   const char *description,
-						   const char *callout_string,
+						   const char *callout_info,
+					     	   size_t callout_len,
 					     	   void *aux);
 
     which are asynchronous equivalents of request_key() and
diff --git a/include/linux/key.h b/include/linux/key.h
index fcdbd5e..4a6021a 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -208,16 +208,19 @@ extern struct key *request_key(struct key_type *type,
 
 extern struct key *request_key_with_auxdata(struct key_type *type,
 					    const char *description,
-					    const char *callout_info,
+					    const void *callout_info,
+					    size_t callout_len,
 					    void *aux);
 
 extern struct key *request_key_async(struct key_type *type,
 				     const char *description,
-				     const char *callout_info);
+				     const void *callout_info,
+				     size_t callout_len);
 
 extern struct key *request_key_async_with_auxdata(struct key_type *type,
 						  const char *description,
-						  const char *callout_info,
+						  const void *callout_info,
+						  size_t callout_len,
 						  void *aux);
 
 extern int wait_for_key_construction(struct key *key, bool intr);
diff --git a/security/keys/internal.h b/security/keys/internal.h
index d36d693..f004835 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -109,7 +109,8 @@ extern int install_process_keyring(struct task_struct *tsk);
 
 extern struct key *request_key_and_link(struct key_type *type,
 					const char *description,
-					const char *callout_info,
+					const void *callout_info,
+					size_t callout_len,
 					void *aux,
 					struct key *dest_keyring,
 					unsigned long flags);
@@ -120,13 +121,15 @@ extern struct key *request_key_and_link(struct key_type *type,
 struct request_key_auth {
 	struct key		*target_key;
 	struct task_struct	*context;
-	char			*callout_info;
+	void			*callout_info;
+	size_t			callout_len;
 	pid_t			pid;
 };
 
 extern struct key_type key_type_request_key_auth;
 extern struct key *request_key_auth_new(struct key *target,
-					const char *callout_info);
+					const void *callout_info,
+					size_t callout_len);
 
 extern struct key *key_get_instantiation_authkey(key_serial_t target_id);
 
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 8ec8432..1698bf9 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -152,6 +152,7 @@ asmlinkage long sys_request_key(const char __user *_type,
 	struct key_type *ktype;
 	struct key *key;
 	key_ref_t dest_ref;
+	size_t callout_len;
 	char type[32], *description, *callout_info;
 	long ret;
 
@@ -169,12 +170,14 @@ asmlinkage long sys_request_key(const char __user *_type,
 
 	/* pull the callout info into kernel space */
 	callout_info = NULL;
+	callout_len = 0;
 	if (_callout_info) {
 		callout_info = strndup_user(_callout_info, PAGE_SIZE);
 		if (IS_ERR(callout_info)) {
 			ret = PTR_ERR(callout_info);
 			goto error2;
 		}
+		callout_len = strlen(callout_info);
 	}
 
 	/* get the destination keyring if specified */
@@ -195,8 +198,8 @@ asmlinkage long sys_request_key(const char __user *_type,
 	}
 
 	/* do the search */
-	key = request_key_and_link(ktype, description, callout_info, NULL,
-				   key_ref_to_ptr(dest_ref),
+	key = request_key_and_link(ktype, description, callout_info,
+				   callout_len, NULL, key_ref_to_ptr(dest_ref),
 				   KEY_ALLOC_IN_QUOTA);
 	if (IS_ERR(key)) {
 		ret = PTR_ERR(key);
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 6381e61..aee4897 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -161,21 +161,22 @@ error_alloc:
  * call out to userspace for key construction
  * - we ignore program failure and go on key status instead
  */
-static int construct_key(struct key *key, const char *callout_info, void *aux)
+static int construct_key(struct key *key, const void *callout_info,
+			 size_t callout_len, void *aux)
 {
 	struct key_construction *cons;
 	request_key_actor_t actor;
 	struct key *authkey;
 	int ret;
 
-	kenter("%d,%s,%p", key->serial, callout_info, aux);
+	kenter("%d,%p,%zu,%p", key->serial, callout_info, callout_len, aux);
 
 	cons = kmalloc(sizeof(*cons), GFP_KERNEL);
 	if (!cons)
 		return -ENOMEM;
 
 	/* allocate an authorisation key */
-	authkey = request_key_auth_new(key, callout_info);
+	authkey = request_key_auth_new(key, callout_info, callout_len);
 	if (IS_ERR(authkey)) {
 		kfree(cons);
 		ret = PTR_ERR(authkey);
@@ -331,6 +332,7 @@ alloc_failed:
 static struct key *construct_key_and_link(struct key_type *type,
 					  const char *description,
 					  const char *callout_info,
+					  size_t callout_len,
 					  void *aux,
 					  struct key *dest_keyring,
 					  unsigned long flags)
@@ -348,7 +350,7 @@ static struct key *construct_key_and_link(struct key_type *type,
 	key_user_put(user);
 
 	if (ret == 0) {
-		ret = construct_key(key, callout_info, aux);
+		ret = construct_key(key, callout_info, callout_len, aux);
 		if (ret < 0)
 			goto construction_failed;
 	}
@@ -370,7 +372,8 @@ construction_failed:
  */
 struct key *request_key_and_link(struct key_type *type,
 				 const char *description,
-				 const char *callout_info,
+				 const void *callout_info,
+				 size_t callout_len,
 				 void *aux,
 				 struct key *dest_keyring,
 				 unsigned long flags)
@@ -378,8 +381,8 @@ struct key *request_key_and_link(struct key_type *type,
 	struct key *key;
 	key_ref_t key_ref;
 
-	kenter("%s,%s,%s,%p,%p,%lx",
-	       type->name, description, callout_info, aux,
+	kenter("%s,%s,%p,%zu,%p,%p,%lx",
+	       type->name, description, callout_info, callout_len, aux,
 	       dest_keyring, flags);
 
 	/* search all the process keyrings for a key */
@@ -398,7 +401,8 @@ struct key *request_key_and_link(struct key_type *type,
 			goto error;
 
 		key = construct_key_and_link(type, description, callout_info,
-					     aux, dest_keyring, flags);
+					     callout_len, aux, dest_keyring,
+					     flags);
 	}
 
 error:
@@ -434,10 +438,13 @@ struct key *request_key(struct key_type *type,
 			const char *callout_info)
 {
 	struct key *key;
+	size_t callout_len = 0;
 	int ret;
 
-	key = request_key_and_link(type, description, callout_info, NULL,
-				   NULL, KEY_ALLOC_IN_QUOTA);
+	if (callout_info)
+		callout_len = strlen(callout_info);
+	key = request_key_and_link(type, description, callout_info, callout_len,
+				   NULL, NULL, KEY_ALLOC_IN_QUOTA);
 	if (!IS_ERR(key)) {
 		ret = wait_for_key_construction(key, false);
 		if (ret < 0) {
@@ -458,14 +465,15 @@ EXPORT_SYMBOL(request_key);
  */
 struct key *request_key_with_auxdata(struct key_type *type,
 				     const char *description,
-				     const char *callout_info,
+				     const void *callout_info,
+				     size_t callout_len,
 				     void *aux)
 {
 	struct key *key;
 	int ret;
 
-	key = request_key_and_link(type, description, callout_info, aux,
-				   NULL, KEY_ALLOC_IN_QUOTA);
+	key = request_key_and_link(type, description, callout_info, callout_len,
+				   aux, NULL, KEY_ALLOC_IN_QUOTA);
 	if (!IS_ERR(key)) {
 		ret = wait_for_key_construction(key, false);
 		if (ret < 0) {
@@ -485,10 +493,12 @@ EXPORT_SYMBOL(request_key_with_auxdata);
  */
 struct key *request_key_async(struct key_type *type,
 			      const char *description,
-			      const char *callout_info)
+			      const void *callout_info,
+			      size_t callout_len)
 {
-	return request_key_and_link(type, description, callout_info, NULL,
-				    NULL, KEY_ALLOC_IN_QUOTA);
+	return request_key_and_link(type, description, callout_info,
+				    callout_len, NULL, NULL,
+				    KEY_ALLOC_IN_QUOTA);
 }
 EXPORT_SYMBOL(request_key_async);
 
@@ -500,10 +510,11 @@ EXPORT_SYMBOL(request_key_async);
  */
 struct key *request_key_async_with_auxdata(struct key_type *type,
 					   const char *description,
-					   const char *callout_info,
+					   const void *callout_info,
+					   size_t callout_len,
 					   void *aux)
 {
-	return request_key_and_link(type, description, callout_info, aux,
-				    NULL, KEY_ALLOC_IN_QUOTA);
+	return request_key_and_link(type, description, callout_info,
+				    callout_len, aux, NULL, KEY_ALLOC_IN_QUOTA);
 }
 EXPORT_SYMBOL(request_key_async_with_auxdata);
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index 510f7be..6827ae3 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -61,7 +61,7 @@ static void request_key_auth_describe(const struct key *key,
 
 	seq_puts(m, "key:");
 	seq_puts(m, key->description);
-	seq_printf(m, " pid:%d ci:%zu", rka->pid, strlen(rka->callout_info));
+	seq_printf(m, " pid:%d ci:%zu", rka->pid, rka->callout_len);
 
 } /* end request_key_auth_describe() */
 
@@ -77,7 +77,7 @@ static long request_key_auth_read(const struct key *key,
 	size_t datalen;
 	long ret;
 
-	datalen = strlen(rka->callout_info);
+	datalen = rka->callout_len;
 	ret = datalen;
 
 	/* we can return the data as is */
@@ -137,7 +137,8 @@ static void request_key_auth_destroy(struct key *key)
  * create an authorisation token for /sbin/request-key or whoever to gain
  * access to the caller's security data
  */
-struct key *request_key_auth_new(struct key *target, const char *callout_info)
+struct key *request_key_auth_new(struct key *target, const void *callout_info,
+				 size_t callout_len)
 {
 	struct request_key_auth *rka, *irka;
 	struct key *authkey = NULL;
@@ -152,7 +153,7 @@ struct key *request_key_auth_new(struct key *target, const char *callout_info)
 		kleave(" = -ENOMEM");
 		return ERR_PTR(-ENOMEM);
 	}
-	rka->callout_info = kmalloc(strlen(callout_info) + 1, GFP_KERNEL);
+	rka->callout_info = kmalloc(callout_len, GFP_KERNEL);
 	if (!rka->callout_info) {
 		kleave(" = -ENOMEM");
 		kfree(rka);
@@ -186,7 +187,8 @@ struct key *request_key_auth_new(struct key *target, const char *callout_info)
 	}
 
 	rka->target_key = key_get(target);
-	strcpy(rka->callout_info, callout_info);
+	memcpy(rka->callout_info, callout_info, callout_len);
+	rka->callout_len = callout_len;
 
 	/* allocate the auth key */
 	sprintf(desc, "%x", target->serial);


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 04/28] KEYS: Add keyctl function to get a security label [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (2 preceding siblings ...)
  2007-12-05 19:38 ` [PATCH 03/28] KEYS: Allow the callout data to be passed as a blob rather than a string " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 05/28] Security: Change current->fs[ug]id to current_fs[ug]id() " David Howells
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Add a keyctl() function to get the security label of a key.

The following is added to Documentation/keys.txt:

 (*) Get the LSM security context attached to a key.

	long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
		    size_t buflen)

     This function returns a string that represents the LSM security context
     attached to a key in the buffer provided.

     Unless there's an error, it always returns the amount of data it could
     produce, even if that's too big for the buffer, but it won't copy more
     than requested to userspace. If the buffer pointer is NULL then no copy
     will take place.

     A NUL character is included at the end of the string if the buffer is
     sufficiently big.  This is included in the returned count.  If no LSM is
     in force then an empty string will be returned.

     A process must have view permission on the key for this function to be
     successful.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/keys.txt   |   21 +++++++++++++++
 include/linux/keyctl.h   |    1 +
 include/linux/security.h |   20 +++++++++++++-
 security/dummy.c         |    8 ++++++
 security/keys/compat.c   |    3 ++
 security/keys/keyctl.c   |   66 ++++++++++++++++++++++++++++++++++++++++++++++
 security/security.c      |    5 +++
 security/selinux/hooks.c |   21 +++++++++++++--
 8 files changed, 141 insertions(+), 4 deletions(-)

diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index b82d38d..be424b0 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -711,6 +711,27 @@ The keyctl syscall functions are:
      The assumed authoritative key is inherited across fork and exec.
 
 
+ (*) Get the LSM security context attached to a key.
+
+	long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
+		    size_t buflen)
+
+     This function returns a string that represents the LSM security context
+     attached to a key in the buffer provided.
+
+     Unless there's an error, it always returns the amount of data it could
+     produce, even if that's too big for the buffer, but it won't copy more
+     than requested to userspace. If the buffer pointer is NULL then no copy
+     will take place.
+
+     A NUL character is included at the end of the string if the buffer is
+     sufficiently big.  This is included in the returned count.  If no LSM is
+     in force then an empty string will be returned.
+
+     A process must have view permission on the key for this function to be
+     successful.
+
+
 ===============
 KERNEL SERVICES
 ===============
diff --git a/include/linux/keyctl.h b/include/linux/keyctl.h
index 3365945..656ee6b 100644
--- a/include/linux/keyctl.h
+++ b/include/linux/keyctl.h
@@ -49,5 +49,6 @@
 #define KEYCTL_SET_REQKEY_KEYRING	14	/* set default request-key keyring */
 #define KEYCTL_SET_TIMEOUT		15	/* set key timeout */
 #define KEYCTL_ASSUME_AUTHORITY		16	/* assume request_key() authorisation */
+#define KEYCTL_GET_SECURITY		17	/* get key security label */
 
 #endif /*  _LINUX_KEYCTL_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index ac05083..8d9e946 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -959,6 +959,17 @@ struct request_sock;
  *	@perm describes the combination of permissions required of this key.
  *	Return 1 if permission granted, 0 if permission denied and -ve it the
  *      normal permissions model should be effected.
+ * @key_getsecurity:
+ *	Get a textual representation of the security context attached to a key
+ *	for the purposes of honouring KEYCTL_GETSECURITY.  This function
+ *	allocates the storage for the NUL-terminated string and the caller
+ *	should free it.
+ *	@key points to the key to be queried.
+ *	@_buffer points to a pointer that should be set to point to the
+ *	 resulting string (if no label or an error occurs).
+ *	Return the length of the string (including terminating NUL) or -ve if
+ *      an error.
+ *	May also return 0 (and a NULL buffer pointer) if there is no label.
  *
  * Security hooks affecting all System V IPC operations.
  *
@@ -1437,7 +1448,7 @@ struct security_operations {
 	int (*key_permission)(key_ref_t key_ref,
 			      struct task_struct *context,
 			      key_perm_t perm);
-
+	int (*key_getsecurity)(struct key *key, char **_buffer);
 #endif	/* CONFIG_KEYS */
 
 };
@@ -2567,6 +2578,7 @@ int security_key_alloc(struct key *key, struct task_struct *tsk, unsigned long f
 void security_key_free(struct key *key);
 int security_key_permission(key_ref_t key_ref,
 			    struct task_struct *context, key_perm_t perm);
+int security_key_getsecurity(struct key *key, char **_buffer);
 
 #else
 
@@ -2588,6 +2600,12 @@ static inline int security_key_permission(key_ref_t key_ref,
 	return 0;
 }
 
+static inline int security_key_getsecurity(struct key *key, char **_buffer)
+{
+	*_buffer = NULL;
+	return 0;
+}
+
 #endif
 #endif /* CONFIG_KEYS */
 
diff --git a/security/dummy.c b/security/dummy.c
index 6d895ad..7993b30 100644
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -949,6 +949,13 @@ static inline int dummy_key_permission(key_ref_t key_ref,
 {
 	return 0;
 }
+
+static int dummy_key_getsecurity(struct key *key, char **_buffer)
+{
+	*_buffer = NULL;
+	return 0;
+}
+
 #endif /* CONFIG_KEYS */
 
 struct security_operations dummy_security_ops;
@@ -1133,6 +1140,7 @@ void security_fixup_ops (struct security_operations *ops)
 	set_to_dummy_if_null(ops, key_alloc);
 	set_to_dummy_if_null(ops, key_free);
 	set_to_dummy_if_null(ops, key_permission);
+	set_to_dummy_if_null(ops, key_getsecurity);
 #endif	/* CONFIG_KEYS */
 
 }
diff --git a/security/keys/compat.c b/security/keys/compat.c
index e10ec99..c766c68 100644
--- a/security/keys/compat.c
+++ b/security/keys/compat.c
@@ -79,6 +79,9 @@ asmlinkage long compat_sys_keyctl(u32 option,
 	case KEYCTL_ASSUME_AUTHORITY:
 		return keyctl_assume_authority(arg2);
 
+	case KEYCTL_GET_SECURITY:
+		return keyctl_get_security(arg2, compat_ptr(arg3), arg4);
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 1698bf9..56e963b 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -20,6 +20,7 @@
 #include <linux/string.h>
 #include <linux/err.h>
 #include <linux/vmalloc.h>
+#include <linux/security.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -1080,6 +1081,66 @@ error:
 
 } /* end keyctl_assume_authority() */
 
+/*
+ * get the security label of a key
+ * - the key must grant us view permission
+ * - if there's a buffer, we place up to buflen bytes of data into it
+ * - unless there's an error, we return the amount of information available,
+ *   irrespective of how much we may have copied (including the terminal NUL)
+ * - implements keyctl(KEYCTL_GET_SECURITY)
+ */
+long keyctl_get_security(key_serial_t keyid,
+			 char __user *buffer,
+			 size_t buflen)
+{
+	struct key *key, *instkey;
+	key_ref_t key_ref;
+	char *context;
+	long ret;
+
+	key_ref = lookup_user_key(NULL, keyid, 0, 1, KEY_VIEW);
+	if (IS_ERR(key_ref)) {
+		if (PTR_ERR(key_ref) != -EACCES)
+			return PTR_ERR(key_ref);
+
+		/* viewing a key under construction is also permitted if we
+		 * have the authorisation token handy */
+		instkey = key_get_instantiation_authkey(keyid);
+		if (IS_ERR(instkey))
+			return PTR_ERR(key_ref);
+		key_put(instkey);
+
+		key_ref = lookup_user_key(NULL, keyid, 0, 1, 0);
+		if (IS_ERR(key_ref))
+			return PTR_ERR(key_ref);
+	}
+
+	key = key_ref_to_ptr(key_ref);
+	ret = security_key_getsecurity(key, &context);
+	if (ret == 0) {
+		/* if no information was returned, give userspace an empty
+		 * string */
+		ret = 1;
+		if (buffer && buflen > 0 &&
+		    copy_to_user(buffer, "", 1) != 0)
+			ret = -EFAULT;
+	} else if (ret > 0) {
+		/* return as much data as there's room for */
+		if (buffer && buflen > 0) {
+			if (buflen > ret)
+				buflen = ret;
+
+			if (copy_to_user(buffer, context, buflen) != 0)
+				ret = -EFAULT;
+		}
+
+		kfree(context);
+	}
+
+	key_ref_put(key_ref);
+	return ret;
+}
+
 /*****************************************************************************/
 /*
  * the key control system call
@@ -1160,6 +1221,11 @@ asmlinkage long sys_keyctl(int option, unsigned long arg2, unsigned long arg3,
 	case KEYCTL_ASSUME_AUTHORITY:
 		return keyctl_assume_authority((key_serial_t) arg2);
 
+	case KEYCTL_GET_SECURITY:
+		return keyctl_get_security((key_serial_t) arg2,
+					   (char *) arg3,
+					   (size_t) arg4);
+
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/security/security.c b/security/security.c
index 0e1f1f1..16213e3 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1079,4 +1079,9 @@ int security_key_permission(key_ref_t key_ref,
 	return security_ops->key_permission(key_ref, context, perm);
 }
 
+int security_key_getsecurity(struct key *key, char **_buffer)
+{
+	return security_ops->key_getsecurity(key, _buffer);
+}
+
 #endif	/* CONFIG_KEYS */
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 9f3124b..bd4cfab 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4768,6 +4768,20 @@ static int selinux_key_permission(key_ref_t key_ref,
 			    SECCLASS_KEY, perm, NULL);
 }
 
+static int selinux_key_getsecurity(struct key *key, char **_buffer)
+{
+	struct key_security_struct *ksec = key->security;
+	char *context = NULL;
+	unsigned len;
+	int rc;
+
+	rc = security_sid_to_context(ksec->sid, &context, &len);
+	if (!rc)
+		rc = len;
+	*_buffer = context;
+	return rc;
+}
+
 #endif
 
 static struct security_operations selinux_ops = {
@@ -4943,9 +4957,10 @@ static struct security_operations selinux_ops = {
 #endif
 
 #ifdef CONFIG_KEYS
-	.key_alloc =                    selinux_key_alloc,
-	.key_free =                     selinux_key_free,
-	.key_permission =               selinux_key_permission,
+	.key_alloc =			selinux_key_alloc,
+	.key_free =			selinux_key_free,
+	.key_permission =		selinux_key_permission,
+	.key_getsecurity =		selinux_key_getsecurity,
 #endif
 };
 


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 05/28] Security: Change current->fs[ug]id to current_fs[ug]id() [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (3 preceding siblings ...)
  2007-12-05 19:38 ` [PATCH 04/28] KEYS: Add keyctl function to get a security label " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 06/28] SECURITY: Separate task security context from task_struct " David Howells
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Change current->fs[ug]id to current_fs[ug]id() so that fsgid and fsuid can be
separated from the task_struct.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/perfmon.c                |    4 ++--
 arch/powerpc/platforms/cell/spufs/inode.c |    4 ++--
 drivers/isdn/capi/capifs.c                |    4 ++--
 drivers/usb/core/inode.c                  |    4 ++--
 fs/9p/fid.c                               |    2 +-
 fs/9p/vfs_inode.c                         |    4 ++--
 fs/9p/vfs_super.c                         |    4 ++--
 fs/affs/inode.c                           |    4 ++--
 fs/anon_inodes.c                          |    4 ++--
 fs/attr.c                                 |    4 ++--
 fs/bfs/dir.c                              |    4 ++--
 fs/cifs/cifsproto.h                       |    2 +-
 fs/cifs/dir.c                             |   12 ++++++------
 fs/cifs/inode.c                           |    8 ++++----
 fs/cifs/misc.c                            |    4 ++--
 fs/coda/cache.c                           |    6 +++---
 fs/coda/upcall.c                          |    4 ++--
 fs/devpts/inode.c                         |    4 ++--
 fs/dquot.c                                |    2 +-
 fs/exec.c                                 |    4 ++--
 fs/ext2/balloc.c                          |    2 +-
 fs/ext2/ialloc.c                          |    4 ++--
 fs/ext2/ioctl.c                           |    2 +-
 fs/ext3/balloc.c                          |    2 +-
 fs/ext3/ialloc.c                          |    4 ++--
 fs/ext4/balloc.c                          |    2 +-
 fs/ext4/ialloc.c                          |    4 ++--
 fs/fuse/dev.c                             |    4 ++--
 fs/gfs2/inode.c                           |   10 +++++-----
 fs/hfs/inode.c                            |    4 ++--
 fs/hfsplus/inode.c                        |    4 ++--
 fs/hpfs/namei.c                           |   24 ++++++++++++------------
 fs/hugetlbfs/inode.c                      |   16 ++++++++--------
 fs/jffs2/fs.c                             |    4 ++--
 fs/jfs/jfs_inode.c                        |    4 ++--
 fs/locks.c                                |    2 +-
 fs/minix/bitmap.c                         |    4 ++--
 fs/namei.c                                |    8 ++++----
 fs/nfsd/vfs.c                             |    4 ++--
 fs/ocfs2/dlm/dlmfs.c                      |    8 ++++----
 fs/ocfs2/namei.c                          |    4 ++--
 fs/pipe.c                                 |    4 ++--
 fs/posix_acl.c                            |    4 ++--
 fs/ramfs/inode.c                          |    4 ++--
 fs/reiserfs/namei.c                       |    4 ++--
 fs/sysv/ialloc.c                          |    4 ++--
 fs/udf/ialloc.c                           |    4 ++--
 fs/udf/namei.c                            |    2 +-
 fs/ufs/ialloc.c                           |    4 ++--
 fs/xfs/linux-2.6/xfs_linux.h              |    4 ++--
 fs/xfs/xfs_acl.c                          |    6 +++---
 fs/xfs/xfs_attr.c                         |    2 +-
 fs/xfs/xfs_inode.c                        |    6 +++---
 fs/xfs/xfs_vnodeops.c                     |    8 ++++----
 include/linux/fs.h                        |    2 +-
 include/linux/sched.h                     |    3 +++
 ipc/mqueue.c                              |    4 ++--
 kernel/cgroup.c                           |    4 ++--
 mm/shmem.c                                |    8 ++++----
 net/9p/client.c                           |    2 +-
 net/socket.c                              |    4 ++--
 net/sunrpc/auth.c                         |    8 ++++----
 security/commoncap.c                      |    8 ++++----
 security/keys/key.c                       |    2 +-
 security/keys/keyctl.c                    |    2 +-
 security/keys/request_key.c               |   10 +++++-----
 security/keys/request_key_auth.c          |    2 +-
 67 files changed, 163 insertions(+), 160 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 73e7c2e..ef383d9 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2206,8 +2206,8 @@ pfm_alloc_fd(struct file **cfile)
 	DPRINT(("new inode ino=%ld @%p\n", inode->i_ino, inode));
 
 	inode->i_mode = S_IFCHR|S_IRUGO;
-	inode->i_uid  = current->fsuid;
-	inode->i_gid  = current->fsgid;
+	inode->i_uid  = current_fsuid();
+	inode->i_gid  = current_fsgid();
 
 	sprintf(name, "[%lu]", inode->i_ino);
 	this.name = name;
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index c0e968a..4efe7bf 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -85,8 +85,8 @@ spufs_new_inode(struct super_block *sb, int mode)
 		goto out;
 
 	inode->i_mode = mode;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 	inode->i_blocks = 0;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 out:
diff --git a/drivers/isdn/capi/capifs.c b/drivers/isdn/capi/capifs.c
index 2dd1b57..26b9aea 100644
--- a/drivers/isdn/capi/capifs.c
+++ b/drivers/isdn/capi/capifs.c
@@ -148,8 +148,8 @@ void capifs_new_ncci(unsigned int number, dev_t device)
 	if (!inode)
 		return;
 	inode->i_ino = number+2;
-	inode->i_uid = config.setuid ? config.uid : current->fsuid;
-	inode->i_gid = config.setgid ? config.gid : current->fsgid;
+	inode->i_uid = config.setuid ? config.uid : current_fsuid();
+	inode->i_gid = config.setgid ? config.gid : current_fsgid();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	init_special_inode(inode, S_IFCHR|config.mode, device);
 	//inode->i_op = &capifs_file_inode_operations;
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index cd4f111..9173eae 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -246,8 +246,8 @@ static struct inode *usbfs_get_inode (struct super_block *sb, int mode, dev_t de
 
 	if (inode) {
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
diff --git a/fs/9p/fid.c b/fs/9p/fid.c
index b364da7..cd7799c 100644
--- a/fs/9p/fid.c
+++ b/fs/9p/fid.c
@@ -121,7 +121,7 @@ struct p9_fid *v9fs_fid_lookup(struct dentry *dentry)
 	switch (access) {
 	case V9FS_ACCESS_SINGLE:
 	case V9FS_ACCESS_USER:
-		uid = current->fsuid;
+		uid = current_fsuid();
 		any = 0;
 		break;
 
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 23581bc..b6d7fc9 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -202,8 +202,8 @@ struct inode *v9fs_get_inode(struct super_block *sb, int mode)
 	inode = new_inode(sb);
 	if (inode) {
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_rdev = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 678c02f..465520d 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -112,8 +112,8 @@ static int v9fs_get_sb(struct file_system_type *fs_type, int flags,
 	struct v9fs_session_info *v9ses = NULL;
 	struct p9_stat *st = NULL;
 	int mode = S_IRWXUGO | S_ISVTX;
-	uid_t uid = current->fsuid;
-	gid_t gid = current->fsgid;
+	uid_t uid = current_fsuid();
+	gid_t gid = current_fsgid();
 	struct p9_fid *fid;
 	int retval = 0;
 
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 4609a6c..fb84ebc 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -305,8 +305,8 @@ affs_new_inode(struct inode *dir)
 	mark_buffer_dirty_inode(bh, inode);
 	affs_brelse(bh);
 
-	inode->i_uid     = current->fsuid;
-	inode->i_gid     = current->fsgid;
+	inode->i_uid     = current_fsuid();
+	inode->i_gid     = current_fsgid();
 	inode->i_ino     = block;
 	inode->i_nlink   = 1;
 	inode->i_mtime   = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 2332188..a2f6a13 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -162,8 +162,8 @@ static struct inode *anon_inode_mkinode(void)
 	 */
 	inode->i_state = I_DIRTY;
 	inode->i_mode = S_IRUSR | S_IWUSR;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	return inode;
 }
diff --git a/fs/attr.c b/fs/attr.c
index 966b73e..117cca7 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -29,13 +29,13 @@ int inode_change_ok(struct inode *inode, struct iattr *attr)
 
 	/* Make sure a caller can chown. */
 	if ((ia_valid & ATTR_UID) &&
-	    (current->fsuid != inode->i_uid ||
+	    (current_fsuid() != inode->i_uid ||
 	     attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
 		goto error;
 
 	/* Make sure caller can chgrp. */
 	if ((ia_valid & ATTR_GID) &&
-	    (current->fsuid != inode->i_uid ||
+	    (current_fsuid() != inode->i_uid ||
 	    (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) &&
 	    !capable(CAP_CHOWN))
 		goto error;
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 1fd056d..499c531 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -104,8 +104,8 @@ static int bfs_create(struct inode *dir, struct dentry *dentry, int mode,
 	}
 	set_bit(ino, info->si_imap);
 	info->si_freei--;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = (dir->i_mode & S_ISGID) ? dir->i_gid : current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = (dir->i_mode & S_ISGID) ? dir->i_gid : current_fsgid();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
 	inode->i_blocks = 0;
 	inode->i_op = &bfs_file_inops;
diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 8350eec..1c659db 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -39,7 +39,7 @@ extern int smb_send(struct socket *, struct smb_hdr *,
 			unsigned int /* length */ , struct sockaddr *);
 extern unsigned int _GetXid(void);
 extern void _FreeXid(unsigned int);
-#define GetXid() (int)_GetXid(); cFYI(1,("CIFS VFS: in %s as Xid: %d with uid: %d",__FUNCTION__, xid,current->fsuid));
+#define GetXid() (int)_GetXid(); cFYI(1,("CIFS VFS: in %s as Xid: %d with uid: %d",__FUNCTION__, xid,current_fsuid()));
 #define FreeXid(curr_xid) {_FreeXid(curr_xid); cFYI(1,("CIFS VFS: leaving %s (xid = %d) rc = %d",__FUNCTION__,curr_xid,(int)rc));}
 extern char *build_path_from_dentry(struct dentry *);
 extern char *build_wildcard_path_from_dentry(struct dentry *direntry);
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 37dc97a..728726a 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -211,8 +211,8 @@ cifs_create(struct inode *inode, struct dentry *direntry, int mode,
 			mode &= ~current->fs->umask;
 			if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SET_UID) {
 				CIFSSMBUnixSetPerms(xid, pTcon, full_path, mode,
-					(__u64)current->fsuid,
-					(__u64)current->fsgid,
+					(__u64)current_fsuid(),
+					(__u64)current_fsgid(),
 					0 /* dev */,
 					cifs_sb->local_nls,
 					cifs_sb->mnt_cifs_flags &
@@ -246,8 +246,8 @@ cifs_create(struct inode *inode, struct dentry *direntry, int mode,
 				if ((oplock & CIFS_CREATE_ACTION) &&
 				    (cifs_sb->mnt_cifs_flags &
 				     CIFS_MOUNT_SET_UID)) {
-					newinode->i_uid = current->fsuid;
-					newinode->i_gid = current->fsgid;
+					newinode->i_uid = current_fsuid();
+					newinode->i_gid = current_fsgid();
 				}
 			}
 		}
@@ -340,8 +340,8 @@ int cifs_mknod(struct inode *inode, struct dentry *direntry, int mode,
 		mode &= ~current->fs->umask;
 		if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SET_UID) {
 			rc = CIFSSMBUnixSetPerms(xid, pTcon, full_path,
-				mode, (__u64)current->fsuid,
-				(__u64)current->fsgid,
+				mode, (__u64)current_fsuid(),
+				(__u64)current_fsgid(),
 				device_number, cifs_sb->local_nls,
 				cifs_sb->mnt_cifs_flags &
 					CIFS_MOUNT_MAP_SPECIAL_CHR);
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index e915eb1..8040b1b 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1031,8 +1031,8 @@ mkdir_get_info:
 			if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SET_UID) {
 				CIFSSMBUnixSetPerms(xid, pTcon, full_path,
 						    mode,
-						    (__u64)current->fsuid,
-						    (__u64)current->fsgid,
+						    (__u64)current_fsuid(),
+						    (__u64)current_fsgid(),
 						    0 /* dev_t */,
 						    cifs_sb->local_nls,
 						    cifs_sb->mnt_cifs_flags &
@@ -1055,9 +1055,9 @@ mkdir_get_info:
 				if (cifs_sb->mnt_cifs_flags &
 				     CIFS_MOUNT_SET_UID) {
 					direntry->d_inode->i_uid =
-						current->fsuid;
+						current_fsuid();
 					direntry->d_inode->i_gid =
-						current->fsgid;
+						current_fsgid();
 				}
 			}
 		}
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 15546c2..b862231 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -351,13 +351,13 @@ header_assemble(struct smb_hdr *buffer, char smb_command /* command */ ,
 		/*  BB Add support for establishing new tCon and SMB Session  */
 		/*      with userid/password pairs found on the smb session   */
 		/*	for other target tcp/ip addresses 		BB    */
-				if (current->fsuid != treeCon->ses->linux_uid) {
+				if (current_fsuid() != treeCon->ses->linux_uid) {
 					cFYI(1, ("Multiuser mode and UID "
 						 "did not match tcon uid"));
 					read_lock(&GlobalSMBSeslock);
 					list_for_each(temp_item, &GlobalSMBSessionList) {
 						ses = list_entry(temp_item, struct cifsSesInfo, cifsSessionList);
-						if (ses->linux_uid == current->fsuid) {
+						if (ses->linux_uid == current_fsuid()) {
 							if (ses->server == treeCon->ses->server) {
 								cFYI(1, ("found matching uid substitute right smb_uid"));
 								buffer->Uid = ses->Suid;
diff --git a/fs/coda/cache.c b/fs/coda/cache.c
index 8a23703..a5bf577 100644
--- a/fs/coda/cache.c
+++ b/fs/coda/cache.c
@@ -32,8 +32,8 @@ void coda_cache_enter(struct inode *inode, int mask)
 	struct coda_inode_info *cii = ITOC(inode);
 
 	cii->c_cached_epoch = atomic_read(&permission_epoch);
-	if (cii->c_uid != current->fsuid) {
-                cii->c_uid = current->fsuid;
+	if (cii->c_uid != current_fsuid()) {
+		cii->c_uid = current_fsuid();
                 cii->c_cached_perm = mask;
         } else
                 cii->c_cached_perm |= mask;
@@ -60,7 +60,7 @@ int coda_cache_check(struct inode *inode, int mask)
         int hit;
 	
         hit = (mask & cii->c_cached_perm) == mask &&
-		cii->c_uid == current->fsuid &&
+		cii->c_uid == current_fsuid() &&
 		cii->c_cached_epoch == atomic_read(&permission_epoch);
 
         return hit;
diff --git a/fs/coda/upcall.c b/fs/coda/upcall.c
index 359e531..806e6aa 100644
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -54,9 +54,9 @@ static void *alloc_upcall(int opcode, int size)
 	inp->ih.pgid = task_pgrp_nr(current);
 #ifdef CONFIG_CODA_FS_OLD_API
 	memset(&inp->ih.cred, 0, sizeof(struct coda_cred));
-	inp->ih.cred.cr_fsuid = current->fsuid;
+	inp->ih.cred.cr_fsuid = current_fsuid();
 #else
-	inp->ih.uid = current->fsuid;
+	inp->ih.uid = current_fsuid();
 #endif
 	return (void*)inp;
 }
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 06ef9a2..1c9ffd9 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -172,8 +172,8 @@ int devpts_pty_new(struct tty_struct *tty)
 		return -ENOMEM;
 
 	inode->i_ino = number+2;
-	inode->i_uid = config.setuid ? config.uid : current->fsuid;
-	inode->i_gid = config.setgid ? config.gid : current->fsgid;
+	inode->i_uid = config.setuid ? config.uid : current_fsuid();
+	inode->i_gid = config.setgid ? config.gid : current_fsgid();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
 	init_special_inode(inode, S_IFCHR|config.mode, device);
 	inode->i_private = tty;
diff --git a/fs/dquot.c b/fs/dquot.c
index 2809768..3dc58a8 100644
--- a/fs/dquot.c
+++ b/fs/dquot.c
@@ -837,7 +837,7 @@ static inline int need_print_warning(struct dquot *dquot)
 
 	switch (dquot->dq_type) {
 		case USRQUOTA:
-			return current->fsuid == dquot->dq_id;
+			return current_fsuid() == dquot->dq_id;
 		case GRPQUOTA:
 			return in_group_p(dquot->dq_id);
 	}
diff --git a/fs/exec.c b/fs/exec.c
index 282240a..a09ce1b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1678,7 +1678,7 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
 	struct inode * inode;
 	struct file * file;
 	int retval = 0;
-	int fsuid = current->fsuid;
+	int fsuid = current_fsuid();
 	int flag = 0;
 	int ispipe = 0;
 	unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
@@ -1784,7 +1784,7 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
 	 * Dont allow local users get cute and trick others to coredump
 	 * into their pre-created files:
 	 */
-	if (inode->i_uid != current->fsuid)
+	if (inode->i_uid != current_fsuid())
 		goto close_fail;
 	if (!file->f_op)
 		goto close_fail;
diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 377ad17..bbddd14 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -1128,7 +1128,7 @@ static int ext2_has_free_blocks(struct ext2_sb_info *sbi)
 	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
 	root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
 	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
-		sbi->s_resuid != current->fsuid &&
+		sbi->s_resuid != current_fsuid() &&
 		(sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
 		return 0;
 	}
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 5deb8b7..1d020a9 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -554,7 +554,7 @@ got:
 
 	sb->s_dirt = 1;
 	mark_buffer_dirty(bh2);
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	if (test_opt (sb, GRPID))
 		inode->i_gid = dir->i_gid;
 	else if (dir->i_mode & S_ISGID) {
@@ -562,7 +562,7 @@ got:
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 	inode->i_mode = mode;
 
 	inode->i_ino = ino;
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 320b2cb..f5fdb95 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -105,7 +105,7 @@ int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
 		if (IS_RDONLY(inode))
 			return -EROFS;
 
-		if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
+		if ((current_fsuid() != inode->i_uid) && !capable(CAP_FOWNER))
 			return -EACCES;
 
 		if (get_user(rsv_window_size, (int __user *)arg))
diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index a8ba7e8..55e39a3 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -1360,7 +1360,7 @@ static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
 	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
 	root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
 	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
-		sbi->s_resuid != current->fsuid &&
+		sbi->s_resuid != current_fsuid() &&
 		(sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
 		return 0;
 	}
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 1bc8cd8..fe20718 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -543,7 +543,7 @@ got:
 		percpu_counter_inc(&sbi->s_dirs_counter);
 	sb->s_dirt = 1;
 
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	if (test_opt (sb, GRPID))
 		inode->i_gid = dir->i_gid;
 	else if (dir->i_mode & S_ISGID) {
@@ -551,7 +551,7 @@ got:
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 	inode->i_mode = mode;
 
 	inode->i_ino = ino;
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 71ee95e..a67d6bc 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1480,7 +1480,7 @@ static int ext4_has_free_blocks(struct ext4_sb_info *sbi)
 	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
 	root_blocks = ext4_r_blocks_count(sbi->s_es);
 	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
-		sbi->s_resuid != current->fsuid &&
+		sbi->s_resuid != current_fsuid() &&
 		(sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
 		return 0;
 	}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index c61f37f..37e95d2 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -674,7 +674,7 @@ got:
 		percpu_counter_inc(&sbi->s_dirs_counter);
 	sb->s_dirt = 1;
 
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	if (test_opt (sb, GRPID))
 		inode->i_gid = dir->i_gid;
 	else if (dir->i_mode & S_ISGID) {
@@ -682,7 +682,7 @@ got:
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 	inode->i_mode = mode;
 
 	inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index db534bc..471a256 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -79,8 +79,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_req *req)
 {
-	req->in.h.uid = current->fsuid;
-	req->in.h.gid = current->fsgid;
+	req->in.h.uid = current_fsuid();
+	req->in.h.gid = current_fsgid();
 	req->in.h.pid = current->pid;
 }
 
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 5f6dc32..7719132 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -688,18 +688,18 @@ static void munge_mode_uid_gid(struct gfs2_inode *dip, unsigned int *mode,
 	    (dip->i_inode.i_mode & S_ISUID) && dip->i_inode.i_uid) {
 		if (S_ISDIR(*mode))
 			*mode |= S_ISUID;
-		else if (dip->i_inode.i_uid != current->fsuid)
+		else if (dip->i_inode.i_uid != current_fsuid())
 			*mode &= ~07111;
 		*uid = dip->i_inode.i_uid;
 	} else
-		*uid = current->fsuid;
+		*uid = current_fsuid();
 
 	if (dip->i_inode.i_mode & S_ISGID) {
 		if (S_ISDIR(*mode))
 			*mode |= S_ISGID;
 		*gid = dip->i_inode.i_gid;
 	} else
-		*gid = current->fsgid;
+		*gid = current_fsgid();
 }
 
 static int alloc_dinode(struct gfs2_inode *dip, u64 *no_addr, u64 *generation)
@@ -1108,8 +1108,8 @@ int gfs2_unlink_ok(struct gfs2_inode *dip, const struct qstr *name,
 		return -EPERM;
 
 	if ((dip->i_inode.i_mode & S_ISVTX) &&
-	    dip->i_inode.i_uid != current->fsuid &&
-	    ip->i_inode.i_uid != current->fsuid && !capable(CAP_FOWNER))
+	    dip->i_inode.i_uid != current_fsuid() &&
+	    ip->i_inode.i_uid != current_fsuid() && !capable(CAP_FOWNER))
 		return -EPERM;
 
 	if (IS_APPEND(&dip->i_inode))
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 97f8446..29caee5 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -155,8 +155,8 @@ struct inode *hfs_new_inode(struct inode *dir, struct qstr *name, int mode)
 	hfs_cat_build_key(sb, (btree_key *)&HFS_I(inode)->cat_key, dir->i_ino, name);
 	inode->i_ino = HFS_SB(sb)->next_id++;
 	inode->i_mode = mode;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 	inode->i_nlink = 1;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
 	HFS_I(inode)->flags = 0;
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 37744cf..af54c28 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -312,8 +312,8 @@ struct inode *hfsplus_new_inode(struct super_block *sb, int mode)
 
 	inode->i_ino = HFSPLUS_SB(sb).next_cnid++;
 	inode->i_mode = mode;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 	inode->i_nlink = 1;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
 	INIT_LIST_HEAD(&HFSPLUS_I(inode).open_dir_list);
diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c
index d256559..2af4578 100644
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -92,11 +92,11 @@ static int hpfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 	inc_nlink(dir);
 	insert_inode_hash(result);
 
-	if (result->i_uid != current->fsuid ||
-	    result->i_gid != current->fsgid ||
+	if (result->i_uid != current_fsuid() ||
+	    result->i_gid != current_fsgid() ||
 	    result->i_mode != (mode | S_IFDIR)) {
-		result->i_uid = current->fsuid;
-		result->i_gid = current->fsgid;
+		result->i_uid = current_fsuid();
+		result->i_gid = current_fsgid();
 		result->i_mode = mode | S_IFDIR;
 		hpfs_write_inode_nolock(result);
 	}
@@ -184,11 +184,11 @@ static int hpfs_create(struct inode *dir, struct dentry *dentry, int mode, struc
 
 	insert_inode_hash(result);
 
-	if (result->i_uid != current->fsuid ||
-	    result->i_gid != current->fsgid ||
+	if (result->i_uid != current_fsuid() ||
+	    result->i_gid != current_fsgid() ||
 	    result->i_mode != (mode | S_IFREG)) {
-		result->i_uid = current->fsuid;
-		result->i_gid = current->fsgid;
+		result->i_uid = current_fsuid();
+		result->i_gid = current_fsgid();
 		result->i_mode = mode | S_IFREG;
 		hpfs_write_inode_nolock(result);
 	}
@@ -247,8 +247,8 @@ static int hpfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t
 	result->i_mtime.tv_nsec = 0;
 	result->i_atime.tv_nsec = 0;
 	hpfs_i(result)->i_ea_size = 0;
-	result->i_uid = current->fsuid;
-	result->i_gid = current->fsgid;
+	result->i_uid = current_fsuid();
+	result->i_gid = current_fsgid();
 	result->i_nlink = 1;
 	result->i_size = 0;
 	result->i_blocks = 1;
@@ -325,8 +325,8 @@ static int hpfs_symlink(struct inode *dir, struct dentry *dentry, const char *sy
 	result->i_atime.tv_nsec = 0;
 	hpfs_i(result)->i_ea_size = 0;
 	result->i_mode = S_IFLNK | 0777;
-	result->i_uid = current->fsuid;
-	result->i_gid = current->fsgid;
+	result->i_uid = current_fsuid();
+	result->i_gid = current_fsgid();
 	result->i_blocks = 1;
 	result->i_nlink = 1;
 	result->i_size = strlen(symlink);
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 09ee07f..39ad919 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -543,9 +543,9 @@ static int hugetlbfs_mknod(struct inode *dir,
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else {
-		gid = current->fsgid;
+		gid = current_fsgid();
 	}
-	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid, gid, mode, dev);
+	inode = hugetlbfs_get_inode(dir->i_sb, current_fsuid(), gid, mode, dev);
 	if (inode) {
 		dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 		d_instantiate(dentry, inode);
@@ -578,9 +578,9 @@ static int hugetlbfs_symlink(struct inode *dir,
 	if (dir->i_mode & S_ISGID)
 		gid = dir->i_gid;
 	else
-		gid = current->fsgid;
+		gid = current_fsgid();
 
-	inode = hugetlbfs_get_inode(dir->i_sb, current->fsuid,
+	inode = hugetlbfs_get_inode(dir->i_sb, current_fsuid(),
 					gid, S_IFLNK|S_IRWXUGO, 0);
 	if (inode) {
 		int l = strlen(symname)+1;
@@ -819,8 +819,8 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
 
 	config.nr_blocks = -1; /* No limit on size by default */
 	config.nr_inodes = -1; /* No limit on number of inodes by default */
-	config.uid = current->fsuid;
-	config.gid = current->fsgid;
+	config.uid = current_fsuid();
+	config.gid = current_fsgid();
 	config.mode = 0755;
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
@@ -933,8 +933,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
 		goto out_shm_unlock;
 
 	error = -ENOSPC;
-	inode = hugetlbfs_get_inode(root->d_sb, current->fsuid,
-				current->fsgid, S_IFREG | S_IRWXUGO, 0);
+	inode = hugetlbfs_get_inode(root->d_sb, current_fsuid(),
+				current_fsgid(), S_IFREG | S_IRWXUGO, 0);
 	if (!inode)
 		goto out_dentry;
 
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index d2e06f7..61849f3 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -425,14 +425,14 @@ struct inode *jffs2_new_inode (struct inode *dir_i, int mode, struct jffs2_raw_i
 
 	memset(ri, 0, sizeof(*ri));
 	/* Set OS-specific defaults for new inodes */
-	ri->uid = cpu_to_je16(current->fsuid);
+	ri->uid = cpu_to_je16(current_fsuid());
 
 	if (dir_i->i_mode & S_ISGID) {
 		ri->gid = cpu_to_je16(dir_i->i_gid);
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else {
-		ri->gid = cpu_to_je16(current->fsgid);
+		ri->gid = cpu_to_je16(current_fsgid());
 	}
 
 	/* POSIX ACLs have to be processed now, at least partly.
diff --git a/fs/jfs/jfs_inode.c b/fs/jfs/jfs_inode.c
index ed6574b..70022fd 100644
--- a/fs/jfs/jfs_inode.c
+++ b/fs/jfs/jfs_inode.c
@@ -93,13 +93,13 @@ struct inode *ialloc(struct inode *parent, umode_t mode)
 		return ERR_PTR(rc);
 	}
 
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	if (parent->i_mode & S_ISGID) {
 		inode->i_gid = parent->i_gid;
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 
 	/*
 	 * New inodes need to save sane values on disk when
diff --git a/fs/locks.c b/fs/locks.c
index 8b8388e..359030b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1358,7 +1358,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 	struct inode *inode = dentry->d_inode;
 	int error, rdlease_count = 0, wrlease_count = 0;
 
-	if ((current->fsuid != inode->i_uid) && !capable(CAP_LEASE))
+	if ((current_fsuid() != inode->i_uid) && !capable(CAP_LEASE))
 		return -EACCES;
 	if (!S_ISREG(inode->i_mode))
 		return -EINVAL;
diff --git a/fs/minix/bitmap.c b/fs/minix/bitmap.c
index 703cc35..3aebe32 100644
--- a/fs/minix/bitmap.c
+++ b/fs/minix/bitmap.c
@@ -262,8 +262,8 @@ struct inode * minix_new_inode(const struct inode * dir, int * error)
 		iput(inode);
 		return NULL;
 	}
-	inode->i_uid = current->fsuid;
-	inode->i_gid = (dir->i_mode & S_ISGID) ? dir->i_gid : current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = (dir->i_mode & S_ISGID) ? dir->i_gid : current_fsgid();
 	inode->i_ino = j;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
 	inode->i_blocks = 0;
diff --git a/fs/namei.c b/fs/namei.c
index 3b993db..8963e91 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -184,7 +184,7 @@ int generic_permission(struct inode *inode, int mask,
 {
 	umode_t			mode = inode->i_mode;
 
-	if (current->fsuid == inode->i_uid)
+	if (current_fsuid() == inode->i_uid)
 		mode >>= 6;
 	else {
 		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
@@ -452,7 +452,7 @@ static int exec_permission_lite(struct inode *inode,
 	if (inode->i_op && inode->i_op->permission)
 		return -EAGAIN;
 
-	if (current->fsuid == inode->i_uid)
+	if (current_fsuid() == inode->i_uid)
 		mode >>= 6;
 	else if (in_group_p(inode->i_gid))
 		mode >>= 3;
@@ -1435,9 +1435,9 @@ static inline int check_sticky(struct inode *dir, struct inode *inode)
 {
 	if (!(dir->i_mode & S_ISVTX))
 		return 0;
-	if (inode->i_uid == current->fsuid)
+	if (inode->i_uid == current_fsuid())
 		return 0;
-	if (dir->i_uid == current->fsuid)
+	if (dir->i_uid == current_fsuid())
 		return 0;
 	return !capable(CAP_FOWNER);
 }
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index d019918..f11c48d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1857,7 +1857,7 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
 		IS_APPEND(inode)?	" append" : "",
 		IS_RDONLY(inode)?	" ro" : "");
 	dprintk("      owner %d/%d user %d/%d\n",
-		inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
+		inode->i_uid, inode->i_gid, current_fsuid(), current->fsgid);
 #endif
 
 	/* Normally we reject any write/sattr etc access on a read-only file
@@ -1899,7 +1899,7 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp,
 	 * with NFSv3.
 	 */
 	if ((acc & MAY_OWNER_OVERRIDE) &&
-	    inode->i_uid == current->fsuid)
+	    inode->i_uid == current_fsuid())
 		return 0;
 
 	err = permission(inode, acc & (MAY_READ|MAY_WRITE|MAY_EXEC), NULL);
diff --git a/fs/ocfs2/dlm/dlmfs.c b/fs/ocfs2/dlm/dlmfs.c
index 6639baa..ea65979 100644
--- a/fs/ocfs2/dlm/dlmfs.c
+++ b/fs/ocfs2/dlm/dlmfs.c
@@ -328,8 +328,8 @@ static struct inode *dlmfs_get_root_inode(struct super_block *sb)
 		ip = DLMFS_I(inode);
 
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
@@ -354,8 +354,8 @@ static struct inode *dlmfs_get_inode(struct inode *parent,
 		return NULL;
 
 	inode->i_mode = mode;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 	inode->i_blocks = 0;
 	inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 989ac27..86e2717 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -426,13 +426,13 @@ static int ocfs2_mknod_locked(struct ocfs2_super *osb,
 	fe->i_blkno = cpu_to_le64(fe_blkno);
 	fe->i_suballoc_bit = cpu_to_le16(suballoc_bit);
 	fe->i_suballoc_slot = cpu_to_le16(osb->slot_num);
-	fe->i_uid = cpu_to_le32(current->fsuid);
+	fe->i_uid = cpu_to_le32(current_fsuid());
 	if (dir->i_mode & S_ISGID) {
 		fe->i_gid = cpu_to_le32(dir->i_gid);
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		fe->i_gid = cpu_to_le32(current->fsgid);
+		fe->i_gid = cpu_to_le32(current_fsgid());
 	fe->i_mode = cpu_to_le16(mode);
 	if (S_ISCHR(mode) || S_ISBLK(mode))
 		fe->id1.dev1.i_rdev = cpu_to_le64(huge_encode_dev(dev));
diff --git a/fs/pipe.c b/fs/pipe.c
index e66ec48..598cf6c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -938,8 +938,8 @@ static struct inode * get_pipe_inode(void)
 	 */
 	inode->i_state = I_DIRTY;
 	inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 
 	return inode;
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index aec931e..39df95a 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -217,11 +217,11 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                 switch(pa->e_tag) {
                         case ACL_USER_OBJ:
 				/* (May have been checked already) */
-                                if (inode->i_uid == current->fsuid)
+				if (inode->i_uid == current_fsuid())
                                         goto check_perm;
                                 break;
                         case ACL_USER:
-                                if (pa->e_id == current->fsuid)
+				if (pa->e_id == current_fsuid())
                                         goto mask;
 				break;
                         case ACL_GROUP_OBJ:
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 8428d5b..98421f7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -55,8 +55,8 @@ struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev)
 
 	if (inode) {
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index b378eea..84458f3 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -582,7 +582,7 @@ static int new_inode_init(struct inode *inode, struct inode *dir, int mode)
 	/* the quota init calls have to know who to charge the quota to, so
 	 ** we have to set uid and gid here
 	 */
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	inode->i_mode = mode;
 	/* Make inode invalid - just in case we are going to drop it before
 	 * the initialization happens */
@@ -593,7 +593,7 @@ static int new_inode_init(struct inode *inode, struct inode *dir, int mode)
 		if (S_ISDIR(mode))
 			inode->i_mode |= S_ISGID;
 	} else {
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 	}
 	DQUOT_INIT(inode);
 	return 0;
diff --git a/fs/sysv/ialloc.c b/fs/sysv/ialloc.c
index 115ab0d..241e976 100644
--- a/fs/sysv/ialloc.c
+++ b/fs/sysv/ialloc.c
@@ -165,9 +165,9 @@ struct inode * sysv_new_inode(const struct inode * dir, mode_t mode)
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	inode->i_ino = fs16_to_cpu(sbi, ino);
 	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
 	inode->i_blocks = 0;
diff --git a/fs/udf/ialloc.c b/fs/udf/ialloc.c
index 636d8f6..1f3c01a 100644
--- a/fs/udf/ialloc.c
+++ b/fs/udf/ialloc.c
@@ -105,13 +105,13 @@ struct inode *udf_new_inode(struct inode *dir, int mode, int *err)
 		mark_buffer_dirty(UDF_SB_LVIDBH(sb));
 	}
 	inode->i_mode = mode;
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	if (dir->i_mode & S_ISGID) {
 		inode->i_gid = dir->i_gid;
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 	} else {
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 	}
 
 	UDF_I_LOCATION(inode).logicalBlockNum = block;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bec96a6..e2d466c 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -636,7 +636,7 @@ static int udf_mknod(struct inode *dir, struct dentry *dentry, int mode,
 	if (!inode)
 		goto out;
 
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	init_special_inode(inode, mode, rdev);
 	if (!(fi = udf_add_entry(dir, dentry, &fibh, &cfi, &err))) {
 		inode->i_nlink--;
diff --git a/fs/ufs/ialloc.c b/fs/ufs/ialloc.c
index 7e260bc..88f8889 100644
--- a/fs/ufs/ialloc.c
+++ b/fs/ufs/ialloc.c
@@ -304,13 +304,13 @@ cg_found:
 
 	inode->i_ino = cg * uspi->s_ipg + bit;
 	inode->i_mode = mode;
-	inode->i_uid = current->fsuid;
+	inode->i_uid = current_fsuid();
 	if (dir->i_mode & S_ISGID) {
 		inode->i_gid = dir->i_gid;
 		if (S_ISDIR(mode))
 			inode->i_mode |= S_ISGID;
 	} else
-		inode->i_gid = current->fsgid;
+		inode->i_gid = current_fsgid();
 
 	inode->i_blocks = 0;
 	inode->i_generation = 0;
diff --git a/fs/xfs/linux-2.6/xfs_linux.h b/fs/xfs/linux-2.6/xfs_linux.h
index dc3752d..0b943dd 100644
--- a/fs/xfs/linux-2.6/xfs_linux.h
+++ b/fs/xfs/linux-2.6/xfs_linux.h
@@ -126,8 +126,8 @@
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_pid()		(current->pid)
-#define current_fsuid(cred)	(current->fsuid)
-#define current_fsgid(cred)	(current->fsgid)
+#define this_fsuid(cred)	(current_fsuid())
+#define this_fsgid(cred)	(current_fsgid())
 #define current_test_flags(f)	(current->flags & (f))
 #define current_set_flags_nested(sp, f)		\
 		(*(sp) = current->flags, current->flags |= (f))
diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c
index 5bfb66f..878ca6e 100644
--- a/fs/xfs/xfs_acl.c
+++ b/fs/xfs/xfs_acl.c
@@ -386,7 +386,7 @@ xfs_acl_allow_set(
 	error = xfs_getattr(ip, &va, 0);
 	if (error)
 		return error;
-	if (va.va_uid != current->fsuid && !capable(CAP_FOWNER))
+	if (va.va_uid != current_fsuid() && !capable(CAP_FOWNER))
 		return EPERM;
 	return error;
 }
@@ -460,13 +460,13 @@ xfs_acl_access(
 		switch (fap->acl_entry[i].ae_tag) {
 		case ACL_USER_OBJ:
 			seen_userobj = 1;
-			if (fuid != current->fsuid)
+			if (fuid != current_fsuid())
 				continue;
 			matched.ae_tag = ACL_USER_OBJ;
 			matched.ae_perm = allows;
 			break;
 		case ACL_USER:
-			if (fap->acl_entry[i].ae_id != current->fsuid)
+			if (fap->acl_entry[i].ae_id != current_fsuid())
 				continue;
 			matched.ae_tag = ACL_USER;
 			matched.ae_perm = allows;
diff --git a/fs/xfs/xfs_attr.c b/fs/xfs/xfs_attr.c
index 93fa64d..33c1173 100644
--- a/fs/xfs/xfs_attr.c
+++ b/fs/xfs/xfs_attr.c
@@ -2627,7 +2627,7 @@ attr_user_capable(
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode) && (inode->i_mode & S_ISVTX) &&
-	    (current_fsuid(cred) != inode->i_uid) && !capable(CAP_FOWNER))
+	    (this_fsuid(cred) != inode->i_uid) && !capable(CAP_FOWNER))
 		return -EPERM;
 	return 0;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index abf509a..31eba0d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1132,8 +1132,8 @@ xfs_ialloc(
 	ip->i_d.di_onlink = 0;
 	ip->i_d.di_nlink = nlink;
 	ASSERT(ip->i_d.di_nlink == nlink);
-	ip->i_d.di_uid = current_fsuid(cr);
-	ip->i_d.di_gid = current_fsgid(cr);
+	ip->i_d.di_uid = this_fsuid(cr);
+	ip->i_d.di_gid = this_fsgid(cr);
 	ip->i_d.di_projid = prid;
 	memset(&(ip->i_d.di_pad[0]), 0, sizeof(ip->i_d.di_pad));
 
@@ -3640,7 +3640,7 @@ xfs_iaccess(
 	if ((error = _ACL_XFS_IACCESS(ip, mode, cr)) != -1)
 		return error ? XFS_ERROR(error) : 0;
 
-	if (current_fsuid(cr) != ip->i_d.di_uid) {
+	if (this_fsuid(cr) != ip->i_d.di_uid) {
 		mode >>= 3;
 		if (!in_group_p((gid_t)ip->i_d.di_gid))
 			mode >>= 3;
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index efd5aff..86bc8ec 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -341,7 +341,7 @@ xfs_setattr(
 	xfs_ilock(ip, lock_flags);
 
 	/* boolean: are we the file owner? */
-	file_owner = (current_fsuid(credp) == ip->i_d.di_uid);
+	file_owner = (this_fsuid(credp) == ip->i_d.di_uid);
 
 	/*
 	 * Change various properties of a file.
@@ -1878,7 +1878,7 @@ xfs_create(
 	 * Make sure that we have allocated dquot(s) on disk.
 	 */
 	error = XFS_QM_DQVOPALLOC(mp, dp,
-			current_fsuid(credp), current_fsgid(credp), prid,
+			this_fsuid(credp), this_fsgid(credp), prid,
 			XFS_QMOPT_QUOTALL|XFS_QMOPT_INHERIT, &udqp, &gdqp);
 	if (error)
 		goto std_return;
@@ -2757,7 +2757,7 @@ xfs_mkdir(
 	 * Make sure that we have allocated dquot(s) on disk.
 	 */
 	error = XFS_QM_DQVOPALLOC(mp, dp,
-			current_fsuid(credp), current_fsgid(credp), prid,
+			this_fsuid(credp), this_fsgid(credp), prid,
 			XFS_QMOPT_QUOTALL | XFS_QMOPT_INHERIT, &udqp, &gdqp);
 	if (error)
 		goto std_return;
@@ -3249,7 +3249,7 @@ xfs_symlink(
 	 * Make sure that we have allocated dquot(s) on disk.
 	 */
 	error = XFS_QM_DQVOPALLOC(mp, dp,
-			current_fsuid(credp), current_fsgid(credp), prid,
+			this_fsuid(credp), this_fsgid(credp), prid,
 			XFS_QMOPT_QUOTALL | XFS_QMOPT_INHERIT, &udqp, &gdqp);
 	if (error)
 		goto std_return;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..850d3fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1057,7 +1057,7 @@ enum {
 #define has_fs_excl() atomic_read(&current->fs_excl)
 
 #define is_owner_or_cap(inode)	\
-	((current->fsuid == (inode)->i_uid) || capable(CAP_FOWNER))
+	((current_fsuid() == (inode)->i_uid) || capable(CAP_FOWNER))
 
 /* not quite ready to be deprecated, but... */
 extern void lock_super(struct super_block *);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac3d496..88a5626 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,6 +1180,9 @@ struct task_struct {
 	struct prop_local_single dirties;
 };
 
+#define current_fsuid() (current->fsuid)
+#define current_fsgid() (current->fsgid)
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 6ca7b97..590045a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -109,8 +109,8 @@ static struct inode *mqueue_get_inode(struct super_block *sb, int mode,
 	inode = new_inode(sb);
 	if (inode) {
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_mtime = inode->i_ctime = inode->i_atime =
 				CURRENT_TIME;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1a3c239..1b85df5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -577,8 +577,8 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 
 	if (inode) {
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inode->i_mapping->backing_dev_info = &cgroup_backing_dev_info;
diff --git a/mm/shmem.c b/mm/shmem.c
index 51b3d6c..292b329 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1412,8 +1412,8 @@ shmem_get_inode(struct super_block *sb, int mode, dev_t dev)
 	inode = new_inode(sb);
 	if (inode) {
 		inode->i_mode = mode;
-		inode->i_uid = current->fsuid;
-		inode->i_gid = current->fsgid;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
 		inode->i_blocks = 0;
 		inode->i_mapping->a_ops = &shmem_aops;
 		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
@@ -2241,8 +2241,8 @@ static int shmem_fill_super(struct super_block *sb,
 	struct inode *inode;
 	struct dentry *root;
 	int mode   = S_IRWXUGO | S_ISVTX;
-	uid_t uid = current->fsuid;
-	gid_t gid = current->fsgid;
+	uid_t uid = current_fsuid();
+	gid_t gid = current_fsgid();
 	int err = -ENOMEM;
 	struct shmem_sb_info *sbinfo;
 	unsigned long blocks = 0;
diff --git a/net/9p/client.c b/net/9p/client.c
index af91993..9cb8b3b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -938,7 +938,7 @@ static struct p9_fid *p9_fid_create(struct p9_client *clnt)
 	fid->rdir_fpos = 0;
 	fid->rdir_pos = 0;
 	fid->rdir_fcall = NULL;
-	fid->uid = current->fsuid;
+	fid->uid = current_fsuid();
 	fid->clnt = clnt;
 	fid->aux = NULL;
 
diff --git a/net/socket.c b/net/socket.c
index 74784df..e5f8151 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -484,8 +484,8 @@ static struct socket *sock_alloc(void)
 	sock = SOCKET_I(inode);
 
 	inode->i_mode = S_IFSOCK | S_IRWXUGO;
-	inode->i_uid = current->fsuid;
-	inode->i_gid = current->fsgid;
+	inode->i_uid = current_fsuid();
+	inode->i_gid = current_fsgid();
 
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 1ea2755..390a1ec 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -337,8 +337,8 @@ struct rpc_cred *
 rpcauth_lookupcred(struct rpc_auth *auth, int flags)
 {
 	struct auth_cred acred = {
-		.uid = current->fsuid,
-		.gid = current->fsgid,
+		.uid = current_fsuid(),
+		.gid = current_fsgid(),
 		.group_info = current->group_info,
 	};
 	struct rpc_cred *ret;
@@ -373,8 +373,8 @@ rpcauth_bindcred(struct rpc_task *task)
 {
 	struct rpc_auth *auth = task->tk_client->cl_auth;
 	struct auth_cred acred = {
-		.uid = current->fsuid,
-		.gid = current->fsgid,
+		.uid = current_fsuid(),
+		.gid = current_fsgid(),
 		.group_info = current->group_info,
 	};
 	struct rpc_cred *ret;
diff --git a/security/commoncap.c b/security/commoncap.c
index 5bc1895..bbe188e 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -336,8 +336,8 @@ void cap_bprm_apply_creds (struct linux_binprm *bprm, int unsafe)
 		}
 	}
 
-	current->suid = current->euid = current->fsuid = bprm->e_uid;
-	current->sgid = current->egid = current->fsgid = bprm->e_gid;
+	current->suid = current->euid = current_fsuid() = bprm->e_uid;
+	current->sgid = current->egid = current_fsgid() = bprm->e_gid;
 
 	/* For init, we want to retain the capabilities set
 	 * in the init_task struct. Thus we skip the usual
@@ -466,11 +466,11 @@ int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t old_suid,
 			 */
 
 			if (!issecure (SECURE_NO_SETUID_FIXUP)) {
-				if (old_fsuid == 0 && current->fsuid != 0) {
+				if (old_fsuid == 0 && current_fsuid() != 0) {
 					cap_t (current->cap_effective) &=
 					    ~CAP_FS_MASK;
 				}
-				if (old_fsuid != 0 && current->fsuid == 0) {
+				if (old_fsuid != 0 && current_fsuid() == 0) {
 					cap_t (current->cap_effective) |=
 					    (cap_t (current->cap_permitted) &
 					     CAP_FS_MASK);
diff --git a/security/keys/key.c b/security/keys/key.c
index fdd5ca6..48fabb1 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -817,7 +817,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 		perm |= KEY_USR_WRITE;
 
 	/* allocate a new key */
-	key = key_alloc(ktype, description, current->fsuid, current->fsgid,
+	key = key_alloc(ktype, description, current_fsuid(), current->fsgid,
 			current, perm, flags);
 	if (IS_ERR(key)) {
 		key_ref = ERR_PTR(PTR_ERR(key));
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 56e963b..b3a63dd 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -810,7 +810,7 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 	down_write(&key->sem);
 
 	/* if we're not the sysadmin, we can only change a key that we own */
-	if (capable(CAP_SYS_ADMIN) || key->uid == current->fsuid) {
+	if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) {
 		key->perm = perm;
 		ret = 0;
 	}
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index aee4897..6d25911 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -76,7 +76,7 @@ static int call_sbin_request_key(struct key_construction *cons,
 	/* allocate a new session keyring */
 	sprintf(desc, "_req.%u", key->serial);
 
-	keyring = keyring_alloc(desc, current->fsuid, current->fsgid, current,
+	keyring = keyring_alloc(desc, current_fsuid(), current->fsgid, current,
 				KEY_ALLOC_QUOTA_OVERRUN, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
@@ -89,8 +89,8 @@ static int call_sbin_request_key(struct key_construction *cons,
 		goto error_link;
 
 	/* record the UID and GID */
-	sprintf(uid_str, "%d", current->fsuid);
-	sprintf(gid_str, "%d", current->fsgid);
+	sprintf(uid_str, "%d", current_fsuid());
+	sprintf(gid_str, "%d", current_fsgid());
 
 	/* we say which key is under construction */
 	sprintf(key_str, "%d", key->serial);
@@ -278,7 +278,7 @@ static int construct_alloc_key(struct key_type *type,
 	mutex_lock(&user->cons_lock);
 
 	key = key_alloc(type, description,
-			current->fsuid, current->fsgid, current, KEY_POS_ALL,
+			current_fsuid(), current->fsgid, current, KEY_POS_ALL,
 			flags);
 	if (IS_ERR(key))
 		goto alloc_failed;
@@ -341,7 +341,7 @@ static struct key *construct_key_and_link(struct key_type *type,
 	struct key *key;
 	int ret;
 
-	user = key_user_lookup(current->fsuid);
+	user = key_user_lookup(current_fsuid());
 	if (!user)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index 6827ae3..cce6b4d 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -194,7 +194,7 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 	sprintf(desc, "%x", target->serial);
 
 	authkey = key_alloc(&key_type_request_key_auth, desc,
-			    current->fsuid, current->fsgid, current,
+			    current_fsuid(), current->fsgid, current,
 			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH |
 			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA);
 	if (IS_ERR(authkey)) {


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 06/28] SECURITY: Separate task security context from task_struct [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (4 preceding siblings ...)
  2007-12-05 19:38 ` [PATCH 05/28] Security: Change current->fs[ug]id to current_fs[ug]id() " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 07/28] SECURITY: De-embed task security record from task and use refcounting " David Howells
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Separate the task security context from task_struct.  At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 drivers/block/loop.c             |    5 -
 drivers/char/agp/frontend.c      |    2 
 drivers/char/drm/drm_fops.c      |    2 
 drivers/char/tty_audit.c         |    2 
 fs/affs/super.c                  |    4 -
 fs/autofs/inode.c                |    4 -
 fs/autofs4/inode.c               |    4 -
 fs/autofs4/waitq.c               |    4 -
 fs/binfmt_elf.c                  |   12 +-
 fs/cifs/connect.c                |    5 -
 fs/cifs/ioctl.c                  |    2 
 fs/ecryptfs/messaging.c          |   15 +-
 fs/exec.c                        |   20 ++-
 fs/fat/inode.c                   |    4 -
 fs/fcntl.c                       |    7 +
 fs/file_table.c                  |    4 -
 fs/fuse/dir.c                    |   12 +-
 fs/hfs/super.c                   |    4 -
 fs/hfsplus/options.c             |    4 -
 fs/hpfs/super.c                  |    4 -
 fs/hugetlbfs/inode.c             |    4 -
 fs/inotify_user.c                |    2 
 fs/ioprio.c                      |   12 +-
 fs/namei.c                       |    6 +
 fs/ncpfs/ioctl.c                 |   32 ++---
 fs/open.c                        |   22 ++-
 fs/proc/array.c                  |   14 +-
 fs/proc/base.c                   |   16 +-
 fs/proc/proc_sysctl.c            |    4 -
 fs/quota.c                       |    4 -
 fs/smbfs/dir.c                   |    4 -
 fs/smbfs/inode.c                 |    2 
 fs/smbfs/proc.c                  |    2 
 include/linux/init_task.h        |   23 +++
 include/linux/sched.h            |   78 +++++++++---
 include/net/scm.h                |    4 -
 ipc/mqueue.c                     |    4 -
 ipc/msg.c                        |    4 -
 ipc/sem.c                        |    4 -
 ipc/shm.c                        |   16 +-
 ipc/util.c                       |    7 +
 kernel/acct.c                    |    8 +
 kernel/auditsc.c                 |   40 +++---
 kernel/cgroup.c                  |    5 -
 kernel/exit.c                    |   12 +-
 kernel/fork.c                    |   23 ++-
 kernel/futex.c                   |    8 +
 kernel/futex_compat.c            |    5 -
 kernel/ptrace.c                  |   14 +-
 kernel/sched.c                   |    9 +
 kernel/signal.c                  |   26 ++--
 kernel/sys.c                     |  251 ++++++++++++++++++++------------------
 kernel/sysctl.c                  |    2 
 kernel/timer.c                   |    8 +
 kernel/uid16.c                   |   28 ++--
 kernel/user.c                    |    4 -
 kernel/user_namespace.c          |    2 
 mm/oom_kill.c                    |    6 -
 net/core/scm.c                   |   10 +-
 net/sunrpc/auth.c                |    4 -
 net/unix/af_unix.c               |   12 +-
 security/dummy.c                 |   40 ++++--
 security/keys/key.c              |    2 
 security/keys/keyctl.c           |   25 ++--
 security/keys/permission.c       |   11 +-
 security/keys/process_keys.c     |   76 ++++++------
 security/keys/request_key.c      |   17 +--
 security/keys/request_key_auth.c |   14 +-
 security/selinux/exports.c       |    4 -
 security/selinux/hooks.c         |  111 ++++++++---------
 security/selinux/selinuxfs.c     |    2 
 71 files changed, 635 insertions(+), 528 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 56e2304..36caefb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -928,7 +928,8 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	int err;
 	struct loop_func_table *xfer;
 
-	if (lo->lo_encrypt_key_size && lo->lo_key_owner != current->uid &&
+	if (lo->lo_encrypt_key_size &&
+	    lo->lo_key_owner != current->act_as->uid &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	if (lo->lo_state != Lo_bound)
@@ -979,7 +980,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	if (info->lo_encrypt_key_size) {
 		memcpy(lo->lo_encrypt_key, info->lo_encrypt_key,
 		       info->lo_encrypt_key_size);
-		lo->lo_key_owner = current->uid;
+		lo->lo_key_owner = current->act_as->uid;
 	}	
 
 	return 0;
diff --git a/drivers/char/agp/frontend.c b/drivers/char/agp/frontend.c
index 7791e98..b07d2d2 100644
--- a/drivers/char/agp/frontend.c
+++ b/drivers/char/agp/frontend.c
@@ -689,7 +689,7 @@ static int agp_open(struct inode *inode, struct file *file)
 	set_bit(AGP_FF_ALLOW_CLIENT, &priv->access_flags);
 	priv->my_pid = current->pid;
 
-	if ((current->uid == 0) || (current->suid == 0)) {
+	if ((current->act_as->uid == 0) || (current->act_as->suid == 0)) {
 		/* Root priv, can be controller */
 		set_bit(AGP_FF_ALLOW_CONTROLLER, &priv->access_flags);
 	}
diff --git a/drivers/char/drm/drm_fops.c b/drivers/char/drm/drm_fops.c
index 3992f73..1f8d0a7 100644
--- a/drivers/char/drm/drm_fops.c
+++ b/drivers/char/drm/drm_fops.c
@@ -243,7 +243,7 @@ static int drm_open_helper(struct inode *inode, struct file *filp,
 	memset(priv, 0, sizeof(*priv));
 	filp->private_data = priv;
 	priv->filp = filp;
-	priv->uid = current->euid;
+	priv->uid = current->act_as->euid;
 	priv->pid = task_pid_nr(current);
 	priv->minor = minor;
 	priv->head = drm_heads[minor];
diff --git a/drivers/char/tty_audit.c b/drivers/char/tty_audit.c
index d222012..625c12b 100644
--- a/drivers/char/tty_audit.c
+++ b/drivers/char/tty_audit.c
@@ -86,7 +86,7 @@ static void tty_audit_buf_push(struct task_struct *tsk, uid_t loginuid,
 		char name[sizeof(tsk->comm)];
 
 		audit_log_format(ab, "tty pid=%u uid=%u auid=%u major=%d "
-				 "minor=%d comm=", tsk->pid, tsk->uid,
+				 "minor=%d comm=", tsk->pid, tsk->sec->uid,
 				 loginuid, buf->major, buf->minor);
 		get_task_comm(name, tsk);
 		audit_log_untrustedstring(ab, name);
diff --git a/fs/affs/super.c b/fs/affs/super.c
index b53e5d0..ed79ab3 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -159,8 +159,8 @@ parse_options(char *options, uid_t *uid, gid_t *gid, int *mode, int *reserved, s
 
 	/* Fill in defaults */
 
-	*uid        = current->uid;
-	*gid        = current->gid;
+	*uid        = current->sec->uid;
+	*gid        = current->sec->gid;
 	*reserved   = 2;
 	*root       = -1;
 	*blocksize  = -1;
diff --git a/fs/autofs/inode.c b/fs/autofs/inode.c
index 45f5992..ac3bd58 100644
--- a/fs/autofs/inode.c
+++ b/fs/autofs/inode.c
@@ -78,8 +78,8 @@ static int parse_options(char *options, int *pipefd, uid_t *uid, gid_t *gid,
 	substring_t args[MAX_OPT_ARGS];
 	int option;
 
-	*uid = current->uid;
-	*gid = current->gid;
+	*uid = current->sec->uid;
+	*gid = current->sec->gid;
 	*pgrp = task_pgrp_nr(current);
 
 	*minproto = *maxproto = AUTOFS_PROTO_VERSION;
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 7f05d6c..fac6121 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -224,8 +224,8 @@ static int parse_options(char *options, int *pipefd, uid_t *uid, gid_t *gid,
 	substring_t args[MAX_OPT_ARGS];
 	int option;
 
-	*uid = current->uid;
-	*gid = current->gid;
+	*uid = current->sec->uid;
+	*gid = current->sec->gid;
 	*pgrp = task_pgrp_nr(current);
 
 	*minproto = AUTOFS_MIN_PROTO_VERSION;
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 1fe28e4..f41f5b7 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -294,8 +294,8 @@ int autofs4_wait(struct autofs_sb_info *sbi, struct dentry *dentry,
 		wq->len = len;
 		wq->dev = autofs4_get_dev(sbi);
 		wq->ino = autofs4_get_ino(sbi);
-		wq->uid = current->uid;
-		wq->gid = current->gid;
+		wq->uid = current->sec->uid;
+		wq->gid = current->sec->gid;
 		wq->pid = current->pid;
 		wq->tgid = current->tgid;
 		wq->status = -EINTR; /* Status return if interrupted */
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index ba8de7c..d50db36 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -200,10 +200,10 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
 	NEW_AUX_ENT(AT_BASE, interp_load_addr);
 	NEW_AUX_ENT(AT_FLAGS, 0);
 	NEW_AUX_ENT(AT_ENTRY, exec->e_entry);
-	NEW_AUX_ENT(AT_UID, tsk->uid);
-	NEW_AUX_ENT(AT_EUID, tsk->euid);
-	NEW_AUX_ENT(AT_GID, tsk->gid);
-	NEW_AUX_ENT(AT_EGID, tsk->egid);
+	NEW_AUX_ENT(AT_UID, tsk->sec->uid);
+	NEW_AUX_ENT(AT_EUID, tsk->sec->euid);
+	NEW_AUX_ENT(AT_GID, tsk->sec->gid);
+	NEW_AUX_ENT(AT_EGID, tsk->sec->egid);
  	NEW_AUX_ENT(AT_SECURE, security_bprm_secureexec(bprm));
 	if (k_platform) {
 		NEW_AUX_ENT(AT_PLATFORM,
@@ -1440,8 +1440,8 @@ static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
 	psinfo->pr_zomb = psinfo->pr_sname == 'Z';
 	psinfo->pr_nice = task_nice(p);
 	psinfo->pr_flag = p->flags;
-	SET_UID(psinfo->pr_uid, p->uid);
-	SET_GID(psinfo->pr_gid, p->gid);
+	SET_UID(psinfo->pr_uid, p->sec->uid);
+	SET_GID(psinfo->pr_gid, p->sec->gid);
 	strncpy(psinfo->pr_fname, p->comm, sizeof(psinfo->pr_fname));
 	
 	return 0;
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index fd9147c..77b3e30 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -818,8 +818,9 @@ cifs_parse_mount_options(char *options, const char *devname,
 	/* null target name indicates to use *SMBSERVR default called name
 	   if we end up sending RFC1001 session initialize */
 	vol->target_rfc1001_name[0] = 0;
-	vol->linux_uid = current->uid;	/* current->euid instead? */
-	vol->linux_gid = current->gid;
+	vol->linux_uid = current->sec->uid;  /* use current->act_as->euid
+					      * instead? */
+	vol->linux_gid = current->sec->gid;
 	vol->dir_mode = S_IRWXUGO;
 	/* 2767 perms indicate mandatory locking support */
 	vol->file_mode = (S_IRWXUGO | S_ISGID) & (~S_IXGRP);
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index d24fe68..bf61a78 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -65,7 +65,7 @@ int cifs_ioctl (struct inode *inode, struct file *filep,
 	switch (command) {
 		case CIFS_IOC_CHECKUMOUNT:
 			cFYI(1, ("User unmount attempted"));
-			if (cifs_sb->mnt_uid == current->uid)
+			if (cifs_sb->mnt_uid == current->sec->uid)
 				rc = 0;
 			else {
 				rc = -EACCES;
diff --git a/fs/ecryptfs/messaging.c b/fs/ecryptfs/messaging.c
index a96d341..8f78205 100644
--- a/fs/ecryptfs/messaging.c
+++ b/fs/ecryptfs/messaging.c
@@ -264,26 +264,27 @@ int ecryptfs_process_response(struct ecryptfs_message *msg, uid_t uid,
 	}
 	msg_ctx = &ecryptfs_msg_ctx_arr[msg->index];
 	mutex_lock(&msg_ctx->mux);
-	if (ecryptfs_find_daemon_id(msg_ctx->task->euid, &id)) {
+	if (ecryptfs_find_daemon_id(msg_ctx->task->sec->euid, &id)) {
 		rc = -EBADMSG;
 		ecryptfs_printk(KERN_WARNING, "User [%d] received a "
 				"message response from process [%d] but does "
 				"not have a registered daemon\n",
-				msg_ctx->task->euid, pid);
+				msg_ctx->task->sec->euid, pid);
 		goto wake_up;
 	}
-	if (msg_ctx->task->euid != uid) {
+	if (msg_ctx->task->sec->euid != uid) {
 		rc = -EBADMSG;
 		ecryptfs_printk(KERN_WARNING, "Received message from user "
 				"[%d]; expected message from user [%d]\n",
-				uid, msg_ctx->task->euid);
+				uid, msg_ctx->task->sec->euid);
 		goto unlock;
 	}
 	if (id->pid != pid) {
 		rc = -EBADMSG;
 		ecryptfs_printk(KERN_ERR, "User [%d] received a "
 				"message response from an unrecognized "
-				"process [%d]\n", msg_ctx->task->euid, pid);
+				"process [%d]\n",
+				msg_ctx->task->sec->euid, pid);
 		goto unlock;
 	}
 	if (msg_ctx->state != ECRYPTFS_MSG_CTX_STATE_PENDING) {
@@ -331,11 +332,11 @@ int ecryptfs_send_message(unsigned int transport, char *data, int data_len,
 	int rc;
 
 	mutex_lock(&ecryptfs_daemon_id_hash_mux);
-	if (ecryptfs_find_daemon_id(current->euid, &id)) {
+	if (ecryptfs_find_daemon_id(current->act_as->euid, &id)) {
 		mutex_unlock(&ecryptfs_daemon_id_hash_mux);
 		rc = -ENOTCONN;
 		ecryptfs_printk(KERN_ERR, "User [%d] does not have a daemon "
-				"registered\n", current->euid);
+				"registered\n", current->sec->euid);
 		goto out;
 	}
 	mutex_unlock(&ecryptfs_daemon_id_hash_mux);
diff --git a/fs/exec.c b/fs/exec.c
index a09ce1b..da01655 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1000,7 +1000,8 @@ int flush_old_exec(struct linux_binprm * bprm)
 
 	current->sas_ss_sp = current->sas_ss_size = 0;
 
-	if (current->euid == current->uid && current->egid == current->gid)
+	if (current->sec->euid == current->sec->uid &&
+	    current->sec->egid == current->sec->gid)
 		set_dumpable(current->mm, 1);
 	else
 		set_dumpable(current->mm, suid_dumpable);
@@ -1027,7 +1028,8 @@ int flush_old_exec(struct linux_binprm * bprm)
 	 */
 	current->mm->task_size = TASK_SIZE;
 
-	if (bprm->e_uid != current->euid || bprm->e_gid != current->egid) {
+	if (bprm->e_uid != current->sec->euid ||
+	    bprm->e_gid != current->sec->egid) {
 		suid_keys(current);
 		set_dumpable(current->mm, suid_dumpable);
 		current->pdeath_signal = 0;
@@ -1069,8 +1071,8 @@ int prepare_binprm(struct linux_binprm *bprm)
 	if (bprm->file->f_op == NULL)
 		return -EACCES;
 
-	bprm->e_uid = current->euid;
-	bprm->e_gid = current->egid;
+	bprm->e_uid = current->sec->euid;
+	bprm->e_gid = current->sec->egid;
 
 	if(!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)) {
 		/* Set-uid? */
@@ -1123,7 +1125,7 @@ void compute_creds(struct linux_binprm *bprm)
 {
 	int unsafe;
 
-	if (bprm->e_uid != current->uid) {
+	if (bprm->e_uid != current->sec->uid) {
 		suid_keys(current);
 		current->pdeath_signal = 0;
 	}
@@ -1441,7 +1443,7 @@ static int format_corename(char *corename, const char *pattern, long signr)
 			/* uid */
 			case 'u':
 				rc = snprintf(out_ptr, out_end - out_ptr,
-					      "%d", current->uid);
+					      "%d", current->sec->uid);
 				if (rc > out_end - out_ptr)
 					goto out;
 				out_ptr += rc;
@@ -1449,7 +1451,7 @@ static int format_corename(char *corename, const char *pattern, long signr)
 			/* gid */
 			case 'g':
 				rc = snprintf(out_ptr, out_end - out_ptr,
-					      "%d", current->gid);
+					      "%d", current->sec->gid);
 				if (rc > out_end - out_ptr)
 					goto out;
 				out_ptr += rc;
@@ -1707,7 +1709,7 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
 	 */
 	if (get_dumpable(mm) == 2) {	/* Setuid core dump mode */
 		flag = O_EXCL;		/* Stop rewrite attacks */
-		current->fsuid = 0;	/* Dump root private */
+		current->act_as->fsuid = 0;	/* Dump root private */
 	}
 
 	retval = coredump_wait(exit_code);
@@ -1803,7 +1805,7 @@ fail_unlock:
 	if (helper_argv)
 		argv_free(helper_argv);
 
-	current->fsuid = fsuid;
+	current->act_as->fsuid = fsuid;
 	complete_all(&mm->core_done);
 fail:
 	return retval;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 920a576..f49733f 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -934,8 +934,8 @@ static int parse_options(char *options, int is_vfat, int silent, int *debug,
 
 	opts->isvfat = is_vfat;
 
-	opts->fs_uid = current->uid;
-	opts->fs_gid = current->gid;
+	opts->fs_uid = current->sec->uid;
+	opts->fs_gid = current->sec->gid;
 	opts->fs_fmask = opts->fs_dmask = current->fs->umask;
 	opts->codepage = fat_default_codepage;
 	opts->iocharset = fat_default_iocharset;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 8685263..35d8f74 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -276,7 +276,8 @@ int __f_setown(struct file *filp, struct pid *pid, enum pid_type type,
 	if (err)
 		return err;
 
-	f_modown(filp, pid, type, current->uid, current->euid, force);
+	f_modown(filp, pid, type, current->sec->uid, current->act_as->euid,
+		 force);
 	return 0;
 }
 EXPORT_SYMBOL(__f_setown);
@@ -461,8 +462,8 @@ static inline int sigio_perm(struct task_struct *p,
                              struct fown_struct *fown, int sig)
 {
 	return (((fown->euid == 0) ||
-		 (fown->euid == p->suid) || (fown->euid == p->uid) ||
-		 (fown->uid == p->suid) || (fown->uid == p->uid)) &&
+		 (fown->euid == p->sec->suid) || (fown->euid == p->sec->uid) ||
+		 (fown->uid == p->sec->suid) || (fown->uid == p->sec->uid)) &&
 		!security_file_send_sigiotask(p, fown, sig));
 }
 
diff --git a/fs/file_table.c b/fs/file_table.c
index 664e3f2..e559b50 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -114,8 +114,8 @@ struct file *get_empty_filp(void)
 	INIT_LIST_HEAD(&f->f_u.fu_list);
 	atomic_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
-	f->f_uid = tsk->fsuid;
-	f->f_gid = tsk->fsgid;
+	f->f_uid = tsk->act_as->fsuid;
+	f->f_gid = tsk->act_as->fsgid;
 	eventpoll_init_file(f);
 	/* f->f_version: 0 */
 	return f;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 80d2f52..0c78b97 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -830,12 +830,12 @@ int fuse_allow_task(struct fuse_conn *fc, struct task_struct *task)
 	if (fc->flags & FUSE_ALLOW_OTHER)
 		return 1;
 
-	if (task->euid == fc->user_id &&
-	    task->suid == fc->user_id &&
-	    task->uid == fc->user_id &&
-	    task->egid == fc->group_id &&
-	    task->sgid == fc->group_id &&
-	    task->gid == fc->group_id)
+	if (task->sec->euid == fc->user_id &&
+	    task->sec->suid == fc->user_id &&
+	    task->sec->uid == fc->user_id &&
+	    task->sec->egid == fc->group_id &&
+	    task->sec->sgid == fc->group_id &&
+	    task->sec->gid == fc->group_id)
 		return 1;
 
 	return 0;
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 16cbd90..54a1d32 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -210,8 +210,8 @@ static int parse_options(char *options, struct hfs_sb_info *hsb)
 	int tmp, token;
 
 	/* initialize the sb with defaults */
-	hsb->s_uid = current->uid;
-	hsb->s_gid = current->gid;
+	hsb->s_uid = current->sec->uid;
+	hsb->s_gid = current->sec->gid;
 	hsb->s_file_umask = 0133;
 	hsb->s_dir_umask = 0022;
 	hsb->s_type = hsb->s_creator = cpu_to_be32(0x3f3f3f3f);	/* == '????' */
diff --git a/fs/hfsplus/options.c b/fs/hfsplus/options.c
index dc64fac..fa5e015 100644
--- a/fs/hfsplus/options.c
+++ b/fs/hfsplus/options.c
@@ -49,8 +49,8 @@ void hfsplus_fill_defaults(struct hfsplus_sb_info *opts)
 	opts->creator = HFSPLUS_DEF_CR_TYPE;
 	opts->type = HFSPLUS_DEF_CR_TYPE;
 	opts->umask = current->fs->umask;
-	opts->uid = current->uid;
-	opts->gid = current->gid;
+	opts->uid = current->sec->uid;
+	opts->gid = current->sec->gid;
 	opts->part = -1;
 	opts->session = -1;
 }
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index 00971d9..cf4c6b5 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -464,8 +464,8 @@ static int hpfs_fill_super(struct super_block *s, void *options, int silent)
 
 	init_MUTEX(&sbi->hpfs_creation_de);
 
-	uid = current->uid;
-	gid = current->gid;
+	uid = current->sec->uid;
+	gid = current->sec->gid;
 	umask = current->fs->umask;
 	lowercase = 0;
 	conv = CONV_BINARY;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 39ad919..1e4d411 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -921,7 +921,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
 	if (!can_do_hugetlb_shm())
 		return ERR_PTR(-EPERM);
 
-	if (!user_shm_lock(size, current->user))
+	if (!user_shm_lock(size, current->sec->user))
 		return ERR_PTR(-ENOMEM);
 
 	root = hugetlbfs_vfsmount->mnt_root;
@@ -960,7 +960,7 @@ out_inode:
 out_dentry:
 	dput(dentry);
 out_shm_unlock:
-	user_shm_unlock(size, current->user);
+	user_shm_unlock(size, current->sec->user);
 	return ERR_PTR(error);
 }
 
diff --git a/fs/inotify_user.c b/fs/inotify_user.c
index 5e00933..6d68f3e 100644
--- a/fs/inotify_user.c
+++ b/fs/inotify_user.c
@@ -558,7 +558,7 @@ asmlinkage long sys_inotify_init(void)
 		goto out_put_fd;
 	}
 
-	user = get_uid(current->user);
+	user = get_uid(current->sec->user);
 	if (unlikely(atomic_read(&user->inotify_devs) >=
 			inotify_max_user_instances)) {
 		ret = -EMFILE;
diff --git a/fs/ioprio.c b/fs/ioprio.c
index e4e01bc..5392a60 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -32,8 +32,8 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
 	int err;
 	struct io_context *ioc;
 
-	if (task->uid != current->euid &&
-	    task->uid != current->uid && !capable(CAP_SYS_NICE))
+	if (task->sec->uid != current->act_as->euid &&
+	    task->sec->uid != current->act_as->uid && !capable(CAP_SYS_NICE))
 		return -EPERM;
 
 	err = security_task_setioprio(task, ioprio);
@@ -115,7 +115,7 @@ asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
 			break;
 		case IOPRIO_WHO_USER:
 			if (!who)
-				user = current->user;
+				user = current->sec->user;
 			else
 				user = find_user(who);
 
@@ -123,7 +123,7 @@ asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
 				break;
 
 			do_each_thread(g, p) {
-				if (p->uid != who)
+				if (p->sec->uid != who)
 					continue;
 				ret = set_task_ioprio(p, ioprio);
 				if (ret)
@@ -206,7 +206,7 @@ asmlinkage long sys_ioprio_get(int which, int who)
 			break;
 		case IOPRIO_WHO_USER:
 			if (!who)
-				user = current->user;
+				user = current->sec->user;
 			else
 				user = find_user(who);
 
@@ -214,7 +214,7 @@ asmlinkage long sys_ioprio_get(int which, int who)
 				break;
 
 			do_each_thread(g, p) {
-				if (p->uid != user->uid)
+				if (p->sec->uid != user->uid)
 					continue;
 				tmpio = get_task_ioprio(p);
 				if (tmpio < 0)
diff --git a/fs/namei.c b/fs/namei.c
index 8963e91..428c195 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1433,11 +1433,13 @@ int fastcall __user_walk(const char __user *name, unsigned flags, struct nameida
  */
 static inline int check_sticky(struct inode *dir, struct inode *inode)
 {
+	uid_t fsuid = current->act_as->fsuid;
+
 	if (!(dir->i_mode & S_ISVTX))
 		return 0;
-	if (inode->i_uid == current_fsuid())
+	if (inode->i_uid == fsuid)
 		return 0;
-	if (dir->i_uid == current_fsuid())
+	if (dir->i_uid == fsuid)
 		return 0;
 	return !capable(CAP_FOWNER);
 }
diff --git a/fs/ncpfs/ioctl.c b/fs/ncpfs/ioctl.c
index c67b4bd..5f1adaf 100644
--- a/fs/ncpfs/ioctl.c
+++ b/fs/ncpfs/ioctl.c
@@ -40,7 +40,7 @@ ncp_get_fs_info(struct ncp_server * server, struct file *file,
 	struct ncp_fs_info info;
 
 	if ((file_permission(file, MAY_WRITE) != 0)
-	    && (current->uid != server->m.mounted_uid)) {
+	    && (current->act_as->uid != server->m.mounted_uid)) {
 		return -EACCES;
 	}
 	if (copy_from_user(&info, arg, sizeof(info)))
@@ -70,7 +70,7 @@ ncp_get_fs_info_v2(struct ncp_server * server, struct file *file,
 	struct ncp_fs_info_v2 info2;
 
 	if ((file_permission(file, MAY_WRITE) != 0)
-	    && (current->uid != server->m.mounted_uid)) {
+	    && (current->act_as->uid != server->m.mounted_uid)) {
 		return -EACCES;
 	}
 	if (copy_from_user(&info2, arg, sizeof(info2)))
@@ -141,7 +141,7 @@ ncp_get_compat_fs_info_v2(struct ncp_server * server, struct file *file,
 	struct compat_ncp_fs_info_v2 info2;
 
 	if ((file_permission(file, MAY_WRITE) != 0)
-	    && (current->uid != server->m.mounted_uid)) {
+	    && (current->act_as->uid != server->m.mounted_uid)) {
 		return -EACCES;
 	}
 	if (copy_from_user(&info2, arg, sizeof(info2)))
@@ -276,7 +276,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 #endif
 	case NCP_IOC_NCPREQUEST:
 		if ((file_permission(filp, MAY_WRITE) != 0)
-		    && (current->uid != server->m.mounted_uid)) {
+		    && (current->act_as->uid != server->m.mounted_uid)) {
 			return -EACCES;
 		}
 #ifdef CONFIG_COMPAT
@@ -356,7 +356,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 	case NCP_IOC_GETMOUNTUID32:
 	case NCP_IOC_GETMOUNTUID64:
 		if ((file_permission(filp, MAY_READ) != 0)
-			&& (current->uid != server->m.mounted_uid)) {
+			&& (current->act_as->uid != server->m.mounted_uid)) {
 			return -EACCES;
 		}
 		if (cmd == NCP_IOC_GETMOUNTUID16) {
@@ -380,7 +380,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 			struct ncp_setroot_ioctl sr;
 
 			if ((file_permission(filp, MAY_READ) != 0)
-			    && (current->uid != server->m.mounted_uid))
+			    && (current->act_as->uid != server->m.mounted_uid))
 			{
 				return -EACCES;
 			}
@@ -455,7 +455,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 #ifdef CONFIG_NCPFS_PACKET_SIGNING	
 	case NCP_IOC_SIGN_INIT:
 		if ((file_permission(filp, MAY_WRITE) != 0)
-		    && (current->uid != server->m.mounted_uid))
+		    && (current->act_as->uid != server->m.mounted_uid))
 		{
 			return -EACCES;
 		}
@@ -478,7 +478,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 		
         case NCP_IOC_SIGN_WANTED:
 		if ((file_permission(filp, MAY_READ) != 0)
-		    && (current->uid != server->m.mounted_uid))
+		    && (current->act_as->uid != server->m.mounted_uid))
 		{
 			return -EACCES;
 		}
@@ -491,7 +491,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 			int newstate;
 
 			if ((file_permission(filp, MAY_WRITE) != 0)
-			    && (current->uid != server->m.mounted_uid))
+			    && (current->act_as->uid != server->m.mounted_uid))
 			{
 				return -EACCES;
 			}
@@ -512,7 +512,7 @@ int ncp_ioctl(struct inode *inode, struct file *filp,
 #ifdef CONFIG_NCPFS_IOCTL_LOCKING
 	case NCP_IOC_LOCKUNLOCK:
 		if ((file_permission(filp, MAY_WRITE) != 0)
-		    && (current->uid != server->m.mounted_uid))
+		    && (current->act_as->uid != server->m.mounted_uid))
 		{
 			return -EACCES;
 		}
@@ -585,7 +585,7 @@ outrel:
 
 #ifdef CONFIG_COMPAT
 	case NCP_IOC_GETOBJECTNAME_32:
-		if (current->uid != server->m.mounted_uid) {
+		if (current->act_as->uid != server->m.mounted_uid) {
 			return -EACCES;
 		}
 		{
@@ -610,7 +610,7 @@ outrel:
 		}
 #endif
 	case NCP_IOC_GETOBJECTNAME:
-		if (current->uid != server->m.mounted_uid) {
+		if (current->act_as->uid != server->m.mounted_uid) {
 			return -EACCES;
 		}
 		{
@@ -637,7 +637,7 @@ outrel:
 	case NCP_IOC_SETOBJECTNAME_32:
 #endif
 	case NCP_IOC_SETOBJECTNAME:
-		if (current->uid != server->m.mounted_uid) {
+		if (current->act_as->uid != server->m.mounted_uid) {
 			return -EACCES;
 		}
 		{
@@ -695,7 +695,7 @@ outrel:
 	case NCP_IOC_GETPRIVATEDATA_32:
 #endif
 	case NCP_IOC_GETPRIVATEDATA:
-		if (current->uid != server->m.mounted_uid) {
+		if (current->act_as->uid != server->m.mounted_uid) {
 			return -EACCES;
 		}
 		{
@@ -740,7 +740,7 @@ outrel:
 	case NCP_IOC_SETPRIVATEDATA_32:
 #endif
 	case NCP_IOC_SETPRIVATEDATA:
-		if (current->uid != server->m.mounted_uid) {
+		if (current->act_as->uid != server->m.mounted_uid) {
 			return -EACCES;
 		}
 		{
@@ -795,7 +795,7 @@ outrel:
 
 	case NCP_IOC_SETDENTRYTTL:
 		if ((file_permission(filp, MAY_WRITE) != 0) &&
-				 (current->uid != server->m.mounted_uid))
+		    current->act_as->uid != server->m.mounted_uid)
 			return -EACCES;
 		{
 			u_int32_t user;
diff --git a/fs/open.c b/fs/open.c
index 4932b4d..6d7a29c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -428,12 +428,12 @@ asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode)
 	if (mode & ~S_IRWXO)	/* where's F_OK, X_OK, W_OK, R_OK? */
 		return -EINVAL;
 
-	old_fsuid = current->fsuid;
-	old_fsgid = current->fsgid;
-	old_cap = current->cap_effective;
+	old_fsuid = current->act_as->fsuid;
+	old_fsgid = current->act_as->fsgid;
+	old_cap = current->act_as->cap_effective;
 
-	current->fsuid = current->uid;
-	current->fsgid = current->gid;
+	current->act_as->fsuid = current->act_as->uid;
+	current->act_as->fsgid = current->act_as->gid;
 
 	/*
 	 * Clear the capabilities if we switch to a non-root user
@@ -443,10 +443,10 @@ asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode)
 	 * value below.  We should hold task_capabilities_lock,
 	 * but we cannot because user_path_walk can sleep.
 	 */
-	if (current->uid)
-		cap_clear(current->cap_effective);
+	if (current->act_as->uid)
+		cap_clear(current->act_as->cap_effective);
 	else
-		current->cap_effective = current->cap_permitted;
+		current->act_as->cap_effective = current->act_as->cap_permitted;
 
 	res = __user_walk_fd(dfd, filename, LOOKUP_FOLLOW|LOOKUP_ACCESS, &nd);
 	if (res)
@@ -464,9 +464,9 @@ asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode)
 out_path_release:
 	path_release(&nd);
 out:
-	current->fsuid = old_fsuid;
-	current->fsgid = old_fsgid;
-	current->cap_effective = old_cap;
+	current->act_as->fsuid = old_fsuid;
+	current->act_as->fsgid = old_fsgid;
+	current->act_as->cap_effective = old_cap;
 
 	return res;
 }
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 65c62e1..94875ce 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -182,8 +182,8 @@ static inline char *task_state(struct task_struct *p, char *buffer)
 		task_tgid_nr_ns(p, ns),
 		task_pid_nr_ns(p, ns),
 		ppid, tpid,
-		p->uid, p->euid, p->suid, p->fsuid,
-		p->gid, p->egid, p->sgid, p->fsgid);
+		p->sec->uid, p->sec->euid, p->sec->suid, p->sec->fsuid,
+		p->sec->gid, p->sec->egid, p->sec->sgid, p->sec->fsgid);
 
 	task_lock(p);
 	if (p->files)
@@ -194,7 +194,7 @@ static inline char *task_state(struct task_struct *p, char *buffer)
 		fdt ? fdt->max_fds : 0);
 	rcu_read_unlock();
 
-	group_info = p->group_info;
+	group_info = p->sec->group_info;
 	get_group_info(group_info);
 	task_unlock(p);
 
@@ -267,7 +267,7 @@ static inline char *task_sig(struct task_struct *p, char *buffer)
 		blocked = p->blocked;
 		collect_sigign_sigcatch(p, &ignored, &caught);
 		num_threads = atomic_read(&p->signal->count);
-		qsize = atomic_read(&p->user->sigpending);
+		qsize = atomic_read(&p->sec->user->sigpending);
 		qlim = p->signal->rlim[RLIMIT_SIGPENDING].rlim_cur;
 		unlock_task_sighand(p, &flags);
 	}
@@ -291,9 +291,9 @@ static inline char *task_cap(struct task_struct *p, char *buffer)
     return buffer + sprintf(buffer, "CapInh:\t%016x\n"
 			    "CapPrm:\t%016x\n"
 			    "CapEff:\t%016x\n",
-			    cap_t(p->cap_inheritable),
-			    cap_t(p->cap_permitted),
-			    cap_t(p->cap_effective));
+			    cap_t(p->sec->cap_inheritable),
+			    cap_t(p->sec->cap_permitted),
+			    cap_t(p->sec->cap_effective));
 }
 
 static inline char *task_context_switch_counts(struct task_struct *p,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 02a63ac..4e4482b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1181,8 +1181,8 @@ static struct inode *proc_pid_make_inode(struct super_block * sb, struct task_st
 	inode->i_uid = 0;
 	inode->i_gid = 0;
 	if (task_dumpable(task)) {
-		inode->i_uid = task->euid;
-		inode->i_gid = task->egid;
+		inode->i_uid = task->sec->euid;
+		inode->i_gid = task->sec->egid;
 	}
 	security_task_to_inode(task, inode);
 
@@ -1207,8 +1207,8 @@ static int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
 	if (task) {
 		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
 		    task_dumpable(task)) {
-			stat->uid = task->euid;
-			stat->gid = task->egid;
+			stat->uid = task->sec->euid;
+			stat->gid = task->sec->egid;
 		}
 	}
 	rcu_read_unlock();
@@ -1239,8 +1239,8 @@ static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 	if (task) {
 		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
 		    task_dumpable(task)) {
-			inode->i_uid = task->euid;
-			inode->i_gid = task->egid;
+			inode->i_uid = task->sec->euid;
+			inode->i_gid = task->sec->egid;
 		} else {
 			inode->i_uid = 0;
 			inode->i_gid = 0;
@@ -1413,8 +1413,8 @@ static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
 				rcu_read_unlock();
 				put_files_struct(files);
 				if (task_dumpable(task)) {
-					inode->i_uid = task->euid;
-					inode->i_gid = task->egid;
+					inode->i_uid = task->sec->euid;
+					inode->i_gid = task->sec->egid;
 				} else {
 					inode->i_uid = 0;
 					inode->i_gid = 0;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 4e57fcf..e4ddfdb 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -393,9 +393,9 @@ static int proc_sys_permission(struct inode *inode, int mask, struct nameidata *
 	error = -EACCES;
 	mode = inode->i_mode;
 
-	if (current->euid == 0)
+	if (current->act_as->euid == 0)
 		mode >>= 6;
-	else if (in_group_p(0))
+	else if (in_egroup_p(0))
 		mode >>= 3;
 
 	if ((mode & mask & (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask)
diff --git a/fs/quota.c b/fs/quota.c
index 99b24b5..ab4f1d9 100644
--- a/fs/quota.c
+++ b/fs/quota.c
@@ -80,7 +80,7 @@ static int generic_quotactl_valid(struct super_block *sb, int type, int cmd, qid
 
 	/* Check privileges */
 	if (cmd == Q_GETQUOTA) {
-		if (((type == USRQUOTA && current->euid != id) ||
+		if (((type == USRQUOTA && current->act_as->euid != id) ||
 		     (type == GRPQUOTA && !in_egroup_p(id))) &&
 		    !capable(CAP_SYS_ADMIN))
 			return -EPERM;
@@ -131,7 +131,7 @@ static int xqm_quotactl_valid(struct super_block *sb, int type, int cmd, qid_t i
 
 	/* Check privileges */
 	if (cmd == Q_XGETQUOTA) {
-		if (((type == XQM_USRQUOTA && current->euid != id) ||
+		if (((type == XQM_USRQUOTA && current->act_as->euid != id) ||
 		     (type == XQM_GRPQUOTA && !in_egroup_p(id))) &&
 		     !capable(CAP_SYS_ADMIN))
 			return -EPERM;
diff --git a/fs/smbfs/dir.c b/fs/smbfs/dir.c
index 48da4fa..53e03a3 100644
--- a/fs/smbfs/dir.c
+++ b/fs/smbfs/dir.c
@@ -667,8 +667,8 @@ smb_make_node(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
 
 	attr.ia_valid = ATTR_MODE | ATTR_UID | ATTR_GID;
 	attr.ia_mode = mode;
-	attr.ia_uid = current->euid;
-	attr.ia_gid = current->egid;
+	attr.ia_uid = current->act_as->euid;
+	attr.ia_gid = current->act_as->egid;
 
 	if (!new_valid_dev(dev))
 		return -EINVAL;
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index 9416ead..95a2455 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -579,7 +579,7 @@ static int smb_fill_super(struct super_block *sb, void *raw_data, int silent)
 		if (parse_options(mnt, raw_data))
 			goto out_bad_option;
 	}
-	mnt->mounted_uid = current->uid;
+	mnt->mounted_uid = current->act_as->uid;
 	smb_setcodepage(server, &mnt->codepage);
 
 	/*
diff --git a/fs/smbfs/proc.c b/fs/smbfs/proc.c
index d517a27..a55d9cd 100644
--- a/fs/smbfs/proc.c
+++ b/fs/smbfs/proc.c
@@ -865,7 +865,7 @@ smb_newconn(struct smb_sb_info *server, struct smb_conn_opt *opt)
 		goto out;
 
 	error = -EACCES;
-	if (current->uid != server->mnt->mounted_uid && 
+	if (current->act_as->uid != server->mnt->mounted_uid &&
 	    !capable(CAP_SYS_ADMIN))
 		goto out;
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index cae35b6..6fa8413 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -114,6 +114,20 @@ extern struct group_info init_groups;
 	.pid = &init_struct_pid,				\
 }
 
+extern struct task_security init_task_security;
+
+#define INIT_TASK_SECURITY(p)					\
+{								\
+	.usage			= ATOMIC_INIT(3),		\
+	.keep_capabilities	= 0,				\
+	.cap_inheritable	= CAP_INIT_INH_SET,		\
+	.cap_permitted		= CAP_FULL_SET,			\
+	.cap_effective		= CAP_INIT_EFF_SET,		\
+	.user			= INIT_USER,			\
+	.group_info		= &init_groups,			\
+	.lock			= __SPIN_LOCK_UNLOCKED(p.lock),	\
+}
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -143,12 +157,9 @@ extern struct group_info init_groups;
 	.children	= LIST_HEAD_INIT(tsk.children),			\
 	.sibling	= LIST_HEAD_INIT(tsk.sibling),			\
 	.group_leader	= &tsk,						\
-	.group_info	= &init_groups,					\
-	.cap_effective	= CAP_INIT_EFF_SET,				\
-	.cap_inheritable = CAP_INIT_INH_SET,				\
-	.cap_permitted	= CAP_FULL_SET,					\
-	.keep_capabilities = 0,						\
-	.user		= INIT_USER,					\
+	.__temp_sec	= INIT_TASK_SECURITY(tsk.__temp_sec),		\
+	.sec		= &tsk.__temp_sec,				\
+	.act_as		= &tsk.__temp_sec,				\
 	.comm		= "swapper",					\
 	.thread		= INIT_THREAD,					\
 	.fs		= &init_fs,					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88a5626..bcd785a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -570,6 +570,63 @@ extern struct user_struct *find_user(uid_t);
 extern struct user_struct root_user;
 #define INIT_USER (&root_user)
 
+
+/*
+ * The security context of a task
+ *
+ * The parts of the context break down into two categories:
+ *
+ *  (1) The objective context of a task.  These parts are used when some other
+ *	task is attempting to affect this one.
+ *
+ *  (2) The subjective context.  These details are used when the task is acting
+ *	upon another object, be that a file, a task, a key or whatever.
+ *
+ * Note that some members of this structure belong to both categories - the
+ * LSM security pointer for instance.
+ *
+ * A task has two security pointers.  task->sec points to the objective context
+ * that defines that task's actual details.  The objective part of this context
+ * is used whenever that task is acted upon.
+ *
+ * task->act_as points to the subjective context that defines the details of
+ * how that task is going to act upon another object.  This may be overridden
+ * temporarily to point to another security context, but normally points to the
+ * same context as task->sec.
+ */
+struct task_security {
+	atomic_t	usage;
+	uid_t		uid;		/* real UID of the task */
+	gid_t		gid;		/* real GID of the task */
+	uid_t		suid;		/* saved UID of the task */
+	gid_t		sgid;		/* saved GID of the task */
+	uid_t		euid;		/* effective UID of the task */
+	gid_t		egid;		/* effective GID of the task */
+	uid_t		fsuid;		/* UID for VFS ops */
+	gid_t		fsgid;		/* GID for VFS ops */
+	unsigned	keep_capabilities:1;
+	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
+	kernel_cap_t	cap_permitted;	/* caps we're permitted */
+	kernel_cap_t	cap_effective;	/* caps we can actually use */
+#ifdef CONFIG_KEYS
+	unsigned char	jit_keyring;	/* default keyring to attach requested
+					 * keys to */
+	struct key	*thread_keyring; /* keyring private to this thread */
+	struct key	*request_key_auth; /* assumed request_key authority */
+#endif
+#ifdef CONFIG_SECURITY
+	void		*security;	/* subjective LSM security */
+#endif
+	struct user_struct *user;	/* real user ID subscription */
+	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
+	spinlock_t	lock;		/* lock for pointer changes */
+};
+
+#define current_fsuid() (current->act_as->fsuid)
+#define current_fsgid() (current->act_as->fsgid)
+#define current_cap()	(current->act_as->cap_effective)
+
+
 struct backing_dev_info;
 struct reclaim_state;
 
@@ -1025,17 +1082,10 @@ struct task_struct {
 	struct list_head cpu_timers[3];
 
 /* process credentials */
-	uid_t uid,euid,suid,fsuid;
-	gid_t gid,egid,sgid,fsgid;
-	struct group_info *group_info;
-	kernel_cap_t   cap_effective, cap_inheritable, cap_permitted;
-	unsigned keep_capabilities:1;
-	struct user_struct *user;
-#ifdef CONFIG_KEYS
-	struct key *request_key_auth;	/* assumed request_key authority */
-	struct key *thread_keyring;	/* keyring private to this thread */
-	unsigned char jit_keyring;	/* default keyring to attach requested keys to */
-#endif
+	struct task_security __temp_sec __deprecated; /* temporary security to be removed */
+	struct task_security *sec;	/* actual/objective task security */
+	struct task_security *act_as;	/* effective/subjective task security */
+
 	char comm[TASK_COMM_LEN]; /* executable name excluding path
 				     - access with [gs]et_task_comm (which lock
 				       it with task_lock())
@@ -1067,9 +1117,6 @@ struct task_struct {
 	int (*notifier)(void *priv);
 	void *notifier_data;
 	sigset_t *notifier_mask;
-#ifdef CONFIG_SECURITY
-	void *security;
-#endif
 	struct audit_context *audit_context;
 	seccomp_t seccomp;
 
@@ -1180,9 +1227,6 @@ struct task_struct {
 	struct prop_local_single dirties;
 };
 
-#define current_fsuid() (current->fsuid)
-#define current_fsgid() (current->fsgid)
-
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/include/net/scm.h b/include/net/scm.h
index 06df126..b133114 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -54,8 +54,8 @@ static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
 			       struct scm_cookie *scm)
 {
 	struct task_struct *p = current;
-	scm->creds.uid = p->uid;
-	scm->creds.gid = p->gid;
+	scm->creds.uid = p->sec->uid;
+	scm->creds.gid = p->sec->gid;
 	scm->creds.pid = task_tgid_vnr(p);
 	scm->fp = NULL;
 	scm->seq = 0;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 590045a..43a3228 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -118,7 +118,7 @@ static struct inode *mqueue_get_inode(struct super_block *sb, int mode,
 		if (S_ISREG(mode)) {
 			struct mqueue_inode_info *info;
 			struct task_struct *p = current;
-			struct user_struct *u = p->user;
+			struct user_struct *u = p->sec->user;
 			unsigned long mq_bytes, mq_msg_tblsz;
 
 			inode->i_fop = &mqueue_file_operations;
@@ -511,7 +511,7 @@ static void __do_notify(struct mqueue_inode_info *info)
 			sig_i.si_code = SI_MESGQ;
 			sig_i.si_value = info->notify.sigev_value;
 			sig_i.si_pid = task_pid_vnr(current);
-			sig_i.si_uid = current->uid;
+			sig_i.si_uid = current->act_as->uid;
 
 			kill_pid_info(info->notify.sigev_signo,
 				      &sig_i, info->notify_owner);
diff --git a/ipc/msg.c b/ipc/msg.c
index fdf3db5..83c7a29 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -551,8 +551,8 @@ asmlinkage long sys_msgctl(int msqid, int cmd, struct msqid_ds __user *buf)
 	}
 
 	err = -EPERM;
-	if (current->euid != ipcp->cuid &&
-	    current->euid != ipcp->uid && !capable(CAP_SYS_ADMIN))
+	if (current->act_as->euid != ipcp->cuid &&
+	    current->act_as->euid != ipcp->uid && !capable(CAP_SYS_ADMIN))
 		/* We _could_ check for CAP_CHOWN above, but we don't */
 		goto out_unlock_up;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 35952c0..35736f5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -924,8 +924,8 @@ static int semctl_down(struct ipc_namespace *ns, int semid, int semnum,
 		if (err)
 			goto out_unlock;
 	}
-	if (current->euid != ipcp->cuid && 
-	    current->euid != ipcp->uid && !capable(CAP_SYS_ADMIN)) {
+	if (current->act_as->euid != ipcp->cuid &&
+	    current->act_as->euid != ipcp->uid && !capable(CAP_SYS_ADMIN)) {
 	    	err=-EPERM;
 		goto out_unlock;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 3818fae..89e232e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -417,7 +417,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (shmflg & SHM_HUGETLB) {
 		/* hugetlb_file_setup takes care of mlock user accounting */
 		file = hugetlb_file_setup(name, size);
-		shp->mlock_user = current->user;
+		shp->mlock_user = current->sec->user;
 	} else {
 		int acctflag = VM_ACCOUNT;
 		/*
@@ -770,8 +770,8 @@ asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
 
 		if (!capable(CAP_IPC_LOCK)) {
 			err = -EPERM;
-			if (current->euid != shp->shm_perm.uid &&
-			    current->euid != shp->shm_perm.cuid)
+			if (current->act_as->euid != shp->shm_perm.uid &&
+			    current->act_as->euid != shp->shm_perm.cuid)
 				goto out_unlock;
 			if (cmd == SHM_LOCK &&
 			    !current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur)
@@ -783,7 +783,7 @@ asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
 			goto out_unlock;
 		
 		if(cmd==SHM_LOCK) {
-			struct user_struct * user = current->user;
+			struct user_struct *user = current->act_as->user;
 			if (!is_file_hugepages(shp->shm_file)) {
 				err = shmem_lock(shp->shm_file, 1, user);
 				if (!err && !(shp->shm_perm.mode & SHM_LOCKED)){
@@ -822,8 +822,8 @@ asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
 		if (err)
 			goto out_unlock_up;
 
-		if (current->euid != shp->shm_perm.uid &&
-		    current->euid != shp->shm_perm.cuid && 
+		if (current->act_as->euid != shp->shm_perm.uid &&
+		    current->act_as->euid != shp->shm_perm.cuid &&
 		    !capable(CAP_SYS_ADMIN)) {
 			err=-EPERM;
 			goto out_unlock_up;
@@ -862,8 +862,8 @@ asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
 		if (err)
 			goto out_unlock_up;
 		err=-EPERM;
-		if (current->euid != shp->shm_perm.uid &&
-		    current->euid != shp->shm_perm.cuid && 
+		if (current->act_as->euid != shp->shm_perm.uid &&
+		    current->act_as->euid != shp->shm_perm.cuid &&
 		    !capable(CAP_SYS_ADMIN)) {
 			goto out_unlock_up;
 		}
diff --git a/ipc/util.c b/ipc/util.c
index 1aa0ebf..fc22a9c 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -283,8 +283,8 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
 
 	ids->in_use++;
 
-	new->cuid = new->uid = current->euid;
-	new->gid = new->cgid = current->egid;
+	new->cuid = new->uid = current->act_as->euid;
+	new->gid = new->cgid = current->act_as->egid;
 
 	new->seq = ids->seq++;
 	if(ids->seq > ids->seq_max)
@@ -632,7 +632,8 @@ int ipcperms (struct kern_ipc_perm *ipcp, short flag)
 		return err;
 	requested_mode = (flag >> 6) | (flag >> 3) | flag;
 	granted_mode = ipcp->mode;
-	if (current->euid == ipcp->cuid || current->euid == ipcp->uid)
+	if (current->act_as->euid == ipcp->cuid ||
+	    current->act_as->euid == ipcp->uid)
 		granted_mode >>= 6;
 	else if (in_group_p(ipcp->cgid) || in_group_p(ipcp->gid))
 		granted_mode >>= 3;
diff --git a/kernel/acct.c b/kernel/acct.c
index cf19547..d66a849 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -470,15 +470,15 @@ static void do_acct_process(struct file *file)
 	do_div(elapsed, AHZ);
 	ac.ac_btime = get_seconds() - elapsed;
 	/* we really need to bite the bullet and change layout */
-	ac.ac_uid = current->uid;
-	ac.ac_gid = current->gid;
+	ac.ac_uid = current->sec->uid;
+	ac.ac_gid = current->sec->gid;
 #if ACCT_VERSION==2
 	ac.ac_ahz = AHZ;
 #endif
 #if ACCT_VERSION==1 || ACCT_VERSION==2
 	/* backward-compatible 16 bit fields */
-	ac.ac_uid16 = current->uid;
-	ac.ac_gid16 = current->gid;
+	ac.ac_uid16 = current->sec->uid;
+	ac.ac_gid16 = current->sec->gid;
 #endif
 #if ACCT_VERSION==3
 	ac.ac_pid = current->tgid;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index bce9ecd..46fe72a 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -394,6 +394,7 @@ static int audit_filter_rules(struct task_struct *tsk,
 			      struct audit_names *name,
 			      enum audit_state *state)
 {
+	struct task_security *sec = tsk->sec;
 	int i, j, need_sid = 1;
 	u32 sid;
 
@@ -413,28 +414,28 @@ static int audit_filter_rules(struct task_struct *tsk,
 			}
 			break;
 		case AUDIT_UID:
-			result = audit_comparator(tsk->uid, f->op, f->val);
+			result = audit_comparator(sec->uid, f->op, f->val);
 			break;
 		case AUDIT_EUID:
-			result = audit_comparator(tsk->euid, f->op, f->val);
+			result = audit_comparator(sec->euid, f->op, f->val);
 			break;
 		case AUDIT_SUID:
-			result = audit_comparator(tsk->suid, f->op, f->val);
+			result = audit_comparator(sec->suid, f->op, f->val);
 			break;
 		case AUDIT_FSUID:
-			result = audit_comparator(tsk->fsuid, f->op, f->val);
+			result = audit_comparator(sec->fsuid, f->op, f->val);
 			break;
 		case AUDIT_GID:
-			result = audit_comparator(tsk->gid, f->op, f->val);
+			result = audit_comparator(sec->gid, f->op, f->val);
 			break;
 		case AUDIT_EGID:
-			result = audit_comparator(tsk->egid, f->op, f->val);
+			result = audit_comparator(sec->egid, f->op, f->val);
 			break;
 		case AUDIT_SGID:
-			result = audit_comparator(tsk->sgid, f->op, f->val);
+			result = audit_comparator(sec->sgid, f->op, f->val);
 			break;
 		case AUDIT_FSGID:
-			result = audit_comparator(tsk->fsgid, f->op, f->val);
+			result = audit_comparator(sec->fsgid, f->op, f->val);
 			break;
 		case AUDIT_PERS:
 			result = audit_comparator(tsk->personality, f->op, f->val);
@@ -997,6 +998,7 @@ static void audit_log_execve_info(struct audit_buffer *ab,
 
 static void audit_log_exit(struct audit_context *context, struct task_struct *tsk)
 {
+	struct task_security *sec = tsk->sec;
 	int i, call_panic = 0;
 	struct audit_buffer *ab;
 	struct audit_aux_data *aux;
@@ -1006,14 +1008,14 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts
 	context->pid = tsk->pid;
 	if (!context->ppid)
 		context->ppid = sys_getppid();
-	context->uid = tsk->uid;
-	context->gid = tsk->gid;
-	context->euid = tsk->euid;
-	context->suid = tsk->suid;
-	context->fsuid = tsk->fsuid;
-	context->egid = tsk->egid;
-	context->sgid = tsk->sgid;
-	context->fsgid = tsk->fsgid;
+	context->uid = sec->uid;
+	context->gid = sec->gid;
+	context->euid = sec->euid;
+	context->suid = sec->suid;
+	context->fsuid = sec->fsuid;
+	context->egid = sec->egid;
+	context->sgid = sec->sgid;
+	context->fsgid = sec->fsgid;
 	context->personality = tsk->personality;
 
 	ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL);
@@ -1788,7 +1790,7 @@ int audit_set_loginuid(struct task_struct *task, uid_t loginuid)
 			if (ab) {
 				audit_log_format(ab, "login pid=%d uid=%u "
 					"old auid=%u new auid=%u",
-					task->pid, task->uid,
+					task->pid, task->sec->uid,
 					context->loginuid, loginuid);
 				audit_log_end(ab);
 			}
@@ -2219,7 +2221,7 @@ int __audit_signal_info(int sig, struct task_struct *t)
 			if (ctx)
 				audit_sig_uid = ctx->loginuid;
 			else
-				audit_sig_uid = tsk->uid;
+				audit_sig_uid = tsk->sec->uid;
 			selinux_get_task_sid(tsk, &audit_sig_sid);
 		}
 		if (!audit_signals || audit_dummy_context())
@@ -2274,7 +2276,7 @@ void audit_core_dumps(long signr)
 	ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
 	audit_log_format(ab, "auid=%u uid=%u gid=%u",
 			audit_get_loginuid(current->audit_context),
-			current->uid, current->gid);
+			current->sec->uid, current->sec->gid);
 	selinux_get_task_sid(current, &sid);
 	if (sid) {
 		char *ctx = NULL;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1b85df5..0016012 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1247,8 +1247,9 @@ static int attach_task_by_pid(struct cgroup *cgrp, char *pidbuf)
 		get_task_struct(tsk);
 		rcu_read_unlock();
 
-		if ((current->euid) && (current->euid != tsk->uid)
-		    && (current->euid != tsk->suid)) {
+		if (current->act_as->euid &&
+		    current->act_as->euid != tsk->sec->uid &&
+		    current->act_as->euid != tsk->sec->suid) {
 			put_task_struct(tsk);
 			return -EACCES;
 		}
diff --git a/kernel/exit.c b/kernel/exit.c
index 549c055..d793e22 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -147,7 +147,7 @@ void release_task(struct task_struct * p)
 	struct task_struct *leader;
 	int zap_leader;
 repeat:
-	atomic_dec(&p->user->processes);
+	atomic_dec(&p->sec->user->processes);
 	proc_flush_task(p);
 	write_lock_irq(&tasklist_lock);
 	ptrace_unlink(p);
@@ -1198,7 +1198,7 @@ static int wait_task_zombie(struct task_struct *p, int noreap,
 
 	if (unlikely(noreap)) {
 		pid_t pid = task_pid_nr_ns(p, ns);
-		uid_t uid = p->uid;
+		uid_t uid = p->sec->uid;
 		int exit_code = p->exit_code;
 		int why, status;
 
@@ -1318,7 +1318,7 @@ static int wait_task_zombie(struct task_struct *p, int noreap,
 	if (!retval && infop)
 		retval = put_user(task_pid_nr_ns(p, ns), &infop->si_pid);
 	if (!retval && infop)
-		retval = put_user(p->uid, &infop->si_uid);
+		retval = put_user(p->sec->uid, &infop->si_uid);
 	if (!retval)
 		retval = task_pid_nr_ns(p, ns);
 
@@ -1381,7 +1381,7 @@ static int wait_task_stopped(struct task_struct *p, int delayed_group_leader,
 	read_unlock(&tasklist_lock);
 
 	if (unlikely(noreap)) {
-		uid_t uid = p->uid;
+		uid_t uid = p->sec->uid;
 		int why = (p->ptrace & PT_PTRACED) ? CLD_TRAPPED : CLD_STOPPED;
 
 		exit_code = p->exit_code;
@@ -1452,7 +1452,7 @@ bail_ref:
 	if (!retval && infop)
 		retval = put_user(pid, &infop->si_pid);
 	if (!retval && infop)
-		retval = put_user(p->uid, &infop->si_uid);
+		retval = put_user(p->sec->uid, &infop->si_uid);
 	if (!retval)
 		retval = pid;
 	put_task_struct(p);
@@ -1491,7 +1491,7 @@ static int wait_task_continued(struct task_struct *p, int noreap,
 
 	ns = current->nsproxy->pid_ns;
 	pid = task_pid_nr_ns(p, ns);
-	uid = p->uid;
+	uid = p->sec->uid;
 	get_task_struct(p);
 	read_unlock(&tasklist_lock);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 8ca1a14..2f267bd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -122,8 +122,8 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(tsk == current);
 
 	security_task_free(tsk);
-	free_uid(tsk->user);
-	put_group_info(tsk->group_info);
+	free_uid(tsk->__temp_sec.user);
+	put_group_info(tsk->__temp_sec.group_info);
 	delayacct_tsk_free(tsk);
 
 	if (!profile_handoff_task(tsk))
@@ -1014,17 +1014,18 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
+	p->act_as = p->sec = &p->__temp_sec;
 	retval = -EAGAIN;
-	if (atomic_read(&p->user->processes) >=
+	if (atomic_read(&p->sec->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
 		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
-		    p->user != current->nsproxy->user_ns->root_user)
+		    p->sec->user != current->nsproxy->user_ns->root_user)
 			goto bad_fork_free;
 	}
 
-	atomic_inc(&p->user->__count);
-	atomic_inc(&p->user->processes);
-	get_group_info(p->group_info);
+	atomic_inc(&p->sec->user->__count);
+	atomic_inc(&p->sec->user->processes);
+	get_group_info(p->sec->group_info);
 
 	/*
 	 * If multiple threads are within copy_process(), then this check
@@ -1080,7 +1081,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->real_start_time = p->start_time;
 	monotonic_to_bootbased(&p->real_start_time);
 #ifdef CONFIG_SECURITY
-	p->security = NULL;
+	p->sec->security = NULL;
 #endif
 	p->io_context = NULL;
 	p->audit_context = NULL;
@@ -1359,9 +1360,9 @@ bad_fork_cleanup_cgroup:
 bad_fork_cleanup_put_domain:
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-	put_group_info(p->group_info);
-	atomic_dec(&p->user->processes);
-	free_uid(p->user);
+	put_group_info(p->sec->group_info);
+	atomic_dec(&p->sec->user->processes);
+	free_uid(p->sec->user);
 bad_fork_free:
 	free_task(p);
 fork_out:
diff --git a/kernel/futex.c b/kernel/futex.c
index 9dc591a..cdaa691 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -444,7 +444,8 @@ static struct task_struct * futex_find_get_task(pid_t pid)
 
 	rcu_read_lock();
 	p = find_task_by_vpid(pid);
-	if (!p || ((current->euid != p->euid) && (current->euid != p->uid)))
+	if (!p || (current->act_as->euid != p->sec->euid &&
+		   current->act_as->euid != p->sec->uid))
 		p = ERR_PTR(-ESRCH);
 	else
 		get_task_struct(p);
@@ -1857,8 +1858,9 @@ sys_get_robust_list(int pid, struct robust_list_head __user * __user *head_ptr,
 		if (!p)
 			goto err_unlock;
 		ret = -EPERM;
-		if ((current->euid != p->euid) && (current->euid != p->uid) &&
-				!capable(CAP_SYS_PTRACE))
+		if (current->act_as->euid != p->sec->euid &&
+		    current->act_as->euid != p->sec->uid &&
+		    !capable(CAP_SYS_PTRACE))
 			goto err_unlock;
 		head = p->robust_list;
 		rcu_read_unlock();
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 0a43def..ae41737 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -141,8 +141,9 @@ compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
 		if (!p)
 			goto err_unlock;
 		ret = -EPERM;
-		if ((current->euid != p->euid) && (current->euid != p->uid) &&
-				!capable(CAP_SYS_PTRACE))
+		if (current->act_as->euid != p->sec->euid &&
+		    current->act_as->euid != p->sec->uid &&
+		    !capable(CAP_SYS_PTRACE))
 			goto err_unlock;
 		head = p->compat_robust_list;
 		read_unlock(&tasklist_lock);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 7c76f2f..039b9d2 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -122,6 +122,8 @@ int ptrace_check_attach(struct task_struct *child, int kill)
 
 static int may_attach(struct task_struct *task)
 {
+	struct task_security *sec = current->act_as, *tsec = task->sec;
+
 	/* May we inspect the given task?
 	 * This check is used both for attaching with ptrace
 	 * and for allowing access to sensitive information in /proc.
@@ -134,12 +136,12 @@ static int may_attach(struct task_struct *task)
 	/* Don't let security modules deny introspection */
 	if (task == current)
 		return 0;
-	if (((current->uid != task->euid) ||
-	     (current->uid != task->suid) ||
-	     (current->uid != task->uid) ||
-	     (current->gid != task->egid) ||
-	     (current->gid != task->sgid) ||
-	     (current->gid != task->gid)) && !capable(CAP_SYS_PTRACE))
+	if (((sec->uid != tsec->euid) ||
+	     (sec->uid != tsec->suid) ||
+	     (sec->uid != tsec->uid) ||
+	     (sec->gid != tsec->egid) ||
+	     (sec->gid != tsec->sgid) ||
+	     (sec->gid != tsec->gid)) && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
 	smp_rmb();
 	if (task->mm)
diff --git a/kernel/sched.c b/kernel/sched.c
index b062856..ff69615 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4300,8 +4300,8 @@ recheck:
 			return -EPERM;
 
 		/* can't change other user's priorities */
-		if ((current->euid != p->euid) &&
-		    (current->euid != p->uid))
+		if ((current->act_as->euid != p->sec->euid) &&
+		    (current->act_as->euid != p->sec->uid))
 			return -EPERM;
 	}
 
@@ -4498,8 +4498,9 @@ long sched_setaffinity(pid_t pid, cpumask_t new_mask)
 	read_unlock(&tasklist_lock);
 
 	retval = -EPERM;
-	if ((current->euid != p->euid) && (current->euid != p->uid) &&
-			!capable(CAP_SYS_NICE))
+	if ((current->act_as->euid != p->sec->euid) &&
+	    (current->act_as->euid != p->sec->uid) &&
+	    !capable(CAP_SYS_NICE))
 		goto out_unlock;
 
 	retval = security_task_setscheduler(p, 0, NULL);
diff --git a/kernel/signal.c b/kernel/signal.c
index afa4f78..5c093a0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -174,7 +174,7 @@ static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags,
 	 * In order to avoid problems with "switch_user()", we want to make
 	 * sure that the compiler doesn't re-load "t->user"
 	 */
-	user = t->user;
+	user = t->sec->user;
 	barrier();
 	atomic_inc(&user->sigpending);
 	if (override_rlimit ||
@@ -537,8 +537,10 @@ static int check_kill_permission(int sig, struct siginfo *info,
 		error = -EPERM;
 		if (((sig != SIGCONT) ||
 			(task_session_nr(current) != task_session_nr(t)))
-		    && (current->euid ^ t->suid) && (current->euid ^ t->uid)
-		    && (current->uid ^ t->suid) && (current->uid ^ t->uid)
+		    && (current->act_as->euid ^ t->sec->suid)
+		    && (current->act_as->euid ^ t->sec->uid)
+		    && (current->act_as->uid ^ t->sec->suid)
+		    && (current->act_as->uid ^ t->sec->uid)
 		    && !capable(CAP_KILL))
 		return error;
 	}
@@ -695,7 +697,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			q->info.si_errno = 0;
 			q->info.si_code = SI_USER;
 			q->info.si_pid = task_pid_vnr(current);
-			q->info.si_uid = current->uid;
+			q->info.si_uid = current->act_as->uid;
 			break;
 		case (unsigned long) SEND_SIG_PRIV:
 			q->info.si_signo = sig;
@@ -1111,8 +1113,8 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid,
 		goto out_unlock;
 	}
 	if ((info == SEND_SIG_NOINFO || (!is_si_special(info) && SI_FROMUSER(info)))
-	    && (euid != p->suid) && (euid != p->uid)
-	    && (uid != p->suid) && (uid != p->uid)) {
+	    && (euid != p->sec->suid) && (euid != p->sec->uid)
+	    && (uid != p->sec->suid) && (uid != p->sec->uid)) {
 		ret = -EPERM;
 		goto out_unlock;
 	}
@@ -1464,7 +1466,7 @@ void do_notify_parent(struct task_struct *tsk, int sig)
 	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
 	rcu_read_unlock();
 
-	info.si_uid = tsk->uid;
+	info.si_uid = tsk->sec->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
 	info.si_utime = cputime_to_jiffies(cputime_add(tsk->utime,
@@ -1535,7 +1537,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
 	rcu_read_unlock();
 
-	info.si_uid = tsk->uid;
+	info.si_uid = tsk->sec->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
 	info.si_utime = cputime_to_jiffies(tsk->utime);
@@ -1661,7 +1663,7 @@ void ptrace_notify(int exit_code)
 	info.si_signo = SIGTRAP;
 	info.si_code = exit_code;
 	info.si_pid = task_pid_vnr(current);
-	info.si_uid = current->uid;
+	info.si_uid = current->sec->uid;
 
 	/* Let the debugger run.  */
 	spin_lock_irq(&current->sighand->siglock);
@@ -1831,7 +1833,7 @@ relock:
 				info->si_errno = 0;
 				info->si_code = SI_USER;
 				info->si_pid = task_pid_vnr(current->parent);
-				info->si_uid = current->parent->uid;
+				info->si_uid = current->parent->sec->uid;
 			}
 
 			/* If the (new) signal is now blocked, requeue it.  */
@@ -2218,7 +2220,7 @@ sys_kill(int pid, int sig)
 	info.si_errno = 0;
 	info.si_code = SI_USER;
 	info.si_pid = task_tgid_vnr(current);
-	info.si_uid = current->uid;
+	info.si_uid = current->act_as->uid;
 
 	return kill_something_info(sig, &info, pid);
 }
@@ -2234,7 +2236,7 @@ static int do_tkill(int tgid, int pid, int sig)
 	info.si_errno = 0;
 	info.si_code = SI_TKILL;
 	info.si_pid = task_tgid_vnr(current);
-	info.si_uid = current->uid;
+	info.si_uid = current->act_as->uid;
 
 	read_lock(&tasklist_lock);
 	p = find_task_by_vpid(pid);
diff --git a/kernel/sys.c b/kernel/sys.c
index d1fe71e..14acc1b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -110,8 +110,8 @@ static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
 	int no_nice;
 
-	if (p->uid != current->euid &&
-		p->euid != current->euid && !capable(CAP_SYS_NICE)) {
+	if (p->sec->uid != current->act_as->euid &&
+	    p->sec->euid != current->act_as->euid && !capable(CAP_SYS_NICE)) {
 		error = -EPERM;
 		goto out;
 	}
@@ -168,18 +168,19 @@ asmlinkage long sys_setpriority(int which, int who, int niceval)
 			} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
 			break;
 		case PRIO_USER:
-			user = current->user;
+			user = current->sec->user;
 			if (!who)
-				who = current->uid;
+				who = current->sec->uid;
 			else
-				if ((who != current->uid) && !(user = find_user(who)))
+				if ((who != current->sec->uid) &&
+				    !(user = find_user(who)))
 					goto out_unlock;	/* No processes for this user */
 
 			do_each_thread(g, p)
-				if (p->uid == who)
+				if (p->sec->uid == who)
 					error = set_one_prio(p, niceval, error);
 			while_each_thread(g, p);
-			if (who != current->uid)
+			if (who != current->sec->uid)
 				free_uid(user);		/* For find_user() */
 			break;
 	}
@@ -230,21 +231,22 @@ asmlinkage long sys_getpriority(int which, int who)
 			} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
 			break;
 		case PRIO_USER:
-			user = current->user;
+			user = current->sec->user;
 			if (!who)
-				who = current->uid;
+				who = current->sec->uid;
 			else
-				if ((who != current->uid) && !(user = find_user(who)))
+				if ((who != current->sec->uid) &&
+				    !(user = find_user(who)))
 					goto out_unlock;	/* No processes for this user */
 
 			do_each_thread(g, p)
-				if (p->uid == who) {
+				if (p->sec->uid == who) {
 					niceval = 20 - task_nice(p);
 					if (niceval > retval)
 						retval = niceval;
 				}
 			while_each_thread(g, p);
-			if (who != current->uid)
+			if (who != current->sec->uid)
 				free_uid(user);		/* for find_user() */
 			break;
 	}
@@ -481,8 +483,9 @@ void ctrl_alt_del(void)
  */
 asmlinkage long sys_setregid(gid_t rgid, gid_t egid)
 {
-	int old_rgid = current->gid;
-	int old_egid = current->egid;
+	struct task_security *sec = current->sec;
+	int old_rgid = sec->gid;
+	int old_egid = sec->egid;
 	int new_rgid = old_rgid;
 	int new_egid = old_egid;
 	int retval;
@@ -493,7 +496,7 @@ asmlinkage long sys_setregid(gid_t rgid, gid_t egid)
 
 	if (rgid != (gid_t) -1) {
 		if ((old_rgid == rgid) ||
-		    (current->egid==rgid) ||
+		    (sec->egid == rgid) ||
 		    capable(CAP_SETGID))
 			new_rgid = rgid;
 		else
@@ -501,8 +504,8 @@ asmlinkage long sys_setregid(gid_t rgid, gid_t egid)
 	}
 	if (egid != (gid_t) -1) {
 		if ((old_rgid == egid) ||
-		    (current->egid == egid) ||
-		    (current->sgid == egid) ||
+		    (sec->egid == egid) ||
+		    (sec->sgid == egid) ||
 		    capable(CAP_SETGID))
 			new_egid = egid;
 		else
@@ -514,10 +517,10 @@ asmlinkage long sys_setregid(gid_t rgid, gid_t egid)
 	}
 	if (rgid != (gid_t) -1 ||
 	    (egid != (gid_t) -1 && egid != old_rgid))
-		current->sgid = new_egid;
-	current->fsgid = new_egid;
-	current->egid = new_egid;
-	current->gid = new_rgid;
+		sec->sgid = new_egid;
+	sec->fsgid = new_egid;
+	sec->egid = new_egid;
+	sec->gid = new_rgid;
 	key_fsgid_changed(current);
 	proc_id_connector(current, PROC_EVENT_GID);
 	return 0;
@@ -530,7 +533,8 @@ asmlinkage long sys_setregid(gid_t rgid, gid_t egid)
  */
 asmlinkage long sys_setgid(gid_t gid)
 {
-	int old_egid = current->egid;
+	struct task_security *sec = current->sec;
+	int old_egid = sec->egid;
 	int retval;
 
 	retval = security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_ID);
@@ -542,13 +546,13 @@ asmlinkage long sys_setgid(gid_t gid)
 			set_dumpable(current->mm, suid_dumpable);
 			smp_wmb();
 		}
-		current->gid = current->egid = current->sgid = current->fsgid = gid;
-	} else if ((gid == current->gid) || (gid == current->sgid)) {
+		sec->gid = sec->egid = sec->sgid = sec->fsgid = gid;
+	} else if ((gid == sec->gid) || (gid == sec->sgid)) {
 		if (old_egid != gid) {
 			set_dumpable(current->mm, suid_dumpable);
 			smp_wmb();
 		}
-		current->egid = current->fsgid = gid;
+		sec->egid = sec->fsgid = gid;
 	}
 	else
 		return -EPERM;
@@ -579,7 +583,7 @@ static int set_user(uid_t new_ruid, int dumpclear)
 		set_dumpable(current->mm, suid_dumpable);
 		smp_wmb();
 	}
-	current->uid = new_ruid;
+	current->sec->uid = new_ruid;
 	return 0;
 }
 
@@ -600,6 +604,7 @@ static int set_user(uid_t new_ruid, int dumpclear)
  */
 asmlinkage long sys_setreuid(uid_t ruid, uid_t euid)
 {
+	struct task_security *sec = current->sec;
 	int old_ruid, old_euid, old_suid, new_ruid, new_euid;
 	int retval;
 
@@ -607,14 +612,14 @@ asmlinkage long sys_setreuid(uid_t ruid, uid_t euid)
 	if (retval)
 		return retval;
 
-	new_ruid = old_ruid = current->uid;
-	new_euid = old_euid = current->euid;
-	old_suid = current->suid;
+	new_ruid = old_ruid = sec->uid;
+	new_euid = old_euid = sec->euid;
+	old_suid = sec->suid;
 
 	if (ruid != (uid_t) -1) {
 		new_ruid = ruid;
 		if ((old_ruid != ruid) &&
-		    (current->euid != ruid) &&
+		    (sec->euid != ruid) &&
 		    !capable(CAP_SETUID))
 			return -EPERM;
 	}
@@ -622,8 +627,8 @@ asmlinkage long sys_setreuid(uid_t ruid, uid_t euid)
 	if (euid != (uid_t) -1) {
 		new_euid = euid;
 		if ((old_ruid != euid) &&
-		    (current->euid != euid) &&
-		    (current->suid != euid) &&
+		    (sec->euid != euid) &&
+		    (sec->suid != euid) &&
 		    !capable(CAP_SETUID))
 			return -EPERM;
 	}
@@ -635,11 +640,11 @@ asmlinkage long sys_setreuid(uid_t ruid, uid_t euid)
 		set_dumpable(current->mm, suid_dumpable);
 		smp_wmb();
 	}
-	current->fsuid = current->euid = new_euid;
+	sec->fsuid = sec->euid = new_euid;
 	if (ruid != (uid_t) -1 ||
 	    (euid != (uid_t) -1 && euid != old_ruid))
-		current->suid = current->euid;
-	current->fsuid = current->euid;
+		sec->suid = sec->euid;
+	sec->fsuid = sec->euid;
 
 	key_fsuid_changed(current);
 	proc_id_connector(current, PROC_EVENT_UID);
@@ -662,7 +667,8 @@ asmlinkage long sys_setreuid(uid_t ruid, uid_t euid)
  */
 asmlinkage long sys_setuid(uid_t uid)
 {
-	int old_euid = current->euid;
+	struct task_security *sec = current->sec;
+	int old_euid = sec->euid;
 	int old_ruid, old_suid, new_suid;
 	int retval;
 
@@ -670,23 +676,23 @@ asmlinkage long sys_setuid(uid_t uid)
 	if (retval)
 		return retval;
 
-	old_ruid = current->uid;
-	old_suid = current->suid;
+	old_ruid = sec->uid;
+	old_suid = sec->suid;
 	new_suid = old_suid;
 	
 	if (capable(CAP_SETUID)) {
 		if (uid != old_ruid && set_user(uid, old_euid != uid) < 0)
 			return -EAGAIN;
 		new_suid = uid;
-	} else if ((uid != current->uid) && (uid != new_suid))
+	} else if ((uid != sec->uid) && (uid != new_suid))
 		return -EPERM;
 
 	if (old_euid != uid) {
 		set_dumpable(current->mm, suid_dumpable);
 		smp_wmb();
 	}
-	current->fsuid = current->euid = uid;
-	current->suid = new_suid;
+	sec->fsuid = sec->euid = uid;
+	sec->suid = new_suid;
 
 	key_fsuid_changed(current);
 	proc_id_connector(current, PROC_EVENT_UID);
@@ -701,9 +707,10 @@ asmlinkage long sys_setuid(uid_t uid)
  */
 asmlinkage long sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 {
-	int old_ruid = current->uid;
-	int old_euid = current->euid;
-	int old_suid = current->suid;
+	struct task_security *sec = current->sec;
+	int old_ruid = sec->uid;
+	int old_euid = sec->euid;
+	int old_suid = sec->suid;
 	int retval;
 
 	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
@@ -711,30 +718,31 @@ asmlinkage long sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 		return retval;
 
 	if (!capable(CAP_SETUID)) {
-		if ((ruid != (uid_t) -1) && (ruid != current->uid) &&
-		    (ruid != current->euid) && (ruid != current->suid))
+		if ((ruid != (uid_t) -1) && (ruid != sec->uid) &&
+		    (ruid != sec->euid) && (ruid != sec->suid))
 			return -EPERM;
-		if ((euid != (uid_t) -1) && (euid != current->uid) &&
-		    (euid != current->euid) && (euid != current->suid))
+		if ((euid != (uid_t) -1) && (euid != sec->uid) &&
+		    (euid != sec->euid) && (euid != sec->suid))
 			return -EPERM;
-		if ((suid != (uid_t) -1) && (suid != current->uid) &&
-		    (suid != current->euid) && (suid != current->suid))
+		if ((suid != (uid_t) -1) && (suid != sec->uid) &&
+		    (suid != sec->euid) && (suid != sec->suid))
 			return -EPERM;
 	}
 	if (ruid != (uid_t) -1) {
-		if (ruid != current->uid && set_user(ruid, euid != current->euid) < 0)
+		if (ruid != sec->uid &&
+		    set_user(ruid, euid != sec->euid) < 0)
 			return -EAGAIN;
 	}
 	if (euid != (uid_t) -1) {
-		if (euid != current->euid) {
+		if (euid != sec->euid) {
 			set_dumpable(current->mm, suid_dumpable);
 			smp_wmb();
 		}
-		current->euid = euid;
+		sec->euid = euid;
 	}
-	current->fsuid = current->euid;
+	sec->fsuid = sec->euid;
 	if (suid != (uid_t) -1)
-		current->suid = suid;
+		sec->suid = suid;
 
 	key_fsuid_changed(current);
 	proc_id_connector(current, PROC_EVENT_UID);
@@ -744,11 +752,12 @@ asmlinkage long sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 
 asmlinkage long sys_getresuid(uid_t __user *ruid, uid_t __user *euid, uid_t __user *suid)
 {
+	struct task_security *sec = current->sec;
 	int retval;
 
-	if (!(retval = put_user(current->uid, ruid)) &&
-	    !(retval = put_user(current->euid, euid)))
-		retval = put_user(current->suid, suid);
+	if (!(retval = put_user(sec->uid, ruid)) &&
+	    !(retval = put_user(sec->euid, euid)))
+		retval = put_user(sec->suid, suid);
 
 	return retval;
 }
@@ -758,6 +767,7 @@ asmlinkage long sys_getresuid(uid_t __user *ruid, uid_t __user *euid, uid_t __us
  */
 asmlinkage long sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 {
+	struct task_security *sec = current->sec;
 	int retval;
 
 	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
@@ -765,28 +775,28 @@ asmlinkage long sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 		return retval;
 
 	if (!capable(CAP_SETGID)) {
-		if ((rgid != (gid_t) -1) && (rgid != current->gid) &&
-		    (rgid != current->egid) && (rgid != current->sgid))
+		if ((rgid != (gid_t) -1) && (rgid != sec->gid) &&
+		    (rgid != sec->egid) && (rgid != sec->sgid))
 			return -EPERM;
-		if ((egid != (gid_t) -1) && (egid != current->gid) &&
-		    (egid != current->egid) && (egid != current->sgid))
+		if ((egid != (gid_t) -1) && (egid != sec->gid) &&
+		    (egid != sec->egid) && (egid != sec->sgid))
 			return -EPERM;
-		if ((sgid != (gid_t) -1) && (sgid != current->gid) &&
-		    (sgid != current->egid) && (sgid != current->sgid))
+		if ((sgid != (gid_t) -1) && (sgid != sec->gid) &&
+		    (sgid != sec->egid) && (sgid != sec->sgid))
 			return -EPERM;
 	}
 	if (egid != (gid_t) -1) {
-		if (egid != current->egid) {
+		if (egid != sec->egid) {
 			set_dumpable(current->mm, suid_dumpable);
 			smp_wmb();
 		}
-		current->egid = egid;
+		sec->egid = egid;
 	}
-	current->fsgid = current->egid;
+	sec->fsgid = sec->egid;
 	if (rgid != (gid_t) -1)
-		current->gid = rgid;
+		sec->gid = rgid;
 	if (sgid != (gid_t) -1)
-		current->sgid = sgid;
+		sec->sgid = sgid;
 
 	key_fsgid_changed(current);
 	proc_id_connector(current, PROC_EVENT_GID);
@@ -795,11 +805,12 @@ asmlinkage long sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 
 asmlinkage long sys_getresgid(gid_t __user *rgid, gid_t __user *egid, gid_t __user *sgid)
 {
+	struct task_security *sec = current->sec;
 	int retval;
 
-	if (!(retval = put_user(current->gid, rgid)) &&
-	    !(retval = put_user(current->egid, egid)))
-		retval = put_user(current->sgid, sgid);
+	if (!(retval = put_user(sec->gid, rgid)) &&
+	    !(retval = put_user(sec->egid, egid)))
+		retval = put_user(sec->sgid, sgid);
 
 	return retval;
 }
@@ -813,20 +824,21 @@ asmlinkage long sys_getresgid(gid_t __user *rgid, gid_t __user *egid, gid_t __us
  */
 asmlinkage long sys_setfsuid(uid_t uid)
 {
+	struct task_security *sec = current->sec;
 	int old_fsuid;
 
-	old_fsuid = current->fsuid;
+	old_fsuid = sec->fsuid;
 	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS))
 		return old_fsuid;
 
-	if (uid == current->uid || uid == current->euid ||
-	    uid == current->suid || uid == current->fsuid || 
+	if (uid == sec->uid || uid == sec->euid ||
+	    uid == sec->suid || uid == sec->fsuid ||
 	    capable(CAP_SETUID)) {
 		if (uid != old_fsuid) {
 			set_dumpable(current->mm, suid_dumpable);
 			smp_wmb();
 		}
-		current->fsuid = uid;
+		sec->fsuid = uid;
 	}
 
 	key_fsuid_changed(current);
@@ -842,20 +854,21 @@ asmlinkage long sys_setfsuid(uid_t uid)
  */
 asmlinkage long sys_setfsgid(gid_t gid)
 {
+	struct task_security *sec = current->sec;
 	int old_fsgid;
 
-	old_fsgid = current->fsgid;
+	old_fsgid = sec->fsgid;
 	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
 		return old_fsgid;
 
-	if (gid == current->gid || gid == current->egid ||
-	    gid == current->sgid || gid == current->fsgid || 
+	if (gid == sec->gid || gid == sec->egid ||
+	    gid == sec->sgid || gid == sec->fsgid ||
 	    capable(CAP_SETGID)) {
 		if (gid != old_fsgid) {
 			set_dumpable(current->mm, suid_dumpable);
 			smp_wmb();
 		}
-		current->fsgid = gid;
+		sec->fsgid = gid;
 		key_fsgid_changed(current);
 		proc_id_connector(current, PROC_EVENT_GID);
 	}
@@ -1235,6 +1248,7 @@ int groups_search(struct group_info *group_info, gid_t grp)
 /* validate and set current->group_info */
 int set_current_groups(struct group_info *group_info)
 {
+	struct task_security *sec = current->sec;
 	int retval;
 	struct group_info *old_info;
 
@@ -1245,10 +1259,10 @@ int set_current_groups(struct group_info *group_info)
 	groups_sort(group_info);
 	get_group_info(group_info);
 
-	task_lock(current);
-	old_info = current->group_info;
-	current->group_info = group_info;
-	task_unlock(current);
+	spin_lock(&sec->lock);
+	old_info = sec->group_info;
+	sec->group_info = group_info;
+	spin_unlock(&sec->lock);
 
 	put_group_info(old_info);
 
@@ -1259,6 +1273,7 @@ EXPORT_SYMBOL(set_current_groups);
 
 asmlinkage long sys_getgroups(int gidsetsize, gid_t __user *grouplist)
 {
+	struct task_security *sec = current->sec;
 	int i = 0;
 
 	/*
@@ -1270,13 +1285,13 @@ asmlinkage long sys_getgroups(int gidsetsize, gid_t __user *grouplist)
 		return -EINVAL;
 
 	/* no need to grab task_lock here; it cannot change */
-	i = current->group_info->ngroups;
+	i = sec->group_info->ngroups;
 	if (gidsetsize) {
 		if (i > gidsetsize) {
 			i = -EINVAL;
 			goto out;
 		}
-		if (groups_to_user(grouplist, current->group_info)) {
+		if (groups_to_user(grouplist, sec->group_info)) {
 			i = -EFAULT;
 			goto out;
 		}
@@ -1320,9 +1335,10 @@ asmlinkage long sys_setgroups(int gidsetsize, gid_t __user *grouplist)
  */
 int in_group_p(gid_t grp)
 {
+	struct task_security *act_as = current->act_as;
 	int retval = 1;
-	if (grp != current->fsgid)
-		retval = groups_search(current->group_info, grp);
+	if (grp != act_as->fsgid)
+		retval = groups_search(act_as->group_info, grp);
 	return retval;
 }
 
@@ -1330,9 +1346,10 @@ EXPORT_SYMBOL(in_group_p);
 
 int in_egroup_p(gid_t grp)
 {
+	struct task_security *act_as = current->act_as;
 	int retval = 1;
-	if (grp != current->egid)
-		retval = groups_search(current->group_info, grp);
+	if (grp != act_as->egid)
+		retval = groups_search(act_as->group_info, grp);
 	return retval;
 }
 
@@ -1641,6 +1658,9 @@ asmlinkage long sys_umask(int mask)
 asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			  unsigned long arg4, unsigned long arg5)
 {
+	struct task_struct *me = current;
+	struct task_security *sec = me->sec;
+	unsigned char comm[sizeof(me->comm)];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -1653,39 +1673,39 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 				error = -EINVAL;
 				break;
 			}
-			current->pdeath_signal = arg2;
+			me->pdeath_signal = arg2;
 			break;
 		case PR_GET_PDEATHSIG:
-			error = put_user(current->pdeath_signal, (int __user *)arg2);
+			error = put_user(me->pdeath_signal, (int __user *)arg2);
 			break;
 		case PR_GET_DUMPABLE:
-			error = get_dumpable(current->mm);
+			error = get_dumpable(me->mm);
 			break;
 		case PR_SET_DUMPABLE:
 			if (arg2 < 0 || arg2 > 1) {
 				error = -EINVAL;
 				break;
 			}
-			set_dumpable(current->mm, arg2);
+			set_dumpable(me->mm, arg2);
 			break;
 
 		case PR_SET_UNALIGN:
-			error = SET_UNALIGN_CTL(current, arg2);
+			error = SET_UNALIGN_CTL(me, arg2);
 			break;
 		case PR_GET_UNALIGN:
-			error = GET_UNALIGN_CTL(current, arg2);
+			error = GET_UNALIGN_CTL(me, arg2);
 			break;
 		case PR_SET_FPEMU:
-			error = SET_FPEMU_CTL(current, arg2);
+			error = SET_FPEMU_CTL(me, arg2);
 			break;
 		case PR_GET_FPEMU:
-			error = GET_FPEMU_CTL(current, arg2);
+			error = GET_FPEMU_CTL(me, arg2);
 			break;
 		case PR_SET_FPEXC:
-			error = SET_FPEXC_CTL(current, arg2);
+			error = SET_FPEXC_CTL(me, arg2);
 			break;
 		case PR_GET_FPEXC:
-			error = GET_FPEXC_CTL(current, arg2);
+			error = GET_FPEXC_CTL(me, arg2);
 			break;
 		case PR_GET_TIMING:
 			error = PR_TIMING_STATISTICAL;
@@ -1698,7 +1718,7 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			break;
 
 		case PR_GET_KEEPCAPS:
-			if (current->keep_capabilities)
+			if (sec->keep_capabilities)
 				error = 1;
 			break;
 		case PR_SET_KEEPCAPS:
@@ -1706,33 +1726,26 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 				error = -EINVAL;
 				break;
 			}
-			current->keep_capabilities = arg2;
+			sec->keep_capabilities = arg2;
 			break;
-		case PR_SET_NAME: {
-			struct task_struct *me = current;
-			unsigned char ncomm[sizeof(me->comm)];
-
-			ncomm[sizeof(me->comm)-1] = 0;
-			if (strncpy_from_user(ncomm, (char __user *)arg2,
+		case PR_SET_NAME:
+			comm[sizeof(me->comm)-1] = 0;
+			if (strncpy_from_user(comm, (char __user *)arg2,
 						sizeof(me->comm)-1) < 0)
 				return -EFAULT;
-			set_task_comm(me, ncomm);
+			set_task_comm(me, comm);
 			return 0;
-		}
-		case PR_GET_NAME: {
-			struct task_struct *me = current;
-			unsigned char tcomm[sizeof(me->comm)];
-
-			get_task_comm(tcomm, me);
-			if (copy_to_user((char __user *)arg2, tcomm, sizeof(tcomm)))
+		case PR_GET_NAME:
+			get_task_comm(comm, me);
+			if (copy_to_user((char __user *)arg2, comm,
+					 sizeof(comm)))
 				return -EFAULT;
 			return 0;
-		}
 		case PR_GET_ENDIAN:
-			error = GET_ENDIAN(current, arg2);
+			error = GET_ENDIAN(me, arg2);
 			break;
 		case PR_SET_ENDIAN:
-			error = SET_ENDIAN(current, arg2);
+			error = SET_ENDIAN(me, arg2);
 			break;
 
 		case PR_GET_SECCOMP:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0deed82..d3863df 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1382,7 +1382,7 @@ out:
 
 static int test_perm(int mode, int op)
 {
-	if (!current->euid)
+	if (!current->act_as->euid)
 		mode >>= 6;
 	else if (in_egroup_p(0))
 		mode >>= 3;
diff --git a/kernel/timer.c b/kernel/timer.c
index a05817c..5710255 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -987,25 +987,25 @@ asmlinkage long sys_getppid(void)
 asmlinkage long sys_getuid(void)
 {
 	/* Only we change this so SMP safe */
-	return current->uid;
+	return current->sec->uid;
 }
 
 asmlinkage long sys_geteuid(void)
 {
 	/* Only we change this so SMP safe */
-	return current->euid;
+	return current->sec->euid;
 }
 
 asmlinkage long sys_getgid(void)
 {
 	/* Only we change this so SMP safe */
-	return current->gid;
+	return current->sec->gid;
 }
 
 asmlinkage long sys_getegid(void)
 {
 	/* Only we change this so SMP safe */
-	return  current->egid;
+	return  current->sec->egid;
 }
 
 #endif
diff --git a/kernel/uid16.c b/kernel/uid16.c
index dd308ba..c56f6fe 100644
--- a/kernel/uid16.c
+++ b/kernel/uid16.c
@@ -86,9 +86,9 @@ asmlinkage long sys_getresuid16(old_uid_t __user *ruid, old_uid_t __user *euid,
 {
 	int retval;
 
-	if (!(retval = put_user(high2lowuid(current->uid), ruid)) &&
-	    !(retval = put_user(high2lowuid(current->euid), euid)))
-		retval = put_user(high2lowuid(current->suid), suid);
+	if (!(retval = put_user(high2lowuid(current->sec->uid), ruid)) &&
+	    !(retval = put_user(high2lowuid(current->sec->euid), euid)))
+		retval = put_user(high2lowuid(current->sec->suid), suid);
 
 	return retval;
 }
@@ -106,9 +106,9 @@ asmlinkage long sys_getresgid16(old_gid_t __user *rgid, old_gid_t __user *egid,
 {
 	int retval;
 
-	if (!(retval = put_user(high2lowgid(current->gid), rgid)) &&
-	    !(retval = put_user(high2lowgid(current->egid), egid)))
-		retval = put_user(high2lowgid(current->sgid), sgid);
+	if (!(retval = put_user(high2lowgid(current->sec->gid), rgid)) &&
+	    !(retval = put_user(high2lowgid(current->sec->egid), egid)))
+		retval = put_user(high2lowgid(current->sec->sgid), sgid);
 
 	return retval;
 }
@@ -166,20 +166,20 @@ asmlinkage long sys_getgroups16(int gidsetsize, old_gid_t __user *grouplist)
 	if (gidsetsize < 0)
 		return -EINVAL;
 
-	get_group_info(current->group_info);
-	i = current->group_info->ngroups;
+	get_group_info(current->sec->group_info);
+	i = current->sec->group_info->ngroups;
 	if (gidsetsize) {
 		if (i > gidsetsize) {
 			i = -EINVAL;
 			goto out;
 		}
-		if (groups16_to_user(grouplist, current->group_info)) {
+		if (groups16_to_user(grouplist, current->sec->group_info)) {
 			i = -EFAULT;
 			goto out;
 		}
 	}
 out:
-	put_group_info(current->group_info);
+	put_group_info(current->sec->group_info);
 	return i;
 }
 
@@ -210,20 +210,20 @@ asmlinkage long sys_setgroups16(int gidsetsize, old_gid_t __user *grouplist)
 
 asmlinkage long sys_getuid16(void)
 {
-	return high2lowuid(current->uid);
+	return high2lowuid(current->sec->uid);
 }
 
 asmlinkage long sys_geteuid16(void)
 {
-	return high2lowuid(current->euid);
+	return high2lowuid(current->sec->euid);
 }
 
 asmlinkage long sys_getgid16(void)
 {
-	return high2lowgid(current->gid);
+	return high2lowgid(current->sec->gid);
 }
 
 asmlinkage long sys_getegid16(void)
 {
-	return high2lowgid(current->egid);
+	return high2lowgid(current->sec->egid);
 }
diff --git a/kernel/user.c b/kernel/user.c
index 8320a87..d9a84b1 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -416,11 +416,11 @@ void switch_uid(struct user_struct *new_user)
 	 * cheaply with the new uid cache, so if it matters
 	 * we should be checking for it.  -DaveM
 	 */
-	old_user = current->user;
+	old_user = current->sec->user;
 	atomic_inc(&new_user->processes);
 	atomic_dec(&old_user->processes);
 	switch_uid_keyring(new_user);
-	current->user = new_user;
+	current->sec->user = new_user;
 	sched_switch_user(current);
 
 	/*
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 7af90fc..fef8023 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -49,7 +49,7 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
 	}
 
 	/* Reset current->user with a new one */
-	new_user = alloc_uid(ns, current->uid);
+	new_user = alloc_uid(ns, current->sec->uid);
 	if (!new_user) {
 		free_uid(ns->root_user);
 		kfree(ns);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 91a081a..917a90e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -125,8 +125,8 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * Superuser processes are usually more important, so we make it
 	 * less likely that we kill those.
 	 */
-	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
-				p->uid == 0 || p->euid == 0)
+	if (cap_t(p->sec->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
+				p->sec->uid == 0 || p->sec->euid == 0)
 		points /= 4;
 
 	/*
@@ -135,7 +135,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * tend to only have this flag set on applications they think
 	 * of as important.
 	 */
-	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
+	if (cap_t(p->sec->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
 		points /= 4;
 
 	/*
diff --git a/net/core/scm.c b/net/core/scm.c
index 100ba6d..c5b076b 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -44,11 +44,13 @@
 
 static __inline__ int scm_check_creds(struct ucred *creds)
 {
+	struct task_security *sec = current->act_as;
+
 	if ((creds->pid == task_tgid_vnr(current) || capable(CAP_SYS_ADMIN)) &&
-	    ((creds->uid == current->uid || creds->uid == current->euid ||
-	      creds->uid == current->suid) || capable(CAP_SETUID)) &&
-	    ((creds->gid == current->gid || creds->gid == current->egid ||
-	      creds->gid == current->sgid) || capable(CAP_SETGID))) {
+	    ((creds->uid == sec->uid || creds->uid == sec->euid ||
+	      creds->uid == sec->suid) || capable(CAP_SETUID)) &&
+	    ((creds->gid == sec->gid || creds->gid == sec->egid ||
+	      creds->gid == sec->sgid) || capable(CAP_SETGID))) {
 	       return 0;
 	}
 	return -EPERM;
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 390a1ec..e254e56 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -339,7 +339,7 @@ rpcauth_lookupcred(struct rpc_auth *auth, int flags)
 	struct auth_cred acred = {
 		.uid = current_fsuid(),
 		.gid = current_fsgid(),
-		.group_info = current->group_info,
+		.group_info = current->act_as->group_info,
 	};
 	struct rpc_cred *ret;
 
@@ -375,7 +375,7 @@ rpcauth_bindcred(struct rpc_task *task)
 	struct auth_cred acred = {
 		.uid = current_fsuid(),
 		.gid = current_fsgid(),
-		.group_info = current->group_info,
+		.group_info = current->act_as->group_info,
 	};
 	struct rpc_cred *ret;
 	int flags = 0;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 060bba4..974037d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -484,8 +484,8 @@ static int unix_listen(struct socket *sock, int backlog)
 	sk->sk_state		= TCP_LISTEN;
 	/* set credentials so connect can copy them */
 	sk->sk_peercred.pid	= task_tgid_vnr(current);
-	sk->sk_peercred.uid	= current->euid;
-	sk->sk_peercred.gid	= current->egid;
+	sk->sk_peercred.uid	= current->act_as->euid;
+	sk->sk_peercred.gid	= current->act_as->egid;
 	err = 0;
 
 out_unlock:
@@ -1135,8 +1135,8 @@ restart:
 	newsk->sk_state		= TCP_ESTABLISHED;
 	newsk->sk_type		= sk->sk_type;
 	newsk->sk_peercred.pid	= task_tgid_vnr(current);
-	newsk->sk_peercred.uid	= current->euid;
-	newsk->sk_peercred.gid	= current->egid;
+	newsk->sk_peercred.uid	= current->act_as->euid;
+	newsk->sk_peercred.gid	= current->act_as->egid;
 	newu = unix_sk(newsk);
 	newsk->sk_sleep		= &newu->peer_wait;
 	otheru = unix_sk(other);
@@ -1196,8 +1196,8 @@ static int unix_socketpair(struct socket *socka, struct socket *sockb)
 	unix_peer(ska)=skb;
 	unix_peer(skb)=ska;
 	ska->sk_peercred.pid = skb->sk_peercred.pid = task_tgid_vnr(current);
-	ska->sk_peercred.uid = skb->sk_peercred.uid = current->euid;
-	ska->sk_peercred.gid = skb->sk_peercred.gid = current->egid;
+	ska->sk_peercred.uid = skb->sk_peercred.uid = current->act_as->euid;
+	ska->sk_peercred.gid = skb->sk_peercred.gid = current->act_as->egid;
 
 	if (ska->sk_type != SOCK_DGRAM) {
 		ska->sk_state = TCP_ESTABLISHED;
diff --git a/security/dummy.c b/security/dummy.c
index 7993b30..287f3ec 100644
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -37,11 +37,11 @@ static int dummy_capget (struct task_struct *target, kernel_cap_t * effective,
 			 kernel_cap_t * inheritable, kernel_cap_t * permitted)
 {
 	*effective = *inheritable = *permitted = 0;
-	if (target->euid == 0) {
+	if (target->sec->euid == 0) {
 		*permitted |= (~0 & ~CAP_FS_MASK);
 		*effective |= (~0 & ~CAP_TO_MASK(CAP_SETPCAP) & ~CAP_FS_MASK);
 	}
-	if (target->fsuid == 0) {
+	if (target->sec->fsuid == 0) {
 		*permitted |= CAP_FS_MASK;
 		*effective |= CAP_FS_MASK;
 	}
@@ -71,7 +71,7 @@ static int dummy_acct (struct file *file)
 
 static int dummy_capable (struct task_struct *tsk, int cap)
 {
-	if (cap_raised (tsk->cap_effective, cap))
+	if (cap_raised(tsk->act_as->cap_effective, cap))
 		return 0;
 	return -EPERM;
 }
@@ -93,7 +93,7 @@ static int dummy_quota_on (struct dentry *dentry)
 
 static int dummy_syslog (int type)
 {
-	if ((type != 3 && type != 10) && current->euid)
+	if ((type != 3 && type != 10) && current->act_as->euid)
 		return -EPERM;
 	return 0;
 }
@@ -126,19 +126,24 @@ static void dummy_bprm_free_security (struct linux_binprm *bprm)
 
 static void dummy_bprm_apply_creds (struct linux_binprm *bprm, int unsafe)
 {
-	if (bprm->e_uid != current->uid || bprm->e_gid != current->gid) {
+	struct task_security *sec = current->sec;
+
+	if (bprm->e_uid != sec->uid || bprm->e_gid != sec->gid) {
 		set_dumpable(current->mm, suid_dumpable);
 
 		if ((unsafe & ~LSM_UNSAFE_PTRACE_CAP) && !capable(CAP_SETUID)) {
-			bprm->e_uid = current->uid;
-			bprm->e_gid = current->gid;
+			bprm->e_uid = sec->uid;
+			bprm->e_gid = sec->gid;
 		}
 	}
 
-	current->suid = current->euid = current->fsuid = bprm->e_uid;
-	current->sgid = current->egid = current->fsgid = bprm->e_gid;
+	sec->suid = sec->euid = sec->fsuid = bprm->e_uid;
+	sec->sgid = sec->egid = sec->fsgid = bprm->e_gid;
 
-	dummy_capget(current, &current->cap_effective, &current->cap_inheritable, &current->cap_permitted);
+	dummy_capget(current,
+		     &sec->cap_effective,
+		     &sec->cap_inheritable,
+		     &sec->cap_permitted);
 }
 
 static void dummy_bprm_post_apply_creds (struct linux_binprm *bprm)
@@ -162,8 +167,8 @@ static int dummy_bprm_secureexec (struct linux_binprm *bprm)
 	   in the AT_SECURE field to decide whether secure mode
 	   is required.  Hence, this logic is required to preserve
 	   the legacy decision algorithm used by the old userland. */
-	return (current->euid != current->uid ||
-		current->egid != current->gid);
+	return (current->sec->euid != current->sec->uid ||
+		current->sec->egid != current->sec->gid);
 }
 
 static int dummy_sb_alloc_security (struct super_block *sb)
@@ -492,7 +497,12 @@ static int dummy_task_setuid (uid_t id0, uid_t id1, uid_t id2, int flags)
 
 static int dummy_task_post_setuid (uid_t id0, uid_t id1, uid_t id2, int flags)
 {
-	dummy_capget(current, &current->cap_effective, &current->cap_inheritable, &current->cap_permitted);
+	struct task_security *sec = current->sec;
+
+	dummy_capget(current,
+		     &sec->cap_effective,
+		     &sec->cap_inheritable,
+		     &sec->cap_permitted);
 	return 0;
 }
 
@@ -579,7 +589,7 @@ static int dummy_task_prctl (int option, unsigned long arg2, unsigned long arg3,
 
 static void dummy_task_reparent_to_init (struct task_struct *p)
 {
-	p->euid = p->fsuid = 0;
+	p->sec->euid = p->sec->fsuid = 0;
 	return;
 }
 
@@ -689,7 +699,7 @@ static int dummy_sem_semop (struct sem_array *sma,
 
 static int dummy_netlink_send (struct sock *sk, struct sk_buff *skb)
 {
-	NETLINK_CB(skb).eff_cap = current->cap_effective;
+	NETLINK_CB(skb).eff_cap = current->act_as->cap_effective;
 	return 0;
 }
 
diff --git a/security/keys/key.c b/security/keys/key.c
index 48fabb1..1692988 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -817,7 +817,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 		perm |= KEY_USR_WRITE;
 
 	/* allocate a new key */
-	key = key_alloc(ktype, description, current_fsuid(), current->fsgid,
+	key = key_alloc(ktype, description, current_fsuid(), current_fsgid(),
 			current, perm, flags);
 	if (IS_ERR(key)) {
 		key_ref = ERR_PTR(PTR_ERR(key));
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index b3a63dd..4051948 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -846,7 +846,7 @@ long keyctl_instantiate_key(key_serial_t id,
 	/* the appropriate instantiation authorisation key must have been
 	 * assumed before calling this */
 	ret = -EPERM;
-	instkey = current->request_key_auth;
+	instkey = current->sec->request_key_auth;
 	if (!instkey)
 		goto error;
 
@@ -895,8 +895,8 @@ long keyctl_instantiate_key(key_serial_t id,
 	/* discard the assumed authority if it's just been disabled by
 	 * instantiation of the key */
 	if (ret == 0) {
-		key_put(current->request_key_auth);
-		current->request_key_auth = NULL;
+		key_put(current->sec->request_key_auth);
+		current->sec->request_key_auth = NULL;
 	}
 
 error2:
@@ -924,7 +924,7 @@ long keyctl_negate_key(key_serial_t id, unsigned timeout, key_serial_t ringid)
 	/* the appropriate instantiation authorisation key must have been
 	 * assumed before calling this */
 	ret = -EPERM;
-	instkey = current->request_key_auth;
+	instkey = current->sec->request_key_auth;
 	if (!instkey)
 		goto error;
 
@@ -952,8 +952,8 @@ long keyctl_negate_key(key_serial_t id, unsigned timeout, key_serial_t ringid)
 	/* discard the assumed authority if it's just been disabled by
 	 * instantiation of the key */
 	if (ret == 0) {
-		key_put(current->request_key_auth);
-		current->request_key_auth = NULL;
+		key_put(current->sec->request_key_auth);
+		current->sec->request_key_auth = NULL;
 	}
 
 error:
@@ -968,6 +968,7 @@ error:
  */
 long keyctl_set_reqkey_keyring(int reqkey_defl)
 {
+	struct task_security *sec = current->sec;
 	int ret;
 
 	switch (reqkey_defl) {
@@ -987,10 +988,10 @@ long keyctl_set_reqkey_keyring(int reqkey_defl)
 	case KEY_REQKEY_DEFL_USER_KEYRING:
 	case KEY_REQKEY_DEFL_USER_SESSION_KEYRING:
 	set:
-		current->jit_keyring = reqkey_defl;
+		sec->jit_keyring = reqkey_defl;
 
 	case KEY_REQKEY_DEFL_NO_CHANGE:
-		return current->jit_keyring;
+		return sec->jit_keyring;
 
 	case KEY_REQKEY_DEFL_GROUP_KEYRING:
 	default:
@@ -1055,8 +1056,8 @@ long keyctl_assume_authority(key_serial_t id)
 
 	/* we divest ourselves of authority if given an ID of 0 */
 	if (id == 0) {
-		key_put(current->request_key_auth);
-		current->request_key_auth = NULL;
+		key_put(current->sec->request_key_auth);
+		current->sec->request_key_auth = NULL;
 		ret = 0;
 		goto error;
 	}
@@ -1072,8 +1073,8 @@ long keyctl_assume_authority(key_serial_t id)
 		goto error;
 	}
 
-	key_put(current->request_key_auth);
-	current->request_key_auth = authkey;
+	key_put(current->sec->request_key_auth);
+	current->sec->request_key_auth = authkey;
 	ret = authkey->serial;
 
 error:
diff --git a/security/keys/permission.c b/security/keys/permission.c
index 3b41f9b..07898bd 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -22,6 +22,7 @@ int key_task_permission(const key_ref_t key_ref,
 			struct task_struct *context,
 			key_perm_t perm)
 {
+	struct task_security *sec = context->act_as;
 	struct key *key;
 	key_perm_t kperm;
 	int ret;
@@ -29,7 +30,7 @@ int key_task_permission(const key_ref_t key_ref,
 	key = key_ref_to_ptr(key_ref);
 
 	/* use the second 8-bits of permissions for keys the caller owns */
-	if (key->uid == context->fsuid) {
+	if (key->uid == sec->fsuid) {
 		kperm = key->perm >> 16;
 		goto use_these_perms;
 	}
@@ -37,14 +38,14 @@ int key_task_permission(const key_ref_t key_ref,
 	/* use the third 8-bits of permissions for keys the caller has a group
 	 * membership in common with */
 	if (key->gid != -1 && key->perm & KEY_GRP_ALL) {
-		if (key->gid == context->fsgid) {
+		if (key->gid == sec->fsgid) {
 			kperm = key->perm >> 8;
 			goto use_these_perms;
 		}
 
-		task_lock(context);
-		ret = groups_search(context->group_info, key->gid);
-		task_unlock(context);
+		spin_lock(&sec->lock);
+		ret = groups_search(sec->group_info, key->gid);
+		spin_unlock(&sec->lock);
 
 		if (ret) {
 			kperm = key->perm >> 8;
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 2a0eb94..98854b2 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -145,7 +145,7 @@ int install_thread_keyring(struct task_struct *tsk)
 
 	sprintf(buf, "_tid.%u", tsk->pid);
 
-	keyring = keyring_alloc(buf, tsk->uid, tsk->gid, tsk,
+	keyring = keyring_alloc(buf, tsk->sec->uid, tsk->sec->gid, tsk,
 				KEY_ALLOC_QUOTA_OVERRUN, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
@@ -153,8 +153,8 @@ int install_thread_keyring(struct task_struct *tsk)
 	}
 
 	task_lock(tsk);
-	old = tsk->thread_keyring;
-	tsk->thread_keyring = keyring;
+	old = tsk->sec->thread_keyring;
+	tsk->sec->thread_keyring = keyring;
 	task_unlock(tsk);
 
 	ret = 0;
@@ -180,7 +180,7 @@ int install_process_keyring(struct task_struct *tsk)
 	if (!tsk->signal->process_keyring) {
 		sprintf(buf, "_pid.%u", tsk->tgid);
 
-		keyring = keyring_alloc(buf, tsk->uid, tsk->gid, tsk,
+		keyring = keyring_alloc(buf, tsk->sec->uid, tsk->sec->gid, tsk,
 					KEY_ALLOC_QUOTA_OVERRUN, NULL);
 		if (IS_ERR(keyring)) {
 			ret = PTR_ERR(keyring);
@@ -226,7 +226,7 @@ static int install_session_keyring(struct task_struct *tsk,
 		if (tsk->signal->session_keyring)
 			flags = KEY_ALLOC_IN_QUOTA;
 
-		keyring = keyring_alloc(buf, tsk->uid, tsk->gid, tsk,
+		keyring = keyring_alloc(buf, tsk->sec->uid, tsk->sec->gid, tsk,
 					flags, NULL);
 		if (IS_ERR(keyring))
 			return PTR_ERR(keyring);
@@ -280,14 +280,14 @@ int copy_thread_group_keys(struct task_struct *tsk)
  */
 int copy_keys(unsigned long clone_flags, struct task_struct *tsk)
 {
-	key_check(tsk->thread_keyring);
-	key_check(tsk->request_key_auth);
+	key_check(tsk->sec->thread_keyring);
+	key_check(tsk->sec->request_key_auth);
 
 	/* no thread keyring yet */
-	tsk->thread_keyring = NULL;
+	tsk->sec->thread_keyring = NULL;
 
 	/* copy the request_key() authorisation for this thread */
-	key_get(tsk->request_key_auth);
+	key_get(tsk->sec->request_key_auth);
 
 	return 0;
 
@@ -310,8 +310,8 @@ void exit_thread_group_keys(struct signal_struct *tg)
  */
 void exit_keys(struct task_struct *tsk)
 {
-	key_put(tsk->thread_keyring);
-	key_put(tsk->request_key_auth);
+	key_put(tsk->sec->thread_keyring);
+	key_put(tsk->sec->request_key_auth);
 
 } /* end exit_keys() */
 
@@ -325,8 +325,8 @@ int exec_keys(struct task_struct *tsk)
 
 	/* newly exec'd tasks don't get a thread keyring */
 	task_lock(tsk);
-	old = tsk->thread_keyring;
-	tsk->thread_keyring = NULL;
+	old = tsk->sec->thread_keyring;
+	tsk->sec->thread_keyring = NULL;
 	task_unlock(tsk);
 
 	key_put(old);
@@ -361,10 +361,11 @@ int suid_keys(struct task_struct *tsk)
 void key_fsuid_changed(struct task_struct *tsk)
 {
 	/* update the ownership of the thread keyring */
-	if (tsk->thread_keyring) {
-		down_write(&tsk->thread_keyring->sem);
-		tsk->thread_keyring->uid = tsk->fsuid;
-		up_write(&tsk->thread_keyring->sem);
+	BUG_ON(!tsk->sec);
+	if (tsk->sec->thread_keyring) {
+		down_write(&tsk->sec->thread_keyring->sem);
+		tsk->sec->thread_keyring->uid = tsk->sec->fsuid;
+		up_write(&tsk->sec->thread_keyring->sem);
 	}
 
 } /* end key_fsuid_changed() */
@@ -376,10 +377,11 @@ void key_fsuid_changed(struct task_struct *tsk)
 void key_fsgid_changed(struct task_struct *tsk)
 {
 	/* update the ownership of the thread keyring */
-	if (tsk->thread_keyring) {
-		down_write(&tsk->thread_keyring->sem);
-		tsk->thread_keyring->gid = tsk->fsgid;
-		up_write(&tsk->thread_keyring->sem);
+	BUG_ON(!tsk->sec);
+	if (tsk->sec->thread_keyring) {
+		down_write(&tsk->sec->thread_keyring->sem);
+		tsk->sec->thread_keyring->gid = tsk->sec->fsgid;
+		up_write(&tsk->sec->thread_keyring->sem);
 	}
 
 } /* end key_fsgid_changed() */
@@ -414,9 +416,9 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	err = ERR_PTR(-EAGAIN);
 
 	/* search the thread keyring first */
-	if (context->thread_keyring) {
+	if (context->sec->thread_keyring) {
 		key_ref = keyring_search_aux(
-			make_key_ref(context->thread_keyring, 1),
+			make_key_ref(context->sec->thread_keyring, 1),
 			context, type, description, match);
 		if (!IS_ERR(key_ref))
 			goto found;
@@ -483,7 +485,7 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	/* or search the user-session keyring */
 	else {
 		key_ref = keyring_search_aux(
-			make_key_ref(context->user->session_keyring, 1),
+			make_key_ref(context->sec->user->session_keyring, 1),
 			context, type, description, match);
 		if (!IS_ERR(key_ref))
 			goto found;
@@ -505,20 +507,20 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	 * search the keyrings of the process mentioned there
 	 * - we don't permit access to request_key auth keys via this method
 	 */
-	if (context->request_key_auth &&
+	if (context->sec->request_key_auth &&
 	    context == current &&
 	    type != &key_type_request_key_auth
 	    ) {
 		/* defend against the auth key being revoked */
-		down_read(&context->request_key_auth->sem);
+		down_read(&context->sec->request_key_auth->sem);
 
-		if (key_validate(context->request_key_auth) == 0) {
-			rka = context->request_key_auth->payload.data;
+		if (key_validate(context->sec->request_key_auth) == 0) {
+			rka = context->sec->request_key_auth->payload.data;
 
 			key_ref = search_process_keyrings(type, description,
 							  match, rka->context);
 
-			up_read(&context->request_key_auth->sem);
+			up_read(&context->sec->request_key_auth->sem);
 
 			if (!IS_ERR(key_ref))
 				goto found;
@@ -535,7 +537,7 @@ key_ref_t search_process_keyrings(struct key_type *type,
 				break;
 			}
 		} else {
-			up_read(&context->request_key_auth->sem);
+			up_read(&context->sec->request_key_auth->sem);
 		}
 	}
 
@@ -577,7 +579,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 
 	switch (id) {
 	case KEY_SPEC_THREAD_KEYRING:
-		if (!context->thread_keyring) {
+		if (!context->sec->thread_keyring) {
 			if (!create)
 				goto error;
 
@@ -588,7 +590,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 			}
 		}
 
-		key = context->thread_keyring;
+		key = context->sec->thread_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
@@ -615,7 +617,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 			/* always install a session keyring upon access if one
 			 * doesn't exist yet */
 			ret = install_session_keyring(
-				context, context->user->session_keyring);
+				context, context->sec->user->session_keyring);
 			if (ret < 0)
 				goto error;
 		}
@@ -628,13 +630,13 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 		break;
 
 	case KEY_SPEC_USER_KEYRING:
-		key = context->user->uid_keyring;
+		key = context->sec->user->uid_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
 
 	case KEY_SPEC_USER_SESSION_KEYRING:
-		key = context->user->session_keyring;
+		key = context->sec->user->session_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
@@ -645,7 +647,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 		goto error;
 
 	case KEY_SPEC_REQKEY_AUTH_KEY:
-		key = context->request_key_auth;
+		key = context->sec->request_key_auth;
 		if (!key)
 			goto error;
 
@@ -747,7 +749,7 @@ long join_session_keyring(const char *name)
 	keyring = find_keyring_by_name(name, 0);
 	if (PTR_ERR(keyring) == -ENOKEY) {
 		/* not found - try and create a new one */
-		keyring = keyring_alloc(name, tsk->uid, tsk->gid, tsk,
+		keyring = keyring_alloc(name, tsk->sec->uid, tsk->sec->gid, tsk,
 					KEY_ALLOC_IN_QUOTA, NULL);
 		if (IS_ERR(keyring)) {
 			ret = PTR_ERR(keyring);
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 6d25911..54b03bb 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -76,7 +76,7 @@ static int call_sbin_request_key(struct key_construction *cons,
 	/* allocate a new session keyring */
 	sprintf(desc, "_req.%u", key->serial);
 
-	keyring = keyring_alloc(desc, current_fsuid(), current->fsgid, current,
+	keyring = keyring_alloc(desc, current_fsuid(), current_fsgid(), current,
 				KEY_ALLOC_QUOTA_OVERRUN, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
@@ -97,7 +97,8 @@ static int call_sbin_request_key(struct key_construction *cons,
 
 	/* we specify the process's default keyrings */
 	sprintf(keyring_str[0], "%d",
-		tsk->thread_keyring ? tsk->thread_keyring->serial : 0);
+		tsk->act_as->thread_keyring ?
+		tsk->act_as->thread_keyring->serial : 0);
 
 	prkey = 0;
 	if (tsk->signal->process_keyring)
@@ -110,7 +111,7 @@ static int call_sbin_request_key(struct key_construction *cons,
 		sskey = rcu_dereference(tsk->signal->session_keyring)->serial;
 		rcu_read_unlock();
 	} else {
-		sskey = tsk->user->session_keyring->serial;
+		sskey = tsk->act_as->user->session_keyring->serial;
 	}
 
 	sprintf(keyring_str[2], "%d", sskey);
@@ -216,10 +217,10 @@ static void construct_key_make_link(struct key *key, struct key *dest_keyring)
 
 	/* find the appropriate keyring */
 	if (!dest_keyring) {
-		switch (tsk->jit_keyring) {
+		switch (tsk->act_as->jit_keyring) {
 		case KEY_REQKEY_DEFL_DEFAULT:
 		case KEY_REQKEY_DEFL_THREAD_KEYRING:
-			dest_keyring = tsk->thread_keyring;
+			dest_keyring = tsk->act_as->thread_keyring;
 			if (dest_keyring)
 				break;
 
@@ -239,11 +240,11 @@ static void construct_key_make_link(struct key *key, struct key *dest_keyring)
 				break;
 
 		case KEY_REQKEY_DEFL_USER_SESSION_KEYRING:
-			dest_keyring = tsk->user->session_keyring;
+			dest_keyring = tsk->act_as->user->session_keyring;
 			break;
 
 		case KEY_REQKEY_DEFL_USER_KEYRING:
-			dest_keyring = tsk->user->uid_keyring;
+			dest_keyring = tsk->act_as->user->uid_keyring;
 			break;
 
 		case KEY_REQKEY_DEFL_GROUP_KEYRING:
@@ -278,7 +279,7 @@ static int construct_alloc_key(struct key_type *type,
 	mutex_lock(&user->cons_lock);
 
 	key = key_alloc(type, description,
-			current_fsuid(), current->fsgid, current, KEY_POS_ALL,
+			current_fsuid(), current_fsgid(), current, KEY_POS_ALL,
 			flags);
 	if (IS_ERR(key))
 		goto alloc_failed;
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index cce6b4d..9598670 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -162,22 +162,22 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 
 	/* see if the calling process is already servicing the key request of
 	 * another process */
-	if (current->request_key_auth) {
+	if (current->act_as->request_key_auth) {
 		/* it is - use that instantiation context here too */
-		down_read(&current->request_key_auth->sem);
+		down_read(&current->act_as->request_key_auth->sem);
 
 		/* if the auth key has been revoked, then the key we're
 		 * servicing is already instantiated */
 		if (test_bit(KEY_FLAG_REVOKED,
-			     &current->request_key_auth->flags))
+			     &current->act_as->request_key_auth->flags))
 			goto auth_key_revoked;
 
-		irka = current->request_key_auth->payload.data;
+		irka = current->act_as->request_key_auth->payload.data;
 		rka->context = irka->context;
 		rka->pid = irka->pid;
 		get_task_struct(rka->context);
 
-		up_read(&current->request_key_auth->sem);
+		up_read(&current->act_as->request_key_auth->sem);
 	}
 	else {
 		/* it isn't - use this process as the context */
@@ -194,7 +194,7 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 	sprintf(desc, "%x", target->serial);
 
 	authkey = key_alloc(&key_type_request_key_auth, desc,
-			    current_fsuid(), current->fsgid, current,
+			    current_fsuid(), current_fsgid(), current,
 			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH |
 			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA);
 	if (IS_ERR(authkey)) {
@@ -211,7 +211,7 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 	return authkey;
 
 auth_key_revoked:
-	up_read(&current->request_key_auth->sem);
+	up_read(&current->act_as->request_key_auth->sem);
 	kfree(rka->callout_info);
 	kfree(rka);
 	kleave("= -EKEYREVOKED");
diff --git a/security/selinux/exports.c b/security/selinux/exports.c
index b6f9694..f660690 100644
--- a/security/selinux/exports.c
+++ b/security/selinux/exports.c
@@ -56,7 +56,7 @@ void selinux_get_ipc_sid(const struct kern_ipc_perm *ipcp, u32 *sid)
 void selinux_get_task_sid(struct task_struct *tsk, u32 *sid)
 {
 	if (selinux_enabled) {
-		struct task_security_struct *tsec = tsk->security;
+		struct task_security_struct *tsec = tsk->sec->security;
 		*sid = tsec->sid;
 		return;
 	}
@@ -77,7 +77,7 @@ EXPORT_SYMBOL_GPL(selinux_string_to_sid);
 int selinux_relabel_packet_permission(u32 sid)
 {
 	if (selinux_enabled) {
-		struct task_security_struct *tsec = current->security;
+		struct task_security_struct *tsec = current->act_as->security;
 
 		return avc_has_perm(tsec->sid, sid, SECCLASS_PACKET,
 				    PACKET__RELABELTO, NULL);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index bd4cfab..e56529f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -165,21 +165,21 @@ static int task_alloc_security(struct task_struct *task)
 
 	tsec->task = task;
 	tsec->osid = tsec->sid = tsec->ptrace_sid = SECINITSID_UNLABELED;
-	task->security = tsec;
+	task->sec->security = tsec;
 
 	return 0;
 }
 
 static void task_free_security(struct task_struct *task)
 {
-	struct task_security_struct *tsec = task->security;
-	task->security = NULL;
+	struct task_security_struct *tsec = task->sec->security;
+	task->sec->security = NULL;
 	kfree(tsec);
 }
 
 static int inode_alloc_security(struct inode *inode)
 {
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->act_as->security;
 	struct inode_security_struct *isec;
 
 	isec = kmem_cache_zalloc(sel_inode_cache, GFP_KERNEL);
@@ -213,7 +213,7 @@ static void inode_free_security(struct inode *inode)
 
 static int file_alloc_security(struct file *file)
 {
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->act_as->security;
 	struct file_security_struct *fsec;
 
 	fsec = kzalloc(sizeof(struct file_security_struct), GFP_KERNEL);
@@ -373,7 +373,7 @@ static int try_context_mount(struct super_block *sb, void *data)
 	const char *name;
 	u32 sid;
 	int alloc = 0, rc = 0, seen = 0;
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->act_as->security;
 	struct superblock_security_struct *sbsec = sb->s_security;
 
 	if (!data)
@@ -1033,8 +1033,8 @@ static int task_has_perm(struct task_struct *tsk1,
 {
 	struct task_security_struct *tsec1, *tsec2;
 
-	tsec1 = tsk1->security;
-	tsec2 = tsk2->security;
+	tsec1 = tsk1->act_as->security;
+	tsec2 = tsk2->sec->security;
 	return avc_has_perm(tsec1->sid, tsec2->sid,
 			    SECCLASS_PROCESS, perms, NULL);
 }
@@ -1046,7 +1046,7 @@ static int task_has_capability(struct task_struct *tsk,
 	struct task_security_struct *tsec;
 	struct avc_audit_data ad;
 
-	tsec = tsk->security;
+	tsec = tsk->sec->security;
 
 	AVC_AUDIT_DATA_INIT(&ad,CAP);
 	ad.tsk = tsk;
@@ -1062,7 +1062,7 @@ static int task_has_system(struct task_struct *tsk,
 {
 	struct task_security_struct *tsec;
 
-	tsec = tsk->security;
+	tsec = tsk->sec->security;
 
 	return avc_has_perm(tsec->sid, SECINITSID_KERNEL,
 			    SECCLASS_SYSTEM, perms, NULL);
@@ -1083,7 +1083,7 @@ static int inode_has_perm(struct task_struct *tsk,
 	if (unlikely (IS_PRIVATE (inode)))
 		return 0;
 
-	tsec = tsk->security;
+	tsec = tsk->sec->security;
 	isec = inode->i_security;
 
 	if (!adp) {
@@ -1123,7 +1123,7 @@ static int file_has_perm(struct task_struct *tsk,
 				struct file *file,
 				u32 av)
 {
-	struct task_security_struct *tsec = tsk->security;
+	struct task_security_struct *tsec = tsk->sec->security;
 	struct file_security_struct *fsec = file->f_security;
 	struct vfsmount *mnt = file->f_path.mnt;
 	struct dentry *dentry = file->f_path.dentry;
@@ -1163,7 +1163,7 @@ static int may_create(struct inode *dir,
 	struct avc_audit_data ad;
 	int rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	dsec = dir->i_security;
 	sbsec = dir->i_sb->s_security;
 
@@ -1200,7 +1200,7 @@ static int may_create_key(u32 ksid,
 {
 	struct task_security_struct *tsec;
 
-	tsec = ctx->security;
+	tsec = ctx->sec->security;
 
 	return avc_has_perm(tsec->sid, ksid, SECCLASS_KEY, KEY__CREATE, NULL);
 }
@@ -1221,7 +1221,7 @@ static int may_link(struct inode *dir,
 	u32 av;
 	int rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	dsec = dir->i_security;
 	isec = dentry->d_inode->i_security;
 
@@ -1265,7 +1265,7 @@ static inline int may_rename(struct inode *old_dir,
 	int old_is_dir, new_is_dir;
 	int rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	old_dsec = old_dir->i_security;
 	old_isec = old_dentry->d_inode->i_security;
 	old_is_dir = S_ISDIR(old_dentry->d_inode->i_mode);
@@ -1318,7 +1318,7 @@ static int superblock_has_perm(struct task_struct *tsk,
 	struct task_security_struct *tsec;
 	struct superblock_security_struct *sbsec;
 
-	tsec = tsk->security;
+	tsec = tsk->act_as->security;
 	sbsec = sb->s_security;
 	return avc_has_perm(tsec->sid, sbsec->sid, SECCLASS_FILESYSTEM,
 			    perms, ad);
@@ -1373,8 +1373,8 @@ static inline u32 file_to_av(struct file *file)
 
 static int selinux_ptrace(struct task_struct *parent, struct task_struct *child)
 {
-	struct task_security_struct *psec = parent->security;
-	struct task_security_struct *csec = child->security;
+	struct task_security_struct *psec = parent->act_as->security;
+	struct task_security_struct *csec = child->sec->security;
 	int rc;
 
 	rc = secondary_ops->ptrace(parent,child);
@@ -1482,7 +1482,7 @@ static int selinux_sysctl(ctl_table *table, int op)
 	if (rc)
 		return rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 
 	rc = selinux_sysctl_get_sid(table, (op == 0001) ?
 				    SECCLASS_DIR : SECCLASS_FILE, &tsid);
@@ -1591,7 +1591,7 @@ static int selinux_syslog(int type)
 static int selinux_vm_enough_memory(struct mm_struct *mm, long pages)
 {
 	int rc, cap_sys_admin = 0;
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->act_as->security;
 
 	rc = secondary_ops->capable(current, CAP_SYS_ADMIN);
 	if (rc == 0)
@@ -1644,7 +1644,7 @@ static int selinux_bprm_set_security(struct linux_binprm *bprm)
 	if (bsec->set)
 		return 0;
 
-	tsec = current->security;
+	tsec = current->sec->security;
 	isec = inode->i_security;
 
 	/* Default to the current task SID. */
@@ -1710,7 +1710,7 @@ static int selinux_bprm_check_security (struct linux_binprm *bprm)
 
 static int selinux_bprm_secureexec (struct linux_binprm *bprm)
 {
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->sec->security;
 	int atsecure = 0;
 
 	if (tsec->osid != tsec->sid) {
@@ -1833,7 +1833,7 @@ static void selinux_bprm_apply_creds(struct linux_binprm *bprm, int unsafe)
 
 	secondary_ops->bprm_apply_creds(bprm, unsafe);
 
-	tsec = current->security;
+	tsec = current->sec->security;
 
 	bsec = bprm->security;
 	sid = bsec->sid;
@@ -1878,7 +1878,7 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
 	struct bprm_security_struct *bsec;
 	int rc, i;
 
-	tsec = current->security;
+	tsec = current->sec->security;
 	bsec = bprm->security;
 
 	if (bsec->unsafe) {
@@ -2133,7 +2133,7 @@ static int selinux_inode_init_security(struct inode *inode, struct inode *dir,
 	int rc;
 	char *namep = NULL, *context;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	dsec = dir->i_security;
 	sbsec = dir->i_sb->s_security;
 
@@ -2318,7 +2318,7 @@ static int selinux_inode_setotherxattr(struct dentry *dentry, char *name)
 
 static int selinux_inode_setxattr(struct dentry *dentry, char *name, void *value, size_t size, int flags)
 {
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->act_as->security;
 	struct inode *inode = dentry->d_inode;
 	struct inode_security_struct *isec = inode->i_security;
 	struct superblock_security_struct *sbsec;
@@ -2492,7 +2492,7 @@ static int selinux_revalidate_file_permission(struct file *file, int mask)
 static int selinux_file_permission(struct file *file, int mask)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
-	struct task_security_struct *tsec = current->security;
+	struct task_security_struct *tsec = current->act_as->security;
 	struct file_security_struct *fsec = file->f_security;
 	struct inode_security_struct *isec = inode->i_security;
 
@@ -2600,7 +2600,8 @@ static int selinux_file_mmap(struct file *file, unsigned long reqprot,
 			     unsigned long addr, unsigned long addr_only)
 {
 	int rc = 0;
-	u32 sid = ((struct task_security_struct*)(current->security))->sid;
+	u32 sid = ((struct task_security_struct *)
+		   (current->act_as->security))->sid;
 
 	if (addr < mmap_min_addr)
 		rc = avc_has_perm(sid, sid, SECCLASS_MEMPROTECT,
@@ -2712,7 +2713,7 @@ static int selinux_file_set_fowner(struct file *file)
 	struct task_security_struct *tsec;
 	struct file_security_struct *fsec;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	fsec = file->f_security;
 	fsec->fown_sid = tsec->sid;
 
@@ -2730,7 +2731,7 @@ static int selinux_file_send_sigiotask(struct task_struct *tsk,
 	/* struct fown_struct is never outside the context of a struct file */
         file = container_of(fown, struct file, f_owner);
 
-	tsec = tsk->security;
+	tsec = tsk->sec->security;
 	fsec = file->f_security;
 
 	if (!signum)
@@ -2793,12 +2794,12 @@ static int selinux_task_alloc_security(struct task_struct *tsk)
 	struct task_security_struct *tsec1, *tsec2;
 	int rc;
 
-	tsec1 = current->security;
+	tsec1 = current->act_as->security;
 
 	rc = task_alloc_security(tsk);
 	if (rc)
 		return rc;
-	tsec2 = tsk->security;
+	tsec2 = tsk->sec->security;
 
 	tsec2->osid = tsec1->osid;
 	tsec2->sid = tsec1->sid;
@@ -2955,7 +2956,7 @@ static int selinux_task_kill(struct task_struct *p, struct siginfo *info,
 		perm = PROCESS__SIGNULL; /* null signal; existence test */
 	else
 		perm = signal_to_av(sig);
-	tsec = p->security;
+	tsec = p->sec->security;
 	if (secid)
 		rc = avc_has_perm(secid, tsec->sid, SECCLASS_PROCESS, perm, NULL);
 	else
@@ -2986,7 +2987,7 @@ static void selinux_task_reparent_to_init(struct task_struct *p)
 
 	secondary_ops->task_reparent_to_init(p);
 
-	tsec = p->security;
+	tsec = p->sec->security;
 	tsec->osid = tsec->sid;
 	tsec->sid = SECINITSID_KERNEL;
 	return;
@@ -2995,7 +2996,7 @@ static void selinux_task_reparent_to_init(struct task_struct *p)
 static void selinux_task_to_inode(struct task_struct *p,
 				  struct inode *inode)
 {
-	struct task_security_struct *tsec = p->security;
+	struct task_security_struct *tsec = p->sec->security;
 	struct inode_security_struct *isec = inode->i_security;
 
 	isec->sid = tsec->sid;
@@ -3227,7 +3228,7 @@ static int socket_has_perm(struct task_struct *task, struct socket *sock,
 	struct avc_audit_data ad;
 	int err = 0;
 
-	tsec = task->security;
+	tsec = task->act_as->security;
 	isec = SOCK_INODE(sock)->i_security;
 
 	if (isec->sid == SECINITSID_KERNEL)
@@ -3251,7 +3252,7 @@ static int selinux_socket_create(int family, int type,
 	if (kern)
 		goto out;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	newsid = tsec->sockcreate_sid ? : tsec->sid;
 	err = avc_has_perm(tsec->sid, newsid,
 			   socket_type_to_security_class(family, type,
@@ -3272,7 +3273,7 @@ static int selinux_socket_post_create(struct socket *sock, int family,
 
 	isec = SOCK_INODE(sock)->i_security;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	newsid = tsec->sockcreate_sid ? : tsec->sid;
 	isec->sclass = socket_type_to_security_class(family, type, protocol);
 	isec->sid = kern ? SECINITSID_KERNEL : newsid;
@@ -3317,7 +3318,7 @@ static int selinux_socket_bind(struct socket *sock, struct sockaddr *address, in
 		struct sock *sk = sock->sk;
 		u32 sid, node_perm, addrlen;
 
-		tsec = current->security;
+		tsec = current->act_as->security;
 		isec = SOCK_INODE(sock)->i_security;
 
 		if (family == PF_INET) {
@@ -4091,7 +4092,7 @@ static int ipc_alloc_security(struct task_struct *task,
 			      struct kern_ipc_perm *perm,
 			      u16 sclass)
 {
-	struct task_security_struct *tsec = task->security;
+	struct task_security_struct *tsec = task->act_as->security;
 	struct ipc_security_struct *isec;
 
 	isec = kzalloc(sizeof(struct ipc_security_struct), GFP_KERNEL);
@@ -4143,7 +4144,7 @@ static int ipc_has_perm(struct kern_ipc_perm *ipc_perms,
 	struct ipc_security_struct *isec;
 	struct avc_audit_data ad;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = ipc_perms->security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4174,7 +4175,7 @@ static int selinux_msg_queue_alloc_security(struct msg_queue *msq)
 	if (rc)
 		return rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = msq->q_perm.security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4200,7 +4201,7 @@ static int selinux_msg_queue_associate(struct msg_queue *msq, int msqflg)
 	struct ipc_security_struct *isec;
 	struct avc_audit_data ad;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = msq->q_perm.security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4246,7 +4247,7 @@ static int selinux_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg,
 	struct avc_audit_data ad;
 	int rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = msq->q_perm.security;
 	msec = msg->security;
 
@@ -4294,7 +4295,7 @@ static int selinux_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg,
 	struct avc_audit_data ad;
 	int rc;
 
-	tsec = target->security;
+	tsec = target->act_as->security;
 	isec = msq->q_perm.security;
 	msec = msg->security;
 
@@ -4321,7 +4322,7 @@ static int selinux_shm_alloc_security(struct shmid_kernel *shp)
 	if (rc)
 		return rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = shp->shm_perm.security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4347,7 +4348,7 @@ static int selinux_shm_associate(struct shmid_kernel *shp, int shmflg)
 	struct ipc_security_struct *isec;
 	struct avc_audit_data ad;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = shp->shm_perm.security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4420,7 +4421,7 @@ static int selinux_sem_alloc_security(struct sem_array *sma)
 	if (rc)
 		return rc;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = sma->sem_perm.security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4446,7 +4447,7 @@ static int selinux_sem_associate(struct sem_array *sma, int semflg)
 	struct ipc_security_struct *isec;
 	struct avc_audit_data ad;
 
-	tsec = current->security;
+	tsec = current->act_as->security;
 	isec = sma->sem_perm.security;
 
 	AVC_AUDIT_DATA_INIT(&ad, IPC);
@@ -4565,7 +4566,7 @@ static int selinux_getprocattr(struct task_struct *p,
 			return error;
 	}
 
-	tsec = p->security;
+	tsec = p->sec->security;
 
 	if (!strcmp(name, "current"))
 		sid = tsec->sid;
@@ -4642,7 +4643,7 @@ static int selinux_setprocattr(struct task_struct *p,
 	   operation.  See selinux_bprm_set_security for the execve
 	   checks and may_create for the file creation checks. The
 	   operation will then fail if the context is not permitted. */
-	tsec = p->security;
+	tsec = p->sec->security;
 	if (!strcmp(name, "exec"))
 		tsec->exec_sid = sid;
 	else if (!strcmp(name, "fscreate"))
@@ -4720,7 +4721,7 @@ static void selinux_release_secctx(char *secdata, u32 seclen)
 static int selinux_key_alloc(struct key *k, struct task_struct *tsk,
 			     unsigned long flags)
 {
-	struct task_security_struct *tsec = tsk->security;
+	struct task_security_struct *tsec = tsk->sec->security;
 	struct key_security_struct *ksec;
 
 	ksec = kzalloc(sizeof(struct key_security_struct), GFP_KERNEL);
@@ -4755,7 +4756,7 @@ static int selinux_key_permission(key_ref_t key_ref,
 
 	key = key_ref_to_ptr(key_ref);
 
-	tsec = ctx->security;
+	tsec = ctx->sec->security;
 	ksec = key->security;
 
 	/* if no specific permissions are requested, we skip the
@@ -4978,7 +4979,7 @@ static __init int selinux_init(void)
 	/* Set the security state for the initial task. */
 	if (task_alloc_security(current))
 		panic("SELinux:  Failed to initialize initial task.\n");
-	tsec = current->security;
+	tsec = current->sec->security;
 	tsec->osid = tsec->sid = SECINITSID_KERNEL;
 
 	sel_inode_cache = kmem_cache_create("selinux_inode_security",
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index f5f3e6d..ee7d535 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -79,7 +79,7 @@ static int task_has_security(struct task_struct *tsk,
 {
 	struct task_security_struct *tsec;
 
-	tsec = tsk->security;
+	tsec = tsk->act_as->security;
 	if (!tsec)
 		return -EACCES;
 


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 07/28] SECURITY: De-embed task security record from task and use refcounting [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (5 preceding siblings ...)
  2007-12-05 19:38 ` [PATCH 06/28] SECURITY: Separate task security context from task_struct " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-05 19:38 ` [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions " David Howells
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Remove the temporarily embedded task security record from task_struct.  Instead
it is made to dangle from the task_struct::sec and task_struct::act_as pointers
with references counted for each.

do_coredump() is made to create a copy of the security record, modify it and
then use that to override the main one for a task.  sys_faccessat() is made to
do the same.

The process and session keyrings are moved from signal_struct into a new
thread_group_security struct.  This is then refcounted, with pointers coming
from the task_security struct instead of from signal_struct.

The keyring functions then take pointers to task_security structs rather than
task_structs for their security contexts.  This is so that request_key() can
proceed asynchronously without having to worry about the initiator task's
act_as pointer changing.

The LSM hooks for dealing with task security are modified to deal with the task
security struct directly rather than going via the task_struct as appopriate.

This permits the subjective security context of a task to be overridden by
changing its act_as pointer without altering its objective security pointer,
and thus not breaking signalling, ptrace, etc. whilst the override is in force.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/exec.c                        |   15 +-
 fs/open.c                        |   37 ++---
 include/linux/init_task.h        |   17 --
 include/linux/key-ui.h           |   10 +
 include/linux/key.h              |   31 +---
 include/linux/sched.h            |   40 ++++-
 include/linux/security.h         |   43 ++++--
 kernel/Makefile                  |    2 
 kernel/cred.c                    |  139 ++++++++++++++++++
 kernel/exit.c                    |    1 
 kernel/fork.c                    |   40 ++---
 kernel/kmod.c                    |   10 +
 kernel/sys.c                     |   16 +-
 kernel/user.c                    |    2 
 net/rxrpc/ar-key.c               |    4 -
 security/dummy.c                 |   14 +-
 security/keys/internal.h         |   10 +
 security/keys/key.c              |    6 -
 security/keys/keyctl.c           |    6 -
 security/keys/keyring.c          |   14 +-
 security/keys/permission.c       |    5 -
 security/keys/proc.c             |    2 
 security/keys/process_keys.c     |  290 +++++++++++++++++---------------------
 security/keys/request_key.c      |   59 ++++----
 security/keys/request_key_auth.c |   38 ++---
 security/security.c              |   20 ++-
 security/selinux/hooks.c         |   46 +++++-
 27 files changed, 526 insertions(+), 391 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index da01655..f10e3fd 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1674,13 +1674,13 @@ int get_dumpable(struct mm_struct *mm)
 
 int do_coredump(long signr, int exit_code, struct pt_regs * regs)
 {
+	struct task_security *sec, *old_act_as;
 	char corename[CORENAME_MAX_SIZE + 1];
 	struct mm_struct *mm = current->mm;
 	struct linux_binfmt * binfmt;
 	struct inode * inode;
 	struct file * file;
 	int retval = 0;
-	int fsuid = current_fsuid();
 	int flag = 0;
 	int ispipe = 0;
 	unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
@@ -1692,7 +1692,10 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
 
 	binfmt = current->binfmt;
 	if (!binfmt || !binfmt->core_dump)
-		goto fail;
+		goto fail_nosubj;
+	sec = dup_task_security(current->sec);
+	if (!sec)
+		goto fail_nosubj;
 	down_write(&mm->mmap_sem);
 	/*
 	 * If another thread got here first, or we are not dumpable, bail out.
@@ -1707,9 +1710,11 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
 	 *	process nor do we know its entire history. We only know it
 	 *	was tainted so we dump it as root in mode 2.
 	 */
+	old_act_as = current->act_as;
 	if (get_dumpable(mm) == 2) {	/* Setuid core dump mode */
 		flag = O_EXCL;		/* Stop rewrite attacks */
-		current->act_as->fsuid = 0;	/* Dump root private */
+		sec->fsuid = 0;		/* Dump root private */
+		current->act_as = sec;
 	}
 
 	retval = coredump_wait(exit_code);
@@ -1805,8 +1810,10 @@ fail_unlock:
 	if (helper_argv)
 		argv_free(helper_argv);
 
-	current->act_as->fsuid = fsuid;
+	current->act_as = old_act_as;
 	complete_all(&mm->core_done);
 fail:
+	put_task_security(sec);
+fail_nosubj:
 	return retval;
 }
diff --git a/fs/open.c b/fs/open.c
index 6d7a29c..710102a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -420,34 +420,27 @@ out:
  */
 asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode)
 {
+	struct task_security *sec, *old_act_as;
 	struct nameidata nd;
-	int old_fsuid, old_fsgid;
-	kernel_cap_t old_cap;
 	int res;
 
 	if (mode & ~S_IRWXO)	/* where's F_OK, X_OK, W_OK, R_OK? */
 		return -EINVAL;
 
-	old_fsuid = current->act_as->fsuid;
-	old_fsgid = current->act_as->fsgid;
-	old_cap = current->act_as->cap_effective;
+	sec = dup_task_security(current->sec);
+	if (!sec)
+		return -ENOMEM;
+	sec->fsuid = current->sec->uid;
+	sec->fsgid = current->sec->gid;
 
-	current->act_as->fsuid = current->act_as->uid;
-	current->act_as->fsgid = current->act_as->gid;
-
-	/*
-	 * Clear the capabilities if we switch to a non-root user
-	 *
-	 * FIXME: There is a race here against sys_capset.  The
-	 * capabilities can change yet we will restore the old
-	 * value below.  We should hold task_capabilities_lock,
-	 * but we cannot because user_path_walk can sleep.
-	 */
-	if (current->act_as->uid)
-		cap_clear(current->act_as->cap_effective);
+	/* Clear the capabilities if we switch to a non-root user */
+	if (current->sec->uid)
+		cap_clear(sec->cap_effective);
 	else
-		current->act_as->cap_effective = current->act_as->cap_permitted;
+		sec->cap_effective = current->sec->cap_permitted;
 
+	old_act_as = current->act_as;
+	current->act_as = sec;
 	res = __user_walk_fd(dfd, filename, LOOKUP_FOLLOW|LOOKUP_ACCESS, &nd);
 	if (res)
 		goto out;
@@ -464,10 +457,8 @@ asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode)
 out_path_release:
 	path_release(&nd);
 out:
-	current->act_as->fsuid = old_fsuid;
-	current->act_as->fsgid = old_fsgid;
-	current->act_as->cap_effective = old_cap;
-
+	current->act_as = old_act_as;
+	put_task_security(sec);
 	return res;
 }
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6fa8413..380cc6a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -116,18 +116,6 @@ extern struct group_info init_groups;
 
 extern struct task_security init_task_security;
 
-#define INIT_TASK_SECURITY(p)					\
-{								\
-	.usage			= ATOMIC_INIT(3),		\
-	.keep_capabilities	= 0,				\
-	.cap_inheritable	= CAP_INIT_INH_SET,		\
-	.cap_permitted		= CAP_FULL_SET,			\
-	.cap_effective		= CAP_INIT_EFF_SET,		\
-	.user			= INIT_USER,			\
-	.group_info		= &init_groups,			\
-	.lock			= __SPIN_LOCK_UNLOCKED(p.lock),	\
-}
-
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -157,9 +145,8 @@ extern struct task_security init_task_security;
 	.children	= LIST_HEAD_INIT(tsk.children),			\
 	.sibling	= LIST_HEAD_INIT(tsk.sibling),			\
 	.group_leader	= &tsk,						\
-	.__temp_sec	= INIT_TASK_SECURITY(tsk.__temp_sec),		\
-	.sec		= &tsk.__temp_sec,				\
-	.act_as		= &tsk.__temp_sec,				\
+	.sec		= &init_task_security,				\
+	.act_as		= &init_task_security,				\
 	.comm		= "swapper",					\
 	.thread		= INIT_THREAD,					\
 	.fs		= &init_fs,					\
diff --git a/include/linux/key-ui.h b/include/linux/key-ui.h
index e8b8a7a..f15ea9d 100644
--- a/include/linux/key-ui.h
+++ b/include/linux/key-ui.h
@@ -43,15 +43,13 @@ struct keyring_list {
  * check to see whether permission is granted to use a key in the desired way
  */
 extern int key_task_permission(const key_ref_t key_ref,
-			       struct task_struct *context,
+			       struct task_security *sec,
 			       key_perm_t perm);
 
-static inline int key_permission(const key_ref_t key_ref, key_perm_t perm)
-{
-	return key_task_permission(key_ref, current, perm);
-}
+#define key_permission(key_ref, perm) \
+	key_task_permission((key_ref), current->act_as, (perm))
 
-extern key_ref_t lookup_user_key(struct task_struct *context,
+extern key_ref_t lookup_user_key(struct task_security *sec,
 				 key_serial_t id, int create, int partial,
 				 key_perm_t perm);
 
diff --git a/include/linux/key.h b/include/linux/key.h
index 4a6021a..f5edd89 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -70,6 +70,8 @@ struct key;
 struct seq_file;
 struct user_struct;
 struct signal_struct;
+struct task_security;
+struct thread_group_security;
 
 struct key_type;
 struct key_owner;
@@ -178,7 +180,7 @@ struct key {
 extern struct key *key_alloc(struct key_type *type,
 			     const char *desc,
 			     uid_t uid, gid_t gid,
-			     struct task_struct *ctx,
+			     struct task_security *sec,
 			     key_perm_t perm,
 			     unsigned long flags);
 
@@ -245,7 +247,7 @@ extern int key_unlink(struct key *keyring,
 		      struct key *key);
 
 extern struct key *keyring_alloc(const char *description, uid_t uid, gid_t gid,
-				 struct task_struct *ctx,
+				 struct task_security *sec,
 				 unsigned long flags,
 				 struct key *dest);
 
@@ -267,24 +269,16 @@ extern struct key *key_lookup(key_serial_t id);
  */
 extern struct key root_user_keyring, root_session_keyring;
 extern int alloc_uid_keyring(struct user_struct *user,
-			     struct task_struct *ctx);
+			     struct task_security *sec);
 extern void switch_uid_keyring(struct user_struct *new_user);
-extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
-extern int copy_thread_group_keys(struct task_struct *tsk);
-extern void exit_keys(struct task_struct *tsk);
-extern void exit_thread_group_keys(struct signal_struct *tg);
+extern int copy_thread_group_keys(struct thread_group_security *tgsec);
 extern int suid_keys(struct task_struct *tsk);
 extern int exec_keys(struct task_struct *tsk);
-extern void key_fsuid_changed(struct task_struct *tsk);
-extern void key_fsgid_changed(struct task_struct *tsk);
+extern void key_fsuid_changed(struct task_security *sec);
+extern void key_fsgid_changed(struct task_security *sec);
 extern void key_init(void);
-
-#define __install_session_keyring(tsk, keyring)			\
-({								\
-	struct key *old_session = tsk->signal->session_keyring;	\
-	tsk->signal->session_keyring = keyring;			\
-	old_session;						\
-})
+extern void __install_session_keyring(struct task_struct *tsk,
+				      struct key *keyring);
 
 #else /* CONFIG_KEYS */
 
@@ -298,11 +292,8 @@ extern void key_init(void);
 #define is_key_possessed(k)		0
 #define alloc_uid_keyring(u,c)		0
 #define switch_uid_keyring(u)		do { } while(0)
-#define __install_session_keyring(t, k)	({ NULL; })
-#define copy_keys(f,t)			0
+#define __install_session_keyring(t, k)	do {} while (0)
 #define copy_thread_group_keys(t)	0
-#define exit_keys(t)			do { } while(0)
-#define exit_thread_group_keys(tg)	do { } while(0)
 #define suid_keys(t)			do { } while(0)
 #define exec_keys(t)			do { } while(0)
 #define key_fsuid_changed(t)		do { } while(0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bcd785a..f6df115 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -491,12 +491,6 @@ struct signal_struct {
 
 	struct list_head cpu_timers[3];
 
-	/* keep the process-shared keyrings here so that they do the right
-	 * thing in threads created with CLONE_THREAD */
-#ifdef CONFIG_KEYS
-	struct key *session_keyring;	/* keyring inherited over fork */
-	struct key *process_keyring;	/* keyring private to this process */
-#endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct pacct_struct pacct;	/* per-process accounting information */
 #endif
@@ -572,6 +566,20 @@ extern struct user_struct root_user;
 
 
 /*
+ * The common security details for a thread group
+ * - shared by CLONE_THREAD
+ */
+#ifdef CONFIG_KEYS
+struct thread_group_security {
+	atomic_t	usage;
+	pid_t		tgid;			/* thread group process ID */
+	spinlock_t	lock;
+	struct key	*session_keyring;	/* keyring inherited over fork */
+	struct key	*process_keyring;	/* keyring private to this process */
+};
+#endif
+
+/*
  * The security context of a task
  *
  * The parts of the context break down into two categories:
@@ -613,6 +621,7 @@ struct task_security {
 					 * keys to */
 	struct key	*thread_keyring; /* keyring private to this thread */
 	struct key	*request_key_auth; /* assumed request_key authority */
+	struct thread_group_security *tgsec;
 #endif
 #ifdef CONFIG_SECURITY
 	void		*security;	/* subjective LSM security */
@@ -622,10 +631,28 @@ struct task_security {
 	spinlock_t	lock;		/* lock for pointer changes */
 };
 
+extern struct task_security *dup_task_security(struct task_security *);
+extern int copy_task_security(struct task_struct *, unsigned long);
+extern void put_task_security(struct task_security *);
+
 #define current_fsuid() (current->act_as->fsuid)
 #define current_fsgid() (current->act_as->fsgid)
 #define current_cap()	(current->act_as->cap_effective)
 
+/**
+ * get_task_security - Get an extra reference on a task security record
+ * @sec: The security record to get the reference on
+ *
+ * Get an extra reference on a task security record.  The caller must arrange
+ * for this to be released.
+ */
+static inline
+struct task_security *get_task_security(struct task_security *sec)
+{
+	atomic_inc(&sec->usage);
+	return sec;
+}
+
 
 struct backing_dev_info;
 struct reclaim_state;
@@ -1082,7 +1109,6 @@ struct task_struct {
 	struct list_head cpu_timers[3];
 
 /* process credentials */
-	struct task_security __temp_sec __deprecated; /* temporary security to be removed */
 	struct task_security *sec;	/* actual/objective task security */
 	struct task_security *act_as;	/* effective/subjective task security */
 
diff --git a/include/linux/security.h b/include/linux/security.h
index 8d9e946..b7ba073 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -550,8 +550,13 @@ struct request_sock;
  *	allocated.
  *	Return 0 if operation was successful.
  * @task_free_security:
- *	@p contains the task_struct for process.
+ *	@p points to the task_security struct to be freed.
  *	Deallocate and clear the p->security field.
+ * @task_dup_security:
+ *	@p points to the task_security struct to be copied
+ *	Duplicate and attach the security structure currently attached to the
+ *	p->security field.
+ *	Return 0 if operation was successful.
  * @task_setuid:
  *	Check permission before setting one or more of the user identity
  *	attributes of the current process.  The @flags parameter indicates
@@ -944,6 +949,7 @@ struct request_sock;
  *	Permit allocation of a key and assign security data. Note that key does
  *	not have a serial number assigned at this point.
  *	@key points to the key.
+ *	@sec points to the task security record to use.
  *	@flags is the allocation flags
  *	Return 0 if permission is granted, -ve error otherwise.
  * @key_free:
@@ -954,8 +960,8 @@ struct request_sock;
  *	See whether a specific operational right is granted to a process on a
  *      key.
  *	@key_ref refers to the key (key pointer + possession attribute bit).
- *	@context points to the process to provide the context against which to
- *       evaluate the security data on the key.
+ *	@sec points to the process's security recored to provide the context
+ *       against which to evaluate the security data on the key.
  *	@perm describes the combination of permissions required of this key.
  *	Return 1 if permission granted, 0 if permission denied and -ve it the
  *      normal permissions model should be effected.
@@ -1312,8 +1318,9 @@ struct security_operations {
 	int (*dentry_open)  (struct file *file);
 
 	int (*task_create) (unsigned long clone_flags);
-	int (*task_alloc_security) (struct task_struct * p);
-	void (*task_free_security) (struct task_struct * p);
+	int (*task_alloc_security) (struct task_struct *p);
+	void (*task_free_security) (struct task_security *p);
+	int (*task_dup_security) (struct task_security *p);
 	int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags);
 	int (*task_post_setuid) (uid_t old_ruid /* or fsuid */ ,
 				 uid_t old_euid, uid_t old_suid, int flags);
@@ -1443,10 +1450,11 @@ struct security_operations {
 
 	/* key management security hooks */
 #ifdef CONFIG_KEYS
-	int (*key_alloc)(struct key *key, struct task_struct *tsk, unsigned long flags);
+	int (*key_alloc)(struct key *key, struct task_security *context,
+			 unsigned long flags);
 	void (*key_free)(struct key *key);
 	int (*key_permission)(key_ref_t key_ref,
-			      struct task_struct *context,
+			      struct task_security *context,
 			      key_perm_t perm);
 	int (*key_getsecurity)(struct key *key, char **_buffer);
 #endif	/* CONFIG_KEYS */
@@ -1561,7 +1569,8 @@ int security_file_receive(struct file *file);
 int security_dentry_open(struct file *file);
 int security_task_create(unsigned long clone_flags);
 int security_task_alloc(struct task_struct *p);
-void security_task_free(struct task_struct *p);
+void security_task_free(struct task_security *p);
+int security_task_dup(struct task_security *p);
 int security_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags);
 int security_task_post_setuid(uid_t old_ruid, uid_t old_euid,
 			       uid_t old_suid, int flags);
@@ -2032,14 +2041,19 @@ static inline int security_task_create (unsigned long clone_flags)
 	return 0;
 }
 
-static inline int security_task_alloc (struct task_struct *p)
+static inline int security_task_alloc(struct task_struct *p)
 {
 	return 0;
 }
 
-static inline void security_task_free (struct task_struct *p)
+static inline void security_task_free(struct task_security *p)
 { }
 
+static inline int security_task_dup(struct task_security *p)
+{
+	return 0;
+}
+
 static inline int security_task_setuid (uid_t id0, uid_t id1, uid_t id2,
 					int flags)
 {
@@ -2574,16 +2588,17 @@ static inline void security_skb_classify_flow(struct sk_buff *skb, struct flowi
 #ifdef CONFIG_KEYS
 #ifdef CONFIG_SECURITY
 
-int security_key_alloc(struct key *key, struct task_struct *tsk, unsigned long flags);
+int security_key_alloc(struct key *key, struct task_security *sec,
+		       unsigned long flags);
 void security_key_free(struct key *key);
 int security_key_permission(key_ref_t key_ref,
-			    struct task_struct *context, key_perm_t perm);
+			    struct task_security *sec, key_perm_t perm);
 int security_key_getsecurity(struct key *key, char **_buffer);
 
 #else
 
 static inline int security_key_alloc(struct key *key,
-				     struct task_struct *tsk,
+				     struct task_security *sec,
 				     unsigned long flags)
 {
 	return 0;
@@ -2594,7 +2609,7 @@ static inline void security_key_free(struct key *key)
 }
 
 static inline int security_key_permission(key_ref_t key_ref,
-					  struct task_struct *context,
+					  struct task_security *sec,
 					  key_perm_t perm)
 {
 	return 0;
diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..eb2af05 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \
-	    utsname.o notifier.o
+	    utsname.o notifier.o cred.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_check.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/cred.c b/kernel/cred.c
new file mode 100644
index 0000000..ddf98c6
--- /dev/null
+++ b/kernel/cred.c
@@ -0,0 +1,139 @@
+/* Tasks security and credentials management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/key.h>
+#include <linux/init_task.h>
+#include <linux/security.h>
+
+#ifdef CONFIG_KEYS
+static struct thread_group_security init_thread_group_security = {
+	.usage	= ATOMIC_INIT(2),
+	.lock	= __SPIN_LOCK_UNLOCKED(init_thread_group_security.lock),
+};
+#endif
+
+struct task_security init_task_security = {
+	.usage			= ATOMIC_INIT(3),
+	.keep_capabilities	= 0,
+	.cap_inheritable	= CAP_INIT_INH_SET,
+	.cap_permitted		= CAP_FULL_SET,
+	.cap_effective		= CAP_INIT_EFF_SET,
+	.user			= INIT_USER,
+#ifdef CONFIG_KEYS
+	.tgsec			= &init_thread_group_security,
+#endif
+	.group_info		= &init_groups,
+	.lock			= __SPIN_LOCK_UNLOCKED(init_task_security.lock),
+};
+
+/**
+ * dup_task_security - Duplicate task security record
+ * @sec: The record to duplicate
+ *
+ * Returns a duplicate of a task security record or NULL if out of memory.
+ */
+struct task_security *dup_task_security(struct task_security *_sec)
+{
+	struct task_security *sec;
+
+	sec = kmemdup(_sec, sizeof(*sec), GFP_KERNEL);
+	if (sec) {
+		atomic_set(&sec->usage, 1);
+		get_uid(sec->user);
+		get_group_info(sec->group_info);
+		key_get(sec->thread_keyring);
+		key_get(sec->request_key_auth);
+		security_task_dup(sec);
+	}
+	return sec;
+}
+EXPORT_SYMBOL(dup_task_security);
+
+/**
+ * copy_task_security - Copy the task security records for fork
+ * @p: The new task
+ * @clone_flags: The details of what the new process shares of the old
+ *
+ * Copy the task security records on a task so that it can affect objects
+ * in the same way as its parent.  Returns 0 if successful or -ENOMEM if out of
+ * memory.
+ */
+int copy_task_security(struct task_struct *p, unsigned long clone_flags)
+{
+	struct task_security *sec;
+
+	sec = kmemdup(p->sec, sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	atomic_set(&sec->usage, 2);
+	spin_lock_init(&sec->lock);
+	get_group_info(sec->group_info);
+	get_uid(p->sec->user);
+
+#ifdef CONFIG_KEYS
+	if (clone_flags & CLONE_THREAD) {
+		atomic_inc(&sec->tgsec->usage);
+	} else {
+		struct thread_group_security *tgsec;
+
+		tgsec = kmalloc(sizeof(*tgsec), GFP_KERNEL);
+		if (!tgsec) {
+			kfree(sec);
+			return -ENOMEM;
+		}
+		atomic_set(&tgsec->usage, 1);
+		spin_lock_init(&tgsec->lock);
+		tgsec->tgid = p->tgid;
+		copy_thread_group_keys(tgsec);
+		sec->tgsec = tgsec;
+	}
+	key_get(sec->request_key_auth);
+	sec->thread_keyring = NULL;
+#endif
+
+#ifdef CONFIG_SECURITY
+	sec->security = NULL;
+#endif
+
+	p->act_as = p->sec = sec;
+	return 0;
+}
+EXPORT_SYMBOL(copy_task_security);
+
+/**
+ * put_task_security - Release a ref on a task security record
+ * @sec: The record to release
+ *
+ * Release a reference to a task security record and destroy it when
+ * there are no references remaining.
+ */
+void put_task_security(struct task_security *sec)
+{
+	if (sec && atomic_dec_and_test(&sec->usage)) {
+		security_task_free(sec);
+		key_put(sec->thread_keyring);
+		key_put(sec->request_key_auth);
+		put_group_info(sec->group_info);
+		free_uid(sec->user);
+
+#ifdef CONFIG_KEYS
+		if (atomic_dec_and_test(&sec->tgsec->usage)) {
+			key_put(sec->tgsec->session_keyring);
+			key_put(sec->tgsec->process_keyring);
+		}
+#endif
+
+		kfree(sec);
+	}
+}
+EXPORT_SYMBOL(put_task_security);
diff --git a/kernel/exit.c b/kernel/exit.c
index d793e22..507b54b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1000,7 +1000,6 @@ fastcall NORET_TYPE void do_exit(long code)
 	check_stack_usage();
 	exit_thread();
 	cgroup_exit(tsk, 1);
-	exit_keys(tsk);
 
 	if (group_dead && tsk->signal->leader)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index 2f267bd..e4572b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -121,9 +121,8 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(atomic_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
-	security_task_free(tsk);
-	free_uid(tsk->__temp_sec.user);
-	put_group_info(tsk->__temp_sec.group_info);
+	put_task_security(tsk->sec);
+	put_task_security(tsk->act_as);
 	delayacct_tsk_free(tsk);
 
 	if (!profile_handoff_task(tsk))
@@ -845,7 +844,6 @@ void __cleanup_sighand(struct sighand_struct *sighand)
 static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 {
 	struct signal_struct *sig;
-	int ret;
 
 	if (clone_flags & CLONE_THREAD) {
 		atomic_inc(&current->signal->count);
@@ -857,12 +855,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	if (!sig)
 		return -ENOMEM;
 
-	ret = copy_thread_group_keys(tsk);
-	if (ret < 0) {
-		kmem_cache_free(signal_cachep, sig);
-		return ret;
-	}
-
 	atomic_set(&sig->count, 1);
 	atomic_set(&sig->live, 1);
 	init_waitqueue_head(&sig->wait_chldexit);
@@ -920,7 +912,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 void __cleanup_signal(struct signal_struct *sig)
 {
-	exit_thread_group_keys(sig);
 	kmem_cache_free(signal_cachep, sig);
 }
 
@@ -1014,18 +1005,19 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
-	p->act_as = p->sec = &p->__temp_sec;
+	retval = copy_task_security(p, clone_flags);
+	if (retval < 0)
+		goto bad_fork_free;
+
 	retval = -EAGAIN;
 	if (atomic_read(&p->sec->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
 		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
 		    p->sec->user != current->nsproxy->user_ns->root_user)
-			goto bad_fork_free;
+			goto bad_fork_cleanup_put_task_sec;
 	}
 
-	atomic_inc(&p->sec->user->__count);
 	atomic_inc(&p->sec->user->processes);
-	get_group_info(p->sec->group_info);
 
 	/*
 	 * If multiple threads are within copy_process(), then this check
@@ -1080,9 +1072,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	do_posix_clock_monotonic_gettime(&p->start_time);
 	p->real_start_time = p->start_time;
 	monotonic_to_bootbased(&p->real_start_time);
-#ifdef CONFIG_SECURITY
-	p->sec->security = NULL;
-#endif
 	p->io_context = NULL;
 	p->audit_context = NULL;
 	cgroup_fork(p);
@@ -1130,7 +1119,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if ((retval = security_task_alloc(p)))
 		goto bad_fork_cleanup_policy;
 	if ((retval = audit_alloc(p)))
-		goto bad_fork_cleanup_security;
+		goto bad_fork_cleanup_policy;
 	/* copy all the process information */
 	if ((retval = copy_semundo(clone_flags, p)))
 		goto bad_fork_cleanup_audit;
@@ -1144,10 +1133,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_sighand;
 	if ((retval = copy_mm(clone_flags, p)))
 		goto bad_fork_cleanup_signal;
-	if ((retval = copy_keys(clone_flags, p)))
-		goto bad_fork_cleanup_mm;
 	if ((retval = copy_namespaces(clone_flags, p)))
-		goto bad_fork_cleanup_keys;
+		goto bad_fork_cleanup_mm;
 	retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
@@ -1329,8 +1316,6 @@ bad_fork_free_pid:
 		free_pid(pid);
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
-bad_fork_cleanup_keys:
-	exit_keys(p);
 bad_fork_cleanup_mm:
 	if (p->mm)
 		mmput(p->mm);
@@ -1346,8 +1331,6 @@ bad_fork_cleanup_semundo:
 	exit_sem(p);
 bad_fork_cleanup_audit:
 	audit_free(p);
-bad_fork_cleanup_security:
-	security_task_free(p);
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
 	mpol_free(p->mempolicy);
@@ -1360,9 +1343,10 @@ bad_fork_cleanup_cgroup:
 bad_fork_cleanup_put_domain:
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
-	put_group_info(p->sec->group_info);
 	atomic_dec(&p->sec->user->processes);
-	free_uid(p->sec->user);
+bad_fork_cleanup_put_task_sec:
+	put_task_security(p->act_as);
+	put_task_security(p->sec);
 bad_fork_free:
 	free_task(p);
 fork_out:
diff --git a/kernel/kmod.c b/kernel/kmod.c
index c6a4f8a..7e34d46 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -133,20 +133,18 @@ struct subprocess_info {
 static int ____call_usermodehelper(void *data)
 {
 	struct subprocess_info *sub_info = data;
-	struct key *new_session, *old_session;
 	int retval;
 
-	/* Unblock all signals and set the session keyring. */
-	new_session = key_get(sub_info->ring);
+	/* Set the session keyring. */
+	__install_session_keyring(current, sub_info->ring);
+
+	/* Unblock all signals. */
 	spin_lock_irq(&current->sighand->siglock);
-	old_session = __install_session_keyring(current, new_session);
 	flush_signal_handlers(current, 1);
 	sigemptyset(&current->blocked);
 	recalc_sigpending();
 	spin_unlock_irq(&current->sighand->siglock);
 
-	key_put(old_session);
-
 	/* Install input pipe when needed */
 	if (sub_info->stdin) {
 		struct files_struct *f = current->files;
diff --git a/kernel/sys.c b/kernel/sys.c
index 14acc1b..e331b6f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -521,7 +521,7 @@ asmlinkage long sys_setregid(gid_t rgid, gid_t egid)
 	sec->fsgid = new_egid;
 	sec->egid = new_egid;
 	sec->gid = new_rgid;
-	key_fsgid_changed(current);
+	key_fsgid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_GID);
 	return 0;
 }
@@ -557,7 +557,7 @@ asmlinkage long sys_setgid(gid_t gid)
 	else
 		return -EPERM;
 
-	key_fsgid_changed(current);
+	key_fsgid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_GID);
 	return 0;
 }
@@ -646,7 +646,7 @@ asmlinkage long sys_setreuid(uid_t ruid, uid_t euid)
 		sec->suid = sec->euid;
 	sec->fsuid = sec->euid;
 
-	key_fsuid_changed(current);
+	key_fsuid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_UID);
 
 	return security_task_post_setuid(old_ruid, old_euid, old_suid, LSM_SETID_RE);
@@ -694,7 +694,7 @@ asmlinkage long sys_setuid(uid_t uid)
 	sec->fsuid = sec->euid = uid;
 	sec->suid = new_suid;
 
-	key_fsuid_changed(current);
+	key_fsuid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_UID);
 
 	return security_task_post_setuid(old_ruid, old_euid, old_suid, LSM_SETID_ID);
@@ -744,7 +744,7 @@ asmlinkage long sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 	if (suid != (uid_t) -1)
 		sec->suid = suid;
 
-	key_fsuid_changed(current);
+	key_fsuid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_UID);
 
 	return security_task_post_setuid(old_ruid, old_euid, old_suid, LSM_SETID_RES);
@@ -798,7 +798,7 @@ asmlinkage long sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 	if (sgid != (gid_t) -1)
 		sec->sgid = sgid;
 
-	key_fsgid_changed(current);
+	key_fsgid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_GID);
 	return 0;
 }
@@ -841,7 +841,7 @@ asmlinkage long sys_setfsuid(uid_t uid)
 		sec->fsuid = uid;
 	}
 
-	key_fsuid_changed(current);
+	key_fsuid_changed(sec);
 	proc_id_connector(current, PROC_EVENT_UID);
 
 	security_task_post_setuid(old_fsuid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS);
@@ -869,7 +869,7 @@ asmlinkage long sys_setfsgid(gid_t gid)
 			smp_wmb();
 		}
 		sec->fsgid = gid;
-		key_fsgid_changed(current);
+		key_fsgid_changed(sec);
 		proc_id_connector(current, PROC_EVENT_GID);
 	}
 	return old_fsgid;
diff --git a/kernel/user.c b/kernel/user.c
index d9a84b1..0555576 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -356,7 +356,7 @@ struct user_struct * alloc_uid(struct user_namespace *ns, uid_t uid)
 #endif
 		new->locked_shm = 0;
 
-		if (alloc_uid_keyring(new, current) < 0) {
+		if (alloc_uid_keyring(new, current->sec) < 0) {
 			kmem_cache_free(uid_cachep, new);
 			uids_mutex_unlock();
 			return NULL;
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index 9a8ff68..14979a5 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -297,7 +297,7 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 
 	_enter("");
 
-	key = key_alloc(&key_type_rxrpc, "x", 0, 0, current, 0,
+	key = key_alloc(&key_type_rxrpc, "x", 0, 0, current->act_as, 0,
 			KEY_ALLOC_NOT_IN_QUOTA);
 	if (IS_ERR(key)) {
 		_leave(" = -ENOMEM [alloc %ld]", PTR_ERR(key));
@@ -343,7 +343,7 @@ struct key *rxrpc_get_null_key(const char *keyname)
 	struct key *key;
 	int ret;
 
-	key = key_alloc(&key_type_rxrpc, keyname, 0, 0, current,
+	key = key_alloc(&key_type_rxrpc, keyname, 0, 0, current->act_as,
 			KEY_POS_SEARCH, KEY_ALLOC_NOT_IN_QUOTA);
 	if (IS_ERR(key))
 		return key;
diff --git a/security/dummy.c b/security/dummy.c
index 287f3ec..526549d 100644
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -480,16 +480,21 @@ static int dummy_task_create (unsigned long clone_flags)
 	return 0;
 }
 
-static int dummy_task_alloc_security (struct task_struct *p)
+static int dummy_task_alloc_security(struct task_struct *p)
 {
 	return 0;
 }
 
-static void dummy_task_free_security (struct task_struct *p)
+static void dummy_task_free_security(struct task_security *sec)
 {
 	return;
 }
 
+static int dummy_task_dup_security(struct task_security *p)
+{
+	return 0;
+}
+
 static int dummy_task_setuid (uid_t id0, uid_t id1, uid_t id2, int flags)
 {
 	return 0;
@@ -943,7 +948,7 @@ static void dummy_release_secctx(char *secdata, u32 seclen)
 }
 
 #ifdef CONFIG_KEYS
-static inline int dummy_key_alloc(struct key *key, struct task_struct *ctx,
+static inline int dummy_key_alloc(struct key *key, struct task_security *sec,
 				  unsigned long flags)
 {
 	return 0;
@@ -954,7 +959,7 @@ static inline void dummy_key_free(struct key *key)
 }
 
 static inline int dummy_key_permission(key_ref_t key_ref,
-				       struct task_struct *context,
+				       struct task_security *sec,
 				       key_perm_t perm)
 {
 	return 0;
@@ -1057,6 +1062,7 @@ void security_fixup_ops (struct security_operations *ops)
 	set_to_dummy_if_null(ops, task_create);
 	set_to_dummy_if_null(ops, task_alloc_security);
 	set_to_dummy_if_null(ops, task_free_security);
+	set_to_dummy_if_null(ops, task_dup_security);
 	set_to_dummy_if_null(ops, task_setuid);
 	set_to_dummy_if_null(ops, task_post_setuid);
 	set_to_dummy_if_null(ops, task_setgid);
diff --git a/security/keys/internal.h b/security/keys/internal.h
index f004835..a439889 100644
--- a/security/keys/internal.h
+++ b/security/keys/internal.h
@@ -92,7 +92,7 @@ extern struct key *keyring_search_instkey(struct key *keyring,
 typedef int (*key_match_func_t)(const struct key *, const void *);
 
 extern key_ref_t keyring_search_aux(key_ref_t keyring_ref,
-				    struct task_struct *tsk,
+				    struct task_security *sec,
 				    struct key_type *type,
 				    const void *description,
 				    key_match_func_t match);
@@ -100,12 +100,12 @@ extern key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 extern key_ref_t search_process_keyrings(struct key_type *type,
 					 const void *description,
 					 key_match_func_t match,
-					 struct task_struct *tsk);
+					 struct task_security *sec);
 
 extern struct key *find_keyring_by_name(const char *name, key_serial_t bound);
 
-extern int install_thread_keyring(struct task_struct *tsk);
-extern int install_process_keyring(struct task_struct *tsk);
+extern int install_thread_keyring(struct task_security *sec);
+extern int install_process_keyring(struct task_security *sec);
 
 extern struct key *request_key_and_link(struct key_type *type,
 					const char *description,
@@ -120,7 +120,7 @@ extern struct key *request_key_and_link(struct key_type *type,
  */
 struct request_key_auth {
 	struct key		*target_key;
-	struct task_struct	*context;
+	struct task_security	*sec;
 	void			*callout_info;
 	size_t			callout_len;
 	pid_t			pid;
diff --git a/security/keys/key.c b/security/keys/key.c
index 1692988..68483d7 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -243,7 +243,7 @@ serial_exists:
  *   instantiate the key or discard it before returning
  */
 struct key *key_alloc(struct key_type *type, const char *desc,
-		      uid_t uid, gid_t gid, struct task_struct *ctx,
+		      uid_t uid, gid_t gid, struct task_security *sec,
 		      key_perm_t perm, unsigned long flags)
 {
 	struct key_user *user = NULL;
@@ -314,7 +314,7 @@ struct key *key_alloc(struct key_type *type, const char *desc,
 #endif
 
 	/* let the security module know about the key */
-	ret = security_key_alloc(key, ctx, flags);
+	ret = security_key_alloc(key, sec, flags);
 	if (ret < 0)
 		goto security_error;
 
@@ -818,7 +818,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 
 	/* allocate a new key */
 	key = key_alloc(ktype, description, current_fsuid(), current_fsgid(),
-			current, perm, flags);
+			current->act_as, perm, flags);
 	if (IS_ERR(key)) {
 		key_ref = ERR_PTR(PTR_ERR(key));
 		goto error_3;
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 4051948..2900451 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -878,7 +878,7 @@ long keyctl_instantiate_key(key_serial_t id,
 	 * requesting task */
 	keyring_ref = NULL;
 	if (ringid) {
-		keyring_ref = lookup_user_key(rka->context, ringid, 1, 0,
+		keyring_ref = lookup_user_key(rka->sec, ringid, 1, 0,
 					      KEY_WRITE);
 		if (IS_ERR(keyring_ref)) {
 			ret = PTR_ERR(keyring_ref);
@@ -973,13 +973,13 @@ long keyctl_set_reqkey_keyring(int reqkey_defl)
 
 	switch (reqkey_defl) {
 	case KEY_REQKEY_DEFL_THREAD_KEYRING:
-		ret = install_thread_keyring(current);
+		ret = install_thread_keyring(sec);
 		if (ret < 0)
 			return ret;
 		goto set;
 
 	case KEY_REQKEY_DEFL_PROCESS_KEYRING:
-		ret = install_process_keyring(current);
+		ret = install_process_keyring(sec);
 		if (ret < 0)
 			return ret;
 
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index 76b89b2..6ccd8f8 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -244,14 +244,14 @@ static long keyring_read(const struct key *keyring,
  * allocate a keyring and link into the destination keyring
  */
 struct key *keyring_alloc(const char *description, uid_t uid, gid_t gid,
-			  struct task_struct *ctx, unsigned long flags,
+			  struct task_security *sec, unsigned long flags,
 			  struct key *dest)
 {
 	struct key *keyring;
 	int ret;
 
 	keyring = key_alloc(&key_type_keyring, description,
-			    uid, gid, ctx,
+			    uid, gid, sec,
 			    (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_ALL,
 			    flags);
 
@@ -280,7 +280,7 @@ struct key *keyring_alloc(const char *description, uid_t uid, gid_t gid,
  * - we propagate the possession attribute from the keyring ref to the key ref
  */
 key_ref_t keyring_search_aux(key_ref_t keyring_ref,
-			     struct task_struct *context,
+			     struct task_security *sec,
 			     struct key_type *type,
 			     const void *description,
 			     key_match_func_t match)
@@ -303,7 +303,7 @@ key_ref_t keyring_search_aux(key_ref_t keyring_ref,
 	key_check(keyring);
 
 	/* top keyring must have search permission to begin the search */
-        err = key_task_permission(keyring_ref, context, KEY_SEARCH);
+	err = key_task_permission(keyring_ref, sec, KEY_SEARCH);
 	if (err < 0) {
 		key_ref = ERR_PTR(err);
 		goto error;
@@ -376,7 +376,7 @@ descend:
 
 		/* key must have search permissions */
 		if (key_task_permission(make_key_ref(key, possessed),
-					context, KEY_SEARCH) < 0)
+					sec, KEY_SEARCH) < 0)
 			continue;
 
 		/* we set a different error code if we pass a negative key */
@@ -403,7 +403,7 @@ ascend:
 			continue;
 
 		if (key_task_permission(make_key_ref(key, possessed),
-					context, KEY_SEARCH) < 0)
+					sec, KEY_SEARCH) < 0)
 			continue;
 
 		/* stack the current position */
@@ -458,7 +458,7 @@ key_ref_t keyring_search(key_ref_t keyring,
 	if (!type->match)
 		return ERR_PTR(-ENOKEY);
 
-	return keyring_search_aux(keyring, current,
+	return keyring_search_aux(keyring, current->sec,
 				  type, description, type->match);
 
 } /* end keyring_search() */
diff --git a/security/keys/permission.c b/security/keys/permission.c
index 07898bd..eff3e29 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -19,10 +19,9 @@
  * but permit the security modules to override
  */
 int key_task_permission(const key_ref_t key_ref,
-			struct task_struct *context,
+			struct task_security *sec,
 			key_perm_t perm)
 {
-	struct task_security *sec = context->act_as;
 	struct key *key;
 	key_perm_t kperm;
 	int ret;
@@ -69,7 +68,7 @@ use_these_perms:
 		return -EACCES;
 
 	/* let LSM be the final arbiter */
-	return security_key_permission(key_ref, context, perm);
+	return security_key_permission(key_ref, sec, perm);
 
 } /* end key_task_permission() */
 
diff --git a/security/keys/proc.c b/security/keys/proc.c
index 3e0d0a6..91e6df2 100644
--- a/security/keys/proc.c
+++ b/security/keys/proc.c
@@ -141,7 +141,7 @@ static int proc_keys_show(struct seq_file *m, void *v)
 
 	/* check whether the current task is allowed to view the key (assuming
 	 * non-possession) */
-	rc = key_task_permission(make_key_ref(key, 0), current, KEY_VIEW);
+	rc = key_task_permission(make_key_ref(key, 0), current->sec, KEY_VIEW);
 	if (rc < 0)
 		return 0;
 
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 98854b2..f51a08c 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -68,7 +68,7 @@ struct key root_session_keyring = {
  * allocate the keyrings to be associated with a UID
  */
 int alloc_uid_keyring(struct user_struct *user,
-		      struct task_struct *ctx)
+		      struct task_security *sec)
 {
 	struct key *uid_keyring, *session_keyring;
 	char buf[20];
@@ -77,7 +77,7 @@ int alloc_uid_keyring(struct user_struct *user,
 	/* concoct a default session keyring */
 	sprintf(buf, "_uid_ses.%u", user->uid);
 
-	session_keyring = keyring_alloc(buf, user->uid, (gid_t) -1, ctx,
+	session_keyring = keyring_alloc(buf, user->uid, (gid_t) -1, sec,
 					KEY_ALLOC_IN_QUOTA, NULL);
 	if (IS_ERR(session_keyring)) {
 		ret = PTR_ERR(session_keyring);
@@ -88,7 +88,7 @@ int alloc_uid_keyring(struct user_struct *user,
 	 * keyring */
 	sprintf(buf, "_uid.%u", user->uid);
 
-	uid_keyring = keyring_alloc(buf, user->uid, (gid_t) -1, ctx,
+	uid_keyring = keyring_alloc(buf, user->uid, (gid_t) -1, sec,
 				    KEY_ALLOC_IN_QUOTA, session_keyring);
 	if (IS_ERR(uid_keyring)) {
 		key_put(session_keyring);
@@ -135,33 +135,29 @@ void switch_uid_keyring(struct user_struct *new_user)
 
 /*****************************************************************************/
 /*
- * install a fresh thread keyring, discarding the old one
+ * make sure a thread keyring is installed
  */
-int install_thread_keyring(struct task_struct *tsk)
+int install_thread_keyring(struct task_security *sec)
 {
-	struct key *keyring, *old;
+	struct key *keyring;
 	char buf[20];
-	int ret;
 
-	sprintf(buf, "_tid.%u", tsk->pid);
+	sprintf(buf, "_tid.%u", current->pid);
 
-	keyring = keyring_alloc(buf, tsk->sec->uid, tsk->sec->gid, tsk,
+	keyring = keyring_alloc(buf, sec->uid, sec->gid, sec,
 				KEY_ALLOC_QUOTA_OVERRUN, NULL);
-	if (IS_ERR(keyring)) {
-		ret = PTR_ERR(keyring);
-		goto error;
-	}
-
-	task_lock(tsk);
-	old = tsk->sec->thread_keyring;
-	tsk->sec->thread_keyring = keyring;
-	task_unlock(tsk);
+	if (IS_ERR(keyring))
+		return PTR_ERR(keyring);
 
-	ret = 0;
+	spin_lock(&sec->lock);
+	if (!sec->thread_keyring) {
+		sec->thread_keyring = keyring;
+		keyring = NULL;
+	}
+	spin_unlock(&sec->lock);
 
-	key_put(old);
-error:
-	return ret;
+	key_put(keyring);
+	return 0;
 
 } /* end install_thread_keyring() */
 
@@ -169,38 +165,36 @@ error:
 /*
  * make sure a process keyring is installed
  */
-int install_process_keyring(struct task_struct *tsk)
+int install_process_keyring(struct task_security *sec)
 {
+	struct thread_group_security *tgsec;
 	struct key *keyring;
 	char buf[20];
-	int ret;
 
 	might_sleep();
+	sec = current->sec;
+	tgsec = sec->tgsec;
 
-	if (!tsk->signal->process_keyring) {
-		sprintf(buf, "_pid.%u", tsk->tgid);
+	if (!tgsec->process_keyring) {
+		sprintf(buf, "_pid.%u", tgsec->tgid);
 
-		keyring = keyring_alloc(buf, tsk->sec->uid, tsk->sec->gid, tsk,
+		keyring = keyring_alloc(buf, sec->uid, sec->gid, sec,
 					KEY_ALLOC_QUOTA_OVERRUN, NULL);
-		if (IS_ERR(keyring)) {
-			ret = PTR_ERR(keyring);
-			goto error;
-		}
+		if (IS_ERR(keyring))
+			return PTR_ERR(keyring);
 
 		/* attach keyring */
-		spin_lock_irq(&tsk->sighand->siglock);
-		if (!tsk->signal->process_keyring) {
-			tsk->signal->process_keyring = keyring;
+		spin_lock(&tgsec->lock);
+		if (!tgsec->process_keyring) {
+			tgsec->process_keyring = keyring;
 			keyring = NULL;
 		}
-		spin_unlock_irq(&tsk->sighand->siglock);
+		spin_unlock(&tgsec->lock);
 
 		key_put(keyring);
 	}
 
-	ret = 0;
-error:
-	return ret;
+	return 0;
 
 } /* end install_process_keyring() */
 
@@ -209,37 +203,38 @@ error:
  * install a session keyring, discarding the old one
  * - if a keyring is not supplied, an empty one is invented
  */
-static int install_session_keyring(struct task_struct *tsk,
+static int install_session_keyring(struct task_security *sec,
 				   struct key *keyring)
 {
+	struct thread_group_security *tgsec;
 	unsigned long flags;
 	struct key *old;
 	char buf[20];
 
 	might_sleep();
+	tgsec = sec->tgsec;
 
 	/* create an empty session keyring */
 	if (!keyring) {
-		sprintf(buf, "_ses.%u", tsk->tgid);
+		sprintf(buf, "_ses.%u", tgsec->tgid);
 
 		flags = KEY_ALLOC_QUOTA_OVERRUN;
-		if (tsk->signal->session_keyring)
+		if (tgsec->session_keyring)
 			flags = KEY_ALLOC_IN_QUOTA;
 
-		keyring = keyring_alloc(buf, tsk->sec->uid, tsk->sec->gid, tsk,
+		keyring = keyring_alloc(buf, sec->uid, sec->gid, sec,
 					flags, NULL);
 		if (IS_ERR(keyring))
 			return PTR_ERR(keyring);
-	}
-	else {
+	} else {
 		atomic_inc(&keyring->usage);
 	}
 
 	/* install the keyring */
-	spin_lock_irq(&tsk->sighand->siglock);
-	old = tsk->signal->session_keyring;
-	rcu_assign_pointer(tsk->signal->session_keyring, keyring);
-	spin_unlock_irq(&tsk->sighand->siglock);
+	spin_lock_irq(&tgsec->lock);
+	old = tgsec->session_keyring;
+	rcu_assign_pointer(tgsec->session_keyring, keyring);
+	spin_unlock_irq(&tgsec->lock);
 
 	/* we're using RCU on the pointer, but there's no point synchronising
 	 * on it if it didn't previously point to anything */
@@ -252,68 +247,49 @@ static int install_session_keyring(struct task_struct *tsk,
 
 } /* end install_session_keyring() */
 
-/*****************************************************************************/
 /*
- * copy the keys in a thread group for fork without CLONE_THREAD
+ * install a session keyring for kmod
  */
-int copy_thread_group_keys(struct task_struct *tsk)
+void __install_session_keyring(struct task_struct *tsk, struct key *keyring)
 {
-	key_check(current->thread_group->session_keyring);
-	key_check(current->thread_group->process_keyring);
-
-	/* no process keyring yet */
-	tsk->signal->process_keyring = NULL;
+	struct thread_group_security *tgsec = tsk->sec->tgsec;
+	struct key *old;
 
-	/* same session keyring */
-	rcu_read_lock();
-	tsk->signal->session_keyring =
-		key_get(rcu_dereference(current->signal->session_keyring));
-	rcu_read_unlock();
+	key_get(keyring);
 
-	return 0;
+	spin_lock(&tgsec->lock);
+	old = tgsec->session_keyring;
+	rcu_assign_pointer(tgsec->session_keyring, keyring);
+	spin_unlock(&tgsec->lock);
 
-} /* end copy_thread_group_keys() */
+	/* we're using RCU on the pointer, but there's no point synchronising
+	 * on it if it didn't previously point to anything */
+	if (old) {
+		synchronize_rcu();
+		key_put(old);
+	}
+}
 
 /*****************************************************************************/
 /*
- * copy the keys for fork
+ * copy the keys in a thread group for fork without CLONE_THREAD
  */
-int copy_keys(unsigned long clone_flags, struct task_struct *tsk)
+int copy_thread_group_keys(struct thread_group_security *tgsec)
 {
-	key_check(tsk->sec->thread_keyring);
-	key_check(tsk->sec->request_key_auth);
+	key_check(tgsec->session_keyring);
+	key_check(tgsec->process_keyring);
 
-	/* no thread keyring yet */
-	tsk->sec->thread_keyring = NULL;
+	/* no process keyring yet */
+	tgsec->process_keyring = NULL;
 
-	/* copy the request_key() authorisation for this thread */
-	key_get(tsk->sec->request_key_auth);
+	/* same session keyring */
+	rcu_read_lock();
+	tgsec->session_keyring =
+		key_get(rcu_dereference(current->sec->tgsec->session_keyring));
+	rcu_read_unlock();
 
 	return 0;
-
-} /* end copy_keys() */
-
-/*****************************************************************************/
-/*
- * dispose of thread group keys upon thread group destruction
- */
-void exit_thread_group_keys(struct signal_struct *tg)
-{
-	key_put(tg->session_keyring);
-	key_put(tg->process_keyring);
-
-} /* end exit_thread_group_keys() */
-
-/*****************************************************************************/
-/*
- * dispose of per-thread keys upon thread exit
- */
-void exit_keys(struct task_struct *tsk)
-{
-	key_put(tsk->sec->thread_keyring);
-	key_put(tsk->sec->request_key_auth);
-
-} /* end exit_keys() */
+}
 
 /*****************************************************************************/
 /*
@@ -321,21 +297,23 @@ void exit_keys(struct task_struct *tsk)
  */
 int exec_keys(struct task_struct *tsk)
 {
+	struct thread_group_security *tgsec = tsk->sec->tgsec;
+	struct task_security *sec = tsk->sec;
 	struct key *old;
 
 	/* newly exec'd tasks don't get a thread keyring */
-	task_lock(tsk);
-	old = tsk->sec->thread_keyring;
-	tsk->sec->thread_keyring = NULL;
-	task_unlock(tsk);
+	spin_lock(&sec->lock);
+	old = sec->thread_keyring;
+	sec->thread_keyring = NULL;
+	spin_unlock(&sec->lock);
 
 	key_put(old);
 
 	/* discard the process keyring from a newly exec'd task */
-	spin_lock_irq(&tsk->sighand->siglock);
-	old = tsk->signal->process_keyring;
-	tsk->signal->process_keyring = NULL;
-	spin_unlock_irq(&tsk->sighand->siglock);
+	spin_lock(&tgsec->lock);
+	old = tgsec->process_keyring;
+	tgsec->process_keyring = NULL;
+	spin_unlock(&tgsec->lock);
 
 	key_put(old);
 
@@ -358,14 +336,13 @@ int suid_keys(struct task_struct *tsk)
 /*
  * the filesystem user ID changed
  */
-void key_fsuid_changed(struct task_struct *tsk)
+void key_fsuid_changed(struct task_security *sec)
 {
 	/* update the ownership of the thread keyring */
-	BUG_ON(!tsk->sec);
-	if (tsk->sec->thread_keyring) {
-		down_write(&tsk->sec->thread_keyring->sem);
-		tsk->sec->thread_keyring->uid = tsk->sec->fsuid;
-		up_write(&tsk->sec->thread_keyring->sem);
+	if (sec->thread_keyring) {
+		down_write(&sec->thread_keyring->sem);
+		sec->thread_keyring->uid = sec->fsuid;
+		up_write(&sec->thread_keyring->sem);
 	}
 
 } /* end key_fsuid_changed() */
@@ -374,14 +351,13 @@ void key_fsuid_changed(struct task_struct *tsk)
 /*
  * the filesystem group ID changed
  */
-void key_fsgid_changed(struct task_struct *tsk)
+void key_fsgid_changed(struct task_security *sec)
 {
 	/* update the ownership of the thread keyring */
-	BUG_ON(!tsk->sec);
-	if (tsk->sec->thread_keyring) {
-		down_write(&tsk->sec->thread_keyring->sem);
-		tsk->sec->thread_keyring->gid = tsk->sec->fsgid;
-		up_write(&tsk->sec->thread_keyring->sem);
+	if (sec->thread_keyring) {
+		down_write(&sec->thread_keyring->sem);
+		sec->thread_keyring->gid = sec->fsgid;
+		up_write(&sec->thread_keyring->sem);
 	}
 
 } /* end key_fsgid_changed() */
@@ -397,7 +373,7 @@ void key_fsgid_changed(struct task_struct *tsk)
 key_ref_t search_process_keyrings(struct key_type *type,
 				  const void *description,
 				  key_match_func_t match,
-				  struct task_struct *context)
+				  struct task_security *sec)
 {
 	struct request_key_auth *rka;
 	key_ref_t key_ref, ret, err;
@@ -416,10 +392,10 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	err = ERR_PTR(-EAGAIN);
 
 	/* search the thread keyring first */
-	if (context->sec->thread_keyring) {
+	if (sec->thread_keyring) {
 		key_ref = keyring_search_aux(
-			make_key_ref(context->sec->thread_keyring, 1),
-			context, type, description, match);
+			make_key_ref(sec->thread_keyring, 1),
+			sec, type, description, match);
 		if (!IS_ERR(key_ref))
 			goto found;
 
@@ -437,10 +413,10 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	}
 
 	/* search the process keyring second */
-	if (context->signal->process_keyring) {
+	if (sec->tgsec->process_keyring) {
 		key_ref = keyring_search_aux(
-			make_key_ref(context->signal->process_keyring, 1),
-			context, type, description, match);
+			make_key_ref(sec->tgsec->process_keyring, 1),
+			sec, type, description, match);
 		if (!IS_ERR(key_ref))
 			goto found;
 
@@ -458,13 +434,13 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	}
 
 	/* search the session keyring */
-	if (context->signal->session_keyring) {
+	if (sec->tgsec->session_keyring) {
 		rcu_read_lock();
 		key_ref = keyring_search_aux(
 			make_key_ref(rcu_dereference(
-					     context->signal->session_keyring),
+					     sec->tgsec->session_keyring),
 				     1),
-			context, type, description, match);
+			sec, type, description, match);
 		rcu_read_unlock();
 
 		if (!IS_ERR(key_ref))
@@ -485,8 +461,8 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	/* or search the user-session keyring */
 	else {
 		key_ref = keyring_search_aux(
-			make_key_ref(context->sec->user->session_keyring, 1),
-			context, type, description, match);
+			make_key_ref(sec->user->session_keyring, 1),
+			sec, type, description, match);
 		if (!IS_ERR(key_ref))
 			goto found;
 
@@ -507,20 +483,20 @@ key_ref_t search_process_keyrings(struct key_type *type,
 	 * search the keyrings of the process mentioned there
 	 * - we don't permit access to request_key auth keys via this method
 	 */
-	if (context->sec->request_key_auth &&
-	    context == current &&
+	if (sec->request_key_auth &&
+	    sec == current->sec &&
 	    type != &key_type_request_key_auth
 	    ) {
 		/* defend against the auth key being revoked */
-		down_read(&context->sec->request_key_auth->sem);
+		down_read(&sec->request_key_auth->sem);
 
-		if (key_validate(context->sec->request_key_auth) == 0) {
-			rka = context->sec->request_key_auth->payload.data;
+		if (key_validate(sec->request_key_auth) == 0) {
+			rka = sec->request_key_auth->payload.data;
 
 			key_ref = search_process_keyrings(type, description,
-							  match, rka->context);
+							  match, rka->sec);
 
-			up_read(&context->sec->request_key_auth->sem);
+			up_read(&sec->request_key_auth->sem);
 
 			if (!IS_ERR(key_ref))
 				goto found;
@@ -537,7 +513,7 @@ key_ref_t search_process_keyrings(struct key_type *type,
 				break;
 			}
 		} else {
-			up_read(&context->sec->request_key_auth->sem);
+			up_read(&sec->request_key_auth->sem);
 		}
 	}
 
@@ -565,78 +541,78 @@ static int lookup_user_key_possessed(const struct key *key, const void *target)
  * - don't create special keyrings unless so requested
  * - partially constructed keys aren't found unless requested
  */
-key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
+key_ref_t lookup_user_key(struct task_security *sec, key_serial_t id,
 			  int create, int partial, key_perm_t perm)
 {
 	key_ref_t key_ref, skey_ref;
 	struct key *key;
 	int ret;
 
-	if (!context)
-		context = current;
+	if (!sec)
+		sec = current->act_as;
 
 	key_ref = ERR_PTR(-ENOKEY);
 
 	switch (id) {
 	case KEY_SPEC_THREAD_KEYRING:
-		if (!context->sec->thread_keyring) {
+		if (!sec->thread_keyring) {
 			if (!create)
 				goto error;
 
-			ret = install_thread_keyring(context);
+			ret = install_thread_keyring(sec);
 			if (ret < 0) {
 				key = ERR_PTR(ret);
 				goto error;
 			}
 		}
 
-		key = context->sec->thread_keyring;
+		key = sec->thread_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
 
 	case KEY_SPEC_PROCESS_KEYRING:
-		if (!context->signal->process_keyring) {
+		if (!sec->tgsec->process_keyring) {
 			if (!create)
 				goto error;
 
-			ret = install_process_keyring(context);
+			ret = install_process_keyring(sec);
 			if (ret < 0) {
 				key = ERR_PTR(ret);
 				goto error;
 			}
 		}
 
-		key = context->signal->process_keyring;
+		key = sec->tgsec->process_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
 
 	case KEY_SPEC_SESSION_KEYRING:
-		if (!context->signal->session_keyring) {
+		if (!sec->tgsec->session_keyring) {
 			/* always install a session keyring upon access if one
 			 * doesn't exist yet */
 			ret = install_session_keyring(
-				context, context->sec->user->session_keyring);
+				sec, sec->user->session_keyring);
 			if (ret < 0)
 				goto error;
 		}
 
 		rcu_read_lock();
-		key = rcu_dereference(context->signal->session_keyring);
+		key = rcu_dereference(sec->tgsec->session_keyring);
 		atomic_inc(&key->usage);
 		rcu_read_unlock();
 		key_ref = make_key_ref(key, 1);
 		break;
 
 	case KEY_SPEC_USER_KEYRING:
-		key = context->sec->user->uid_keyring;
+		key = sec->user->uid_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
 
 	case KEY_SPEC_USER_SESSION_KEYRING:
-		key = context->sec->user->session_keyring;
+		key = sec->user->session_keyring;
 		atomic_inc(&key->usage);
 		key_ref = make_key_ref(key, 1);
 		break;
@@ -647,7 +623,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 		goto error;
 
 	case KEY_SPEC_REQKEY_AUTH_KEY:
-		key = context->sec->request_key_auth;
+		key = sec->request_key_auth;
 		if (!key)
 			goto error;
 
@@ -671,7 +647,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 		/* check to see if we possess the key */
 		skey_ref = search_process_keyrings(key->type, key,
 						   lookup_user_key_possessed,
-						   current);
+						   sec);
 
 		if (!IS_ERR(skey_ref)) {
 			key_put(key);
@@ -703,7 +679,7 @@ key_ref_t lookup_user_key(struct task_struct *context, key_serial_t id,
 		goto invalid_key;
 
 	/* check the permissions */
-	ret = key_task_permission(key_ref, context, perm);
+	ret = key_task_permission(key_ref, sec, perm);
 	if (ret < 0)
 		goto invalid_key;
 
@@ -726,18 +702,18 @@ invalid_key:
  */
 long join_session_keyring(const char *name)
 {
-	struct task_struct *tsk = current;
+	struct task_security *sec = current->sec;
 	struct key *keyring;
 	long ret;
 
 	/* if no name is provided, install an anonymous keyring */
 	if (!name) {
-		ret = install_session_keyring(tsk, NULL);
+		ret = install_session_keyring(sec, NULL);
 		if (ret < 0)
 			goto error;
 
 		rcu_read_lock();
-		ret = rcu_dereference(tsk->signal->session_keyring)->serial;
+		ret = rcu_dereference(sec->tgsec->session_keyring)->serial;
 		rcu_read_unlock();
 		goto error;
 	}
@@ -749,7 +725,7 @@ long join_session_keyring(const char *name)
 	keyring = find_keyring_by_name(name, 0);
 	if (PTR_ERR(keyring) == -ENOKEY) {
 		/* not found - try and create a new one */
-		keyring = keyring_alloc(name, tsk->sec->uid, tsk->sec->gid, tsk,
+		keyring = keyring_alloc(name, sec->uid, sec->gid, sec,
 					KEY_ALLOC_IN_QUOTA, NULL);
 		if (IS_ERR(keyring)) {
 			ret = PTR_ERR(keyring);
@@ -762,7 +738,7 @@ long join_session_keyring(const char *name)
 	}
 
 	/* we've got a keyring - now to install it */
-	ret = install_session_keyring(tsk, keyring);
+	ret = install_session_keyring(sec, keyring);
 	if (ret < 0)
 		goto error2;
 
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 54b03bb..be88aeb 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -63,7 +63,7 @@ static int call_sbin_request_key(struct key_construction *cons,
 				 const char *op,
 				 void *aux)
 {
-	struct task_struct *tsk = current;
+	struct task_security *sec = current->act_as;
 	key_serial_t prkey, sskey;
 	struct key *key = cons->key, *authkey = cons->authkey, *keyring;
 	char *argv[9], *envp[3], uid_str[12], gid_str[12];
@@ -76,7 +76,7 @@ static int call_sbin_request_key(struct key_construction *cons,
 	/* allocate a new session keyring */
 	sprintf(desc, "_req.%u", key->serial);
 
-	keyring = keyring_alloc(desc, current_fsuid(), current_fsgid(), current,
+	keyring = keyring_alloc(desc, sec->fsuid, sec->fsgid, sec,
 				KEY_ALLOC_QUOTA_OVERRUN, NULL);
 	if (IS_ERR(keyring)) {
 		ret = PTR_ERR(keyring);
@@ -89,29 +89,27 @@ static int call_sbin_request_key(struct key_construction *cons,
 		goto error_link;
 
 	/* record the UID and GID */
-	sprintf(uid_str, "%d", current_fsuid());
-	sprintf(gid_str, "%d", current_fsgid());
+	sprintf(uid_str, "%d", sec->fsuid);
+	sprintf(gid_str, "%d", sec->fsgid);
 
 	/* we say which key is under construction */
 	sprintf(key_str, "%d", key->serial);
 
 	/* we specify the process's default keyrings */
-	sprintf(keyring_str[0], "%d",
-		tsk->act_as->thread_keyring ?
-		tsk->act_as->thread_keyring->serial : 0);
+	sprintf(keyring_str[0], "%d", key_serial(sec->thread_keyring));
 
 	prkey = 0;
-	if (tsk->signal->process_keyring)
-		prkey = tsk->signal->process_keyring->serial;
+	if (sec->tgsec->process_keyring)
+		prkey = sec->tgsec->process_keyring->serial;
 
 	sprintf(keyring_str[1], "%d", prkey);
 
-	if (tsk->signal->session_keyring) {
+	if (sec->tgsec->session_keyring) {
 		rcu_read_lock();
-		sskey = rcu_dereference(tsk->signal->session_keyring)->serial;
+		sskey = rcu_dereference(sec->tgsec->session_keyring)->serial;
 		rcu_read_unlock();
 	} else {
-		sskey = tsk->act_as->user->session_keyring->serial;
+		sskey = sec->user->session_keyring->serial;
 	}
 
 	sprintf(keyring_str[2], "%d", sskey);
@@ -210,29 +208,29 @@ static int construct_key(struct key *key, const void *callout_info,
  */
 static void construct_key_make_link(struct key *key, struct key *dest_keyring)
 {
-	struct task_struct *tsk = current;
+	struct task_security *sec = current->sec;
 	struct key *drop = NULL;
 
 	kenter("{%d},%p", key->serial, dest_keyring);
 
 	/* find the appropriate keyring */
 	if (!dest_keyring) {
-		switch (tsk->act_as->jit_keyring) {
+		switch (sec->jit_keyring) {
 		case KEY_REQKEY_DEFL_DEFAULT:
 		case KEY_REQKEY_DEFL_THREAD_KEYRING:
-			dest_keyring = tsk->act_as->thread_keyring;
+			dest_keyring = sec->thread_keyring;
 			if (dest_keyring)
 				break;
 
 		case KEY_REQKEY_DEFL_PROCESS_KEYRING:
-			dest_keyring = tsk->signal->process_keyring;
+			dest_keyring = sec->tgsec->process_keyring;
 			if (dest_keyring)
 				break;
 
 		case KEY_REQKEY_DEFL_SESSION_KEYRING:
 			rcu_read_lock();
 			dest_keyring = key_get(
-				rcu_dereference(tsk->signal->session_keyring));
+				rcu_dereference(sec->tgsec->session_keyring));
 			rcu_read_unlock();
 			drop = dest_keyring;
 
@@ -240,11 +238,11 @@ static void construct_key_make_link(struct key *key, struct key *dest_keyring)
 				break;
 
 		case KEY_REQKEY_DEFL_USER_SESSION_KEYRING:
-			dest_keyring = tsk->act_as->user->session_keyring;
+			dest_keyring = sec->user->session_keyring;
 			break;
 
 		case KEY_REQKEY_DEFL_USER_KEYRING:
-			dest_keyring = tsk->act_as->user->uid_keyring;
+			dest_keyring = sec->user->uid_keyring;
 			break;
 
 		case KEY_REQKEY_DEFL_GROUP_KEYRING:
@@ -268,6 +266,7 @@ static int construct_alloc_key(struct key_type *type,
 			       const char *description,
 			       struct key *dest_keyring,
 			       unsigned long flags,
+			       struct task_security *sec,
 			       struct key_user *user,
 			       struct key **_key)
 {
@@ -278,9 +277,8 @@ static int construct_alloc_key(struct key_type *type,
 
 	mutex_lock(&user->cons_lock);
 
-	key = key_alloc(type, description,
-			current_fsuid(), current_fsgid(), current, KEY_POS_ALL,
-			flags);
+	key = key_alloc(type, description, sec->fsuid, sec->fsgid, sec,
+			KEY_POS_ALL, flags);
 	if (IS_ERR(key))
 		goto alloc_failed;
 
@@ -294,8 +292,7 @@ static int construct_alloc_key(struct key_type *type,
 	 * waited for locks */
 	mutex_lock(&key_construction_mutex);
 
-	key_ref = search_process_keyrings(type, description, type->match,
-					  current);
+	key_ref = search_process_keyrings(type, description, type->match, sec);
 	if (!IS_ERR(key_ref))
 		goto key_already_present;
 
@@ -336,18 +333,19 @@ static struct key *construct_key_and_link(struct key_type *type,
 					  size_t callout_len,
 					  void *aux,
 					  struct key *dest_keyring,
+					  struct task_security *sec,
 					  unsigned long flags)
 {
 	struct key_user *user;
 	struct key *key;
 	int ret;
 
-	user = key_user_lookup(current_fsuid());
+	user = key_user_lookup(sec->fsuid);
 	if (!user)
 		return ERR_PTR(-ENOMEM);
 
-	ret = construct_alloc_key(type, description, dest_keyring, flags, user,
-				  &key);
+	ret = construct_alloc_key(type, description, dest_keyring, flags, sec,
+				  user, &key);
 	key_user_put(user);
 
 	if (ret == 0) {
@@ -379,6 +377,7 @@ struct key *request_key_and_link(struct key_type *type,
 				 struct key *dest_keyring,
 				 unsigned long flags)
 {
+	struct task_security *sec = current->sec;
 	struct key *key;
 	key_ref_t key_ref;
 
@@ -387,9 +386,7 @@ struct key *request_key_and_link(struct key_type *type,
 	       dest_keyring, flags);
 
 	/* search all the process keyrings for a key */
-	key_ref = search_process_keyrings(type, description, type->match,
-					  current);
-
+	key_ref = search_process_keyrings(type, description, type->match, sec);
 	if (!IS_ERR(key_ref)) {
 		key = key_ref_to_ptr(key_ref);
 	} else if (PTR_ERR(key_ref) != -EAGAIN) {
@@ -403,7 +400,7 @@ struct key *request_key_and_link(struct key_type *type,
 
 		key = construct_key_and_link(type, description, callout_info,
 					     callout_len, aux, dest_keyring,
-					     flags);
+					     sec, flags);
 	}
 
 error:
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index 9598670..e3c09a4 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -104,10 +104,8 @@ static void request_key_auth_revoke(struct key *key)
 
 	kenter("{%d}", key->serial);
 
-	if (rka->context) {
-		put_task_struct(rka->context);
-		rka->context = NULL;
-	}
+	put_task_security(rka->sec);
+	rka->sec = NULL;
 
 } /* end request_key_auth_revoke() */
 
@@ -121,11 +119,7 @@ static void request_key_auth_destroy(struct key *key)
 
 	kenter("{%d}", key->serial);
 
-	if (rka->context) {
-		put_task_struct(rka->context);
-		rka->context = NULL;
-	}
-
+	put_task_security(rka->sec);
 	key_put(rka->target_key);
 	kfree(rka->callout_info);
 	kfree(rka);
@@ -141,6 +135,7 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 				 size_t callout_len)
 {
 	struct request_key_auth *rka, *irka;
+	struct task_security *sec = current->sec;
 	struct key *authkey = NULL;
 	char desc[20];
 	int ret;
@@ -162,28 +157,25 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 
 	/* see if the calling process is already servicing the key request of
 	 * another process */
-	if (current->act_as->request_key_auth) {
+	if (sec->request_key_auth) {
 		/* it is - use that instantiation context here too */
-		down_read(&current->act_as->request_key_auth->sem);
+		down_read(&sec->request_key_auth->sem);
 
 		/* if the auth key has been revoked, then the key we're
 		 * servicing is already instantiated */
-		if (test_bit(KEY_FLAG_REVOKED,
-			     &current->act_as->request_key_auth->flags))
+		if (test_bit(KEY_FLAG_REVOKED, &sec->request_key_auth->flags))
 			goto auth_key_revoked;
 
-		irka = current->act_as->request_key_auth->payload.data;
-		rka->context = irka->context;
+		irka = sec->request_key_auth->payload.data;
+		rka->sec = irka->sec;
 		rka->pid = irka->pid;
-		get_task_struct(rka->context);
+		get_task_security(rka->sec);
 
-		up_read(&current->act_as->request_key_auth->sem);
-	}
-	else {
+		up_read(&sec->request_key_auth->sem);
+	} else {
 		/* it isn't - use this process as the context */
-		rka->context = current;
+		rka->sec = get_task_security(sec);
 		rka->pid = current->pid;
-		get_task_struct(rka->context);
 	}
 
 	rka->target_key = key_get(target);
@@ -194,7 +186,7 @@ struct key *request_key_auth_new(struct key *target, const void *callout_info,
 	sprintf(desc, "%x", target->serial);
 
 	authkey = key_alloc(&key_type_request_key_auth, desc,
-			    current_fsuid(), current_fsgid(), current,
+			    sec->fsuid, sec->fsgid, sec,
 			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH |
 			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA);
 	if (IS_ERR(authkey)) {
@@ -260,7 +252,7 @@ struct key *key_get_instantiation_authkey(key_serial_t target_id)
 		&key_type_request_key_auth,
 		(void *) (unsigned long) target_id,
 		key_get_instantiation_authkey_match,
-		current);
+		current->act_as);
 
 	if (IS_ERR(authkey_ref)) {
 		authkey = ERR_PTR(PTR_ERR(authkey_ref));
diff --git a/security/security.c b/security/security.c
index 16213e3..92d66d6 100644
--- a/security/security.c
+++ b/security/security.c
@@ -573,9 +573,14 @@ int security_task_alloc(struct task_struct *p)
 	return security_ops->task_alloc_security(p);
 }
 
-void security_task_free(struct task_struct *p)
+void security_task_free(struct task_security *sec)
 {
-	security_ops->task_free_security(p);
+	security_ops->task_free_security(sec);
+}
+
+int security_task_dup(struct task_security *sec)
+{
+	return security_ops->task_dup_security(sec);
 }
 
 int security_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags)
@@ -1063,9 +1068,10 @@ EXPORT_SYMBOL(security_skb_classify_flow);
 
 #ifdef CONFIG_KEYS
 
-int security_key_alloc(struct key *key, struct task_struct *tsk, unsigned long flags)
+int security_key_alloc(struct key *key, struct task_security *sec,
+		       unsigned long flags)
 {
-	return security_ops->key_alloc(key, tsk, flags);
+	return security_ops->key_alloc(key, sec, flags);
 }
 
 void security_key_free(struct key *key)
@@ -1073,10 +1079,10 @@ void security_key_free(struct key *key)
 	security_ops->key_free(key);
 }
 
-int security_key_permission(key_ref_t key_ref,
-			    struct task_struct *context, key_perm_t perm)
+int security_key_permission(key_ref_t key_ref, struct task_security *sec,
+			    key_perm_t perm)
 {
-	return security_ops->key_permission(key_ref, context, perm);
+	return security_ops->key_permission(key_ref, sec, perm);
 }
 
 int security_key_getsecurity(struct key *key, char **_buffer)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index e56529f..20a6b55 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -170,10 +170,10 @@ static int task_alloc_security(struct task_struct *task)
 	return 0;
 }
 
-static void task_free_security(struct task_struct *task)
+static void task_free_security(struct task_security *sec)
 {
-	struct task_security_struct *tsec = task->sec->security;
-	task->sec->security = NULL;
+	struct task_security_struct *tsec = sec->security;
+	sec->security = NULL;
 	kfree(tsec);
 }
 
@@ -2818,9 +2818,32 @@ static int selinux_task_alloc_security(struct task_struct *tsk)
 	return 0;
 }
 
-static void selinux_task_free_security(struct task_struct *tsk)
+static void selinux_task_free_security(struct task_security *sec)
 {
-	task_free_security(tsk);
+	task_free_security(sec);
+}
+
+static int selinux_task_dup_security(struct task_security *sec)
+{
+	struct task_security_struct *tsec1, *tsec2;
+
+	tsec1 = sec->security;
+
+	tsec2 = kmemdup(tsec1, sizeof(*tsec1), GFP_KERNEL);
+	if (!tsec2)
+		return -ENOMEM;
+
+	tsec2->osid = tsec1->osid;
+	tsec2->sid = tsec1->sid;
+
+	tsec2->exec_sid = tsec1->exec_sid;
+	tsec2->create_sid = tsec1->create_sid;
+	tsec2->keycreate_sid = tsec1->keycreate_sid;
+	tsec2->sockcreate_sid = tsec1->sockcreate_sid;
+	tsec2->ptrace_sid = SECINITSID_UNLABELED;
+	sec->security = tsec2;
+
+	return 0;
 }
 
 static int selinux_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags)
@@ -4718,10 +4741,10 @@ static void selinux_release_secctx(char *secdata, u32 seclen)
 
 #ifdef CONFIG_KEYS
 
-static int selinux_key_alloc(struct key *k, struct task_struct *tsk,
+static int selinux_key_alloc(struct key *k, struct task_security *context,
 			     unsigned long flags)
 {
-	struct task_security_struct *tsec = tsk->sec->security;
+	struct task_security_struct *tsec = context->security;
 	struct key_security_struct *ksec;
 
 	ksec = kzalloc(sizeof(struct key_security_struct), GFP_KERNEL);
@@ -4747,7 +4770,7 @@ static void selinux_key_free(struct key *k)
 }
 
 static int selinux_key_permission(key_ref_t key_ref,
-			    struct task_struct *ctx,
+			    struct task_security *context,
 			    key_perm_t perm)
 {
 	struct key *key;
@@ -4756,7 +4779,7 @@ static int selinux_key_permission(key_ref_t key_ref,
 
 	key = key_ref_to_ptr(key_ref);
 
-	tsec = ctx->sec->security;
+	tsec = context->security;
 	ksec = key->security;
 
 	/* if no specific permissions are requested, we skip the
@@ -4860,6 +4883,7 @@ static struct security_operations selinux_ops = {
 	.task_create =			selinux_task_create,
 	.task_alloc_security =		selinux_task_alloc_security,
 	.task_free_security =		selinux_task_free_security,
+	.task_dup_security =		selinux_task_dup_security,
 	.task_setuid =			selinux_task_setuid,
 	.task_post_setuid =		selinux_task_post_setuid,
 	.task_setgid =			selinux_task_setgid,
@@ -5001,9 +5025,9 @@ static __init int selinux_init(void)
 
 #ifdef CONFIG_KEYS
 	/* Add security information to initial keyrings */
-	selinux_key_alloc(&root_user_keyring, current,
+	selinux_key_alloc(&root_user_keyring, current->sec,
 			  KEY_ALLOC_NOT_IN_QUOTA);
-	selinux_key_alloc(&root_session_keyring, current,
+	selinux_key_alloc(&root_session_keyring, current->sec,
 			  KEY_ALLOC_NOT_IN_QUOTA);
 #endif
 


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (6 preceding siblings ...)
  2007-12-05 19:38 ` [PATCH 07/28] SECURITY: De-embed task security record from task and use refcounting " David Howells
@ 2007-12-05 19:38 ` David Howells
  2007-12-10 16:46   ` Stephen Smalley
  2007-12-10 17:07   ` David Howells
  2007-12-05 19:39 ` [PATCH 09/28] FS-Cache: Release page->private after failed readahead " David Howells
                   ` (19 subsequent siblings)
  27 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:38 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Allow kernel services to override LSM settings appropriate to the actions
performed by a task by duplicating a security record, modifying it and then
using task_struct::act_as to point to it when performing operations on behalf
of a task.

This is used, for example, by CacheFiles which has to transparently access the
cache on behalf of a process that thinks it is doing, say, NFS accesses with a
potentially inappropriate (with respect to accessing the cache) set of
security data.

This patch provides two LSM hooks for modifying a task security record:

 (*) security_kernel_act_as() which allows modification of the security datum
     with which a task acts on other objects (most notably files).

 (*) security_create_files_as() which allows modification of the security
     datum that is used to initialise the security data on a file that a task
     creates.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/cred.h     |   22 ++++++++++++
 include/linux/security.h |   35 +++++++++++++++++++
 kernel/cred.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++
 security/dummy.c         |   15 ++++++++
 security/security.c      |   13 +++++++
 security/selinux/hooks.c |   45 ++++++++++++++++++++++++
 6 files changed, 216 insertions(+), 0 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
new file mode 100644
index 0000000..c9f8906
--- /dev/null
+++ b/include/linux/cred.h
@@ -0,0 +1,22 @@
+/* Credential management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CRED_H
+#define _LINUX_CRED_H
+
+struct task_security;
+struct inode;
+
+extern struct task_security *get_kernel_security(const char *,
+						 struct task_struct *);
+extern int change_create_files_as(struct task_security *, struct inode *);
+
+#endif /* _LINUX_CRED_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index b7ba073..f0dce11 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -557,6 +557,18 @@ struct request_sock;
  *	Duplicate and attach the security structure currently attached to the
  *	p->security field.
  *	Return 0 if operation was successful.
+ * @task_kernel_act_as:
+ *	Set the credentials for a kernel service to act as (subjective context).
+ *	@sec points to the task security record to be modified.
+ *	@service names the service making the request.
+ *	@daemon: A userspace daemon to be used as a reference.
+ *	Return 0 if successful.
+ * @task_create_files_as:
+ *	Set the file creation context in a task security record to be the same
+ *	as the objective context of the specified inode.
+ *	@sec points to the task security record to be modified.
+ *	@inode points to the inode to use as a reference.
+ *	Return 0 if successful.
  * @task_setuid:
  *	Check permission before setting one or more of the user identity
  *	attributes of the current process.  The @flags parameter indicates
@@ -1321,6 +1333,11 @@ struct security_operations {
 	int (*task_alloc_security) (struct task_struct *p);
 	void (*task_free_security) (struct task_security *p);
 	int (*task_dup_security) (struct task_security *p);
+	int (*task_kernel_act_as)(struct task_security *sec,
+				  const char *service,
+				  struct task_struct *daemon);
+	int (*task_create_files_as)(struct task_security *sec,
+				    struct inode *inode);
 	int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags);
 	int (*task_post_setuid) (uid_t old_ruid /* or fsuid */ ,
 				 uid_t old_euid, uid_t old_suid, int flags);
@@ -1571,6 +1588,11 @@ int security_task_create(unsigned long clone_flags);
 int security_task_alloc(struct task_struct *p);
 void security_task_free(struct task_security *p);
 int security_task_dup(struct task_security *p);
+int security_task_kernel_act_as(struct task_security *sec,
+				const char *service,
+				struct task_struct *daemon);
+int security_task_create_files_as(struct task_security *sec,
+				  struct inode *inode);
 int security_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags);
 int security_task_post_setuid(uid_t old_ruid, uid_t old_euid,
 			       uid_t old_suid, int flags);
@@ -2054,6 +2076,19 @@ static inline int security_task_dup(struct task_security *p)
 	return 0;
 }
 
+static inline int security_task_kernel_act_as(struct task_security *sec,
+					      const char *service,
+					      struct task_struct *daemon)
+{
+	return 0;
+}
+
+static inline int security_task_create_files_as(struct task_security *sec,
+						struct inode *inode)
+{
+	return 0;
+}
+
 static inline int security_task_setuid (uid_t id0, uid_t id1, uid_t id2,
 					int flags)
 {
diff --git a/kernel/cred.c b/kernel/cred.c
index ddf98c6..eac1d0c 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/key.h>
+#include <linux/keyctl.h>
 #include <linux/init_task.h>
 #include <linux/security.h>
 
@@ -137,3 +138,88 @@ void put_task_security(struct task_security *sec)
 	}
 }
 EXPORT_SYMBOL(put_task_security);
+
+/**
+ * get_kernel_security - Get a task security record for a named kernel service
+ * @service: The name of the service
+ * @daemon: A userspace daemon to be used as a reference
+ *
+ * Get a task security record for a specific kernel service.  This can then be
+ * used to override a task's own security so that work can be done on behalf of
+ * that task that requires a different security context.
+ *
+ * @daemon is used to provide a base for the security record, but can be NULL.
+ * If @daemon is supplied, then the security data will be derived from that;
+ * otherwise they'll be set to 0 and no groups, full capabilities and no keys.
+ *
+ * @daemon is also passd to the LSM module as a base from which to initialise
+ * any MAC controls.
+ *
+ * The caller may change these controls afterwards if desired.
+ */
+struct task_security *get_kernel_security(const char *service,
+					  struct task_struct *daemon)
+{
+	const struct task_security *dsec;
+	struct task_security *sec;
+	int ret;
+
+	sec = kzalloc(sizeof *sec, GFP_KERNEL);
+	if (!sec)
+		return ERR_PTR(-ENOMEM);
+
+	if (daemon) {
+		rcu_read_lock();
+		dsec = rcu_dereference(daemon->sec);
+		*sec = *dsec;
+		get_group_info(sec->group_info);
+		get_uid(sec->user);
+		rcu_read_unlock();
+#ifdef CONFIG_KEYS
+		sec->request_key_auth = NULL;
+		sec->thread_keyring = NULL;
+		sec->tgsec = NULL;
+#endif
+	} else {
+		sec->keep_capabilities = 0;
+		sec->cap_inheritable = CAP_INIT_INH_SET;
+		sec->cap_permitted = CAP_FULL_SET;
+		sec->cap_effective = CAP_INIT_EFF_SET;
+		sec->user = &root_user;
+		get_uid(sec->user);
+		sec->group_info = &init_groups;
+		get_group_info(sec->group_info);
+	}
+
+	atomic_set(&sec->usage, 1);
+	spin_lock_init(&sec->lock);
+#ifdef CONFIG_KEYS
+	sec->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING;
+#endif
+
+	ret = security_task_kernel_act_as(sec, service, daemon);
+	if (ret < 0) {
+		put_task_security(sec);
+		return ERR_PTR(ret);
+	}
+
+	return sec;
+}
+EXPORT_SYMBOL(get_kernel_security);
+
+/**
+ * change_create_files_as - Change the file create context in a security record
+ * @sec: The security record to alter
+ * @inode: The inode to take the context from
+ *
+ * Change the file creation context in a security record to be the same as the
+ * object context of the specified inode, so that the new inodes have the same
+ * MAC context as that inode.
+ */
+int change_create_files_as(struct task_security *sec, struct inode *inode)
+{
+	sec->fsuid = inode->i_uid;
+	sec->fsgid = inode->i_gid;
+	return security_task_create_files_as(sec, inode);
+}
+EXPORT_SYMBOL(change_create_files_as);
diff --git a/security/dummy.c b/security/dummy.c
index 526549d..1c8bad6 100644
--- a/security/dummy.c
+++ b/security/dummy.c
@@ -495,6 +495,19 @@ static int dummy_task_dup_security(struct task_security *p)
 	return 0;
 }
 
+static int dummy_task_kernel_act_as(struct task_security *sec,
+				    const char *service,
+				    struct task_struct *daemon)
+{
+	return 0;
+}
+
+static int dummy_task_create_files_as(struct task_security *sec,
+				      struct inode *inode)
+{
+	return 0;
+}
+
 static int dummy_task_setuid (uid_t id0, uid_t id1, uid_t id2, int flags)
 {
 	return 0;
@@ -1063,6 +1076,8 @@ void security_fixup_ops (struct security_operations *ops)
 	set_to_dummy_if_null(ops, task_alloc_security);
 	set_to_dummy_if_null(ops, task_free_security);
 	set_to_dummy_if_null(ops, task_dup_security);
+	set_to_dummy_if_null(ops, task_kernel_act_as);
+	set_to_dummy_if_null(ops, task_create_files_as);
 	set_to_dummy_if_null(ops, task_setuid);
 	set_to_dummy_if_null(ops, task_post_setuid);
 	set_to_dummy_if_null(ops, task_setgid);
diff --git a/security/security.c b/security/security.c
index 92d66d6..86d94e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -583,6 +583,19 @@ int security_task_dup(struct task_security *sec)
 	return security_ops->task_dup_security(sec);
 }
 
+int security_task_kernel_act_as(struct task_security *sec,
+				const char *service,
+				struct task_struct *daemon)
+{
+	return security_ops->task_kernel_act_as(sec, service, daemon);
+}
+
+int security_task_create_files_as(struct task_security *sec,
+				  struct inode *inode)
+{
+	return security_ops->task_create_files_as(sec, inode);
+}
+
 int security_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags)
 {
 	return security_ops->task_setuid(id0, id1, id2, flags);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 20a6b55..2416f54 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2846,6 +2846,49 @@ static int selinux_task_dup_security(struct task_security *sec)
 	return 0;
 }
 
+/*
+ * get the security data for a kernel service, deriving the subjective context
+ * from the security data of a userspace daemon if one supplied
+ * - all the creation contexts are set to unlabelled
+ */
+static int selinux_task_kernel_act_as(struct task_security *sec,
+				      const char *service,
+				      struct task_struct *daemon)
+{
+	struct task_security_struct *tsec, *dtsec;
+	u32 ksid;
+	int ret;
+
+	tsec = sec->security;
+	dtsec = daemon ? daemon->sec->security : init_task.sec->security;
+
+	ret = security_transition_sid(dtsec->sid, SECINITSID_KERNEL,
+				      SECCLASS_PROCESS, &ksid);
+	if (ret < 0)
+		return ret;
+
+	tsec->sid = ksid;
+	tsec->create_sid = SECINITSID_UNLABELED;
+	tsec->keycreate_sid = SECINITSID_UNLABELED;
+	tsec->sockcreate_sid = SECINITSID_UNLABELED;
+	sec->security = tsec;
+	return 0;
+}
+
+/*
+ * set the file creation context in a security record to the same as the
+ * objective context of the specified inode
+ */
+static int selinux_task_create_files_as(struct task_security *sec,
+					struct inode *inode)
+{
+	struct task_security_struct *tsec = sec->security;
+	struct inode_security_struct *isec = inode->i_security;
+
+	tsec->create_sid = isec->sid;
+	return 0;
+}
+
 static int selinux_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags)
 {
 	/* Since setuid only affects the current process, and
@@ -4884,6 +4927,8 @@ static struct security_operations selinux_ops = {
 	.task_alloc_security =		selinux_task_alloc_security,
 	.task_free_security =		selinux_task_free_security,
 	.task_dup_security =		selinux_task_dup_security,
+	.task_kernel_act_as =		selinux_task_kernel_act_as,
+	.task_create_files_as =		selinux_task_create_files_as,
 	.task_setuid =			selinux_task_setuid,
 	.task_post_setuid =		selinux_task_post_setuid,
 	.task_setgid =			selinux_task_setgid,


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 09/28] FS-Cache: Release page->private after failed readahead [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (7 preceding siblings ...)
  2007-12-05 19:38 ` [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-14  3:51   ` Nick Piggin
  2007-12-17 22:42   ` David Howells
  2007-12-05 19:39 ` [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management " David Howells
                   ` (18 subsequent siblings)
  27 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.

The invalidatepage() address space op is called (indirectly) to do the honours.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 mm/readahead.c |   39 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c9c50ca..75aa6b6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -44,6 +44,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before calling,
+ *   such as the NFS fs marking pages that are cached locally on disk, thus we
+ *   need to give the fs a chance to clean up in the event of an error
+ */
+static void read_cache_pages_invalidate_page(struct address_space *mapping,
+					     struct page *page)
+{
+	if (PagePrivate(page)) {
+		if (TestSetPageLocked(page))
+			BUG();
+		page->mapping = mapping;
+		do_invalidatepage(page, 0);
+		page->mapping = NULL;
+		unlock_page(page);
+	}
+	page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space *mapping,
+					      struct list_head *pages)
+{
+	struct page *victim;
+
+	while (!list_empty(pages)) {
+		victim = list_to_page(pages);
+		list_del(&victim->lru);
+		read_cache_pages_invalidate_page(mapping, victim);
+	}
+}
+
 /**
  * read_cache_pages - populate an address space with some pages & start reads against them
  * @mapping: the address_space
@@ -65,14 +100,14 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping,
 					page->index, GFP_KERNEL)) {
-			page_cache_release(page);
+			read_cache_pages_invalidate_page(mapping, page);
 			continue;
 		}
 		page_cache_release(page);
 
 		ret = filler(data, page);
 		if (unlikely(ret)) {
-			put_pages_list(pages);
+			read_cache_pages_invalidate_pages(mapping, pages);
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (8 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 09/28] FS-Cache: Release page->private after failed readahead " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-14  4:08   ` Nick Piggin
  2007-12-17 22:36   ` David Howells
  2007-12-05 19:39 ` [PATCH 11/28] FS-Cache: Provide an add_wait_queue_tail() function " David Howells
                   ` (17 subsequent siblings)
  27 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Recruit a couple of page flags to aid in cache management.  The following extra
flags are defined:

 (1) PG_fscache (PG_owner_priv_2)

     The marked page is backed by a local cache and is pinning resources in the
     cache driver.

 (2) PG_fscache_write (PG_owner_priv_3)

     The marked page is being written to the local cache.  The page may not be
     modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to detect this.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/splice.c                |    2 +-
 include/linux/page-flags.h |   38 ++++++++++++++++++++++++++++++++++++--
 include/linux/pagemap.h    |   11 +++++++++++
 mm/filemap.c               |   16 ++++++++++++++++
 mm/migrate.c               |    2 +-
 mm/page_alloc.c            |    3 +++
 mm/readahead.c             |    9 +++++----
 mm/swap.c                  |    4 ++--
 mm/swap_state.c            |    4 ++--
 mm/truncate.c              |   10 +++++-----
 mm/vmscan.c                |    2 +-
 11 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6bdcb61..61edad7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 		 */
 		wait_on_page_writeback(page);
 
-		if (PagePrivate(page))
+		if (page_has_private(page))
 			try_to_release_page(page, GFP_KERNEL);
 
 		/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..fcc9e23 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,30 @@
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
-#define PG_owner_priv_1		 8	/* Owner use. If pagecache, fs may use*/
+#define PG_owner_priv_1		 8	/* Owner use. fs may use in pagecache */
 #define PG_arch_1		 9
 #define PG_reserved		10
 #define PG_private		11	/* If pagecache, has fs-private data */
 
 #define PG_writeback		12	/* Page is under writeback */
+#define PG_owner_priv_2		13	/* Owner use. fs may use in pagecache */
 #define PG_compound		14	/* Part of a compound page */
 #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
+#define PG_owner_priv_3		18	/* Owner use. fs may use in pagecache */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
-/* PG_owner_priv_1 users should have descriptive aliases */
+/* PG_owner_priv_1/2/3 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned		PG_owner_priv_1	/* Xen pinned pagetable */
+#define PG_fscache		PG_owner_priv_2	/* Backed by local cache */
+#define PG_fscache_write	PG_owner_priv_3	/* Writing to local cache */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -199,6 +204,24 @@ static inline void SetPageUptodate(struct page *page)
 #define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback,	\
 							&(page)->flags)
 
+#define PageFsCache(page)	test_bit(PG_fscache, &(page)->flags)
+#define SetPageFsCache(page)	set_bit(PG_fscache, &(page)->flags)
+#define ClearPageFsCache(page)	clear_bit(PG_fscache, &(page)->flags)
+#define TestSetPageFsCache(page) test_and_set_bit(PG_fscache, &(page)->flags)
+#define TestClearPageFsCache(page) test_and_clear_bit(PG_fscache, \
+						      &(page)->flags)
+
+#define PageFsCacheWrite(page)		test_bit(PG_fscache_write, \
+						 &(page)->flags)
+#define SetPageFsCacheWrite(page)	set_bit(PG_fscache_write, \
+						&(page)->flags)
+#define ClearPageFsCacheWrite(page)	clear_bit(PG_fscache_write, \
+						  &(page)->flags)
+#define TestSetPageFsCacheWrite(page)	test_and_set_bit(PG_fscache_write, \
+							 &(page)->flags)
+#define TestClearPageFsCacheWrite(page)	test_and_clear_bit(PG_fscache_write, \
+							   &(page)->flags)
+
 #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
 #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
 #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
@@ -272,4 +295,15 @@ static inline void set_page_writeback(struct page *page)
 	test_set_page_writeback(page);
 }
 
+/**
+ * page_has_private - Determine if page has private stuff
+ * @page: The page to be checked
+ *
+ * Determine if a page has private stuff, indicating that release routines
+ * should be invoked upon it.
+ */
+#define page_has_private(page)			\
+	((page)->flags & ((1 << PG_private) |	\
+			  (1 << PG_fscache)))
+
 #endif	/* PAGE_FLAGS_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..6a1b317 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -212,6 +212,17 @@ static inline void wait_on_page_writeback(struct page *page)
 extern void end_page_writeback(struct page *page);
 
 /*
+ * Wait for a page to finish being written to a local cache
+ */
+static inline void wait_on_page_fscache_write(struct page *page)
+{
+	if (PageFsCacheWrite(page))
+		wait_on_page_bit(page, PG_fscache_write);
+}
+
+extern void end_page_fscache_write(struct page *page);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index 188cf5f..8a8e5b8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -560,6 +560,19 @@ void end_page_writeback(struct page *page)
 EXPORT_SYMBOL(end_page_writeback);
 
 /**
+ * end_page_fscache_write - End writing page to cache
+ * @page: the page
+ */
+void end_page_fscache_write(struct page *page)
+{
+	if (!TestClearPageFsCacheWrite(page))
+		BUG();
+	smp_mb__after_clear_bit();
+	wake_up_page(page, PG_fscache_write);
+}
+EXPORT_SYMBOL(end_page_fscache_write);
+
+/**
  * __lock_page - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
  *
@@ -2528,6 +2541,9 @@ out:
  * (presumably at page->private).  If the release was successful, return `1'.
  * Otherwise return zero.
  *
+ * This may also be called if PG_fscache is set on a page, indicating that the
+ * page is known to the local caching routines.
+ *
  * The @gfp_mask argument specifies whether I/O may be performed to release
  * this page (__GFP_IO), and whether the call may block (__GFP_WAIT).
  *
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a207e8..ebaf557 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -545,7 +545,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	 * Buffers may be managed in a filesystem specific way.
 	 * We must have no buffers or drop them.
 	 */
-	if (PagePrivate(page) &&
+	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5a58d4..ac4c341 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -230,6 +230,7 @@ static void bad_page(struct page *page)
 	dump_stack();
 	page->flags &= ~(1 << PG_lru	|
 			1 << PG_private |
+			1 << PG_fscache	|
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_dirty	|
@@ -456,6 +457,7 @@ static inline int free_pages_check(struct page *page)
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private |
+			1 << PG_fscache	|
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_slab	|
@@ -605,6 +607,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private	|
+			1 << PG_fscache	|
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_dirty	|
diff --git a/mm/readahead.c b/mm/readahead.c
index 75aa6b6..272ffc7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -46,14 +46,15 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 /*
  * see if a page needs releasing upon read_cache_pages() failure
- * - the caller of read_cache_pages() may have set PG_private before calling,
- *   such as the NFS fs marking pages that are cached locally on disk, thus we
- *   need to give the fs a chance to clean up in the event of an error
+ * - the caller of read_cache_pages() may have set PG_private or PG_fscache
+ *   before calling, such as the NFS fs marking pages that are cached locally
+ *   on disk, thus we need to give the fs a chance to clean up in the event of
+ *   an error
  */
 static void read_cache_pages_invalidate_page(struct address_space *mapping,
 					     struct page *page)
 {
-	if (PagePrivate(page)) {
+	if (page_has_private(page)) {
 		if (TestSetPageLocked(page))
 			BUG();
 		page->mapping = mapping;
diff --git a/mm/swap.c b/mm/swap.c
index 9ac8832..22d6e03 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,8 +455,8 @@ void pagevec_strip(struct pagevec *pvec)
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
-		if (PagePrivate(page) && !TestSetPageLocked(page)) {
-			if (PagePrivate(page))
+		if (page_has_private(page) && !TestSetPageLocked(page)) {
+			if (page_has_private(page))
 				try_to_release_page(page, 0);
 			unlock_page(page);
 		}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b526356..33f1e5c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -76,7 +76,7 @@ static int __add_to_swap_cache(struct page *page, swp_entry_t entry,
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageSwapCache(page));
-	BUG_ON(PagePrivate(page));
+	BUG_ON(page_has_private(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		write_lock_irq(&swapper_space.tree_lock);
@@ -129,7 +129,7 @@ void __delete_from_swap_cache(struct page *page)
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
-	BUG_ON(PagePrivate(page));
+	BUG_ON(page_has_private(page));
 
 	radix_tree_delete(&swapper_space.page_tree, page_private(page));
 	set_page_private(page, 0);
diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..5b7d1c5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -49,7 +49,7 @@ void do_invalidatepage(struct page *page, unsigned long offset)
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
 	zero_user_page(page, partial, PAGE_CACHE_SIZE - partial, KM_USER0);
-	if (PagePrivate(page))
+	if (page_has_private(page))
 		do_invalidatepage(page, partial);
 }
 
@@ -100,7 +100,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
-	if (PagePrivate(page))
+	if (page_has_private(page))
 		do_invalidatepage(page, 0);
 
 	remove_from_page_cache(page);
@@ -125,7 +125,7 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return 0;
 
-	if (PagePrivate(page) && !try_to_release_page(page, 0))
+	if (page_has_private(page) && !try_to_release_page(page, 0))
 		return 0;
 
 	ret = remove_mapping(mapping, page);
@@ -347,14 +347,14 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return 0;
 
-	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
+	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
 	write_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
-	BUG_ON(PagePrivate(page));
+	BUG_ON(page_has_private(page));
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5a9597..ff2fa2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -578,7 +578,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * process address space (page_count == 1) it can be freed.
 		 * Otherwise, leave the page on the LRU so it is swappable.
 		 */
-		if (PagePrivate(page)) {
+		if (page_has_private(page)) {
 			if (!try_to_release_page(page, sc->gfp_mask))
 				goto activate_locked;
 			if (!mapping && page_count(page) == 1)


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 11/28] FS-Cache: Provide an add_wait_queue_tail() function [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (9 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 12/28] FS-Cache: Generic filesystem caching facility " David Howells
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Provide an add_wait_queue_tail() function to add a waiter to the back of a
wait queue instead of the front.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/wait.h |    2 ++
 kernel/wait.c        |   18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 0e68628..f1038d0 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -118,6 +118,8 @@ static inline int waitqueue_active(wait_queue_head_t *q)
 #define is_sync_wait(wait)	(!(wait) || ((wait)->private))
 
 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
+extern void FASTCALL(add_wait_queue_tail(wait_queue_head_t *q,
+					 wait_queue_t *wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
 
diff --git a/kernel/wait.c b/kernel/wait.c
index 444ddbf..7acc9cc 100644
--- a/kernel/wait.c
+++ b/kernel/wait.c
@@ -29,6 +29,24 @@ void fastcall add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
 }
 EXPORT_SYMBOL(add_wait_queue);
 
+/**
+ * add_wait_queue_tail - Add a waiter to the back of a waitqueue
+ * @q: the wait queue to append the waiter to
+ * @wait: the waiter to be queued
+ *
+ * Add a waiter to the back of a waitqueue so that it gets woken up last.
+ */
+void fastcall add_wait_queue_tail(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	unsigned long flags;
+
+	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+	spin_lock_irqsave(&q->lock, flags);
+	__add_wait_queue_tail(q, wait);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL(add_wait_queue_tail);
+
 void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
 {
 	unsigned long flags;


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 12/28] FS-Cache: Generic filesystem caching facility [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (10 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 11/28] FS-Cache: Provide an add_wait_queue_tail() function " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 13/28] CacheFiles: Add missing copy_page export for ia64 " David Howells
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

The attached patch adds a generic intermediary (FS-Cache) by which filesystems
may call on local caching capabilities, and by which local caching backends may
make caches available:

	+---------+
	|         |                        +--------------+
	|   NFS   |--+                     |              |
	|         |  |                 +-->|   CacheFS    |
	+---------+  |   +----------+  |   |  /dev/hda5   |
	             |   |          |  |   +--------------+
	+---------+  +-->|          |  |
	|         |      |          |--+
	|   AFS   |----->| FS-Cache |
	|         |      |          |--+
	+---------+  +-->|          |  |
	             |   |          |  |   +--------------+
	+---------+  |   +----------+  |   |              |
	|         |  |                 +-->|  CacheFiles  |
	|  ISOFS  |--+                     |  /var/cache  |
	|         |                        +--------------+
	+---------+

The patch also documents the netfs interface and the cache backend
interface provided by the facility.


There are a number of reasons why I'm not using i_mapping to do this.
These have been discussed a lot on the LKML and CacheFS mailing lists,
but to summarise the basics:

 (1) Most filesystems don't do hole reportage.  Holes in files are treated as
     blocks of zeros and can't be distinguished otherwise, making it difficult
     to distinguish blocks that have been read from the network and cached from
     those that haven't.

 (2) The backing inode must be fully populated before being exposed to
     userspace through the main inode because the VM/VFS goes directly to the
     backing inode and does not interrogate the front inode on VM ops.

     Therefore:

     (a) The backing inode must fit entirely within the cache.

     (b) All backed files currently open must fit entirely within the cache at
     	 the same time.

     (c) A working set of files in total larger than the cache may not be
     	 cached.

     (d) A file may not grow larger than the available space in the cache.

     (e) A file that's open and cached, and remotely grows larger than the
     	 cache is potentially stuffed.

 (3) Writes go to the backing filesystem, and can only be transferred to the
     network when the file is closed.

 (4) There's no record of what changes have been made, so the whole file must
     be written back.

 (5) The pages belong to the backing filesystem, and all metadata associated
     with that page are relevant only to the backing filesystem, and not
     anything stacked atop it.


The attached patch adds a generic core to which both networking filesystems and
caches may bind.  It transfers requests from networking filesystems to
appropriate caches if possible, or else gracefully denies them.

If this facility is disabled in the kernel configuration, then all its
operations will be trivially reducible to nothing by the compiler.

FS-Cache provides the following facilities:

 (1) Caches can be added / removed at any time, even whilst in use.

 (2) Adds a facility by which tags can be used to refer to caches, even if
     they're not mounted yet.

 (3) More than one cache can be used at once.  Caches can be selected
     explicitly by use of tags.

 (4) The netfs is provided with an interface that allows either party to
     withdraw caching facilities from a file (required for (1)).

 (5) A netfs may annotate cache objects that belongs to it.

 (6) Cache objects can be pinned and reservations made.

 (7) The interface to the netfs returns as few errors as possible, preferring
     rather to let the netfs remain oblivious.

 (8) Cookies are used to represent indices, files and other objects to the
     netfs.  The simplest cookie is just a NULL pointer - indicating nothing
     cached there.

 (9) The netfs is allowed to propose - dynamically - any index hierarchy it
     desires, though it must be aware that the index search function is
     recursive, stack space is limited, and indices can only be children of
     indices.

(10) Indices can be used to group files together to reduce key size and to make
     group invalidation easier.  The use of indices may make lookup quicker,
     but that's cache dependent.

(11) Data I/O is effectively done directly to and from the netfs's pages.  The
     netfs indicates that page A is at index B of the data-file represented by
     cookie C, and that it should be read or written.  The cache backend may or
     may not start I/O on that page, but if it does, a netfs callback will be
     invoked to indicate completion.  The I/O may be either synchronous or
     asynchronous.

(12) Cookies can be "retired" upon release.  At this point FS-Cache will mark
     them as obsolete and the index hierarchy rooted at that point will get
     recycled.

(13) The netfs provides a "match" function for index searches.  In addition to
     saying whether a match was made or not, this can also specify that an
     entry should be updated or deleted.


FS-Cache maintains a virtual indexing tree in which all indices, files, objects
and pages are kept.  Bits of this tree may actually reside in one or more
caches.

                                           FSDEF
                                             |
                        +------------------------------------+
                        |                                    |
                       NFS                                  AFS
                        |                                    |
           +--------------------------+                +-----------+
           |                          |                |           |
        homedir                     mirror          afs.org   redhat.com
           |                          |                            |
     +------------+           +---------------+              +----------+
     |            |           |               |              |          |
   00001        00002       00007           00125        vol00001   vol00002
     |            |           |               |                         |
 +---+---+     +-----+      +---+      +------+------+            +-----+----+
 |   |   |     |     |      |   |      |      |      |            |     |    |
PG0 PG1 PG2   PG0  XATTR   PG0 PG1   DIRENT DIRENT DIRENT        R/W   R/O  Bak
                     |                                            |
                    PG0                                       +-------+
                                                              |       |
                                                            00001   00003
                                                              |
                                                          +---+---+
                                                          |   |   |
                                                         PG0 PG1 PG2

In the example above, you can see two netfs's being backed: NFS and AFS.  These
have different index hierarchies:

 (*) The NFS primary index will probably contain per-server indices.  Each
     server index is indexed by NFS file handles to get data file objects.
     Each data file objects can have an array of pages, but may also have
     further child objects, such as extended attributes and directory entries.
     Extended attribute objects themselves have page-array contents.

 (*) The AFS primary index contains per-cell indices.  Each cell index contains
     per-logical-volume indices.  Each of volume index contains up to three
     indices for the read-write, read-only and backup mirrors of those volumes.
     Each of these contains vnode data file objects, each of which contains an
     array of pages.

The very top index is the FS-Cache master index in which individual netfs's
have entries.

Any index object may reside in more than one cache, provided it only has index
children.  Any index with non-index object children will be assumed to only
reside in one cache.


The FS-Cache overview can be found in:

	Documentation/filesystems/caching/fscache.txt

The netfs API to FS-Cache can be found in:

	Documentation/filesystems/caching/netfs-api.txt

The cache backend API to FS-Cache can be found in:

	Documentation/filesystems/caching/backend-api.txt

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/caching/backend-api.txt |  625 +++++++++++++++
 Documentation/filesystems/caching/fscache.txt     |  295 +++++++
 Documentation/filesystems/caching/netfs-api.txt   |  741 ++++++++++++++++++
 fs/Kconfig                                        |    6 
 fs/Makefile                                       |    1 
 fs/fscache/Kconfig                                |   49 +
 fs/fscache/Makefile                               |   19 
 fs/fscache/fsc-cache.c                            |  506 ++++++++++++
 fs/fscache/fsc-cookie.c                           |  490 ++++++++++++
 fs/fscache/fsc-fsdef.c                            |  112 +++
 fs/fscache/fsc-internal.h                         |  376 +++++++++
 fs/fscache/fsc-main.c                             |  133 +++
 fs/fscache/fsc-manage.c                           |  257 ++++++
 fs/fscache/fsc-object.c                           |  583 ++++++++++++++
 fs/fscache/fsc-page.c                             |  872 +++++++++++++++++++++
 fs/fscache/fsc-proc.c                             |  398 ++++++++++
 fs/fscache/fsc-stats.c                            |  103 ++
 fs/fscache/fsc-threads.c                          |  676 ++++++++++++++++
 include/linux/fscache-cache.h                     |  433 ++++++++++
 include/linux/fscache.h                           |  593 ++++++++++++++
 20 files changed, 7268 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
new file mode 100644
index 0000000..a7e58eb
--- /dev/null
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -0,0 +1,625 @@
+			  ==========================
+			  FS-CACHE CACHE BACKEND API
+			  ==========================
+
+The FS-Cache system provides an API by which actual caches can be supplied to
+FS-Cache for it to then serve out to network filesystems and other interested
+parties.
+
+This API is declared in <linux/fscache-cache.h>.
+
+
+====================================
+INITIALISING AND REGISTERING A CACHE
+====================================
+
+To start off, a cache definition must be initialised and registered for each
+cache the backend wants to make available.  For instance, CacheFS does this in
+the fill_super() operation on mounting.
+
+The cache definition (struct fscache_cache) should be initialised by calling:
+
+	void fscache_init_cache(struct fscache_cache *cache,
+				struct fscache_cache_ops *ops,
+				const char *idfmt,
+				...);
+
+Where:
+
+ (*) "cache" is a pointer to the cache definition;
+
+ (*) "ops" is a pointer to the table of operations that the backend supports on
+     this cache; and
+
+ (*) "idfmt" is a format and printf-style arguments for constructing a label
+     for the cache.
+
+
+The cache should then be registered with FS-Cache by passing a pointer to the
+previously initialised cache definition to:
+
+	int fscache_add_cache(struct fscache_cache *cache,
+			      struct fscache_object *fsdef,
+			      const char *tagname);
+
+Two extra arguments should also be supplied:
+
+ (*) "fsdef" which should point to the object representation for the FS-Cache
+     master index in this cache.  Netfs primary index entries will be created
+     here.
+
+ (*) "tagname" which, if given, should be a text string naming this cache.  If
+     this is NULL, the identifier will be used instead.  For CacheFS, the
+     identifier is set to name the underlying block device and the tag can be
+     supplied by mount.
+
+This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
+is already in use.  0 will be returned on success.
+
+
+=====================
+UNREGISTERING A CACHE
+=====================
+
+A cache can be withdrawn from the system by calling this function with a
+pointer to the cache definition:
+
+	void fscache_withdraw_cache(struct fscache_cache *cache);
+
+In CacheFS's case, this is called by put_super().
+
+
+========
+SECURITY
+========
+
+The cache methods are executed one of two contexts:
+
+ (1) that of the userspace process that issued the netfs operation that caused
+     the cache method to be invoked, or
+
+ (2) that of one of the processes in the FS-Cache thread pool.
+
+In either case, this may not be an appropriate context in which to access the
+cache.
+
+The calling process's fsuid, fsgid and SELinux security identities may need to
+be masqueraded for the duration of the cache driver's access to the cache.
+This is left to the cache to handle; FS-Cache makes no effort in this regard.
+
+
+===================================
+CONTROL AND STATISTICS PRESENTATION
+===================================
+
+The cache may present data to the outside world through FS-Cache's interfaces
+in sysfs and procfs - the former for control and the latter for statistics.
+
+A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
+is enabled.  This is accessible through the kobject struct fscache_cache::kobj
+and is for use by the cache as it sees fit.
+
+The cache driver may create itself a directory named for the cache type in the
+/proc/fs/fscache/ directory.  This is available if CONFIG_FSCACHE_PROC is
+enabled and is accessible through:
+
+	struct proc_dir_entry *proc_fscache;
+
+
+========================
+RELEVANT DATA STRUCTURES
+========================
+
+ (*) Index/Data file FS-Cache representation cookie:
+
+	struct fscache_cookie {
+		struct fscache_object_def	*def;
+		struct fscache_netfs		*netfs;
+		void				*netfs_data;
+		...
+	};
+
+     The fields that might be of use to the backend describe the object
+     definition, the netfs definition and the netfs's data for this cookie.
+     The object definition contain functions supplied by the netfs for loading
+     and matching index entries; these are required to provide some of the
+     cache operations.
+
+
+ (*) In-cache object representation:
+
+	struct fscache_object {
+		int				debug_id;
+		enum {
+			FSCACHE_OBJECT_RECYCLING,
+			...
+		}				state;
+		spinlock_t			lock
+		struct fscache_cache		*cache;
+		struct fscache_cookie		*cookie;
+		...
+	};
+
+     Structures of this type should be allocated by the cache backend and
+     passed to FS-Cache when requested by the appropriate cache operation.  In
+     the case of CacheFS, they're embedded in CacheFS's internal object
+     structures.
+
+     The debug_id is a simple integer that can be used in debugging messages
+     that refer to a particular object.  In such a case it should be printed
+     using "OBJ%x" to be consistent with FS-Cache.
+
+     Each object contains a pointer to the cookie that represents the object it
+     is backing.  An object should retired when put_object() is called if it is
+     in state FSCACHE_OBJECT_RECYCLING.  The fscache_object struct should be
+     initialised by calling fscache_object_init(object).
+
+
+ (*) FS-Cache operation record:
+
+	struct fscache_operation {
+		atomic_t		usage;
+		struct fscache_object	*object;
+		unsigned long		flags;
+	#define FSCACHE_OP_EXCLUSIVE
+		void (*processor)(struct fscache_operation *op);
+		void (*release)(struct fscache_operation *op);
+		...
+	};
+
+     FS-Cache has a pool of threads that it uses to give CPU time to the
+     various asynchronous operations that need to be done as part of driving
+     the cache.  These are represented by the above structure.  The processor
+     method is called to give the op CPU time, and the release method to get
+     rid of it when its usage count reaches 0.
+
+     An operation can be made exclusive upon an object by setting the
+     appropriate flag before enqueuing it with fscache_enqueue_operation().  If
+     an operation needs more processing time, it should be enqueued again.
+
+
+ (*) FS-Cache retrieval operation record:
+
+	struct fscache_retrieval {
+		struct fscache_operation op;
+		struct address_space	*mapping;
+		struct list_head	*to_do;
+		...
+	};
+
+     A structure of this type is allocated by FS-Cache to record retrieval and
+     allocation requests made by the netfs.  This struct is then passed to the
+     backend to do the operation.  The backend may get extra refs to it by
+     calling fscache_get_retrieval() and refs may be discarded by calling
+     fscache_put_retrieval().
+
+     A retrieval operation can be used by the backend to do retrieval work.  To
+     do this, the retrieval->op.processor method pointer should be set
+     appropriately by the backend and fscache_enqueue_retrieval() called to
+     submit it to the thread pool.  CacheFiles, for example, uses this to queue
+     page examination when it detects PG_lock being cleared.
+
+     The to_do field is an empty list available for the cache backend to use as
+     it sees fit.
+
+
+ (*) FS-Cache storage operation record:
+
+	struct fscache_storage {
+		struct fscache_operation op;
+		pgoff_t			store_limit;
+		...
+	};
+
+     A structure of this type is allocated by FS-Cache to record outstanding
+     writes to be made.  FS-Cache itself enqueues this operation and invokes
+     the write_page() method on the object at appropriate times to effect
+     storage.
+
+
+================
+CACHE OPERATIONS
+================
+
+The cache backend provides FS-Cache with a table of operations that can be
+performed on the denizens of the cache.  These are held in a structure of type:
+
+	struct fscache_cache_ops
+
+ (*) Name of cache provider [mandatory]:
+
+	const char *name
+
+     This isn't strictly an operation, but should be pointed at a string naming
+     the backend.
+
+
+ (*) Allocate a new object [mandatory]:
+
+	struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
+					       struct fscache_cookie *cookie)
+
+     This method is used to allocate a cache object representation to back a
+     cookie in a particular cache.  fscache_object_init() should be called on
+     the object to initialise it prior to returning.
+
+     This function may also be used to parse the index key to be used for
+     multiple lookup calls to turn it into a more convenient form.  FS-Cache
+     will call the lookup_complete() method to allow the cache to release the
+     form once lookup is complete or aborted.
+
+
+ (*) Look up and create object [mandatory]:
+
+	void (*lookup_object)(struct fscache_object *object)
+
+     This method is used to look up an object, given that the object is already
+     allocated and attached to the cookie.  This should instantiate that object
+     in the cache if it can.
+
+     The method should call fscache_object_lookup_negative() as soon as
+     possible if it determines the object doesn't exist in the cache.  If the
+     object is found to exist and the netfs indicates that it is valid then
+     fscache_obtained_object() should be called once the object is in a
+     position to have data stored in it.  Similarly, fscache_obtained_object()
+     should also be called once a non-present object has been created.
+
+     If a lookup error occurs, fscache_object_lookup_error() should be called
+     to abort the lookup of that object.
+
+
+ (*) Release lookup data [mandatory]:
+
+	void (*lookup_complete)(struct fscache_object *object)
+
+     This method is called to ask the cache to release any resources it was
+     using to perform a lookup.
+
+
+ (*) Increment object refcount [mandatory]:
+
+	struct fscache_object *(*grab_object)(struct fscache_object *object)
+
+     This method is called to increment the reference count on an object.  It
+     may fail (for instance if the cache is being withdrawn) by returning NULL.
+     It should return the object pointer if successful.
+
+
+ (*) Lock/Unlock object [mandatory]:
+
+	void (*lock_object)(struct fscache_object *object)
+	void (*unlock_object)(struct fscache_object *object)
+
+     These methods are used to exclusively lock an object.  It must be possible
+     to schedule with the lock held, so a spinlock isn't sufficient.
+
+
+ (*) Pin/Unpin object [optional]:
+
+	int (*pin_object)(struct fscache_object *object)
+	void (*unpin_object)(struct fscache_object *object)
+
+     These methods are used to pin an object into the cache.  Once pinned an
+     object cannot be reclaimed to make space.  Return -ENOSPC if there's not
+     enough space in the cache to permit this.
+
+
+ (*) Update object [mandatory]:
+
+	int (*update_object)(struct fscache_object *object)
+
+     This is called to update the index entry for the specified object.  The
+     new information should be in object->cookie->netfs_data.  This can be
+     obtained by calling object->cookie->def->get_aux()/get_attr().
+
+
+ (*) Discard object [mandatory]:
+
+	void (*drop_object)(struct fscache_object *object)
+
+     This method is called to indicate that an object has been unbound from its
+     cookie, and that the cache should release the object's resources and
+     retire it if it's in state FSCACHE_OBJECT_RECYCLING.
+
+     This method should not attempt to release any references held by the
+     caller.  The caller will invoke the put_object() method as appropriate.
+
+
+ (*) Release object reference [mandatory]:
+
+	void (*put_object)(struct fscache_object *object)
+
+     This method is used to discard a reference to an object.  The object may
+     be freed when all the references to it are released.
+
+
+ (*) Synchronise a cache [mandatory]:
+
+	void (*sync)(struct fscache_cache *cache)
+
+     This is called to ask the backend to synchronise a cache with its backing
+     device.
+
+
+ (*) Dissociate a cache [mandatory]:
+
+	void (*dissociate_pages)(struct fscache_cache *cache)
+
+     This is called to ask a cache to perform any page dissociations as part of
+     cache withdrawal.
+
+
+ (*) Notification that the attributes on a netfs file changed [mandatory]:
+
+	int (*attr_changed)(struct fscache_object *object);
+
+     This is called to indicate to the cache that certain attributes on a netfs
+     file have changed (for example the maximum size a file may reach).  The
+     cache can read these from the netfs by calling the cookie's get_attr()
+     method.
+
+     The cache may use the file size information to reserve space on the cache.
+     It should also call fscache_set_store_limit() to indicate to FS-Cache the
+     highest byte it's willing to store for an object.
+
+     This method may return -ve if an error occurred or the cache object cannot
+     be expanded.  In such a case, the object will be withdrawn from service.
+
+     This operation is run asynchronously from FS-Cache's thread pool, and
+     storage and retrieval operations from the netfs are excluded during the
+     execution of this operation.
+
+
+ (*) Reserve cache space for an object's data [optional]:
+
+	int (*reserve_space)(struct fscache_object *object, loff_t size);
+
+     This is called to request that cache space be reserved to hold the data
+     for an object and the metadata used to track it.  Zero size should be
+     taken as request to cancel a reservation.
+
+     This should return 0 if successful, -ENOSPC if there isn't enough space
+     available, or -ENOMEM or -EIO on other errors.
+
+     The reservation may exceed the current size of the object, thus permitting
+     future expansion.  If the amount of space consumed by an object would
+     exceed the reservation, it's permitted to refuse requests to allocate
+     pages, but not required.  An object may be pruned down to its reservation
+     size if larger than that already.
+
+
+ (*) Request page be read from cache [mandatory]:
+
+	int (*read_or_alloc_page)(struct fscache_retrieval *op,
+				  struct page *page,
+				  gfp_t gfp)
+
+     This is called to attempt to read a netfs page from the cache, or to
+     reserve a backing block if not.  FS-Cache will have done as much checking
+     as it can before calling, but most of the work belongs to the backend.
+
+     If there's no page in the cache, then -ENODATA should be returned if the
+     backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
+     didn't.
+
+     If there is suitable data in the cache, then a read operation should be
+     queued and 0 returned.  When the read finishes, fscache_end_io() should be
+     called.
+
+     The fscache_mark_pages_cached() should be called for the page if any cache
+     metadata is retained.  This will indicate to the netfs that the page needs
+     explicit uncaching.  This operation takes a pagevec, thus allowing several
+     pages to be marked at once.
+
+     The retrieval record pointed to by op should be retained for each page
+     queued and released when I/O on the page has been formally ended.
+     fscache_get/put_retrieval() are available for this purpose.
+
+     The retrieval record may be used to get CPU time via the FS-Cache thread
+     pool.  If this is desired, the op->op.processor should be set to point to
+     the appropriate processing routine, and fscache_enqueue_retrieval() should
+     be called at an appropriate point to request CPU time.  For instance, the
+     retrieval routine could be enqueued upon the completion of a disk read.
+     The to_do field in the retrieval record is provided to aid in this.
+
+     If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
+     returned if possible or fscache_end_io() called with a suitable error
+     code..
+
+
+ (*) Request pages be read from cache [mandatory]:
+
+	int (*read_or_alloc_pages)(struct fscache_retrieval *op,
+				   struct list_head *pages,
+				   unsigned *nr_pages,
+				   gfp_t gfp)
+
+     This is like the read_or_alloc_page() method, except it is handed a list
+     of pages instead of one page.  Any pages on which a read operation is
+     started must be added to the page cache for the specified mapping and also
+     to the LRU.  Such pages must also be removed from the pages list and
+     *nr_pages decremented per page.
+
+     If there was an error such as -ENOMEM, then that should be returned; else
+     if one or more pages couldn't be read or allocated, then -ENOBUFS should
+     be returned; else if one or more pages couldn't be read, then -ENODATA
+     should be returned.  If all the pages are dispatched then 0 should be
+     returned.
+
+
+ (*) Request page be allocated in the cache [mandatory]:
+
+	int (*allocate_page)(struct fscache_retrieval *op,
+			     struct page *page,
+			     gfp_t gfp)
+
+     This is like the read_or_alloc_page() method, except that it shouldn't
+     read from the cache, even if there's data there that could be retrieved.
+     It should, however, set up any internal metadata required such that
+     the write_page() method can write to the cache.
+
+     If there's no backing block available, then -ENOBUFS should be returned
+     (or -ENOMEM if there were other problems).  If a block is successfully
+     allocated, then the netfs page should be marked and 0 returned.
+
+
+ (*) Request pages be allocated in the cache [mandatory]:
+
+	int (*allocate_pages)(struct fscache_retrieval *op,
+			      struct list_head *pages,
+			      unsigned *nr_pages,
+			      gfp_t gfp)
+
+     This is an multiple page version of the allocate_page() method.  pages and
+     nr_pages should be treated as for the read_or_alloc_pages() method.
+
+
+ (*) Request page be written to cache [mandatory]:
+
+	int (*write_page)(struct fscache_storage *op,
+			  struct page *page);
+
+     This is called to write from a page on which there was a previously
+     successful read_or_alloc_page() call or similar.  FS-Cache filters out
+     pages that don't have mappings.
+
+     This method is called asynchronously from the FS-Cache thread pool.  It is
+     not required to actually store anything, provided -ENODATA is then
+     returned to the next read of this page.
+
+     If an error occurred, then a negative error code should be returned,
+     otherwise zero should be returned.  FS-Cache will take appropriate action
+     in response to an error, such as withdrawing this object.
+
+     If this method returns success then FS-Cache will inform the netfs
+     appropriately.
+
+
+ (*) Discard retained per-page metadata [mandatory]:
+
+	void (*uncache_page)(struct fscache_object *object, struct page *page)
+
+     This is called when a netfs page is being evicted from the pagecache.  The
+     cache backend should tear down any internal representation or tracking it
+     maintains for this page.
+
+
+==================
+FS-CACHE UTILITIES
+==================
+
+FS-Cache provides some utilities that a cache backend may make use of:
+
+ (*) Note occurrence of an I/O error in a cache:
+
+	void fscache_io_error(struct fscache_cache *cache)
+
+     This tells FS-Cache that an I/O error occurred in the cache.  After this
+     has been called, only resource dissociation operations (object and page
+     release) will be passed from the netfs to the cache backend for the
+     specified cache.
+
+     This does not actually withdraw the cache.  That must be done separately.
+
+
+ (*) Invoke the retrieval I/O completion function:
+
+	void fscache_end_io(struct fscache_retrieval *op, struct page *page,
+			    int error);
+
+     This is called to note the end of an attempt to retrieve a page.  The
+     error value should be 0 if successful and an error otherwise.
+
+
+ (*) Set highest store limit:
+
+	void fscache_set_store_limit(struct fscache_object *object,
+				     loff_t i_size);
+
+     This sets the limit FS-Cache imposes on the highest byte it's willing to
+     try and store for a netfs.  Any page over this limit is automatically
+     rejected by fscache_read_alloc_page() and co with -ENOBUFS.
+
+
+ (*) Mark pages as being cached:
+
+	void fscache_mark_pages_cached(struct fscache_retrieval *op,
+				       struct pagevec *pagevec);
+
+     This marks a set of pages as being cached.  After this has been called,
+     the netfs must call fscache_uncache_page() to unmark the pages.
+
+
+ (*) Initialise a freshly allocated object:
+
+	void fscache_object_init(struct fscache_object *object);
+
+     This initialises all the fields in an object representation.
+
+
+ (*) Indicate negative lookup on an object:
+
+	void fscache_object_lookup_negative(struct fscache_object *object);
+
+     This is called to indicate to FS-Cache that a lookup process for an object
+     found a negative result.
+
+     This changes the state of an object to permit reads pending on lookup
+     completion to go off and start fetching data from the netfs server as it's
+     known at this point that there can't be any data in the cache.
+
+     This may be called multiple times on an object.  Only the first call is
+     significant - all subsequent calls are ignored.
+
+
+ (*) Indicate an object has been obtained:
+
+	void fscache_obtained_object(struct fscache_object *object);
+
+     This is called to indicate to FS-Cache that a lookup process for an object
+     produced a positive result, or that an object was created.  This should
+     only be called once for any particular object.
+
+     This changes the state of an object to indicate:
+
+	(1) if no call to fscache_object_lookup_negative() has been made on
+	    this object, that there may be data available, and that reads can
+	    now go and look for it; and
+
+        (2) that writes may now proceed against this object.
+
+
+ (*) Indicate that object lookup failed:
+
+	void fscache_object_lookup_error(struct fscache_object *object);
+
+     This marks an object as having encountered a fatal error (usually EIO)
+     and causes it to move into a state whereby it will be withdrawn as soon
+     as possible.
+
+
+ (*) Get and release references on a retrieval record:
+
+	void fscache_get_retrieval(struct fscache_retrieval *op);
+	void fscache_put_retrieval(struct fscache_retrieval *op);
+
+     These two functions are used to retain a retrieval record whilst doing
+     asynchronous data retrieval and block allocation.
+
+
+ (*) Enqueue a retrieval record for processing.
+
+	void fscache_enqueue_retrieval(struct fscache_retrieval *op);
+
+     This enqueues a retrieval record for processing by the FS-Cache thread
+     pool.  One of the threads in the pool will invoke the retrieval record's
+     op->op.processor callback function.  This function may be called from
+     within the callback function.
+
+
+ (*) List of object state names:
+
+	const char *fscache_object_states[];
+
+     For debugging purposes, this may be used to turn the state that an object
+     is in into a text string for display purposes.
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
new file mode 100644
index 0000000..b28f2ca
--- /dev/null
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -0,0 +1,295 @@
+			  ==========================
+			  General Filesystem Caching
+			  ==========================
+
+========
+OVERVIEW
+========
+
+This facility is a general purpose cache for network filesystems, though it
+could be used for caching other things such as ISO9660 filesystems too.
+
+FS-Cache mediates between cache backends (such as CacheFS) and network
+filesystems:
+
+	+---------+
+	|         |                        +--------------+
+	|   NFS   |--+                     |              |
+	|         |  |                 +-->|   CacheFS    |
+	+---------+  |   +----------+  |   |  /dev/hda5   |
+	             |   |          |  |   +--------------+
+	+---------+  +-->|          |  |
+	|         |      |          |--+
+	|   AFS   |----->| FS-Cache |
+	|         |      |          |--+
+	+---------+  +-->|          |  |
+	             |   |          |  |   +--------------+
+	+---------+  |   +----------+  |   |              |
+	|         |  |                 +-->|  CacheFiles  |
+	|  ISOFS  |--+                     |  /var/cache  |
+	|         |                        +--------------+
+	+---------+
+
+
+FS-Cache does not follow the idea of completely loading every netfs file
+opened in its entirety into a cache before permitting it to be accessed and
+then serving the pages out of that cache rather than the netfs inode because:
+
+ (1) It must be practical to operate without a cache.
+
+ (2) The size of any accessible file must not be limited to the size of the
+     cache.
+
+ (3) The combined size of all opened files (this includes mapped libraries)
+     must not be limited to the size of the cache.
+
+ (4) The user should not be forced to download an entire file just to do a
+     one-off access of a small portion of it (such as might be done with the
+     "file" program).
+
+It instead serves the cache out in PAGE_SIZE chunks as and when requested by
+the netfs('s) using it.
+
+
+FS-Cache provides the following facilities:
+
+ (1) More than one cache can be used at once.  Caches can be selected
+     explicitly by use of tags.
+
+ (2) Caches can be added / removed at any time.
+
+ (3) The netfs is provided with an interface that allows either party to
+     withdraw caching facilities from a file (required for (2)).
+
+ (4) The interface to the netfs returns as few errors as possible, preferring
+     rather to let the netfs remain oblivious.
+
+ (5) Cookies are used to represent indices, files and other objects to the
+     netfs.  The simplest cookie is just a NULL pointer - indicating nothing
+     cached there.
+
+ (6) The netfs is allowed to propose - dynamically - any index hierarchy it
+     desires, though it must be aware that the index search function is
+     recursive, stack space is limited, and indices can only be children of
+     indices.
+
+ (7) Data I/O is done direct to and from the netfs's pages.  The netfs
+     indicates that page A is at index B of the data-file represented by cookie
+     C, and that it should be read or written.  The cache backend may or may
+     not start I/O on that page, but if it does, a netfs callback will be
+     invoked to indicate completion.  The I/O may be either synchronous or
+     asynchronous.
+
+ (8) Cookies can be "retired" upon release.  At this point FS-Cache will mark
+     them as obsolete and the index hierarchy rooted at that point will get
+     recycled.
+
+ (9) The netfs provides a "match" function for index searches.  In addition to
+     saying whether a match was made or not, this can also specify that an
+     entry should be updated or deleted.
+
+(10) As much as possible is done asynchronously.
+
+
+FS-Cache maintains a virtual indexing tree in which all indices, files, objects
+and pages are kept.  Bits of this tree may actually reside in one or more
+caches.
+
+                                           FSDEF
+                                             |
+                        +------------------------------------+
+                        |                                    |
+                       NFS                                  AFS
+                        |                                    |
+           +--------------------------+                +-----------+
+           |                          |                |           |
+        homedir                     mirror          afs.org   redhat.com
+           |                          |                            |
+     +------------+           +---------------+              +----------+
+     |            |           |               |              |          |
+   00001        00002       00007           00125        vol00001   vol00002
+     |            |           |               |                         |
+ +---+---+     +-----+      +---+      +------+------+            +-----+----+
+ |   |   |     |     |      |   |      |      |      |            |     |    |
+PG0 PG1 PG2   PG0  XATTR   PG0 PG1   DIRENT DIRENT DIRENT        R/W   R/O  Bak
+                     |                                            |
+                    PG0                                       +-------+
+                                                              |       |
+                                                            00001   00003
+                                                              |
+                                                          +---+---+
+                                                          |   |   |
+                                                         PG0 PG1 PG2
+
+In the example above, you can see two netfs's being backed: NFS and AFS.  These
+have different index hierarchies:
+
+ (*) The NFS primary index contains per-server indices.  Each server index is
+     indexed by NFS file handles to get data file objects.  Each data file
+     objects can have an array of pages, but may also have further child
+     objects, such as extended attributes and directory entries.  Extended
+     attribute objects themselves have page-array contents.
+
+ (*) The AFS primary index contains per-cell indices.  Each cell index contains
+     per-logical-volume indices.  Each of volume index contains up to three
+     indices for the read-write, read-only and backup mirrors of those volumes.
+     Each of these contains vnode data file objects, each of which contains an
+     array of pages.
+
+The very top index is the FS-Cache master index in which individual netfs's
+have entries.
+
+Any index object may reside in more than one cache, provided it only has index
+children.  Any index with non-index object children will be assumed to only
+reside in one cache.
+
+
+The netfs API to FS-Cache can be found in:
+
+	Documentation/filesystems/caching/netfs-api.txt
+
+The cache backend API to FS-Cache can be found in:
+
+	Documentation/filesystems/caching/backend-api.txt
+
+
+=======================
+STATISTICAL INFORMATION
+=======================
+
+If FS-Cache is compiled with the following options enabled:
+
+	CONFIG_FSCACHE_PROC=y (implied by the following two)
+	CONFIG_FSCACHE_STATS=y
+	CONFIG_FSCACHE_HISTOGRAM=y
+
+then it will gather certain statistics and display them through a number of
+proc files.
+
+ (*) /proc/fs/fscache/stats
+
+     This shows counts of a number of events that can happen in FS-Cache:
+
+	CLASS	EVENT	MEANING
+	=======	=======	=======================================================
+	Cookies	idx=N	Number of index cookies allocated
+		dat=N	Number of data storage cookies allocated
+		spc=N	Number of special cookies allocated
+	Objects	alc=N	Number of objects allocated
+		nal=N	Number of object allocation failures
+		avl=N	Number of objects that reached the available state
+	Pages	mrk=N	Number of pages marked as being cached
+		unc=N	Number of uncache page requests seen
+	Acquire	n=N	Number of acquire cookie requests seen
+		nul=N	Number of acq reqs given a NULL parent
+		noc=N	Number of acq reqs rejected due to no cache available
+		ok=N	Number of acq reqs succeeded
+		nbf=N	Number of acq reqs rejected due to error
+		oom=N	Number of acq reqs failed on ENOMEM
+	Lookups	n=N	Number of lookup calls made on cache backends
+		neg=N	Number of negative lookups made
+		pos=N	Number of positive lookups made
+		crt=N	Number of objects created by lookup
+		bst=N	Number of objects with boosted lookup priority
+	Updates	n=N	Number of update cookie requests seen
+		nul=N	Number of upd reqs given a NULL parent
+		run=N	Number of upd reqs granted CPU time
+	Relinqs	n=N	Number of relinquish cookie requests seen
+		nul=N	Number of rlq reqs given a NULL parent
+		wcr=N	Number of rlq reqs waited on completion of creation
+	AttrChg	n=N	Number of attribute changed requests seen
+		ok=N	Number of attr changed requests queued
+		nbf=N	Number of attr changed rejected -ENOBUFS
+		oom=N	Number of attr changed failed -ENOMEM
+		run=N	Number of attr changed ops given CPU time
+	Allocs	n=N	Number of allocation requests seen
+		ok=N	Number of successful alloc reqs
+		wt=N	Number of alloc reqs that waited on lookup completion
+		nbf=N	Number of alloc reqs rejected -ENOBUFS
+		ops=N	Number of alloc reqs submitted
+		owt=N	Number of alloc reqs waited for CPU time
+	Retrvls	n=N	Number of retrieval (read) requests seen
+		ok=N	Number of successful retr reqs
+		wt=N	Number of retr reqs that waited on lookup completion
+		nod=N	Number of retr reqs returned -ENODATA
+		nbf=N	Number of retr reqs rejected -ENOBUFS
+		int=N	Number of retr reqs aborted -ERESTARTSYS
+		oom=N	Number of retr reqs failed -ENOMEM
+		ops=N	Number of retr reqs submitted
+		owt=N	Number of retr reqs waited for CPU time
+	Stores	n=N	Number of storage (write) requests seen
+		ok=N	Number of successful store reqs
+		agn=N	Number of store reqs on a page already pending storage
+		nbf=N	Number of store reqs rejected -ENOBUFS
+		oom=N	Number of store reqs failed -ENOMEM
+		ops=N	Number of store reqs submitted
+		run=N	Number of store reqs granted CPU time
+	Ops	pend=N	Number of times async ops added to pending queues
+		run=N	Number of times async ops given CPU time
+		enq=N	Number of times async ops queued for processing
+		req=N	Number of times async ops requeued for processing
+		rel=N	Number of times async ops released
+
+
+ (*) /proc/fs/fscache/pool
+
+     This shows the number of objects and operations each thread in the thread
+     pool has given CPU time to.
+
+
+ (*) /proc/fs/fscache/histogram
+
+	cat /proc/fs/fscache/histogram 
+	+HZ   +TIME OBJ INST  OP RUNS   OBJ RUNS  RETRV DLY RETRIEVLS
+	===== ===== ========= ========= ========= ========= =========
+
+     This shows the breakdown of the number of times each amount of time
+     between 0 jiffies and HZ-1 jiffies a variety of tasks took to run.  The
+     columns are as follows:
+
+	COLUMN		TIME MEASUREMENT
+	=======		=======================================================
+	OBJ INST	Length of time to instantiate an object
+	OP RUNS		Length of time a call to process an operation took
+	OBJ RUNS	Length of time a call to process an object event took
+	RETRV DLY	Time between an requesting a read and lookup completing
+	RETRIEVLS	Time between beginning and end of a retrieval
+
+     Each row shows the number of events that took a particular range of times.
+     Each step is 1 jiffy in size.  The +HZ column indicates the particular
+     jiffy range covered, and the +TIME field the equivalent number of seconds.
+
+
+=========
+DEBUGGING
+=========
+
+The FS-Cache facility can have runtime debugging enabled by adjusting the value
+in:
+
+	/sys/module/fscache/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+	BIT	VALUE	STREAM				POINT
+	=======	=======	===============================	=======================
+	0	1	Cache management		Function entry trace
+	1	2					Function exit trace
+	2	4					General
+	3	8	Cookie management		Function entry trace
+	4	16					Function exit trace
+	5	32					General
+	6	64	Page handling			Function entry trace
+	7	128					Function exit trace
+	8	256					General
+	9	512	Thread pool management		Function entry trace
+	10	1024					Function exit trace
+	11	2048					General
+
+The appropriate set of values should be OR'd together and the result written to
+the control file.  For example:
+
+	echo $((1|8|64)) >/sys/module/fscache/parameters/debug
+
+will turn on all function entry debugging.
+
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
new file mode 100644
index 0000000..0b6d09a
--- /dev/null
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -0,0 +1,741 @@
+			===============================
+			FS-CACHE NETWORK FILESYSTEM API
+			===============================
+
+There's an API by which a network filesystem can make use of the FS-Cache
+facilities.  This is based around a number of principles:
+
+ (1) Caches can store a number of different object types.  There are two main
+     object types: indices and files.  The first is a special type used by
+     FS-Cache to make finding objects faster and to make retiring of groups of
+     objects easier.
+
+ (2) Every index, file or other object is represented by a cookie.  This cookie
+     may or may not have anything associated with it, but the netfs doesn't
+     need to care.
+
+ (3) Barring the top-level index (one entry per cached netfs), the index
+     hierarchy for each netfs is structured according the whim of the netfs.
+
+This API is declared in <linux/fscache.h>.
+
+This document contains the following sections:
+
+	 (1) Network filesystem definition
+	 (2) Index definition
+	 (3) Object definition
+	 (4) Network filesystem (un)registration
+	 (5) Cache tag lookup
+	 (6) Index registration
+	 (7) Data file registration
+	 (8) Miscellaneous object registration
+	 (9) Setting the data file size
+	(10) Page alloc/read/write
+	(11) Page uncaching
+	(12) Index and data file update
+	(13) Miscellaneous cookie operations
+	(14) Cookie unregistration
+	(15) Index and data file invalidation
+
+
+=============================
+NETWORK FILESYSTEM DEFINITION
+=============================
+
+FS-Cache needs a description of the network filesystem.  This is specified
+using a record of the following structure:
+
+	struct fscache_netfs {
+		uint32_t			version;
+		const char			*name;
+		struct fscache_netfs_operations	*ops;
+		struct fscache_cookie		*primary_index;
+		...
+	};
+
+This first three fields should be filled in before registration, and the fourth
+will be filled in by the registration function; any other fields should just be
+ignored and are for internal use only.
+
+The fields are:
+
+ (1) The name of the netfs (used as the key in the toplevel index).
+
+ (2) The version of the netfs (if the name matches but the version doesn't, the
+     entire in-cache hierarchy for this netfs will be scrapped and begun
+     afresh).
+
+ (3) The operations table is defined as follows:
+
+	struct fscache_netfs_operations {
+	};
+
+     Currently there aren't any functions here.
+
+ (4) The cookie representing the primary index will be allocated according to
+     another parameter passed into the registration function.
+
+For example, kAFS (linux/fs/afs/) uses the following definitions to describe
+itself:
+
+	static struct fscache_netfs_operations afs_cache_ops = {
+	};
+
+	struct fscache_netfs afs_cache_netfs = {
+		.version	= 0,
+		.name		= "afs",
+		.ops		= &afs_cache_ops,
+	};
+
+
+================
+INDEX DEFINITION
+================
+
+Indices are used for two purposes:
+
+ (1) To aid the finding of a file based on a series of keys (such as AFS's
+     "cell", "volume ID", "vnode ID").
+
+ (2) To make it easier to discard a subset of all the files cached based around
+     a particular key - for instance to mirror the removal of an AFS volume.
+
+However, since it's unlikely that any two netfs's are going to want to define
+their index hierarchies in quite the same way, FS-Cache tries to impose as few
+restraints as possible on how an index is structured and where it is placed in
+the tree.  The netfs can even mix indices and data files at the same level, but
+it's not recommended.
+
+Each index entry consists of a key of indeterminate length plus some auxilliary
+data, also of indeterminate length.
+
+There are some limits on indices:
+
+ (1) Any index containing non-index objects should be restricted to a single
+     cache.  Any such objects created within an index will be created in the
+     first cache only.  The cache in which an index is created can be
+     controlled by cache tags (see below).
+
+ (2) The entry data must be atomically journallable, so it is limited to about
+     400 bytes at present.  At least 400 bytes will be available.
+
+ (3) The depth of the index tree should be judged with care as the search
+     function is recursive.  Too many layers will run the kernel out of stack.
+
+
+=================
+OBJECT DEFINITION
+=================
+
+To define an object, a structure of the following type should be filled out:
+
+	struct fscache_cookie_def
+	{
+		uint8_t name[16];
+		uint8_t type;
+
+		struct fscache_cache_tag *(*select_cache)(
+			const void *parent_netfs_data,
+			const void *cookie_netfs_data);
+
+		uint16_t (*get_key)(const void *cookie_netfs_data,
+				    void *buffer,
+				    uint16_t bufmax);
+
+		void (*get_attr)(const void *cookie_netfs_data,
+				 uint64_t *size);
+
+		uint16_t (*get_aux)(const void *cookie_netfs_data,
+				    void *buffer,
+				    uint16_t bufmax);
+
+		enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
+						   const void *data,
+						   uint16_t datalen);
+
+		void (*get_context)(void *cookie_netfs_data, void *context);
+
+		void (*put_context)(void *cookie_netfs_data, void *context);
+
+		void (*mark_pages_cached)(void *cookie_netfs_data,
+					  struct address_space *mapping,
+					  struct pagevec *cached_pvec);
+
+		void (*now_uncached)(void *cookie_netfs_data);
+	};
+
+This has the following fields:
+
+ (1) The type of the object [mandatory].
+
+     This is one of the following values:
+
+	(*) FSCACHE_COOKIE_TYPE_INDEX
+
+	    This defines an index, which is a special FS-Cache type.
+
+	(*) FSCACHE_COOKIE_TYPE_DATAFILE
+
+	    This defines an ordinary data file.
+
+	(*) Any other value between 2 and 255
+
+	    This defines an extraordinary object such as an XATTR.
+
+ (2) The name of the object type (NUL terminated unless all 16 chars are used)
+     [optional].
+
+ (3) A function to select the cache in which to store an index [optional].
+
+     This function is invoked when an index needs to be instantiated in a cache
+     during the instantiation of a non-index object.  Only the immediate index
+     parent for the non-index object will be queried.  Any indices above that
+     in the hierarchy may be stored in multiple caches.  This function does not
+     need to be supplied for any non-index object or any index that will only
+     have index children.
+
+     If this function is not supplied or if it returns NULL then the first
+     cache in the parent's list will be chosed, or failing that, the first
+     cache in the master list.
+
+ (4) A function to retrieve an object's key from the netfs [mandatory].
+
+     This function will be called with the netfs data that was passed to the
+     cookie acquisition function and the maximum length of key data that it may
+     provide.  It should write the required key data into the given buffer and
+     return the quantity it wrote.
+
+ (5) A function to retrieve attribute data from the netfs [optional].
+
+     This function will be called with the netfs data that was passed to the
+     cookie acquisition function.  It should return the size of the file if
+     this is a data file.  The size may be used to govern how much cache must
+     be reserved for this file in the cache.
+
+     If the function is absent, a file size of 0 is assumed.
+
+ (6) A function to retrieve auxilliary data from the netfs [optional].
+
+     This function will be called with the netfs data that was passed to the
+     cookie acquisition function and the maximum length of auxilliary data that
+     it may provide.  It should write the auxilliary data into the given buffer
+     and return the quantity it wrote.
+
+     If this function is absent, the auxilliary data length will be set to 0.
+
+     The length of the auxilliary data buffer may be dependent on the key
+     length.  A netfs mustn't rely on being able to provide more than 400 bytes
+     for both.
+
+ (7) A function to check the auxilliary data [optional].
+
+     This function will be called to check that a match found in the cache for
+     this object is valid.  For instance with AFS it could check the auxilliary
+     data against the data version number returned by the server to determine
+     whether the index entry in a cache is still valid.
+
+     If this function is absent, it will be assumed that matching objects in a
+     cache are always valid.
+
+     If present, the function should return one of the following values:
+
+	(*) FSCACHE_CHECKAUX_OKAY		- the entry is okay as is
+	(*) FSCACHE_CHECKAUX_NEEDS_UPDATE	- the entry requires update
+	(*) FSCACHE_CHECKAUX_OBSOLETE		- the entry should be deleted
+
+     This function can also be used to extract data from the auxilliary data in
+     the cache and copy it into the netfs's structures.
+
+ (8) A pair of functions to manage contexts for the completion callback
+     [optional].
+
+     The cache read/write functions are passed a context which is then passed
+     to the I/O completion callback function.  To ensure this context remains
+     valid until after the I/O completion is called, two functions may be
+     provided: one to get an extra reference on the context, and one to drop a
+     reference to it.
+
+     If the context is not used or is a type of object that won't go out of
+     scope, then these functions are not required.  These functions are not
+     required for indices as indices may not contain data.  These functions may
+     be called in interrupt context and so may not sleep.
+
+ (9) A function to mark a page as retaining cache metadata [optional].
+
+     This is called by the cache to indicate that it is retaining in-memory
+     information for this page and that the netfs should uncache the page when
+     it has finished.  This does not indicate whether there's data on the disk
+     or not.  Note that several pages at once may be presented for marking.
+
+     The PG_fscache bit is set on the pages before this function would be
+     called, so the function need not be provided if this is sufficient.
+
+     This function is not required for indices as they're not permitted data.
+
+(10) A function to unmark all the pages retaining cache metadata [mandatory].
+
+     This is called by FS-Cache to indicate that a backing store is being
+     unbound from a cookie and that all the marks on the pages should be
+     cleared to prevent confusion.  Note that the cache will have torn down all
+     its tracking information so that the pages don't need to be explicitly
+     uncached.
+
+     This function is not required for indices as they're not permitted data.
+
+
+===================================
+NETWORK FILESYSTEM (UN)REGISTRATION
+===================================
+
+The first step is to declare the network filesystem to the cache.  This also
+involves specifying the layout of the primary index (for AFS, this would be the
+"cell" level).
+
+The registration function is:
+
+	int fscache_register_netfs(struct fscache_netfs *netfs);
+
+It just takes a pointer to the netfs definition.  It returns 0 or an error as
+appropriate.
+
+For kAFS, registration is done as follows:
+
+	ret = fscache_register_netfs(&afs_cache_netfs);
+
+The last step is, of course, unregistration:
+
+	void fscache_unregister_netfs(struct fscache_netfs *netfs);
+
+
+================
+CACHE TAG LOOKUP
+================
+
+FS-Cache permits the use of more than one cache.  To permit particular index
+subtrees to be bound to particular caches, the second step is to look up cache
+representation tags.  This step is optional; it can be left entirely up to
+FS-Cache as to which cache should be used.  The problem with doing that is that
+FS-Cache will always pick the first cache that was registered.
+
+To get the representation for a named tag:
+
+	struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
+
+This takes a text string as the name and returns a representation of a tag.  It
+will never return an error.  It may return a dummy tag, however, if it runs out
+of memory; this will inhibit caching with this tag.
+
+Any representation so obtained must be released by passing it to this function:
+
+	void fscache_release_cache_tag(struct fscache_cache_tag *tag);
+
+The tag will be retrieved by FS-Cache when it calls the object definition
+operation select_cache().
+
+
+==================
+INDEX REGISTRATION
+==================
+
+The third step is to inform FS-Cache about part of an index hierarchy that can
+be used to locate files.  This is done by requesting a cookie for each index in
+the path to the file:
+
+	struct fscache_cookie *
+	fscache_acquire_cookie(struct fscache_cookie *parent,
+			       const struct fscache_object_def *def,
+			       void *netfs_data);
+
+This function creates an index entry in the index represented by parent,
+filling in the index entry by calling the operations pointed to by def.
+
+Note that this function never returns an error - all errors are handled
+internally.  It may, however, return NULL to indicate no cookie.  It is quite
+acceptable to pass this token back to this function as the parent to another
+acquisition (or even to the relinquish cookie, read page and write page
+functions - see below).
+
+Note also that no indices are actually created in a cache until a non-index
+object needs to be created somewhere down the hierarchy.  Furthermore, an index
+may be created in several different caches independently at different times.
+This is all handled transparently, and the netfs doesn't see any of it.
+
+For example, with AFS, a cell would be added to the primary index.  This index
+entry would have a dependent inode containing a volume location index for the
+volume mappings within this cell:
+
+	cell->cache =
+		fscache_acquire_cookie(afs_cache_netfs.primary_index,
+				       &afs_cell_cache_index_def,
+				       cell);
+
+Then when a volume location was accessed, it would be entered into the cell's
+index and an inode would be allocated that acts as a volume type and hash chain
+combination:
+
+	vlocation->cache =
+		fscache_acquire_cookie(cell->cache,
+				       &afs_vlocation_cache_index_def,
+				       vlocation);
+
+And then a particular flavour of volume (R/O for example) could be added to
+that index, creating another index for vnodes (AFS inode equivalents):
+
+	volume->cache =
+		fscache_acquire_cookie(vlocation->cache,
+				       &afs_volume_cache_index_def,
+				       volume);
+
+
+======================
+DATA FILE REGISTRATION
+======================
+
+The fourth step is to request a data file be created in the cache.  This is
+identical to index cookie acquisition.  The only difference is that the type in
+the object definition should be something other than index type.
+
+	vnode->cache =
+		fscache_acquire_cookie(volume->cache,
+				       &afs_vnode_cache_object_def,
+				       vnode);
+
+
+=================================
+MISCELLANEOUS OBJECT REGISTRATION
+=================================
+
+An optional step is to request an object of miscellaneous type be created in
+the cache.  This is almost identical to index cookie acquisition.  The only
+difference is that the type in the object definition should be something other
+than index type.  Whilst the parent object could be an index, it's more likely
+it would be some other type of object such as a data file.
+
+	xattr->cache =
+		fscache_acquire_cookie(vnode->cache,
+				       &afs_xattr_cache_object_def,
+				       xattr);
+
+Miscellaneous objects might be used to store extended attributes or directory
+entries for example.
+
+
+==========================
+SETTING THE DATA FILE SIZE
+==========================
+
+The fifth step is to set the physical attributes of the file, such as its size.
+This doesn't automatically reserve any space in the cache, but permits the
+cache to adjust its metadata for data tracking appropriately:
+
+	int fscache_attr_changed(struct fscache_cookie *cookie);
+
+The cache will return -ENOBUFS if there is no backing cache or if there is no
+space to allocate any extra metadata required in the cache.  The attributes
+will be accessed with the get_attr() cookie definition operation.
+
+Note that attempts to read or write data pages in the cache over this size may
+be rebuffed with -ENOBUFS.
+
+This operation schedules an attribute adjustment to happen asynchronously at
+some point in the future, and as such, it may happen after the function returns
+to the caller.  The attribute adjustment excludes read and write operations.
+
+
+=====================
+PAGE READ/ALLOC/WRITE
+=====================
+
+And the sixth step is to store and retrieve pages in the cache.  There are
+three functions that are used to do this.
+
+Note:
+
+ (1) A page should not be re-read or re-allocated without uncaching it first.
+
+ (2) A read or allocated page must be uncached when the netfs page is released
+     from the pagecache.
+
+ (3) A page should only be written to the cache if previous read or allocated.
+
+This permits the cache to maintain its page tracking in proper order.
+
+
+PAGE READ
+---------
+
+Firstly, the netfs should ask FS-Cache to examine the caches and read the
+contents cached for a particular page of a particular file if present, or else
+allocate space to store the contents if not:
+
+	typedef
+	void (*fscache_rw_complete_t)(struct page *page,
+				      void *context,
+				      int error);
+
+	int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+				       struct page *page,
+				       fscache_rw_complete_t end_io_func,
+				       void *context,
+				       gfp_t gfp);
+
+The cookie argument must specify a cookie for an object that isn't an index,
+the page specified will have the data loaded into it (and is also used to
+specify the page number), and the gfp argument is used to control how any
+memory allocations made are satisfied.
+
+If the cookie indicates the inode is not cached:
+
+ (1) The function will return -ENOBUFS.
+
+Else if there's a copy of the page resident in the cache:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) The function will submit a request to read the data from the cache's
+     backing device directly into the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the read is complete, end_io_func() will be invoked with:
+
+     (*) The netfs data supplied when the cookie was created.
+
+     (*) The page descriptor.
+
+     (*) The context argument passed to the above function.  This will be
+     	 maintained with the get_context/put_context functions mentioned above.
+
+     (*) An argument that's 0 on success or negative for an error code.
+
+     If an error occurs, it should be assumed that the page contains no usable
+     data.
+
+     end_io_func() will be called in process context if the read is results in
+     an error, but it might be called in interrupt context if the read is
+     successful.
+
+Otherwise, if there's not a copy available in cache, but the cache may be able
+to store the page:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) A block may be reserved in the cache and attached to the object at the
+     appropriate place.
+
+ (3) The function will return -ENODATA.
+
+This function may also return -ENOMEM or -EINTR, in which case it won't have
+read any data from the cache.
+
+
+PAGE ALLOCATE
+-------------
+
+Alternatively, if there's not expected to be any data in the cache for a page
+because the file has been extended, a block can simply be allocated instead:
+
+	int fscache_alloc_page(struct fscache_cookie *cookie,
+			       struct page *page,
+			       gfp_t gfp);
+
+This is similar to the fscache_read_or_alloc_page() function, except that it
+never reads from the cache.  It will return 0 if a block has been allocated,
+rather than -ENODATA as the other would.  One or the other must be performed
+before writing to the cache.
+
+The mark_pages_cached() cookie operation will be called on the page if
+successful.
+
+
+PAGE WRITE
+----------
+
+Secondly, if the netfs changes the contents of the page (either due to an
+initial download or if a user performs a write), then the page should be
+written back to the cache:
+
+	int fscache_write_page(struct fscache_cookie *cookie,
+			       struct page *page,
+			       gfp_t gfp);
+
+The cookie argument must specify a data file cookie, the page specified should
+contain the data to be written (and is also used to specify the page number),
+and the gfp argument is used to control how any memory allocations made are
+satisfied.
+
+The page must have first been read or allocated successfully and must not have
+been uncached before writing is performed.
+
+If the cookie indicates the inode is not cached then:
+
+ (1) The function will return -ENOBUFS.
+
+Else if space can be allocated in the cache to hold this page:
+
+ (1) PG_fscache_write will be set on the page.
+
+ (2) The function will submit a request to write the data to cache's backing
+     device directly from the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the write is complete PG_fscache_write is cleared on the page and
+     anyone waiting for that bit will be woken up.
+
+Else if there's no space available in the cache, -ENOBUFS will be returned.  It
+is also possible for the PG_fscache_write bit to be cleared when no write took
+place if unforeseen circumstances arose (such as a disk error).
+
+Writing takes place asynchronously.
+
+
+MULTIPLE PAGE READ
+------------------
+
+A facility is provided to read several pages at once, as requested by the
+readpages() address space operation:
+
+	int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+					struct address_space *mapping,
+					struct list_head *pages,
+					int *nr_pages,
+					fscache_rw_complete_t end_io_func,
+					void *context,
+					gfp_t gfp);
+
+This works in a similar way to fscache_read_or_alloc_page(), except:
+
+ (1) Any page it can retrieve data for is removed from pages and nr_pages and
+     dispatched for reading to the disk.  Reads of adjacent pages on disk may
+     be merged for greater efficiency.
+
+ (2) The mark_pages_cached() cookie operation will be called on several pages
+     at once if they're being read or allocated.
+
+ (3) If there was an general error, then that error will be returned.
+
+     Else if some pages couldn't be allocated or read, then -ENOBUFS will be
+     returned.
+
+     Else if some pages couldn't be read but were allocated, then -ENODATA will
+     be returned.
+
+     Otherwise, if all pages had reads dispatched, then 0 will be returned, the
+     list will be empty and *nr_pages will be 0.
+
+ (4) end_io_func will be called once for each page being read as the reads
+     complete.  It will be called in process context if error != 0, but it may
+     be called in interrupt context if there is no error.
+
+Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
+some of the pages being read and some being allocated.  Those pages will have
+been marked appropriately and will need uncaching.
+
+
+==============
+PAGE UNCACHING
+==============
+
+To uncache a page, this function should be called:
+
+	void fscache_uncache_page(struct fscache_cookie *cookie,
+				  struct page *page);
+
+This function permits the cache to release any in-memory representation it
+might be holding for this netfs page.  This function must be called once for
+each page on which the read or write page functions above have been called to
+make sure the cache's in-memory tracking information gets torn down.
+
+Note that pages can't be explicitly deleted from the a data file.  The whole
+data file must be retired (see the relinquish cookie function below).
+
+Furthermore, note that this does not cancel the asynchronous read or write
+operation started by the read/alloc and write functions.
+
+
+==========================
+INDEX AND DATA FILE UPDATE
+==========================
+
+To request an update of the index data for an index or other object, the
+following function should be called:
+
+	void fscache_update_cookie(struct fscache_cookie *cookie);
+
+This function will refer back to the netfs_data pointer stored in the cookie by
+the acquisition function to obtain the data to write into each revised index
+entry.  The update method in the parent index definition will be called to
+transfer the data.
+
+Note that partial updates may happen automatically at other times, such as when
+data blocks are added to a data file object.
+
+
+===============================
+MISCELLANEOUS COOKIE OPERATIONS
+===============================
+
+There are a number of operations that can be used to control cookies:
+
+ (*) Cookie pinning:
+
+	int fscache_pin_cookie(struct fscache_cookie *cookie);
+	void fscache_unpin_cookie(struct fscache_cookie *cookie);
+
+     These operations permit data cookies to be pinned into the cache and to
+     have the pinning removed.  They are not permitted on index cookies.
+
+     The pinning function will return 0 if successful, -ENOBUFS in the cookie
+     isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
+     -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+     -EIO if there's any other problem.
+
+ (*) Data space reservation:
+
+	int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
+
+     This permits a netfs to request cache space be reserved to store up to the
+     given amount of a file.  It is permitted to ask for more than the current
+     size of the file to allow for future file expansion.
+
+     If size is given as zero then the reservation will be cancelled.
+
+     The function will return 0 if successful, -ENOBUFS in the cookie isn't
+     backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
+     -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+     -EIO if there's any other problem.
+
+     Note that this doesn't pin an object in a cache; it can still be culled to
+     make space if it's not in use.
+
+
+=====================
+COOKIE UNREGISTRATION
+=====================
+
+To get rid of a cookie, this function should be called.
+
+	void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+				       int retire);
+
+If retire is non-zero, then the object will be marked for recycling, and all
+copies of it will be removed from all active caches in which it is present.
+Not only that but all child objects will also be retired.
+
+If retire is zero, then the object may be available again when next the
+acquisition function is called.  Retirement here will overrule the pinning on a
+cookie.
+
+One very important note - relinquish must NOT be called for a cookie unless all
+the cookies for "child" indices, objects and pages have been relinquished
+first.
+
+
+================================
+INDEX AND DATA FILE INVALIDATION
+================================
+
+There is no direct way to invalidate an index subtree or a data file.  To do
+this, the caller should relinquish and retire the cookie they have, and then
+acquire a new one.
diff --git a/fs/Kconfig b/fs/Kconfig
index 635f3e2..715ab7c 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -656,6 +656,12 @@ config GENERIC_ACL
 	bool
 	select FS_POSIX_ACL
 
+menu "Caches"
+
+source "fs/fscache/Kconfig"
+
+endmenu
+
 if BLOCK
 menu "CD-ROM/DVD Filesystems"
 
diff --git a/fs/Makefile b/fs/Makefile
index 500cf15..1767091 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_PROFILING)		+= dcookies.o
 obj-$(CONFIG_DLM)		+= dlm/
  
 # Do not add any filesystems before this line
+obj-$(CONFIG_FSCACHE)		+= fscache/
 obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
 obj-$(CONFIG_EXT3_FS)		+= ext3/ # Before ext2 so root fs can be ext3
 obj-$(CONFIG_EXT4DEV_FS)	+= ext4/ # Before ext2 so root fs can be ext4dev
diff --git a/fs/fscache/Kconfig b/fs/fscache/Kconfig
new file mode 100644
index 0000000..e68c945
--- /dev/null
+++ b/fs/fscache/Kconfig
@@ -0,0 +1,49 @@
+
+config FSCACHE
+	tristate "General filesystem local caching manager"
+	depends on EXPERIMENTAL
+	help
+	  This option enables a generic filesystem caching manager that can be
+	  used by various network and other filesystems to cache data locally.
+	  Different sorts of caches can be plugged in, depending on the
+	  resources available.
+
+	  See Documentation/filesystems/caching/fscache.txt for more information.
+
+config FSCACHE_PROC
+	bool "Provide /proc interface for local caching statistics"
+	depends on FSCACHE && PROC_FS
+
+config FSCACHE_STATS
+	bool "Gather statistical information on local caching"
+	depends on FSCACHE_PROC
+	help
+	  This option causes statistical information to be gathered on local
+	  caching and exported through files:
+
+		/proc/fs/fscache/stats
+		/proc/fs/fscache/pool
+
+	  See Documentation/filesystems/caching/fscache.txt for more information.
+
+config FSCACHE_HISTOGRAM
+	bool "Gather latency information on local caching"
+	depends on FSCACHE_PROC
+	help
+
+	  This option causes latency information to be gathered on local
+	  caching and exported through file:
+
+		/proc/fs/fscache/histogram
+
+	  See Documentation/filesystems/caching/fscache.txt for more information.
+
+config FSCACHE_DEBUG
+	bool "Debug FS-Cache"
+	depends on FSCACHE
+	help
+	  This permits debugging to be dynamically enabled in the local caching
+	  management module.  If this is set, the debugging output may be
+	  enabled by setting bits in /sys/modules/fscache/parameter/debug.
+
+	  See Documentation/filesystems/caching/fscache.txt for more information.
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
new file mode 100644
index 0000000..e60dad3
--- /dev/null
+++ b/fs/fscache/Makefile
@@ -0,0 +1,19 @@
+#
+# Makefile for general filesystem caching code
+#
+
+fscache-y := \
+	fsc-cache.o \
+	fsc-cookie.o \
+	fsc-fsdef.o \
+	fsc-main.o \
+	fsc-manage.o \
+	fsc-object.o \
+	fsc-page.o \
+	fsc-threads.o
+
+fscache-$(CONFIG_FSCACHE_PROC) += \
+	fsc-proc.o \
+	fsc-stats.o
+
+obj-$(CONFIG_FSCACHE) := fscache.o
diff --git a/fs/fscache/fsc-cache.c b/fs/fscache/fsc-cache.c
new file mode 100644
index 0000000..7e77673
--- /dev/null
+++ b/fs/fscache/fsc-cache.c
@@ -0,0 +1,506 @@
+/* FS-Cache cache handling
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL CACHE
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "fsc-internal.h"
+
+LIST_HEAD(fscache_cache_list);
+DECLARE_RWSEM(fscache_addremove_sem);
+DECLARE_WAIT_QUEUE_HEAD(fscache_clearance_wq);
+static LIST_HEAD(fscache_cache_tag_list);
+static LIST_HEAD(fscache_netfs_list);
+
+static struct sysfs_ops fscache_cache_sysfs_ops = {
+};
+
+static struct kobj_type fscache_cache_ktype = {
+	.sysfs_ops	= &fscache_cache_sysfs_ops,
+};
+
+/*
+ * look up a cache tag
+ */
+struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *name)
+{
+	struct fscache_cache_tag *tag, *xtag;
+
+	/* firstly check for the existence of the tag under read lock */
+	down_read(&fscache_addremove_sem);
+
+	list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+		if (strcmp(tag->name, name) == 0) {
+			atomic_inc(&tag->usage);
+			up_read(&fscache_addremove_sem);
+			return tag;
+		}
+	}
+
+	up_read(&fscache_addremove_sem);
+
+	/* the tag does not exist - create a candidate */
+	xtag = kzalloc(sizeof(*xtag) + strlen(name) + 1, GFP_KERNEL);
+	if (!xtag)
+		/* return a dummy tag if out of memory */
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&xtag->usage, 1);
+	strcpy(xtag->name, name);
+
+	/* write lock, search again and add if still not present */
+	down_write(&fscache_addremove_sem);
+
+	list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+		if (strcmp(tag->name, name) == 0) {
+			atomic_inc(&tag->usage);
+			up_write(&fscache_addremove_sem);
+			kfree(xtag);
+			return tag;
+		}
+	}
+
+	list_add_tail(&xtag->link, &fscache_cache_tag_list);
+	up_write(&fscache_addremove_sem);
+	return xtag;
+}
+
+/*
+ * release a reference to a cache tag
+ */
+void __fscache_release_cache_tag(struct fscache_cache_tag *tag)
+{
+	if (tag != ERR_PTR(-ENOMEM)) {
+		down_write(&fscache_addremove_sem);
+
+		if (atomic_dec_and_test(&tag->usage))
+			list_del_init(&tag->link);
+		else
+			tag = NULL;
+
+		up_write(&fscache_addremove_sem);
+
+		kfree(tag);
+	}
+}
+
+/*
+ * register a network filesystem for caching
+ */
+int __fscache_register_netfs(struct fscache_netfs *netfs)
+{
+	struct fscache_netfs *ptr;
+	int ret;
+
+	_enter("{%s}", netfs->name);
+
+	INIT_LIST_HEAD(&netfs->link);
+
+	/* allocate a cookie for the primary index */
+	netfs->primary_index =
+		kmem_cache_zalloc(fscache_cookie_jar, GFP_KERNEL);
+
+	if (!netfs->primary_index) {
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	/* initialise the primary index cookie */
+	atomic_set(&netfs->primary_index->usage, 1);
+	atomic_set(&netfs->primary_index->n_children, 0);
+
+	netfs->primary_index->def		= &fscache_fsdef_netfs_def;
+	netfs->primary_index->parent		= &fscache_fsdef_index;
+	netfs->primary_index->netfs_data	= netfs;
+
+	atomic_inc(&netfs->primary_index->parent->usage);
+	atomic_inc(&netfs->primary_index->parent->n_children);
+
+	spin_lock_init(&netfs->primary_index->lock);
+	INIT_HLIST_HEAD(&netfs->primary_index->backing_objects);
+
+	/* check the netfs type is not already present */
+	down_write(&fscache_addremove_sem);
+
+	ret = -EEXIST;
+	list_for_each_entry(ptr, &fscache_netfs_list, link) {
+		if (strcmp(ptr->name, netfs->name) == 0)
+			goto already_registered;
+	}
+
+	list_add(&netfs->link, &fscache_netfs_list);
+	ret = 0;
+
+	printk(KERN_NOTICE "FS-Cache: Netfs '%s' registered for caching\n",
+	       netfs->name);
+
+already_registered:
+	up_write(&fscache_addremove_sem);
+
+	if (ret < 0) {
+		netfs->primary_index->parent = NULL;
+		__fscache_cookie_put(netfs->primary_index);
+		netfs->primary_index = NULL;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+EXPORT_SYMBOL(__fscache_register_netfs);
+
+/*
+ * unregister a network filesystem from the cache
+ * - all cookies must have been released first
+ */
+void __fscache_unregister_netfs(struct fscache_netfs *netfs)
+{
+	_enter("{%s.%u}", netfs->name, netfs->version);
+
+	down_write(&fscache_addremove_sem);
+
+	list_del(&netfs->link);
+	fscache_relinquish_cookie(netfs->primary_index, 0);
+
+	up_write(&fscache_addremove_sem);
+
+	printk(KERN_NOTICE "FS-Cache: Netfs '%s' unregistered from caching\n",
+	       netfs->name);
+
+	_leave("");
+}
+EXPORT_SYMBOL(__fscache_unregister_netfs);
+
+/**
+ * fscache_init_cache - Initialise a cache record
+ * @cache: The cache record to be initialised
+ * @ops: The cache operations to be installed in that record
+ * @idfmt: Format string to define identifier
+ * @...: sprintf-style arguments
+ *
+ * Initialise a record of a cache and fill in the name.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+void fscache_init_cache(struct fscache_cache *cache,
+			const struct fscache_cache_ops *ops,
+			const char *idfmt,
+			...)
+{
+	va_list va;
+
+	memset(cache, 0, sizeof(*cache));
+
+	cache->ops = ops;
+	cache->kobj.kset = &fscache_kset;
+	cache->kobj.ktype = &fscache_cache_ktype;
+
+	va_start(va, idfmt);
+	vsnprintf(cache->identifier, sizeof(cache->identifier), idfmt, va);
+	va_end(va);
+
+	INIT_LIST_HEAD(&cache->link);
+	INIT_LIST_HEAD(&cache->object_list);
+	spin_lock_init(&cache->object_list_lock);
+}
+EXPORT_SYMBOL(fscache_init_cache);
+
+/**
+ * fscache_add_cache - Declare a cache as being open for business
+ * @cache: The record describing the cache
+ * @ifsdef: The record of the cache object describing the top-level index
+ * @tagname: The tag describing this cache
+ *
+ * Add a cache to the system, making it available for netfs's to use.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+int fscache_add_cache(struct fscache_cache *cache,
+		      struct fscache_object *ifsdef,
+		      const char *tagname)
+{
+	struct fscache_cache_tag *tag;
+	int ret;
+
+	BUG_ON(!cache->ops);
+	BUG_ON(!ifsdef);
+
+	/* make sure the worker threads are present */
+	down_write(&fscache_addremove_sem);
+	fscache_init_threads();
+	up_write(&fscache_addremove_sem);
+
+	cache->flags = 0;
+	ifsdef->event_mask = ULONG_MAX & ~(1 << FSCACHE_OBJECT_EV_CLEARED);
+	ifsdef->state = FSCACHE_OBJECT_ACTIVE;
+
+	if (!tagname)
+		tagname = cache->identifier;
+
+	BUG_ON(!tagname[0]);
+
+	_enter("{%s.%s},,%s", cache->ops->name, cache->identifier, tagname);
+
+	/* we use the cache tag to uniquely identify caches */
+	tag = __fscache_lookup_cache_tag(tagname);
+	if (IS_ERR(tag))
+		goto nomem;
+
+	if (test_and_set_bit(FSCACHE_TAG_RESERVED, &tag->flags))
+		goto tag_in_use;
+
+	ret = kobject_set_name(&cache->kobj, "%s", tagname);
+	if (ret < 0)
+		goto error;
+
+	ret = kobject_register(&cache->kobj);
+	if (ret < 0)
+		goto error;
+
+	if (!cache->ops->grab_object(ifsdef))
+		BUG();
+
+	ifsdef->cookie = &fscache_fsdef_index;
+	ifsdef->cache = cache;
+	cache->fsdef = ifsdef;
+
+	down_write(&fscache_addremove_sem);
+
+	tag->cache = cache;
+	cache->tag = tag;
+
+	/* add the cache to the list */
+	list_add(&cache->link, &fscache_cache_list);
+
+	/* add the cache's netfs definition index object to the cache's
+	 * list */
+	spin_lock(&cache->object_list_lock);
+	list_add_tail(&ifsdef->cache_link, &cache->object_list);
+	spin_unlock(&cache->object_list_lock);
+
+	/* add the cache's netfs definition index object to the top level index
+	 * cookie as a known backing object */
+	spin_lock(&fscache_fsdef_index.lock);
+
+	hlist_add_head(&ifsdef->cookie_link,
+		       &fscache_fsdef_index.backing_objects);
+
+	atomic_inc(&fscache_fsdef_index.usage);
+
+	/* done */
+	spin_unlock(&fscache_fsdef_index.lock);
+	up_write(&fscache_addremove_sem);
+
+	printk(KERN_NOTICE "FS-Cache: Cache \"%s\" added (type %s)\n",
+	       cache->tag->name, cache->ops->name);
+
+	_leave(" = 0 [%s]", cache->identifier);
+	return 0;
+
+tag_in_use:
+	printk(KERN_ERR "FS-Cache: Cache tag '%s' already in use\n", tagname);
+	__fscache_release_cache_tag(tag);
+	_leave(" = -EXIST");
+	return -EEXIST;
+
+error:
+	__fscache_release_cache_tag(tag);
+	_leave(" = %d", ret);
+	return ret;
+
+nomem:
+	_leave(" = -ENOMEM");
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(fscache_add_cache);
+
+/*
+ * select a cache in which to store an object
+ * - the cache addremove semaphore must be at least read-locked by the caller
+ * - the object will never be an index
+ */
+struct fscache_cache *fscache_select_cache_for_object(
+	struct fscache_cookie *cookie)
+{
+	struct fscache_cache_tag *tag;
+	struct fscache_object *object;
+	struct fscache_cache *cache;
+
+	_enter("");
+
+	if (list_empty(&fscache_cache_list)) {
+		_leave(" = NULL [no cache]");
+		return NULL;
+	}
+
+	/* we check the parent to determine the cache to use */
+	spin_lock(&cookie->lock);
+
+	/* the first in the parent's backing list should be the preferred
+	 * cache */
+	if (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		cache = object->cache;
+		if (object->state >= FSCACHE_OBJECT_DYING ||
+		    test_bit(FSCACHE_IOERROR, &cache->flags))
+			cache = NULL;
+
+		spin_unlock(&cookie->lock);
+		_leave(" = %p [parent]", cache);
+		return cache;
+	}
+
+	/* the parent is unbacked */
+	if (cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX) {
+		/* cookie not an index and is unbacked */
+		spin_unlock(&cookie->lock);
+		_leave(" = NULL [cookie ub,ni]");
+		return NULL;
+	}
+
+	spin_unlock(&cookie->lock);
+
+	if (!cookie->def->select_cache)
+		goto no_preference;
+
+	/* ask the netfs for its preference */
+	tag = cookie->def->select_cache(cookie->parent->netfs_data,
+					cookie->netfs_data);
+	if (!tag)
+		goto no_preference;
+
+	if (tag == ERR_PTR(-ENOMEM)) {
+		_leave(" = NULL [nomem tag]");
+		return NULL;
+	}
+
+	if (!tag->cache) {
+		_leave(" = NULL [unbacked tag]");
+		return NULL;
+	}
+
+	if (test_bit(FSCACHE_IOERROR, &tag->cache->flags))
+		return NULL;
+
+	_leave(" = %p [specific]", tag->cache);
+	return tag->cache;
+
+no_preference:
+	/* netfs has no preference - just select first cache */
+	cache = list_entry(fscache_cache_list.next,
+			   struct fscache_cache, link);
+	_leave(" = %p [first]", cache);
+	return cache;
+}
+
+/**
+ * fscache_io_error - Note a cache I/O error
+ * @cache: The record describing the cache
+ *
+ * Note that an I/O error occurred in a cache and that it should no longer be
+ * used for anything.  This also reports the error into the kernel log.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+void fscache_io_error(struct fscache_cache *cache)
+{
+	set_bit(FSCACHE_IOERROR, &cache->flags);
+
+	printk(KERN_ERR "FS-Cache: Cache %s stopped due to I/O error\n",
+	       cache->ops->name);
+}
+EXPORT_SYMBOL(fscache_io_error);
+
+/**
+ * fscache_withdraw_cache - Withdraw a cache from the active service
+ * @cache: The record describing the cache
+ *
+ * Withdraw a cache from service, unbinding all its cache objects from the
+ * netfs cookies they're currently representing.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+void fscache_withdraw_cache(struct fscache_cache *cache)
+{
+	struct fscache_object *object;
+	LIST_HEAD(object_list);
+
+	_enter("");
+
+	printk(KERN_NOTICE "FS-Cache: Withdrawing cache \"%s\"\n",
+	       cache->tag->name);
+
+	/* make the cache unavailable for cookie acquisition */
+	if (test_and_set_bit(FSCACHE_CACHE_WITHDRAWN, &cache->flags))
+		BUG();
+
+	down_write(&fscache_addremove_sem);
+	list_del_init(&cache->link);
+	cache->tag->cache = NULL;
+	up_write(&fscache_addremove_sem);
+
+	/* make sure all pages pinned by operations on behalf of the netfs are
+	 * written to disk */
+	cache->ops->sync_cache(cache);
+
+	/* dissociate all the netfs pages backed by this cache from the block
+	 * mappings in the cache */
+	cache->ops->dissociate_pages(cache);
+
+	/* we now have to destroy all the active objects pertaining to this
+	 * cache - which we do by passing them off to thread pool to dispose
+	 * of */
+	_debug("destroy");
+
+	spin_lock(&cache->object_list_lock);
+
+	while (!list_empty(&cache->object_list)) {
+		object = list_entry(cache->object_list.next,
+				    struct fscache_object, cache_link);
+		list_move_tail(&object->cache_link, &object_list);
+
+		_debug("withdraw %p", object->cookie);
+
+		spin_lock(&object->lock);
+		spin_unlock(&cache->object_list_lock);
+		fscache_raise_event(object, FSCACHE_OBJECT_EV_WITHDRAW);
+		spin_unlock(&object->lock);
+
+		cond_resched();
+		spin_lock(&cache->object_list_lock);
+	}
+
+	spin_unlock(&cache->object_list_lock);
+
+	/* wait for all extant objects to finish their outstanding operations
+	 * and go away */
+	_debug("wait for finish");
+	wait_event(fscache_clearance_wq,
+		   atomic_read(&cache->thread_usage) == 0);
+	_debug("wait for clearance");
+	wait_event(fscache_clearance_wq,
+		   list_empty(&cache->object_list));
+	_debug("cleared");
+
+	kobject_unregister(&cache->kobj);
+
+	clear_bit(FSCACHE_TAG_RESERVED, &cache->tag->flags);
+	fscache_release_cache_tag(cache->tag);
+	cache->tag = NULL;
+
+	_leave("");
+}
+EXPORT_SYMBOL(fscache_withdraw_cache);
diff --git a/fs/fscache/fsc-cookie.c b/fs/fscache/fsc-cookie.c
new file mode 100644
index 0000000..41e5c4b
--- /dev/null
+++ b/fs/fscache/fsc-cookie.c
@@ -0,0 +1,490 @@
+/* netfs cookie management
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL COOKIE
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "fsc-internal.h"
+
+struct kmem_cache *fscache_cookie_jar;
+
+static atomic_t fscache_object_debug_id = ATOMIC_INIT(0);
+
+static int fscache_acquire_non_index_cookie(struct fscache_cookie *cookie);
+static int fscache_alloc_object(struct fscache_cache *cache,
+				struct fscache_cookie *cookie);
+static int fscache_attach_object(struct fscache_cookie *cookie,
+				 struct fscache_object *object);
+
+/*
+ * initialise an cookie jar slab element prior to any use
+ */
+void fscache_cookie_init_once(struct kmem_cache *cachep, void *_cookie)
+{
+	struct fscache_cookie *cookie = _cookie;
+
+	memset(cookie, 0, sizeof(*cookie));
+	spin_lock_init(&cookie->lock);
+	INIT_HLIST_HEAD(&cookie->backing_objects);
+}
+
+/*
+ * request a cookie to represent an object (index, datafile, xattr, etc)
+ * - parent specifies the parent object
+ *   - the top level index cookie for each netfs is stored in the fscache_netfs
+ *     struct upon registration
+ * - def points to the definition
+ * - the netfs_data will be passed to the functions pointed to in *def
+ * - all attached caches will be searched to see if they contain this object
+ * - index objects aren't stored on disk until there's a dependent file that
+ *   needs storing
+ * - other objects are stored in a selected cache immediately, and all the
+ *   indices forming the path to it are instantiated if necessary
+ * - we never let on to the netfs about errors
+ *   - we may set a negative cookie pointer, but that's okay
+ */
+struct fscache_cookie *__fscache_acquire_cookie(struct fscache_cookie *parent,
+						const struct fscache_cookie_def *def,
+						void *netfs_data)
+{
+	struct fscache_cookie *cookie;
+
+	BUG_ON(!def);
+
+	_enter("{%s},{%s},%p",
+	       parent ? (char *) parent->def->name : "<no-parent>",
+	       def->name, netfs_data);
+
+	fscache_stat(&fscache_n_acquires);
+
+	/* if there's no parent cookie, then we don't create one here either */
+	if (!parent) {
+		fscache_stat(&fscache_n_acquires_null);
+		_leave(" [no parent]");
+		return NULL;
+	}
+
+	/* validate the definition */
+	BUG_ON(!def->get_key);
+	BUG_ON(!def->name[0]);
+
+	BUG_ON(def->type == FSCACHE_COOKIE_TYPE_INDEX &&
+	       parent->def->type != FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* allocate and initialise a cookie */
+	cookie = kmem_cache_alloc(fscache_cookie_jar, GFP_KERNEL);
+	if (!cookie) {
+		fscache_stat(&fscache_n_acquires_oom);
+		_leave(" [ENOMEM]");
+		return NULL;
+	}
+
+	atomic_set(&cookie->usage, 1);
+	atomic_set(&cookie->n_children, 0);
+
+	atomic_inc(&parent->usage);
+	atomic_inc(&parent->n_children);
+
+	cookie->def		= def;
+	cookie->parent		= parent;
+	cookie->netfs_data	= netfs_data;
+	cookie->flags		= 0;
+
+	switch (cookie->def->type) {
+	case FSCACHE_COOKIE_TYPE_INDEX:
+		fscache_stat(&fscache_n_cookie_index);
+		break;
+	case FSCACHE_COOKIE_TYPE_DATAFILE:
+		fscache_stat(&fscache_n_cookie_data);
+		break;
+	default:
+		fscache_stat(&fscache_n_cookie_special);
+		break;
+	}
+
+	/* if the object is an index then we need do nothing more here - we
+	 * create indices on disk when we need them as an index may exist in
+	 * multiple caches */
+	if (cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX) {
+		if (fscache_acquire_non_index_cookie(cookie) < 0) {
+			atomic_dec(&parent->n_children);
+			__fscache_cookie_put(cookie);
+			fscache_stat(&fscache_n_acquires_nobufs);
+			_leave(" = NULL");
+			return NULL;
+		}
+	}
+
+	fscache_stat(&fscache_n_acquires_ok);
+	_leave(" = %p", cookie);
+	return cookie;
+}
+EXPORT_SYMBOL(__fscache_acquire_cookie);
+
+/*
+ * acquire a non-index cookie
+ * - this must make sure the index chain is instantiated and instantiate the
+ *   object representation too
+ */
+static int fscache_acquire_non_index_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	struct fscache_cache *cache;
+	uint64_t i_size;
+	int ret;
+
+	_enter("");
+
+	cookie->flags = 1 << FSCACHE_COOKIE_UNAVAILABLE;
+
+	/* now we need to see whether the backing objects for this cookie yet
+	 * exist, if not there'll be nothing to search */
+	down_read(&fscache_addremove_sem);
+
+	if (list_empty(&fscache_cache_list)) {
+		up_read(&fscache_addremove_sem);
+		_leave(" = 0 [no caches]");
+		return 0;
+	}
+
+	/* select a cache in which to store the object */
+	cache = fscache_select_cache_for_object(cookie->parent);
+	if (!cache) {
+		up_read(&fscache_addremove_sem);
+		fscache_stat(&fscache_n_acquires_no_cache);
+		_leave(" = -ENOMEDIUM [no cache]");
+		return -ENOMEDIUM;
+	}
+
+	_debug("cache %s", cache->tag->name);
+
+	cookie->flags =
+		(1 << FSCACHE_COOKIE_LOOKING_UP) |
+		(1 << FSCACHE_COOKIE_CREATING) |
+		(1 << FSCACHE_COOKIE_NO_DATA_YET);
+
+	/* ask the cache to allocate objects for this cookie and its parent
+	 * chain */
+	ret = fscache_alloc_object(cache, cookie);
+	if (ret < 0) {
+		up_read(&fscache_addremove_sem);
+		_leave(" = %d", ret);
+		return ret;
+	}
+
+	/* initiate the process of looking up all the objects in the chain */
+	cookie->def->get_attr(cookie->netfs_data, &i_size);
+
+	spin_lock(&cookie->lock);
+	if (hlist_empty(&cookie->backing_objects)) {
+		spin_unlock(&cookie->lock);
+		goto unavailable;
+	}
+
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+
+	fscache_set_store_limit(object, i_size);
+	fscache_enqueue_object(object);
+	spin_unlock(&cookie->lock);
+
+	/* we may be required to wait for lookup to complete at this point */
+	if (!fscache_defer_lookup) {
+		_debug("non-deferred lookup %p", &cookie->flags);
+		wait_on_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP,
+			    fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+		_debug("complete");
+		if (test_bit(FSCACHE_COOKIE_UNAVAILABLE, &cookie->flags))
+			goto unavailable;
+	}
+
+	up_read(&fscache_addremove_sem);
+	_leave(" = 0 [deferred]");
+	return 0;
+
+unavailable:
+	up_read(&fscache_addremove_sem);
+	_leave(" = -ENOBUFS");
+	return -ENOBUFS;
+}
+
+/*
+ * recursively allocate cache object records for a cookie/cache combination
+ * - caller must be holding the addremove sem
+ */
+static int fscache_alloc_object(struct fscache_cache *cache,
+				struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	struct hlist_node *_n;
+	int ret;
+
+	_enter("%p,%p{%s}", cache, cookie, cookie->def->name);
+
+	spin_lock(&cookie->lock);
+	hlist_for_each_entry(object, _n, &cookie->backing_objects,
+			     cookie_link) {
+		if (object->cache == cache)
+			goto object_already_extant;
+	}
+	spin_unlock(&cookie->lock);
+
+	/* ask the cache to allocate an object (we may end up with duplicate
+	 * objects at this stage, but we sort that out later) */
+	object = cache->ops->alloc_object(cache, cookie);
+	if (IS_ERR(object)) {
+		fscache_stat(&fscache_n_object_no_alloc);
+		ret = PTR_ERR(object);
+		goto error;
+	}
+
+	fscache_stat(&fscache_n_object_alloc);
+
+	object->debug_id = atomic_inc_return(&fscache_object_debug_id);
+
+	_debug("ALLOC OBJ%x: %s {%lx}",
+	       object->debug_id, cookie->def->name, object->events);
+
+	ret = fscache_alloc_object(cache, cookie->parent);
+	if (ret < 0)
+		goto error_put;
+
+	/* only attach if we managed to allocate all we needed, otherwise
+	 * discard the object we just allocated and instead use the one
+	 * attached to the cookie */
+	if (fscache_attach_object(cookie, object) < 0)
+		cache->ops->put_object(object);
+
+	_leave(" = 0");
+	return 0;
+
+object_already_extant:
+	ret = -ENOBUFS;
+	if (object->state >= FSCACHE_OBJECT_DYING) {
+		spin_unlock(&cookie->lock);
+		goto error;
+	}
+	spin_unlock(&cookie->lock);
+	_leave(" = 0 [found]");
+	return 0;
+
+error_put:
+	cache->ops->put_object(object);
+error:
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * attach a cache object to a cookie
+ */
+static int fscache_attach_object(struct fscache_cookie *cookie,
+				 struct fscache_object *object)
+{
+	struct fscache_object *p;
+	struct fscache_cache *cache = object->cache;
+	struct hlist_node *_n;
+	int ret;
+
+	_enter("{%s},{OBJ%x}", cookie->def->name, object->debug_id);
+
+	spin_lock(&cookie->lock);
+
+	/* there may be multiple initial creations of this object, but we only
+	 * want one */
+	ret = -EEXIST;
+	hlist_for_each_entry(p, _n, &cookie->backing_objects, cookie_link) {
+		if (p->cache == object->cache) {
+			if (p->state >= FSCACHE_OBJECT_DYING)
+				ret = -ENOBUFS;
+			goto cant_attach_object;
+		}
+	}
+
+	/* pin the parent object */
+	spin_lock_nested(&cookie->parent->lock, 1);
+	hlist_for_each_entry(p, _n, &cookie->parent->backing_objects,
+			     cookie_link) {
+		if (p->cache == object->cache) {
+			if (p->state >= FSCACHE_OBJECT_DYING) {
+				ret = -ENOBUFS;
+				spin_unlock(&cookie->parent->lock);
+				goto cant_attach_object;
+			}
+			object->parent = p;
+			spin_lock(&p->lock);
+			p->n_children++;
+			spin_unlock(&p->lock);
+			break;
+		}
+	}
+	spin_unlock(&cookie->parent->lock);
+
+	/* attach to the cache's object list */
+	if (list_empty(&object->cache_link)) {
+		spin_lock(&cache->object_list_lock);
+		list_add(&object->cache_link, &cache->object_list);
+		spin_unlock(&cache->object_list_lock);
+	}
+
+	/* attach to the cookie */
+	object->cookie = cookie;
+	atomic_inc(&cookie->usage);
+	hlist_add_head(&object->cookie_link, &cookie->backing_objects);
+	ret = 0;
+
+cant_attach_object:
+	spin_unlock(&cookie->lock);
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * update the index entries backing a cookie
+ */
+void __fscache_update_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	struct hlist_node *_p;
+
+	fscache_stat(&fscache_n_updates);
+
+	if (!cookie) {
+		fscache_stat(&fscache_n_updates_null);
+		_leave(" [no cookie]");
+		return;
+	}
+
+	_enter("{%s}", cookie->def->name);
+
+	BUG_ON(!cookie->def->get_aux);
+
+	spin_lock(&cookie->lock);
+
+	/* update the index entry on disk in each cache backing this cookie */
+	hlist_for_each_entry(object, _p,
+			     &cookie->backing_objects, cookie_link) {
+		fscache_raise_event(object, FSCACHE_OBJECT_EV_UPDATE);
+	}
+
+	spin_unlock(&cookie->lock);
+	_leave("");
+}
+EXPORT_SYMBOL(__fscache_update_cookie);
+
+/*
+ * release a cookie back to the cache
+ * - the object will be marked as recyclable on disk if retire is true
+ * - all dependents of this cookie must have already been unregistered
+ *   (indices/files/pages)
+ */
+void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
+{
+	struct fscache_cache *cache;
+	struct fscache_object *object;
+	unsigned long event;
+
+	fscache_stat(&fscache_n_relinquishes);
+
+	if (!cookie) {
+		fscache_stat(&fscache_n_relinquishes_null);
+		_leave(" [no cookie]");
+		return;
+	}
+
+	_enter("%p{%s,%p},%d",
+	       cookie, cookie->def->name, cookie->netfs_data, retire);
+
+	if (atomic_read(&cookie->n_children) != 0) {
+		printk(KERN_ERR "FS-Cache: Cookie '%s' still has children\n",
+		       cookie->def->name);
+		BUG();
+	}
+
+	/* wait for the cookie to finish being instantiated (or to fail) */
+	if (test_bit(FSCACHE_COOKIE_CREATING, &cookie->flags)) {
+		fscache_stat(&fscache_n_relinquishes_waitcrt);
+		wait_on_bit(&cookie->flags, FSCACHE_COOKIE_CREATING,
+			    fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+	}
+
+	event = retire ? FSCACHE_OBJECT_EV_RETIRE : FSCACHE_OBJECT_EV_RELEASE;
+
+	/* detach pointers back to the netfs */
+	spin_lock(&cookie->lock);
+
+	cookie->netfs_data	= NULL;
+	cookie->def		= NULL;
+
+	/* break links with all the active objects */
+	while (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object,
+				     cookie_link);
+
+		_debug("RELEASE OBJ%x", object->debug_id);
+
+		/* detach each cache object from the object cookie */
+		spin_lock(&object->lock);
+		hlist_del_init(&object->cookie_link);
+
+		cache = object->cache;
+		object->cookie = NULL;
+		fscache_raise_event(object, event);
+		spin_unlock(&object->lock);
+
+		if (atomic_dec_and_test(&cookie->usage))
+			/* the cookie refcount shouldn't be reduced to 0 yet */
+			BUG();
+	}
+
+	spin_unlock(&cookie->lock);
+
+	if (cookie->parent) {
+		ASSERTCMP(atomic_read(&cookie->parent->usage), >, 0);
+		ASSERTCMP(atomic_read(&cookie->parent->n_children), >, 0);
+		atomic_dec(&cookie->parent->n_children);
+	}
+
+	/* finally dispose of the cookie */
+	ASSERTCMP(atomic_read(&cookie->usage), >, 0);
+	fscache_cookie_put(cookie);
+
+	_leave("");
+}
+EXPORT_SYMBOL(__fscache_relinquish_cookie);
+
+/*
+ * destroy a cookie
+ */
+void __fscache_cookie_put(struct fscache_cookie *cookie)
+{
+	struct fscache_cookie *parent;
+
+	_enter("%p", cookie);
+
+	for (;;) {
+		_debug("FREE COOKIE %p", cookie);
+		parent = cookie->parent;
+		BUG_ON(!hlist_empty(&cookie->backing_objects));
+		kmem_cache_free(fscache_cookie_jar, cookie);
+
+		if (!parent)
+			break;
+
+		cookie = parent;
+		BUG_ON(atomic_read(&cookie->usage) <= 0);
+		if (!atomic_dec_and_test(&cookie->usage))
+			break;
+	}
+
+	_leave("");
+}
diff --git a/fs/fscache/fsc-fsdef.c b/fs/fscache/fsc-fsdef.c
new file mode 100644
index 0000000..e52fc61
--- /dev/null
+++ b/fs/fscache/fsc-fsdef.c
@@ -0,0 +1,112 @@
+/* Filesystem index definition
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL CACHE
+#include <linux/module.h>
+#include "fsc-internal.h"
+
+static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax);
+
+static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax);
+
+static
+enum fscache_checkaux fscache_fsdef_netfs_check_aux(void *cookie_netfs_data,
+						    const void *data,
+						    uint16_t datalen);
+
+struct fscache_cookie_def fscache_fsdef_netfs_def = {
+	.name		= "FSDEF.netfs",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= fscache_fsdef_netfs_get_key,
+	.get_aux	= fscache_fsdef_netfs_get_aux,
+	.check_aux	= fscache_fsdef_netfs_check_aux,
+};
+
+static struct fscache_cookie_def fscache_fsdef_index_def = {
+	.name		= ".FS-Cache",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+};
+
+struct fscache_cookie fscache_fsdef_index = {
+	.usage		= ATOMIC_INIT(1),
+	.lock		= __SPIN_LOCK_UNLOCKED(fscache_fsdef_index.lock),
+	.backing_objects = HLIST_HEAD_INIT,
+	.def		= &fscache_fsdef_index_def,
+};
+EXPORT_SYMBOL(fscache_fsdef_index);
+
+/*
+ * get the key data for an FSDEF index record
+ */
+static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
+{
+	const struct fscache_netfs *netfs = cookie_netfs_data;
+	unsigned klen;
+
+	_enter("{%s.%u},", netfs->name, netfs->version);
+
+	klen = strlen(netfs->name);
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, netfs->name, klen);
+	return klen;
+}
+
+/*
+ * get the auxilliary data for an FSDEF index record
+ */
+static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
+{
+	const struct fscache_netfs *netfs = cookie_netfs_data;
+	unsigned dlen;
+
+	_enter("{%s.%u},", netfs->name, netfs->version);
+
+	dlen = sizeof(uint32_t);
+	if (dlen > bufmax)
+		return 0;
+
+	memcpy(buffer, &netfs->version, dlen);
+	return dlen;
+}
+
+/*
+ * check that the version stored in the auxilliary data is correct
+ */
+static
+enum fscache_checkaux fscache_fsdef_netfs_check_aux(void *cookie_netfs_data,
+						    const void *data,
+						    uint16_t datalen)
+{
+	struct fscache_netfs *netfs = cookie_netfs_data;
+	uint32_t version;
+
+	_enter("{%s},,%hu", netfs->name, datalen);
+
+	if (datalen != sizeof(version)) {
+		_leave(" = OBSOLETE [dl=%d v=%zu]", datalen, sizeof(version));
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	memcpy(&version, data, sizeof(version));
+	if (version != netfs->version) {
+		_leave(" = OBSOLETE [ver=%x net=%x]", version, netfs->version);
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	_leave(" = OKAY");
+	return FSCACHE_CHECKAUX_OKAY;
+}
diff --git a/fs/fscache/fsc-internal.h b/fs/fscache/fsc-internal.h
new file mode 100644
index 0000000..295dc95
--- /dev/null
+++ b/fs/fscache/fsc-internal.h
@@ -0,0 +1,376 @@
+/* Internal definitions for FS-Cache
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+/*
+ * Lock order, in the order in which multiple locks should be obtained:
+ * - fscache_addremove_sem
+ * - cookie->lock
+ * - cookie->parent->lock
+ * - cache->object_list_lock
+ * - object->lock
+ * - object->parent->lock
+ * - fscache_thread_lock
+ *
+ */
+
+#include <linux/fscache-cache.h>
+#include <linux/sched.h>
+
+#define FSCACHE_MIN_THREADS	4
+#define FSCACHE_MAX_THREADS	32
+
+/*
+ * fsc-cache.c
+ */
+extern struct list_head fscache_cache_list;
+extern struct rw_semaphore fscache_addremove_sem;
+extern wait_queue_head_t fscache_clearance_wq;
+
+extern struct fscache_cache *fscache_select_cache_for_object(
+	struct fscache_cookie *);
+
+/*
+ * fsc-cookie.c
+ */
+extern struct kmem_cache *fscache_cookie_jar;
+
+extern void fscache_cookie_init_once(struct kmem_cache *, void *);
+extern void __fscache_cookie_put(struct fscache_cookie *);
+
+/*
+ * fsc-fsdef.c
+ */
+extern struct fscache_cookie fscache_fsdef_index;
+extern struct fscache_cookie_def fscache_fsdef_netfs_def;
+
+/*
+ * fsc-main.c
+ */
+extern unsigned fscache_defer_lookup;
+extern unsigned fscache_defer_create;
+extern unsigned fscache_debug;
+extern struct kset fscache_kset;
+
+extern int fscache_wait_bit(void *);
+extern int fscache_wait_bit_interruptible(void *);
+
+/*
+ * fsc-object.c
+ */
+extern void fscache_object_state_machine(struct fscache_object *);
+extern void fscache_withdrawing_object(struct fscache_cache *,
+				       struct fscache_object *);
+
+/*
+ * fsc-stats.c
+ */
+#ifdef CONFIG_FSCACHE_STATS
+extern atomic_t fscache_n_ops_processed[FSCACHE_MAX_THREADS];
+extern atomic_t fscache_n_objs_processed[FSCACHE_MAX_THREADS];
+
+extern atomic_t fscache_n_op_pend;
+extern atomic_t fscache_n_op_run;
+extern atomic_t fscache_n_op_enqueue;
+extern atomic_t fscache_n_op_requeue;
+extern atomic_t fscache_n_op_release;
+
+extern atomic_t fscache_n_attr_changed;
+extern atomic_t fscache_n_attr_changed_ok;
+extern atomic_t fscache_n_attr_changed_nobufs;
+extern atomic_t fscache_n_attr_changed_nomem;
+extern atomic_t fscache_n_attr_changed_calls;
+
+extern atomic_t fscache_n_allocs;
+extern atomic_t fscache_n_allocs_ok;
+extern atomic_t fscache_n_allocs_wait;
+extern atomic_t fscache_n_allocs_nobufs;
+extern atomic_t fscache_n_alloc_ops;
+extern atomic_t fscache_n_alloc_op_waits;
+
+extern atomic_t fscache_n_retrievals;
+extern atomic_t fscache_n_retrievals_ok;
+extern atomic_t fscache_n_retrievals_wait;
+extern atomic_t fscache_n_retrievals_nodata;
+extern atomic_t fscache_n_retrievals_nobufs;
+extern atomic_t fscache_n_retrievals_intr;
+extern atomic_t fscache_n_retrievals_nomem;
+extern atomic_t fscache_n_retrieval_ops;
+extern atomic_t fscache_n_retrieval_op_waits;
+
+extern atomic_t fscache_n_stores;
+extern atomic_t fscache_n_stores_ok;
+extern atomic_t fscache_n_stores_again;
+extern atomic_t fscache_n_stores_nobufs;
+extern atomic_t fscache_n_stores_oom;
+extern atomic_t fscache_n_store_ops;
+extern atomic_t fscache_n_store_calls;
+
+extern atomic_t fscache_n_marks;
+extern atomic_t fscache_n_uncaches;
+
+extern atomic_t fscache_n_acquires;
+extern atomic_t fscache_n_acquires_null;
+extern atomic_t fscache_n_acquires_no_cache;
+extern atomic_t fscache_n_acquires_ok;
+extern atomic_t fscache_n_acquires_nobufs;
+extern atomic_t fscache_n_acquires_oom;
+
+extern atomic_t fscache_n_updates;
+extern atomic_t fscache_n_updates_null;
+extern atomic_t fscache_n_updates_run;
+
+extern atomic_t fscache_n_relinquishes;
+extern atomic_t fscache_n_relinquishes_null;
+extern atomic_t fscache_n_relinquishes_waitcrt;
+
+extern atomic_t fscache_n_cookie_index;
+extern atomic_t fscache_n_cookie_data;
+extern atomic_t fscache_n_cookie_special;
+
+extern atomic_t fscache_n_object_alloc;
+extern atomic_t fscache_n_object_no_alloc;
+extern atomic_t fscache_n_object_lookups;
+extern atomic_t fscache_n_object_lookups_negative;
+extern atomic_t fscache_n_object_lookups_positive;
+extern atomic_t fscache_n_object_created;
+extern atomic_t fscache_n_object_avail;
+extern atomic_t fscache_n_object_boosted;
+
+static inline void fscache_stat(atomic_t *stat)
+{
+	atomic_inc(stat);
+}
+#else
+
+#define fscache_stat(stat) do {} while (0)
+#endif
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+extern atomic_t fscache_obj_instantiate_histogram[HZ];
+extern atomic_t fscache_objs_histogram[HZ];
+extern atomic_t fscache_ops_histogram[HZ];
+extern atomic_t fscache_retrieval_delay_histogram[HZ];
+extern atomic_t fscache_retrieval_histogram[HZ];
+
+static inline void fscache_hist(atomic_t histogram[], unsigned long start_jif)
+{
+	unsigned long jif = jiffies - start_jif;
+	if (jif >= HZ)
+		jif = HZ - 1;
+	atomic_inc(&histogram[jif]);
+}
+
+#else
+#define fscache_hist(hist, start_jif) do {} while (0)
+#endif
+
+#ifdef CONFIG_FSCACHE_PROC
+extern int __init fscache_proc_init(void);
+extern void fscache_proc_cleanup(void);
+#else
+#define fscache_proc_init()	(0)
+#define fscache_proc_cleanup()	do {} while (0)
+#endif
+
+/*
+ * fsc-threads.c
+ */
+extern void fscache_start_operations(struct fscache_object *);
+extern void fscache_enqueue_object(struct fscache_object *);
+extern void fscache_enqueue_dependents(struct fscache_object *);
+extern void fscache_dequeue_object(struct fscache_object *);
+extern void fscache_boost_object(struct fscache_object *);
+extern int fscache_init_threads(void);
+extern void fscache_kill_threads(void);
+
+/*
+ * raise an event on an object
+ * - if the event is not masked for that object, then the object is
+ *   queued for attention by the thread pool.
+ */
+static inline void fscache_raise_event(struct fscache_object *object,
+				       unsigned event)
+{
+	if (!test_and_set_bit(event, &object->events) &&
+	    test_bit(event, &object->event_mask))
+		fscache_enqueue_object(object);
+}
+
+/*
+ * drop a reference to a cookie
+ */
+static inline void fscache_cookie_put(struct fscache_cookie *cookie)
+{
+	BUG_ON(atomic_read(&cookie->usage) <= 0);
+	if (atomic_dec_and_test(&cookie->usage))
+		__fscache_cookie_put(cookie);
+}
+
+/*
+ * get an extra reference to a netfs retrieval context
+ */
+static inline
+void *fscache_get_context(struct fscache_cookie *cookie, void *context)
+{
+	if (cookie->def->get_context)
+		cookie->def->get_context(cookie->netfs_data, context);
+	return context;
+}
+
+/*
+ * release a reference to a netfs retrieval context
+ */
+static inline
+void fscache_put_context(struct fscache_cookie *cookie, void *context)
+{
+	if (cookie->def->put_context)
+		cookie->def->put_context(cookie->netfs_data, context);
+}
+
+/*****************************************************************************/
+/*
+ * debug tracing
+ */
+#define dbgprintk(FMT, ...) \
+	printk(KERN_DEBUG "[%-6.6s] "FMT"\n", current->comm, ##__VA_ARGS__)
+
+/* make sure we maintain the format strings, even when debugging is disabled */
+static inline __attribute__((format(printf, 1, 2)))
+void _dbprintk(const char *fmt, ...)
+{
+}
+
+#define kenter(FMT, ...) dbgprintk("==> %s("FMT")", __FUNCTION__, ##__VA_ARGS__)
+#define kleave(FMT, ...) dbgprintk("<== %s()"FMT"", __FUNCTION__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) dbgprintk(FMT, ##__VA_ARGS__)
+
+#define kjournal(FMT, ...) _dbprintk(FMT, ##__VA_ARGS__)
+
+#ifdef __KDEBUG
+#define _enter(FMT, ...) kenter(FMT, ##__VA_ARGS__)
+#define _leave(FMT, ...) kleave(FMT, ##__VA_ARGS__)
+#define _debug(FMT, ...) kdebug(FMT, ##__VA_ARGS__)
+
+#elif defined(CONFIG_FSCACHE_DEBUG)
+#define _enter(FMT, ...)			\
+do {						\
+	if (__do_kdebug(ENTER))			\
+		kenter(FMT, ##__VA_ARGS__);	\
+} while (0)
+
+#define _leave(FMT, ...)			\
+do {						\
+	if (__do_kdebug(LEAVE))			\
+		kleave(FMT, ##__VA_ARGS__);	\
+} while (0)
+
+#define _debug(FMT, ...)			\
+do {						\
+	if (__do_kdebug(DEBUG))			\
+		kdebug(FMT, ##__VA_ARGS__);	\
+} while (0)
+
+#else
+#define _enter(FMT, ...) _dbprintk("==> %s("FMT")", __FUNCTION__, ##__VA_ARGS__)
+#define _leave(FMT, ...) _dbprintk("<== %s()"FMT"", __FUNCTION__, ##__VA_ARGS__)
+#define _debug(FMT, ...) _dbprintk(FMT, ##__VA_ARGS__)
+#endif
+
+/*
+ * determine whether a particular optional debugging point should be logged
+ * - we need to go through three steps to persuade cpp to correctly join the
+ *   shorthand in FSCACHE_DEBUG_LEVEL with its prefix
+ */
+#define ____do_kdebug(LEVEL, POINT) \
+	unlikely((fscache_debug & \
+		  (FSCACHE_POINT_##POINT << (FSCACHE_DEBUG_ ## LEVEL * 3))))
+#define ___do_kdebug(LEVEL, POINT) \
+	____do_kdebug(LEVEL, POINT)
+#define __do_kdebug(POINT) \
+	___do_kdebug(FSCACHE_DEBUG_LEVEL, POINT)
+
+#define FSCACHE_DEBUG_CACHE	0
+#define FSCACHE_DEBUG_COOKIE	1
+#define FSCACHE_DEBUG_PAGE	2
+#define FSCACHE_DEBUG_THREAD	3
+
+#define FSCACHE_POINT_ENTER	1
+#define FSCACHE_POINT_LEAVE	2
+#define FSCACHE_POINT_DEBUG	4
+
+#ifndef FSCACHE_DEBUG_LEVEL
+#define FSCACHE_DEBUG_LEVEL CACHE
+#endif
+
+/*
+ * assertions
+ */
+#if 1 /* defined(__KDEBUGALL) */
+
+#define ASSERT(X)							\
+do {									\
+	if (unlikely(!(X))) {						\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "FS-Cache: Assertion failed\n");	\
+		BUG();							\
+	}								\
+} while (0)
+
+#define ASSERTCMP(X, OP, Y)						\
+do {									\
+	if (unlikely(!((X) OP (Y)))) {					\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "FS-Cache: Assertion failed\n");	\
+		printk(KERN_ERR "%lx " #OP " %lx is false\n",		\
+		       (unsigned long)(X), (unsigned long)(Y));		\
+		BUG();							\
+	}								\
+} while (0)
+
+#define ASSERTIF(C, X)							\
+do {									\
+	if (unlikely((C) && !(X))) {					\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "FS-Cache: Assertion failed\n");	\
+		BUG();							\
+	}								\
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y)					\
+do {									\
+	if (unlikely((C) && !((X) OP (Y)))) {				\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "FS-Cache: Assertion failed\n");	\
+		printk(KERN_ERR "%lx " #OP " %lx is false\n",		\
+		       (unsigned long)(X), (unsigned long)(Y));		\
+		BUG();							\
+	}								\
+} while (0)
+
+#else
+
+#define ASSERT(X)				\
+do {						\
+} while (0)
+
+#define ASSERTCMP(X, OP, Y)			\
+do {						\
+} while (0)
+
+#define ASSERTIF(C, X)				\
+do {						\
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y)		\
+do {						\
+} while (0)
+
+#endif /* assert or not */
diff --git a/fs/fscache/fsc-main.c b/fs/fscache/fsc-main.c
new file mode 100644
index 0000000..8ccced8
--- /dev/null
+++ b/fs/fscache/fsc-main.c
@@ -0,0 +1,133 @@
+/* General filesystem local caching manager
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL CACHE
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include "fsc-internal.h"
+
+MODULE_DESCRIPTION("FS Cache Manager");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+unsigned fscache_defer_lookup = 1;
+module_param_named(defer_lookup, fscache_defer_lookup, uint,
+		   S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_defer_lookup,
+		 "Defer cookie lookup to background thread");
+
+unsigned fscache_defer_create = 1;
+module_param_named(defer_create, fscache_defer_create, uint,
+		   S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_defer_create,
+		 "Defer cookie creation to background thread");
+
+unsigned fscache_debug;
+module_param_named(debug, fscache_debug, uint,
+		   S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_debug,
+		 "FS-Cache debugging mask");
+
+static struct sysfs_ops fscache_sysfs_ops = {
+};
+
+static struct kobj_type fscache_ktype = {
+	.sysfs_ops	= &fscache_sysfs_ops,
+};
+
+struct kset fscache_kset = {
+	.kobj.k_name	= "fscache",
+	.kobj.kset	= &fs_subsys,
+	.ktype		= &fscache_ktype,
+};
+
+/*
+ * initialise the fs caching module
+ */
+static int __init fscache_init(void)
+{
+	int ret;
+
+	ret = fscache_proc_init();
+	if (ret < 0)
+		goto error_proc;
+
+	ret = fscache_init_threads();
+	if (ret < 0)
+		goto error_init_threads;
+
+	fscache_cookie_jar = kmem_cache_create("fscache_cookie_jar",
+					       sizeof(struct fscache_cookie),
+					       0,
+					       0,
+					       fscache_cookie_init_once);
+	if (!fscache_cookie_jar) {
+		printk(KERN_NOTICE
+		       "FS-Cache: Failed to allocate a cookie jar\n");
+		ret = -ENOMEM;
+		goto error_cookie_jar;
+	}
+
+	ret = kset_register(&fscache_kset);
+	if (ret < 0)
+		goto error_kset;
+
+	printk(KERN_NOTICE "FS-Cache: Loaded\n");
+	return 0;
+
+error_kset:
+	kmem_cache_destroy(fscache_cookie_jar);
+error_cookie_jar:
+	fscache_kill_threads();
+error_init_threads:
+	fscache_proc_cleanup();
+error_proc:
+	return ret;
+}
+
+fs_initcall(fscache_init);
+
+/*
+ * clean up on module removal
+ */
+static void __exit fscache_exit(void)
+{
+	_enter("");
+
+	kset_unregister(&fscache_kset);
+	kmem_cache_destroy(fscache_cookie_jar);
+	fscache_kill_threads();
+	fscache_proc_cleanup();
+	printk(KERN_NOTICE "FS-Cache: Unloaded\n");
+}
+
+module_exit(fscache_exit);
+
+/*
+ * wait_on_bit() sleep function for uninterruptible waiting
+ */
+int fscache_wait_bit(void *flags)
+{
+	schedule();
+	return 0;
+}
+
+/*
+ * wait_on_bit() sleep function for interruptible waiting
+ */
+int fscache_wait_bit_interruptible(void *flags)
+{
+	schedule();
+	return signal_pending(current);
+}
diff --git a/fs/fscache/fsc-manage.c b/fs/fscache/fsc-manage.c
new file mode 100644
index 0000000..8b2802a
--- /dev/null
+++ b/fs/fscache/fsc-manage.c
@@ -0,0 +1,257 @@
+/* Manage cache objects
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL PAGE
+#include <linux/module.h>
+#include <linux/fscache-cache.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include "fsc-internal.h"
+
+/*
+ * wait for a slow operation to float to the front of the op queue, then
+ * effectively suspend processing on this object till the calling process
+ * context has completed the operation
+ * - this prevents kfscached being hogged by a really slow op
+ */
+#if 0
+static int fscache_slow_op(struct fscache_object *object,
+			   struct fscache_operation *op)
+{
+	int ret;
+
+	kenter("");
+
+	if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
+		wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
+
+	if (test_bit(FSCACHE_OP_IN_PROGRESS, &op->flags))
+		ret = -EINPROGRESS;
+	else
+		ret = 0;
+
+	kleave(" = %d", ret);
+	return ret;
+}
+#endif
+
+/*
+ * reserve space for an object
+ */
+int __fscache_reserve_space(struct fscache_cookie *cookie, loff_t size)
+{
+#if 0
+	struct fscache_operation *op;
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,%llu,", cookie, size);
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+
+	op = kzalloc(sizeof(*op), GFP_KERNEL);
+	if (!op) {
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	op->processor = fscache_slow_op;
+	__set_bit(FSCACHE_OP_WAITING, &op->flags);
+	__set_bit(FSCACHE_OP_IN_PROGRESS, &op->flags);
+
+	spin_lock(&cookie->lock);
+
+	ret = -ENOBUFS;
+	if (hlist_empty(&cookie->backing_objects))
+		goto error;
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+	if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+		goto error;
+
+	ret = -EOPNOTSUPP;
+	if (!object->cache->ops->reserve_space)
+		goto error;
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * write attempts on pages
+	 */
+	ret = -ENOBUFS;
+	if (!object->cache->ops->grab_object(object))
+		goto error;
+	list_add_tail(&op->link, &object->operations);
+
+	if (!test_and_set_bit(FSCACHE_OBJECT_BUSY, &object->flags))
+		queue_work(fscache_workqueue, &object->work);
+
+	spin_unlock(&cookie->lock);
+
+	/* wait for the operation queue to be whittled down */
+	if (wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
+			fscache_wait_bit_interruptible, TASK_INTERRUPTIBLE
+			) < 0) {
+		/* we were interrupted */
+		spin_lock(&cookie->lock);
+		if (!test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
+			queue_work(fscache_workqueue, &object->work);
+		if (!test_and_clear_bit(FSCACHE_OP_IN_PROGRESS, &op->flags))
+			BUG();
+		ret = -ERESTARTSYS;
+		goto out_locked;
+	}
+
+	/* okay - we're now in a position to make a reservation */
+	ret = -ENOBUFS;
+	if (test_bit(FSCACHE_OBJECT_ABORTED, &object->flags) ||
+	    test_bit(FSCACHE_IOERROR, &object->cache->flags))
+		goto out;
+
+	ret = -ERESTARTSYS;
+	if (signal_pending(current))
+		goto out;
+
+	/* ask the cache to honour the operation */
+	ret = object->cache->ops->reserve_space(object, size);
+
+out:
+	spin_lock(&cookie->lock);
+out_locked:
+	if (!test_and_clear_bit(FSCACHE_OP_IN_PROGRESS, &op->flags))
+		BUG();
+	queue_work(fscache_workqueue, &object->work);
+	spin_unlock(&cookie->lock);
+
+	kleave(" = %d", ret);
+	return ret;
+
+error:
+	spin_unlock(&cookie->lock);
+	kfree(op);
+	kleave(" = %d", ret);
+	return ret;
+#endif
+	kleave(" = -ENOBUFS");
+	return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_reserve_space);
+
+#if 0
+/*
+ * pin an object into the cache
+ */
+int __fscache_pin_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p", cookie);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" = -ENOBUFS");
+		return -ENOBUFS;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * read and write attempts on pages
+	 */
+	down_write(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and pin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		if (!object->cache->ops->pin_object) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+
+		/* prevent the cache from being withdrawn */
+		if (fscache_operation_lock(object)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				ret = object->cache->ops->pin_object(object);
+
+				object->cache->ops->put_object(object);
+			}
+
+			fscache_operation_unlock(object);
+		}
+	}
+
+out:
+	up_write(&cookie->sem);
+	_leave(" = %d", ret);
+	return ret;
+}
+EXPORT_SYMBOL(__fscache_pin_cookie);
+
+/*
+ * unpin an object into the cache
+ */
+void __fscache_unpin_cookie(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p", cookie);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		_leave(" [no obj]");
+		return;
+	}
+
+	/* not supposed to use this for indexes */
+	BUG_ON(cookie->def->type == FSCACHE_COOKIE_TYPE_INDEX);
+
+	/* prevent the file from being uncached whilst we access it and exclude
+	 * read and write attempts on pages
+	 */
+	down_write(&cookie->sem);
+
+	ret = -ENOBUFS;
+	if (!hlist_empty(&cookie->backing_objects)) {
+		/* get and unpin the backing object */
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+
+		if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+			goto out;
+
+		if (!object->cache->ops->unpin_object)
+			goto out;
+
+		/* prevent the cache from being withdrawn */
+		if (fscache_operation_lock(object)) {
+			if (object->cache->ops->grab_object(object)) {
+				/* ask the cache to honour the operation */
+				object->cache->ops->unpin_object(object);
+
+				object->cache->ops->put_object(object);
+			}
+
+			fscache_operation_unlock(object);
+		}
+	}
+
+out:
+	up_write(&cookie->sem);
+	_leave("");
+}
+EXPORT_SYMBOL(__fscache_unpin_cookie);
+#endif
diff --git a/fs/fscache/fsc-object.c b/fs/fscache/fsc-object.c
new file mode 100644
index 0000000..796637a
--- /dev/null
+++ b/fs/fscache/fsc-object.c
@@ -0,0 +1,583 @@
+/* FS-Cache object state machine handler
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL COOKIE
+#include <linux/module.h>
+#include "fsc-internal.h"
+
+const char *fscache_object_states[] = {
+	[FSCACHE_OBJECT_INIT]		= "OBJECT_INIT",
+	[FSCACHE_OBJECT_LOOKING_UP]	= "OBJECT_LOOKING_UP",
+	[FSCACHE_OBJECT_CREATING]	= "OBJECT_CREATING",
+	[FSCACHE_OBJECT_AVAILABLE]	= "OBJECT_AVAILABLE",
+	[FSCACHE_OBJECT_ACTIVE]		= "OBJECT_ACTIVE",
+	[FSCACHE_OBJECT_UPDATING]	= "OBJECT_UPDATING",
+	[FSCACHE_OBJECT_DYING]		= "OBJECT_DYING",
+	[FSCACHE_OBJECT_LC_DYING]	= "OBJECT_LC_DYING",
+	[FSCACHE_OBJECT_ABORT_INIT]	= "OBJECT_ABORT_INIT",
+	[FSCACHE_OBJECT_RELEASING]	= "OBJECT_RELEASING",
+	[FSCACHE_OBJECT_RECYCLING]	= "OBJECT_RECYCLING",
+	[FSCACHE_OBJECT_WITHDRAWING]	= "OBJECT_WITHDRAWING",
+	[FSCACHE_OBJECT_DEAD]		= "OBJECT_DEAD",
+};
+EXPORT_SYMBOL(fscache_object_states);
+
+static void fscache_check_object_parent(struct fscache_object *);
+static void fscache_lookup_object(struct fscache_object *);
+static void fscache_object_available(struct fscache_object *);
+static void fscache_release_object(struct fscache_object *);
+static void fscache_withdraw_object(struct fscache_object *);
+
+/*
+ * object state machine processor
+ * - initiates parent lookup
+ * - does object lookup
+ * - does object creation
+ * - does object recycling and retirement
+ * - does object withdrawal
+ */
+void fscache_object_state_machine(struct fscache_object *object)
+{
+	ASSERT(object != NULL);
+
+	_enter("{OBJ%x,%s,%lx}",
+	       object->debug_id, fscache_object_states[object->state],
+	       object->events);
+
+	switch (object->state) {
+	case FSCACHE_OBJECT_INIT:
+		object->event_mask =
+			ULONG_MAX & ~(1 << FSCACHE_OBJECT_EV_CLEARED);
+		fscache_check_object_parent(object);
+		goto done;
+
+	case FSCACHE_OBJECT_LOOKING_UP:
+		fscache_lookup_object(object);
+		goto lookup_transit;
+
+	case FSCACHE_OBJECT_CREATING:
+		fscache_lookup_object(object);
+		goto lookup_transit;
+
+	case FSCACHE_OBJECT_AVAILABLE:
+		fscache_object_available(object);
+		goto active_transit;
+
+	case FSCACHE_OBJECT_ACTIVE:
+		goto active_transit;
+
+	case FSCACHE_OBJECT_UPDATING:
+		clear_bit(FSCACHE_OBJECT_EV_UPDATE, &object->events);
+		fscache_stat(&fscache_n_updates_run);
+		object->cache->ops->update_object(object);
+		goto active_transit;
+
+		/* object started dying during lookup */
+	case FSCACHE_OBJECT_LC_DYING:
+		object->event_mask &= ~(1 << FSCACHE_OBJECT_EV_UPDATE);
+		object->cache->ops->lookup_complete(object);
+
+		object->state = FSCACHE_OBJECT_DYING;
+		spin_lock(&object->lock);
+		if (test_and_clear_bit(FSCACHE_COOKIE_CREATING,
+				       &object->cookie->flags))
+			wake_up_bit(&object->cookie->flags,
+				    FSCACHE_COOKIE_CREATING);
+		spin_unlock(&object->lock);
+
+		/* wait for completion of active accessors */
+	case FSCACHE_OBJECT_DYING:
+	dying:
+		clear_bit(FSCACHE_OBJECT_EV_CLEARED, &object->events);
+		spin_lock(&object->lock);
+		_debug("dying OBJ%x {%d,%d}",
+		       object->debug_id, object->n_ops, object->n_children);
+		if (object->n_ops == 0 && object->n_children == 0) {
+			object->event_mask &=
+				~(1 << FSCACHE_OBJECT_EV_CLEARED);
+			object->event_mask |=
+				(1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+				(1 << FSCACHE_OBJECT_EV_RETIRE) |
+				(1 << FSCACHE_OBJECT_EV_RELEASE) |
+				(1 << FSCACHE_OBJECT_EV_ERROR);
+		} else {
+			object->event_mask &=
+				~((1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+				  (1 << FSCACHE_OBJECT_EV_RETIRE) |
+				  (1 << FSCACHE_OBJECT_EV_RELEASE) |
+				  (1 << FSCACHE_OBJECT_EV_ERROR));
+			object->event_mask |=
+				1 << FSCACHE_OBJECT_EV_CLEARED;
+		}
+		spin_unlock(&object->lock);
+		fscache_enqueue_dependents(object);
+		goto terminal_transit;
+
+		/* handle an abort during the init state */
+	case FSCACHE_OBJECT_ABORT_INIT:
+		_debug("handle abort init %lx", object->events);
+		object->event_mask &= ~(1 << FSCACHE_OBJECT_EV_UPDATE);
+		fscache_dequeue_object(object);
+
+		object->state = FSCACHE_OBJECT_DYING;
+		spin_lock(&object->lock);
+		if (test_and_clear_bit(FSCACHE_COOKIE_CREATING,
+				       &object->cookie->flags))
+			wake_up_bit(&object->cookie->flags,
+				    FSCACHE_COOKIE_CREATING);
+		spin_unlock(&object->lock);
+		goto dying;
+
+	case FSCACHE_OBJECT_RELEASING:
+	case FSCACHE_OBJECT_RECYCLING:
+		object->event_mask &=
+			~((1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+			  (1 << FSCACHE_OBJECT_EV_RETIRE) |
+			  (1 << FSCACHE_OBJECT_EV_RELEASE) |
+			  (1 << FSCACHE_OBJECT_EV_ERROR));
+		fscache_release_object(object);
+		object->state = FSCACHE_OBJECT_DEAD;
+		goto terminal_transit;
+
+	case FSCACHE_OBJECT_WITHDRAWING:
+		object->event_mask &=
+			~((1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+			  (1 << FSCACHE_OBJECT_EV_RETIRE) |
+			  (1 << FSCACHE_OBJECT_EV_RELEASE) |
+			  (1 << FSCACHE_OBJECT_EV_ERROR));
+		fscache_withdraw_object(object);
+		object->state = FSCACHE_OBJECT_DEAD;
+		goto terminal_transit;
+
+	case FSCACHE_OBJECT_DEAD:
+		printk(KERN_ERR "FS-Cache:"
+		       " Unexpected event in dead state %lx\n",
+		       object->events & object->event_mask);
+		BUG();
+
+	default:
+		printk(KERN_ERR "FS-Cache: Unknown object state %u\n",
+		       object->state);
+		BUG();
+	}
+
+	/* determine the transition from a lookup state */
+lookup_transit:
+	switch (fls(object->events & object->event_mask) - 1) {
+	case FSCACHE_OBJECT_EV_WITHDRAW:
+	case FSCACHE_OBJECT_EV_RETIRE:
+	case FSCACHE_OBJECT_EV_RELEASE:
+	case FSCACHE_OBJECT_EV_ERROR:
+		object->state = FSCACHE_OBJECT_LC_DYING;
+		break;
+	case FSCACHE_OBJECT_EV_REQUEUE:
+		break;
+	case -1:
+		break; /* sleep until event */
+	default:
+		goto unsupported_event;
+	}
+	goto done;
+
+	/* determine the transition from an active state */
+active_transit:
+	switch (fls(object->events & object->event_mask) - 1) {
+	case FSCACHE_OBJECT_EV_WITHDRAW:
+	case FSCACHE_OBJECT_EV_RETIRE:
+	case FSCACHE_OBJECT_EV_RELEASE:
+	case FSCACHE_OBJECT_EV_ERROR:
+		object->state = FSCACHE_OBJECT_DYING;
+		break;
+	case FSCACHE_OBJECT_EV_UPDATE:
+		object->state = FSCACHE_OBJECT_UPDATING;
+		break;
+	case -1:
+		object->state = FSCACHE_OBJECT_ACTIVE;
+		break; /* sleep until event */
+	default:
+		goto unsupported_event;
+	}
+	goto done;
+
+	/* determine the transition from a terminal state */
+terminal_transit:
+	switch (fls(object->events & object->event_mask) - 1) {
+	case FSCACHE_OBJECT_EV_WITHDRAW:
+		object->state = FSCACHE_OBJECT_WITHDRAWING;
+		break;
+	case FSCACHE_OBJECT_EV_RETIRE:
+		object->state = FSCACHE_OBJECT_RECYCLING;
+		break;
+	case FSCACHE_OBJECT_EV_RELEASE:
+		object->state = FSCACHE_OBJECT_RELEASING;
+		break;
+	case FSCACHE_OBJECT_EV_ERROR:
+		object->state = FSCACHE_OBJECT_WITHDRAWING;
+		break;
+	case FSCACHE_OBJECT_EV_CLEARED:
+		object->state = FSCACHE_OBJECT_DYING;
+		break;
+	case -1:
+		break; /* sleep until event */
+	default:
+		goto unsupported_event;
+	}
+
+done:
+	_leave(" [->%s]", fscache_object_states[object->state]);
+	return;
+
+unsupported_event:
+	printk(KERN_ERR "FS-Cache:"
+	       " Unsupported event %lx [mask %lx] in state %s\n",
+	       object->events, object->event_mask,
+	       fscache_object_states[object->state]);
+	BUG();
+}
+
+/*
+ * check the specified object's parent to see if we can make use of it
+ * immediately to do a creation
+ * - we may need to start the process of creating a parent and we need to wait
+ *   for the parent's lookup and creation to complete if it's not there yet
+ * - an object's cookie is pinned until we clear FSCACHE_COOKIE_CREATING on the
+ *   leaf-most cookies of the object and all its children
+ */
+static void fscache_check_object_parent(struct fscache_object *object)
+{
+	struct fscache_object *parent;
+
+	_enter("");
+	ASSERT(object->cookie != NULL);
+	ASSERT(object->cookie->parent != NULL);
+	ASSERT(list_empty(&object->work_link));
+
+	if (object->events & ((1 << FSCACHE_OBJECT_EV_ERROR) |
+			      (1 << FSCACHE_OBJECT_EV_RELEASE) |
+			      (1 << FSCACHE_OBJECT_EV_RETIRE) |
+			      (1 << FSCACHE_OBJECT_EV_WITHDRAW))) {
+		_debug("abort init %lx", object->events);
+		object->state = FSCACHE_OBJECT_ABORT_INIT;
+		return;
+	}
+
+	spin_lock(&object->cookie->lock);
+	spin_lock_nested(&object->cookie->parent->lock, 1);
+
+	parent = object->parent;
+	if (!parent) {
+		_debug("no parent");
+		set_bit(FSCACHE_OBJECT_EV_WITHDRAW, &object->events);
+	} else {
+		spin_lock_nested(&parent->lock, 1);
+		_debug("parent %s", fscache_object_states[parent->state]);
+
+		if (parent->state >= FSCACHE_OBJECT_DYING) {
+			_debug("bad parent");
+			set_bit(FSCACHE_OBJECT_EV_WITHDRAW, &object->events);
+		} else if (parent->state < FSCACHE_OBJECT_AVAILABLE) {
+			_debug("wait");
+			object->cache->ops->grab_object(object);
+			set_bit(FSCACHE_OBJECT_WAITING, &object->flags);
+			list_add(&object->work_link, &parent->dependents);
+			atomic_inc(&object->cache->thread_usage);
+			if (parent->state == FSCACHE_OBJECT_INIT)
+				fscache_enqueue_object(parent);
+		} else {
+			_debug("go");
+			parent->n_ops++;
+			object->lookup_jif = jiffies;
+			object->state = FSCACHE_OBJECT_LOOKING_UP;
+			set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+		}
+
+		spin_unlock(&parent->lock);
+	}
+
+	spin_unlock(&object->cookie->parent->lock);
+	spin_unlock(&object->cookie->lock);
+	_leave("");
+}
+
+/*
+ * look up an object in its cache
+ * - we hold an "access lock" on the parent object, so the parent object cannot
+ *   be withdrawn by either party till we've finished
+ * - an object's cookie is pinned until we clear FSCACHE_COOKIE_CREATING on the
+ *   leaf-most cookies of the object and all its children
+ */
+static void fscache_lookup_object(struct fscache_object *object)
+{
+	struct fscache_cookie *cookie = object->cookie;
+	struct fscache_object *parent;
+
+	_enter("");
+
+	parent = object->parent;
+	ASSERT(parent != NULL);
+	ASSERTCMP(parent->n_ops, >, 0);
+
+	/* make sure the parent is still available */
+	ASSERTCMP(parent->state, >=, FSCACHE_OBJECT_AVAILABLE);
+
+	if (parent->state >= FSCACHE_OBJECT_DYING ||
+	    test_bit(FSCACHE_IOERROR, &object->cache->flags)) {
+		_debug("unavailable");
+		set_bit(FSCACHE_OBJECT_EV_WITHDRAW, &object->events);
+		_leave("");
+		return;
+	}
+
+	_debug("LOOKUP \"%s/%s\" in \"%s\"",
+	       parent->cookie->def->name, cookie->def->name,
+	       object->cache->tag->name);
+
+	fscache_stat(&fscache_n_object_lookups);
+	object->cache->ops->lookup_object(object);
+
+	if (test_bit(FSCACHE_OBJECT_EV_ERROR, &object->events))
+		set_bit(FSCACHE_COOKIE_UNAVAILABLE, &cookie->flags);
+
+	_leave("");
+}
+
+/**
+ * fscache_object_lookup_negative - Note negative cookie lookup
+ * @object: Object pointing to cookie to mark
+ *
+ * Note negative lookup, permitting those waiting to read data from an already
+ * existing backing object to continue as there's no data for them to read.
+ */
+void fscache_object_lookup_negative(struct fscache_object *object)
+{
+	struct fscache_cookie *cookie = object->cookie;
+
+	_enter("{OBJ%x,%s}",
+	       object->debug_id, fscache_object_states[object->state]);
+
+	if (object->state == FSCACHE_OBJECT_LOOKING_UP) {
+		fscache_stat(&fscache_n_object_lookups_negative);
+
+		/* transit here to allow write requests to begin stacking up
+		 * and read requests to begin returning ENODATA */
+		object->state = FSCACHE_OBJECT_CREATING;
+
+		set_bit(FSCACHE_COOKIE_PENDING_FILL, &cookie->flags);
+		set_bit(FSCACHE_COOKIE_NO_DATA_YET, &cookie->flags);
+
+		_debug("wake up lookup %p", &cookie->flags);
+		smp_mb__before_clear_bit();
+		clear_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP);
+		clear_bit(FSCACHE_OBJECT_SYNC, &object->flags);
+		set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+	} else {
+		ASSERTCMP(object->state, ==, FSCACHE_OBJECT_CREATING);
+	}
+
+	_leave("");
+}
+EXPORT_SYMBOL(fscache_object_lookup_negative);
+
+/**
+ * fscache_obtained_object - Note successful object lookup or creation
+ * @object: Object pointing to cookie to mark
+ *
+ * Note successful lookup and/or creation, permitting those waiting to write
+ * data to a backing object to continue.
+ *
+ * Note that after calling this, an object's cookie may be relinquished by the
+ * netfs, and so must be accessed with object lock held.
+ */
+void fscache_obtained_object(struct fscache_object *object)
+{
+	struct fscache_cookie *cookie = object->cookie;
+
+	_enter("{OBJ%x,%s}",
+	       object->debug_id, fscache_object_states[object->state]);
+
+	/* if we were still looking up, then we must have a positive lookup
+	 * result, in which case there may be data available */
+	if (object->state == FSCACHE_OBJECT_LOOKING_UP) {
+		fscache_stat(&fscache_n_object_lookups_positive);
+
+		clear_bit(FSCACHE_COOKIE_NO_DATA_YET, &cookie->flags);
+
+		object->state = FSCACHE_OBJECT_AVAILABLE;
+
+		smp_mb__before_clear_bit();
+		clear_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP);
+		clear_bit(FSCACHE_OBJECT_SYNC, &object->flags);
+		set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+	} else {
+		ASSERTCMP(object->state, ==, FSCACHE_OBJECT_CREATING);
+		fscache_stat(&fscache_n_object_created);
+
+		object->state = FSCACHE_OBJECT_AVAILABLE;
+		set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+		smp_wmb();
+	}
+
+	if (test_and_clear_bit(FSCACHE_COOKIE_CREATING, &cookie->flags))
+		wake_up_bit(&cookie->flags, FSCACHE_COOKIE_CREATING);
+
+	_leave("");
+}
+EXPORT_SYMBOL(fscache_obtained_object);
+
+/*
+ * handle an object that has just become available
+ */
+static void fscache_object_available(struct fscache_object *object)
+{
+	_enter("{OBJ%x}", object->debug_id);
+
+	spin_lock(&object->lock);
+	if (test_and_clear_bit(FSCACHE_COOKIE_CREATING, &object->cookie->flags))
+		wake_up_bit(&object->cookie->flags, FSCACHE_COOKIE_CREATING);
+
+	spin_lock_nested(&object->parent->lock, 1);
+	object->parent->n_ops--;
+	if (object->parent->n_ops == 0)
+		fscache_raise_event(object->parent, FSCACHE_OBJECT_EV_CLEARED);
+	spin_unlock(&object->parent->lock);
+
+	if (object->n_ops > 0)
+		fscache_start_operations(object);
+	spin_unlock(&object->lock);
+
+	object->cache->ops->lookup_complete(object);
+	fscache_enqueue_dependents(object);
+
+	if (test_bit(FSCACHE_OBJECT_BOOSTED, &object->flags))
+		fscache_hist(fscache_obj_instantiate_histogram,
+			     object->lookup_jif);
+	fscache_stat(&fscache_n_object_avail);
+
+	_leave("");
+}
+
+/*
+ * drop an object's attachments
+ */
+static void fscache_drop_object(struct fscache_object *object)
+{
+	struct fscache_object *parent = object->parent;
+	struct fscache_cache *cache = object->cache;
+
+	_enter("{OBJ%x,%d}", object->debug_id, object->n_children);
+
+	spin_lock(&cache->object_list_lock);
+	list_del_init(&object->cache_link);
+	spin_unlock(&cache->object_list_lock);
+
+	cache->ops->drop_object(object);
+
+	if (parent) {
+		_debug("release parent OBJ%x {%d}",
+		       parent->debug_id, parent->n_children);
+
+		spin_lock(&parent->lock);
+		parent->n_children--;
+		if (parent->n_children == 0)
+			fscache_raise_event(parent, FSCACHE_OBJECT_EV_CLEARED);
+		spin_unlock(&parent->lock);
+		object->parent = NULL;
+	}
+
+	/* this just shifts the object release to the fscache thread pool */
+	object->cache->ops->put_object(object);
+}
+
+/*
+ * release or recycle an object
+ */
+static void fscache_release_object(struct fscache_object *object)
+{
+	_enter("");
+
+	fscache_drop_object(object);
+
+	_leave("");
+}
+
+/*
+ * withdraw an object from active service
+ */
+static void fscache_withdraw_object(struct fscache_object *object)
+{
+	struct fscache_cookie *cookie;
+	bool detached;
+
+	_enter("");
+
+	spin_lock(&object->lock);
+	cookie = object->cookie;
+	if (cookie) {
+		/* need to get the cookie lock before the object lock, starting
+		 * from the object pointer */
+		atomic_inc(&cookie->usage);
+		spin_unlock(&object->lock);
+
+		detached = false;
+		spin_lock(&cookie->lock);
+		spin_lock(&object->lock);
+
+		if (object->cookie == cookie) {
+			hlist_del_init(&object->cookie_link);
+			object->cookie = NULL;
+			detached = true;
+		}
+		spin_unlock(&cookie->lock);
+		fscache_cookie_put(cookie);
+		if (detached)
+			fscache_cookie_put(cookie);
+	}
+
+	spin_unlock(&object->lock);
+
+	fscache_drop_object(object);
+
+	_leave("");
+}
+
+/*
+ * withdraw an object from active service at the behest of the cache
+ * - need break the links to a cached object cookie
+ * - called under two situations:
+ *   (1) recycler decides to reclaim an in-use object
+ *   (2) a cache is unmounted
+ * - have to take care as the cookie can be being relinquished by the netfs
+ *   simultaneously
+ * - the object is pinned by the caller holding a refcount on it
+ */
+void fscache_withdrawing_object(struct fscache_cache *cache,
+				struct fscache_object *object)
+{
+	bool enqueue = false;
+
+	_enter(",OBJ%x", object->debug_id);
+
+	spin_lock(&object->lock);
+	if (object->state < FSCACHE_OBJECT_WITHDRAWING) {
+		object->state = FSCACHE_OBJECT_WITHDRAWING;
+		enqueue = true;
+	}
+	spin_unlock(&object->lock);
+
+	if (enqueue)
+		fscache_enqueue_object(object);
+
+	_leave("");
+}
diff --git a/fs/fscache/fsc-page.c b/fs/fscache/fsc-page.c
new file mode 100644
index 0000000..a5834fd
--- /dev/null
+++ b/fs/fscache/fsc-page.c
@@ -0,0 +1,872 @@
+/* Cache page management and data I/O routines
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL PAGE
+#include <linux/module.h>
+#include <linux/fscache-cache.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include "fsc-internal.h"
+
+/*
+ * start an op running
+ */
+static void fscache_run_op(struct fscache_object *object,
+			   struct fscache_operation *op)
+{
+	object->n_in_progress++;
+	clear_bit(FSCACHE_OP_WAITING, &op->flags);
+	if (op->processor)
+		fscache_enqueue_operation(op);
+	fscache_stat(&fscache_n_op_run);
+}
+
+/*
+ * submit an exclusive operation for an object
+ * - other ops are excluded from running simultaneously with this one
+ */
+static int fscache_submit_exclusive_op(struct fscache_object *object,
+				       struct fscache_operation *op)
+{
+	int ret;
+
+	spin_lock(&object->lock);
+	ASSERTCMP(object->n_ops, >=, object->n_in_progress);
+	ASSERTCMP(object->n_ops, >=, object->n_exclusive);
+
+	ret = -ENOBUFS;
+	if (fscache_object_is_active(object)) {
+		op->object = object;
+		object->n_ops++;
+		object->n_exclusive++;	/* reads and writes must wait */
+
+		if (object->n_ops > 0) {
+			list_add_tail(&op->work_link, &object->pending_ops);
+			fscache_stat(&fscache_n_op_pend);
+		} else {
+			ASSERTCMP(object->n_in_progress, ==, 0);
+			fscache_run_op(object, op);
+		}
+
+		/* need to issue a new write op after this */
+		clear_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags);
+		ret = 0;
+	} else if (object->state == FSCACHE_OBJECT_CREATING) {
+		op->object = object;
+		object->n_ops++;
+		object->n_exclusive++;	/* reads and writes must wait */
+		list_add_tail(&op->work_link, &object->pending_ops);
+		fscache_stat(&fscache_n_op_pend);
+		ret = 0;
+	} else {
+		/* not allowed to submit ops in any other state */
+		BUG();
+	}
+
+	spin_unlock(&object->lock);
+	return ret;
+}
+
+/*
+ * submit an operation for an object
+ */
+static int fscache_submit_op(struct fscache_object *object,
+			     struct fscache_operation *op)
+{
+	int ret;
+
+	ASSERTCMP(atomic_read(&op->usage), >, 0);
+
+	spin_lock(&object->lock);
+	ASSERTCMP(object->n_ops, >=, object->n_in_progress);
+	ASSERTCMP(object->n_ops, >=, object->n_exclusive);
+
+	if (fscache_object_is_active(object)) {
+		op->object = object;
+		object->n_ops++;
+
+		if (object->n_exclusive > 0) {
+			ASSERTCMP(object->n_in_progress, >, 0);
+			list_add_tail(&op->work_link, &object->pending_ops);
+			fscache_stat(&fscache_n_op_pend);
+		} else {
+			ASSERT(list_empty(&object->pending_ops));
+			ASSERTCMP(object->n_exclusive, ==, 0);
+			fscache_run_op(object, op);
+		}
+		ret = 0;
+	} else if (object->state == FSCACHE_OBJECT_CREATING) {
+		op->object = object;
+		object->n_ops++;
+		list_add_tail(&op->work_link, &object->pending_ops);
+		fscache_stat(&fscache_n_op_pend);
+		ret = 0;
+	} else if (!test_bit(FSCACHE_IOERROR, &object->cache->flags)) {
+		static bool once_only;
+		if (!once_only) {
+			once_only = true;
+			kdebug("no submit [OBJ%x %s]",
+			       object->debug_id,
+			       fscache_object_states[object->state]);
+			dump_stack();
+		}
+		ret = -ENOBUFS;
+	} else {
+		ret = -ENOBUFS;
+	}
+
+	spin_unlock(&object->lock);
+	return ret;
+}
+
+/*
+ * queue an object for withdrawal on error, aborting all following asynchronous
+ * operations
+ */
+static void fscache_abort_object(struct fscache_object *object)
+{
+	fscache_raise_event(object, FSCACHE_OBJECT_EV_ERROR);
+}
+
+/*
+ * actually apply the changed attributes to a cache object
+ */
+static void fscache_attr_changed_op(struct fscache_operation *op)
+{
+	struct fscache_object *object = op->object;
+
+	_enter("{OBJ%x}", object->debug_id);
+
+	fscache_stat(&fscache_n_attr_changed_calls);
+
+	if (fscache_object_is_active(object) &&
+	    object->cache->ops->attr_changed(object) < 0)
+		fscache_abort_object(object);
+
+	_leave("");
+}
+
+/*
+ * notification that the attributes on an object have changed
+ */
+int __fscache_attr_changed(struct fscache_cookie *cookie)
+{
+	struct fscache_operation *op;
+	struct fscache_object *object;
+
+	_enter("%p", cookie);
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+
+	fscache_stat(&fscache_n_attr_changed);
+
+	op = kzalloc(sizeof(*op), GFP_KERNEL);
+	if (!op) {
+		fscache_stat(&fscache_n_attr_changed_nomem);
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	op->flags |= 1 << FSCACHE_OP_EXCLUSIVE;
+	op->processor = fscache_attr_changed_op;
+
+	spin_lock(&cookie->lock);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs;
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+
+	if (fscache_submit_exclusive_op(object, op) < 0)
+		goto nobufs;
+	spin_unlock(&cookie->lock);
+	fscache_stat(&fscache_n_attr_changed_ok);
+	_leave(" = 0");
+	return 0;
+
+nobufs:
+	spin_unlock(&cookie->lock);
+	kfree(op);
+	fscache_stat(&fscache_n_attr_changed_nobufs);
+	_leave(" = %d", -ENOBUFS);
+	return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_attr_changed);
+
+/*
+ * release a retrieval op reference
+ */
+static void fscache_release_retrieval_op(struct fscache_operation *_op)
+{
+	struct fscache_retrieval *op =
+		container_of(_op, struct fscache_retrieval, op);
+
+	_enter("");
+
+	fscache_hist(fscache_retrieval_histogram, op->start_time);
+	if (op->context)
+		fscache_put_context(op->op.object->cookie, op->context);
+
+	_leave("");
+}
+
+/*
+ * allocate a retrieval op
+ */
+static struct fscache_retrieval *fscache_alloc_retrieval(
+	struct address_space *mapping,
+	fscache_rw_complete_t end_io_func,
+	void *context)
+{
+	struct fscache_retrieval *op;
+
+	/* allocate a retrieval operation and attempt to submit it */
+	op = kzalloc(sizeof(*op), GFP_NOIO);
+	if (!op) {
+		fscache_stat(&fscache_n_retrievals_nomem);
+		return NULL;
+	}
+
+	atomic_set(&op->op.usage, 1);
+	op->op.flags	= (1 << FSCACHE_OP_WAITING) | (1 << FSCACHE_OP_SYNC);
+	op->op.release	= fscache_release_retrieval_op;
+	op->mapping	= mapping;
+	op->end_io_func	= end_io_func;
+	op->context	= context;
+	op->start_time	= jiffies;
+	INIT_LIST_HEAD(&op->op.work_link);
+	INIT_LIST_HEAD(&op->to_do);
+	return op;
+}
+
+/*
+ * wait for a deferred lookup to complete
+ */
+static int fscache_wait_for_deferred_lookup(struct fscache_cookie *cookie)
+{
+	struct fscache_object *object;
+	unsigned long jif;
+
+	if (!test_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags))
+		return 0;
+
+	fscache_stat(&fscache_n_retrievals_wait);
+
+	/* tell the cookie dispatcher that this cookie is now being waited
+	 * upon for lookup completion */
+	spin_lock(&cookie->lock);
+	if (!hlist_empty(&cookie->backing_objects)) {
+		object = hlist_entry(cookie->backing_objects.first,
+				     struct fscache_object, cookie_link);
+		if (!test_and_set_bit(FSCACHE_OBJECT_SYNC, &object->flags))
+			fscache_boost_object(object);
+	}
+	spin_unlock(&cookie->lock);
+
+	jif = jiffies;
+	if (wait_on_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP,
+			fscache_wait_bit_interruptible,
+			TASK_INTERRUPTIBLE) != 0) {
+		fscache_stat(&fscache_n_retrievals_intr);
+		return -ERESTARTSYS;
+	}
+
+	ASSERT(!test_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags));
+
+	smp_rmb();
+	fscache_hist(fscache_retrieval_delay_histogram, jif);
+	return 0;
+}
+
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - we return:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -ERESTARTSYS - interrupted
+ *   -ENOBUFS	- no backing object available in which to cache the block
+ *   -ENODATA	- no data available in the backing object for this block
+ *   0		- dispatched a read - it'll call end_io_func() when finished
+ */
+int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+				 struct page *page,
+				 fscache_rw_complete_t end_io_func,
+				 void *context,
+				 gfp_t gfp)
+{
+	struct fscache_retrieval *op;
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,%p,,,", cookie, page);
+
+	fscache_stat(&fscache_n_retrievals);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs;
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+	ASSERTCMP(page, !=, NULL);
+
+	if (fscache_wait_for_deferred_lookup(cookie) < 0)
+		return -ERESTARTSYS;
+
+	op = fscache_alloc_retrieval(page->mapping, end_io_func, context);
+	if (!op) {
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	spin_lock(&cookie->lock);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs_unlock;
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+
+	ASSERTCMP(object->state, >, FSCACHE_COOKIE_LOOKING_UP);
+
+	if (fscache_submit_op(object, &op->op) < 0)
+		goto nobufs_unlock;
+	spin_unlock(&cookie->lock);
+
+	fscache_stat(&fscache_n_retrieval_ops);
+
+	/* pin the netfs read context in case we need to do the actual netfs
+	 * read because we've encountered a cache read failure */
+	fscache_get_context(object->cookie, op->context);
+
+	if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
+		_debug(">>> WT");
+		fscache_stat(&fscache_n_retrieval_op_waits);
+		wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+			    fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+		_debug("<<< GO");
+	}
+
+	/* ask the cache to honour the operation */
+	if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
+		ret = object->cache->ops->allocate_page(op, page, gfp);
+		if (ret == 0)
+			ret = -ENODATA;
+	} else {
+		ret = object->cache->ops->read_or_alloc_page(op, page, gfp);
+	}
+
+	if (ret == -ENOMEM)
+		fscache_stat(&fscache_n_retrievals_nomem);
+	else if (ret == -ERESTARTSYS)
+		fscache_stat(&fscache_n_retrievals_intr);
+	else if (ret == -ENODATA)
+		fscache_stat(&fscache_n_retrievals_nodata);
+	else if (ret < 0)
+		fscache_stat(&fscache_n_retrievals_nobufs);
+	else
+		fscache_stat(&fscache_n_retrievals_ok);
+
+	fscache_put_retrieval(op);
+	_leave(" = %d", ret);
+	return ret;
+
+nobufs_unlock:
+	spin_unlock(&cookie->lock);
+	kfree(op);
+nobufs:
+	fscache_stat(&fscache_n_retrievals_nobufs);
+	_leave(" = -ENOBUFS");
+	return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_read_or_alloc_page);
+
+/*
+ * read a list of page from the cache or allocate a block in which to store
+ * them
+ * - we return:
+ *   -ENOMEM	- out of memory, some pages may be being read
+ *   -ERESTARTSYS - interrupted, some pages may be being read
+ *   -ENOBUFS	- no backing object or space available in which to cache any
+ *                pages not being read
+ *   -ENODATA	- no data available in the backing object for some or all of
+ *                the pages
+ *   0		- dispatched a read on all pages
+ *
+ * end_io_func() will be called for each page read from the cache as it is
+ * finishes being read
+ *
+ * any pages for which a read is dispatched will be removed from pages and
+ * nr_pages
+ */
+int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+				  struct address_space *mapping,
+				  struct list_head *pages,
+				  unsigned *nr_pages,
+				  fscache_rw_complete_t end_io_func,
+				  void *context,
+				  gfp_t gfp)
+{
+	fscache_pages_retrieval_func_t func;
+	struct fscache_retrieval *op;
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,,%d,,,", cookie, *nr_pages);
+
+	fscache_stat(&fscache_n_retrievals);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs;
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+	ASSERTCMP(*nr_pages, >, 0);
+	ASSERT(!list_empty(pages));
+
+	if (fscache_wait_for_deferred_lookup(cookie) < 0)
+		return -ERESTARTSYS;
+
+	op = fscache_alloc_retrieval(mapping, end_io_func, context);
+	if (!op)
+		return -ENOMEM;
+
+	spin_lock(&cookie->lock);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs_unlock;
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+
+	if (fscache_submit_op(object, &op->op) < 0)
+		goto nobufs_unlock;
+	spin_unlock(&cookie->lock);
+
+	fscache_stat(&fscache_n_retrieval_ops);
+
+	/* pin the netfs read context in case we need to do the actual netfs
+	 * read because we've encountered a cache read failure */
+	fscache_get_context(object->cookie, op->context);
+
+	if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
+		_debug(">>> WT");
+		fscache_stat(&fscache_n_retrieval_op_waits);
+		wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+			    fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+		_debug("<<< GO");
+	}
+
+	/* ask the cache to honour the operation */
+	if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags))
+		func = object->cache->ops->allocate_pages;
+	else
+		func = object->cache->ops->read_or_alloc_pages;
+	ret = func(op, pages, nr_pages, gfp);
+
+	if (ret == -ENOMEM)
+		fscache_stat(&fscache_n_retrievals_nomem);
+	else if (ret == -ERESTARTSYS)
+		fscache_stat(&fscache_n_retrievals_intr);
+	else if (ret == -ENODATA)
+		fscache_stat(&fscache_n_retrievals_nodata);
+	else if (ret < 0)
+		fscache_stat(&fscache_n_retrievals_nobufs);
+	else
+		fscache_stat(&fscache_n_retrievals_ok);
+
+	fscache_put_retrieval(op);
+	_leave(" = %d", ret);
+	return ret;
+
+nobufs_unlock:
+	spin_unlock(&cookie->lock);
+	kfree(op);
+nobufs:
+	fscache_stat(&fscache_n_retrievals_nobufs);
+	_leave(" = -ENOBUFS");
+	return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_read_or_alloc_pages);
+
+/*
+ * allocate a block in the cache on which to store a page
+ * - we return:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -ERESTARTSYS - interrupted
+ *   -ENOBUFS	- no backing object available in which to cache the block
+ *   0		- block allocated
+ */
+int __fscache_alloc_page(struct fscache_cookie *cookie,
+			 struct page *page,
+			 gfp_t gfp)
+{
+	struct fscache_retrieval *op;
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,%p,,,", cookie, page);
+
+	fscache_stat(&fscache_n_allocs);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs;
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+	ASSERTCMP(page, !=, NULL);
+
+	if (fscache_wait_for_deferred_lookup(cookie) < 0)
+		return -ERESTARTSYS;
+
+	op = fscache_alloc_retrieval(page->mapping, NULL, NULL);
+	if (!op)
+		return -ENOMEM;
+
+	spin_lock(&cookie->lock);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs_unlock;
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+
+	if (fscache_submit_op(object, &op->op) < 0)
+		goto nobufs_unlock;
+	spin_unlock(&cookie->lock);
+
+	fscache_stat(&fscache_n_alloc_ops);
+
+	if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
+		_debug(">>> WT");
+		fscache_stat(&fscache_n_alloc_op_waits);
+		wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+			    fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+		_debug("<<< GO");
+	}
+
+	/* ask the cache to honour the operation */
+	ret = object->cache->ops->allocate_page(op, page, gfp);
+
+	if (ret < 0)
+		fscache_stat(&fscache_n_allocs_nobufs);
+	else
+		fscache_stat(&fscache_n_allocs_ok);
+
+	fscache_put_retrieval(op);
+	_leave(" = %d", ret);
+	return ret;
+
+nobufs_unlock:
+	spin_unlock(&cookie->lock);
+	kfree(op);
+nobufs:
+	fscache_stat(&fscache_n_allocs_nobufs);
+	_leave(" = -ENOBUFS");
+	return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_alloc_page);
+
+/*
+ * release a write op reference
+ */
+static void fscache_release_write_op(struct fscache_operation *_op)
+{
+	_enter("");
+}
+
+/*
+ * store a page in the cache in the background
+ */
+static void fscache_write_op(struct fscache_operation *_op)
+{
+	struct fscache_storage *op =
+		container_of(_op, struct fscache_storage, op);
+	struct fscache_object *object = op->op.object;
+	struct page *page;
+	unsigned n;
+	void *results[1];
+	int ret;
+
+	_enter("{%d}", atomic_read(&op->op.usage));
+
+	if (!fscache_object_is_active(object)) {
+		fscache_put_operation(&op->op);
+		_leave("");
+		return;
+	}
+
+	fscache_stat(&fscache_n_store_calls);
+
+	/* find a page to store */
+	spin_lock(&object->lock);
+
+	page = NULL;
+	n = radix_tree_gang_lookup(&object->stores, results, 0, 1);
+	if (n == 1) {
+		page = results[0];
+		_debug("gang %d [%lx]", n, page->index);
+		if (page->index <= op->store_limit)
+			radix_tree_delete(&object->stores, page->index);
+		else
+			goto superseded;
+	} else {
+		goto superseded;
+	}
+
+	spin_unlock(&object->lock);
+
+	if (page) {
+		ret = object->cache->ops->write_page(op, page);
+		end_page_fscache_write(page);
+		page_cache_release(page);
+		if (ret < 0) {
+			fscache_abort_object(object);
+			fscache_put_operation(&op->op);
+		} else {
+			fscache_enqueue_operation(&op->op);
+		}
+	}
+
+	_leave("");
+	return;
+
+superseded:
+	/* this writer is going away and there aren't any more things to
+	 * write */
+	_debug("cease");
+	clear_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags);
+	spin_unlock(&object->lock);
+	fscache_put_operation(&op->op);
+	_leave("");
+}
+
+/*
+ * request a page be stored in the cache
+ * - returns:
+ *   -ENOMEM	- out of memory, nothing done
+ *   -ENOBUFS	- no backing object available in which to cache the page
+ *   0		- dispatched a write - it'll call end_io_func() when finished
+ *
+ * if the cookie still has a backing object at this point, that object can be
+ * in one of a few states with respect to storage processing:
+ *
+ *  (1) negative lookup, object not yet created (FSCACHE_COOKIE_CREATING is
+ *      set)
+ *
+ *	(a) no writes yet (set FSCACHE_COOKIE_PENDING_FILL and queue deferred
+ *	    fill op)
+ *
+ *	(b) writes deferred till post-creation (mark page for writing and
+ *	    return immediately)
+ *
+ *  (2) negative lookup, object created, initial fill being made from netfs
+ *      (FSCACHE_COOKIE_INITIAL_FILL is set)
+ *
+ *	(a) fill point not yet reached this page (mark page for writing and
+ *          return)
+ *
+ *	(b) fill point passed this page (queue op to store this page)
+ *
+ *  (3) object extant (queue op to store this page)
+ *
+ * any other state is invalid
+ */
+int __fscache_write_page(struct fscache_cookie *cookie,
+			 struct page *page,
+			 gfp_t gfp)
+{
+	struct fscache_storage *op;
+	struct fscache_object *object;
+	int ret;
+
+	_enter("%p,%x,", cookie, (u32) page->flags);
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+	ASSERT(PageFsCache(page));
+
+	fscache_stat(&fscache_n_stores);
+
+	op = kzalloc(sizeof(*op), GFP_NOIO);
+	if (!op) {
+		fscache_stat(&fscache_n_stores_oom);
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	ret = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
+	if (ret < 0) {
+		kfree(op);
+		fscache_stat(&fscache_n_stores_oom);
+		_leave(" = %d", ret);
+		return ret;
+	}
+
+	ret = -ENOBUFS;
+	spin_lock(&cookie->lock);
+
+	if (hlist_empty(&cookie->backing_objects))
+		goto nobufs;
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+	if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+		goto nobufs;
+
+	/* add the page to the pending-storage radix tree on the backing
+	 * object */
+	spin_lock(&object->lock);
+
+	_debug("store limit %llx", (unsigned long long) object->store_limit);
+
+	ret = radix_tree_insert(&object->stores, page->index, page);
+	if (ret < 0) {
+		if (ret == -EEXIST)
+			goto already_queued;
+		_debug("insert failed %d", ret);
+		ret = -ENOBUFS;
+		goto nobufs_unlock;
+	}
+
+	page_cache_get(page);
+	if (TestSetPageFsCacheWrite(page))
+		BUG();
+
+	/* we only want one writer at a time, but we do need to queue new
+	 * writers after exclusive ops */
+	if (test_and_set_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags))
+		goto already_pending;
+
+	spin_unlock(&object->lock);
+
+	op->op.processor = fscache_write_op;
+	op->op.release =  fscache_release_write_op;
+	op->store_limit = object->store_limit;
+	INIT_LIST_HEAD(&op->op.work_link);
+	atomic_set(&op->op.usage, 1);
+
+	if (fscache_submit_op(object, &op->op) < 0) {
+		radix_tree_delete(&object->stores, page->index);
+		end_page_fscache_write(page);
+		page_cache_release(page);
+		ret = -ENOBUFS;
+		goto nobufs;
+	}
+
+	spin_unlock(&cookie->lock);
+	radix_tree_preload_end();
+	fscache_stat(&fscache_n_store_ops);
+	fscache_stat(&fscache_n_stores_ok);
+	_leave(" = 0");
+	return 0;
+
+already_queued:
+	fscache_stat(&fscache_n_stores_again);
+already_pending:
+	spin_unlock(&object->lock);
+	spin_unlock(&cookie->lock);
+	radix_tree_preload_end();
+	kfree(op);
+	fscache_stat(&fscache_n_stores_ok);
+	_leave(" = 0");
+	return 0;
+
+nobufs_unlock:
+	spin_unlock(&object->lock);
+nobufs:
+	spin_unlock(&cookie->lock);
+	radix_tree_preload_end();
+	kfree(op);
+	fscache_stat(&fscache_n_stores_nobufs);
+	_leave(" = -ENOBUFS");
+	return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_write_page);
+
+/*
+ * remove a page from the cache
+ */
+void __fscache_uncache_page(struct fscache_cookie *cookie, struct page *page)
+{
+	struct fscache_object *object;
+
+	_enter(",%p", page);
+
+	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+	ASSERTCMP(page, !=, NULL);
+
+	fscache_stat(&fscache_n_uncaches);
+
+	/* cache withdrawal may beat us to it */
+	if (!PageFsCache(page))
+		goto done;
+
+	/* get the object */
+	spin_lock(&cookie->lock);
+
+	if (hlist_empty(&cookie->backing_objects)) {
+		ClearPageFsCache(page);
+		goto done_unlock;
+	}
+
+	object = hlist_entry(cookie->backing_objects.first,
+			     struct fscache_object, cookie_link);
+
+	/* there might now be stuff on disk we could read */
+	clear_bit(FSCACHE_COOKIE_NO_DATA_YET, &cookie->flags);
+
+	/* only invoke the cache backend if we managed to mark the page
+	 * uncached here; this deals with synchronisation vs withdrawal */
+	if (TestClearPageFsCache(page) &&
+	    object->cache->ops->uncache_page) {
+		/* the cache backend releases the cookie lock */
+		object->cache->ops->uncache_page(object, page);
+		goto done;
+	}
+
+done_unlock:
+	spin_unlock(&cookie->lock);
+done:
+	_leave("");
+}
+EXPORT_SYMBOL(__fscache_uncache_page);
+
+/**
+ * fscache_mark_pages_cached - Mark pages as being cached
+ * @op: The retrieval op pages are being marked for
+ * @pagevec: The pages to be marked
+ *
+ * Mark a bunch of netfs pages as being cached.  After this is called,
+ * the netfs must call fscache_uncache_page() to remove the mark.
+ */
+void fscache_mark_pages_cached(struct fscache_retrieval *op,
+			       struct pagevec *pagevec)
+{
+	struct fscache_cookie *cookie = op->op.object->cookie;
+	unsigned long loop;
+
+#ifdef CONFIG_FSCACHE_STATS
+	atomic_add(pagevec->nr, &fscache_n_marks);
+#endif
+
+	for (loop = 0; loop < pagevec->nr; loop++) {
+		struct page *page = pagevec->pages[loop];
+
+		_debug("- mark %p{%lx}", page, page->index);
+		if (TestSetPageFsCache(page)) {
+			static bool once_only = false;
+			if (!once_only) {
+				once_only = true;
+				printk(KERN_WARNING "FS-Cache:"
+				       " Cookie type %s marked page %lx"
+				       " multiple times\n",
+				       cookie->def->name, page->index);
+			}
+		}
+	}
+
+	if (cookie->def->mark_pages_cached)
+		cookie->def->mark_pages_cached(cookie->netfs_data,
+					       op->mapping, pagevec);
+	pagevec_reinit(pagevec);
+}
+EXPORT_SYMBOL(fscache_mark_pages_cached);
diff --git a/fs/fscache/fsc-proc.c b/fs/fscache/fsc-proc.c
new file mode 100644
index 0000000..85fedb7
--- /dev/null
+++ b/fs/fscache/fsc-proc.c
@@ -0,0 +1,398 @@
+/* FS-Cache statistics viewing interface
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL THREAD
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "fsc-internal.h"
+
+struct fscache_proc {
+	unsigned			nlines;
+	const struct seq_operations	*ops;
+};
+
+struct proc_dir_entry *proc_fscache;
+EXPORT_SYMBOL(proc_fscache);
+
+static int fscache_proc_open(struct inode *inode, struct file *file);
+static void *fscache_proc_start(struct seq_file *m, loff_t *pos);
+static void fscache_proc_stop(struct seq_file *m, void *v);
+static void *fscache_proc_next(struct seq_file *m, void *v, loff_t *pos);
+
+static const struct file_operations fscache_proc_fops = {
+	.open		= fscache_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+#ifdef CONFIG_FSCACHE_STATS
+static int fscache_stats_show(struct seq_file *m, void *v);
+static int fscache_pool_show(struct seq_file *m, void *v);
+
+static const struct seq_operations fscache_stats_ops = {
+	.start		= fscache_proc_start,
+	.stop		= fscache_proc_stop,
+	.next		= fscache_proc_next,
+	.show		= fscache_stats_show,
+};
+
+static const struct fscache_proc fscache_stats = {
+	.nlines		= 16,
+	.ops		= &fscache_stats_ops,
+};
+
+static const struct seq_operations fscache_pool_ops = {
+	.start		= fscache_proc_start,
+	.stop		= fscache_proc_stop,
+	.next		= fscache_proc_next,
+	.show		= fscache_pool_show,
+};
+
+static const struct fscache_proc fscache_pool = {
+	.nlines		= FSCACHE_MAX_THREADS + 2,
+	.ops		= &fscache_pool_ops,
+};
+#endif
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+static int fscache_histogram_show(struct seq_file *m, void *v);
+
+static const struct seq_operations fscache_histogram_ops = {
+	.start		= fscache_proc_start,
+	.stop		= fscache_proc_stop,
+	.next		= fscache_proc_next,
+	.show		= fscache_histogram_show,
+};
+
+static const struct fscache_proc fscache_histogram = {
+	.nlines		= HZ + 1,
+	.ops		= &fscache_histogram_ops,
+};
+#endif
+
+#define FSC_DESC(SELECT, N) ((void *) (unsigned long) (((SELECT) << 16) | (N)))
+
+/*
+ * initialise the /proc/fs/fscache/ directory
+ */
+int __init fscache_proc_init(void)
+{
+	struct proc_dir_entry *p;
+
+	_enter("");
+
+	proc_fscache = proc_mkdir("fs/fscache", NULL);
+	if (!proc_fscache)
+		goto error_dir;
+	proc_fscache->owner = THIS_MODULE;
+
+#ifdef CONFIG_FSCACHE_STATS
+	p = create_proc_entry("stats", 0, proc_fscache);
+	if (!p)
+		goto error_stats;
+	p->proc_fops = &fscache_proc_fops;
+	p->owner = THIS_MODULE;
+	p->data = (void *) &fscache_stats;
+
+	p = create_proc_entry("pool", 0, proc_fscache);
+	if (!p)
+		goto error_pool;
+	p->proc_fops = &fscache_proc_fops;
+	p->owner = THIS_MODULE;
+	p->data = (void *) &fscache_pool;
+#endif
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+	p = create_proc_entry("histogram", 0, proc_fscache);
+	if (!p)
+		goto error_histogram;
+	p->proc_fops = &fscache_proc_fops;
+	p->owner = THIS_MODULE;
+	p->data = (void *) &fscache_histogram;
+#endif
+
+	_leave(" = 0");
+	return 0;
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+error_histogram:
+#endif
+#ifdef CONFIG_FSCACHE_STATS
+	remove_proc_entry("pool", proc_fscache);
+error_pool:
+	remove_proc_entry("stats", proc_fscache);
+error_stats:
+#endif
+	remove_proc_entry("fs/fscache", NULL);
+error_dir:
+	_leave(" = -ENOMEM");
+	return -ENOMEM;
+}
+
+/*
+ * clean up the /proc/fs/fscache/ directory
+ */
+void fscache_proc_cleanup(void)
+{
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+	remove_proc_entry("histogram", proc_fscache);
+#endif
+#ifdef CONFIG_FSCACHE_STATS
+	remove_proc_entry("pool", proc_fscache);
+	remove_proc_entry("stats", proc_fscache);
+#endif
+	remove_proc_entry("fs/fscache", NULL);
+}
+
+/*
+ * open "/proc/fs/fscache/XXX" which provide statistics summaries
+ */
+static int fscache_proc_open(struct inode *inode, struct file *file)
+{
+	const struct fscache_proc *proc = PDE(inode)->data;
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open(file, proc->ops);
+	if (ret == 0) {
+		m = file->private_data;
+		m->private = (void *) proc;
+	}
+	return ret;
+}
+
+/*
+ * set up the iterator to start reading from the first line
+ */
+static void *fscache_proc_start(struct seq_file *m, loff_t *_pos)
+{
+	if (*_pos == 0)
+		*_pos = 1;
+	return (void *)(unsigned long) *_pos;
+}
+
+/*
+ * move to the next line
+ */
+static void *fscache_proc_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	const struct fscache_proc *proc = m->private;
+
+	(*pos)++;
+	return *pos > proc->nlines ? NULL : (void *)(unsigned long) *pos;
+}
+
+/*
+ * clean up after reading
+ */
+static void fscache_proc_stop(struct seq_file *m, void *v)
+{
+}
+
+#ifdef CONFIG_FSCACHE_STATS
+/*
+ * display the general statistics
+ */
+static int fscache_stats_show(struct seq_file *m, void *v)
+{
+	unsigned long line = (unsigned long) v;
+
+	switch (line) {
+	case 1:
+		seq_puts(m, "FS-Cache statistics\n");
+		break;
+
+	case 2:
+		seq_printf(m, "Cookies: idx=%u dat=%u spc=%u\n",
+			   atomic_read(&fscache_n_cookie_index),
+			   atomic_read(&fscache_n_cookie_data),
+			   atomic_read(&fscache_n_cookie_special));
+		break;
+
+	case 3:
+		seq_printf(m, "Objects: alc=%u nal=%u avl=%u\n",
+			   atomic_read(&fscache_n_object_alloc),
+			   atomic_read(&fscache_n_object_no_alloc),
+			   atomic_read(&fscache_n_object_avail));
+		break;
+
+	case 4:
+		seq_printf(m, "Pages  : mrk=%u unc=%u\n",
+			   atomic_read(&fscache_n_marks),
+			   atomic_read(&fscache_n_uncaches));
+		break;
+
+	case 5:
+		seq_printf(m, "Acquire: n=%u nul=%u noc=%u ok=%u nbf=%u"
+			   " oom=%u\n",
+			   atomic_read(&fscache_n_acquires),
+			   atomic_read(&fscache_n_acquires_null),
+			   atomic_read(&fscache_n_acquires_no_cache),
+			   atomic_read(&fscache_n_acquires_ok),
+			   atomic_read(&fscache_n_acquires_nobufs),
+			   atomic_read(&fscache_n_acquires_oom));
+		break;
+
+	case 6:
+		seq_printf(m, "Lookups: n=%u neg=%u pos=%u crt=%u bst=%u\n",
+			   atomic_read(&fscache_n_object_lookups),
+			   atomic_read(&fscache_n_object_lookups_negative),
+			   atomic_read(&fscache_n_object_lookups_positive),
+			   atomic_read(&fscache_n_object_created),
+			   atomic_read(&fscache_n_object_boosted));
+		break;
+
+	case 7:
+		seq_printf(m, "Updates: n=%u nul=%u run=%u\n",
+			   atomic_read(&fscache_n_updates),
+			   atomic_read(&fscache_n_updates_null),
+			   atomic_read(&fscache_n_updates_run));
+		break;
+
+	case 8:
+		seq_printf(m, "Relinqs: n=%u nul=%u wcr=%u\n",
+			   atomic_read(&fscache_n_relinquishes),
+			   atomic_read(&fscache_n_relinquishes_null),
+			   atomic_read(&fscache_n_relinquishes_waitcrt));
+		break;
+
+	case 9:
+		seq_printf(m, "AttrChg: n=%u ok=%u nbf=%u oom=%u run=%u\n",
+			   atomic_read(&fscache_n_attr_changed),
+			   atomic_read(&fscache_n_attr_changed_ok),
+			   atomic_read(&fscache_n_attr_changed_nobufs),
+			   atomic_read(&fscache_n_attr_changed_nomem),
+			   atomic_read(&fscache_n_attr_changed_calls));
+		break;
+
+	case 10:
+		seq_printf(m, "Allocs : n=%u ok=%u wt=%u nbf=%u\n",
+			   atomic_read(&fscache_n_allocs),
+			   atomic_read(&fscache_n_allocs_ok),
+			   atomic_read(&fscache_n_allocs_wait),
+			   atomic_read(&fscache_n_allocs_nobufs));
+		break;
+	case 11:
+		seq_printf(m, "Allocs : ops=%u owt=%u\n",
+			   atomic_read(&fscache_n_alloc_ops),
+			   atomic_read(&fscache_n_alloc_op_waits));
+		break;
+
+	case 12:
+		seq_printf(m, "Retrvls: n=%u ok=%u wt=%u nod=%u nbf=%u"
+			   " int=%u oom=%u\n",
+			   atomic_read(&fscache_n_retrievals),
+			   atomic_read(&fscache_n_retrievals_ok),
+			   atomic_read(&fscache_n_retrievals_wait),
+			   atomic_read(&fscache_n_retrievals_nodata),
+			   atomic_read(&fscache_n_retrievals_nobufs),
+			   atomic_read(&fscache_n_retrievals_intr),
+			   atomic_read(&fscache_n_retrievals_nomem));
+		break;
+	case 13:
+		seq_printf(m, "Retrvls: ops=%u owt=%u\n",
+			   atomic_read(&fscache_n_retrieval_ops),
+			   atomic_read(&fscache_n_retrieval_op_waits));
+		break;
+
+	case 14:
+		seq_printf(m, "Stores : n=%u ok=%u agn=%u nbf=%u oom=%u\n",
+			   atomic_read(&fscache_n_stores),
+			   atomic_read(&fscache_n_stores_ok),
+			   atomic_read(&fscache_n_stores_again),
+			   atomic_read(&fscache_n_stores_nobufs),
+			   atomic_read(&fscache_n_stores_oom));
+		break;
+	case 15:
+		seq_printf(m, "Stores : ops=%u run=%u\n",
+			   atomic_read(&fscache_n_store_ops),
+			   atomic_read(&fscache_n_store_calls));
+		break;
+
+	case 16:
+		seq_printf(m, "Ops    : pend=%u run=%u enq=%u req=%u rel=%u\n",
+			   atomic_read(&fscache_n_op_pend),
+			   atomic_read(&fscache_n_op_run),
+			   atomic_read(&fscache_n_op_enqueue),
+			   atomic_read(&fscache_n_op_requeue),
+			   atomic_read(&fscache_n_op_release));
+		break;
+
+	default:
+		break;
+	}
+	return 0;
+}
+
+/*
+ * display the per-pool-thread statistics
+ */
+static int fscache_pool_show(struct seq_file *m, void *v)
+{
+	unsigned line = (unsigned long) v;
+	unsigned x, y;
+
+	switch (line) {
+	case 1:
+		seq_puts(m, "THREAD  OPERS RUN OBJS RUN\n");
+		return 0;
+	case 2:
+		seq_puts(m, "======= ========= =========\n");
+		return 0;
+	case 3 ... FSCACHE_MAX_THREADS + 2:
+		x = atomic_read(&fscache_n_ops_processed[line - 3]);
+		y = atomic_read(&fscache_n_objs_processed[line - 3]);
+		if (x != 0 || y != 0)
+			seq_printf(m, "kfsc%02ud %9u %9u\n", line - 3, x, y);
+	default:
+		return 0;
+	}
+}
+#endif
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+/*
+ * display the time-taken histogram
+ */
+static int fscache_histogram_show(struct seq_file *m, void *v)
+{
+	unsigned long index;
+	unsigned n[5], t;
+
+	switch ((unsigned long) v) {
+	case 1:
+		seq_puts(m, "JIFS  SECS  OBJ INST  OP RUNS   OBJ RUNS "
+			 " RETRV DLY RETRIEVLS\n");
+		return 0;
+	case 2:
+		seq_puts(m, "===== ===== ========= ========= ========="
+			 " ========= =========\n");
+		return 0;
+	default:
+		index = (unsigned long) v - 3;
+		n[0] = atomic_read(&fscache_obj_instantiate_histogram[index]);
+		n[1] = atomic_read(&fscache_ops_histogram[index]);
+		n[2] = atomic_read(&fscache_objs_histogram[index]);
+		n[3] = atomic_read(&fscache_retrieval_delay_histogram[index]);
+		n[4] = atomic_read(&fscache_retrieval_histogram[index]);
+		if (!(n[0] | n[1] | n[2] | n[3] | n[4]))
+			return 0;
+
+		t = (index * 1000) / HZ;
+
+		seq_printf(m, "%4lu  0.%03u %9u %9u %9u %9u %9u\n",
+			   index, t, n[0], n[1], n[2], n[3], n[4]);
+		return 0;
+	}
+}
+#endif
diff --git a/fs/fscache/fsc-stats.c b/fs/fscache/fsc-stats.c
new file mode 100644
index 0000000..15adbda
--- /dev/null
+++ b/fs/fscache/fsc-stats.c
@@ -0,0 +1,103 @@
+/* FS-Cache statistics
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL THREAD
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "fsc-internal.h"
+
+/*
+ * operation counters
+ */
+#ifdef CONFIG_FSCACHE_STATS
+atomic_t fscache_n_op_pend;
+atomic_t fscache_n_op_run;
+atomic_t fscache_n_op_enqueue;
+atomic_t fscache_n_op_requeue;
+atomic_t fscache_n_op_release;
+
+atomic_t fscache_n_attr_changed;
+atomic_t fscache_n_attr_changed_ok;
+atomic_t fscache_n_attr_changed_nobufs;
+atomic_t fscache_n_attr_changed_nomem;
+atomic_t fscache_n_attr_changed_calls;
+
+atomic_t fscache_n_allocs;
+atomic_t fscache_n_allocs_ok;
+atomic_t fscache_n_allocs_wait;
+atomic_t fscache_n_allocs_nobufs;
+atomic_t fscache_n_alloc_ops;
+atomic_t fscache_n_alloc_op_waits;
+
+atomic_t fscache_n_retrievals;
+atomic_t fscache_n_retrievals_ok;
+atomic_t fscache_n_retrievals_wait;
+atomic_t fscache_n_retrievals_nodata;
+atomic_t fscache_n_retrievals_nobufs;
+atomic_t fscache_n_retrievals_intr;
+atomic_t fscache_n_retrievals_nomem;
+atomic_t fscache_n_retrieval_ops;
+atomic_t fscache_n_retrieval_op_waits;
+
+atomic_t fscache_n_stores;
+atomic_t fscache_n_stores_ok;
+atomic_t fscache_n_stores_again;
+atomic_t fscache_n_stores_nobufs;
+atomic_t fscache_n_stores_oom;
+atomic_t fscache_n_store_ops;
+atomic_t fscache_n_store_calls;
+
+atomic_t fscache_n_marks;
+atomic_t fscache_n_uncaches;
+
+atomic_t fscache_n_acquires;
+atomic_t fscache_n_acquires_null;
+atomic_t fscache_n_acquires_no_cache;
+atomic_t fscache_n_acquires_ok;
+atomic_t fscache_n_acquires_nobufs;
+atomic_t fscache_n_acquires_oom;
+
+atomic_t fscache_n_updates;
+atomic_t fscache_n_updates_null;
+atomic_t fscache_n_updates_run;
+
+atomic_t fscache_n_relinquishes;
+atomic_t fscache_n_relinquishes_null;
+atomic_t fscache_n_relinquishes_waitcrt;
+
+atomic_t fscache_n_cookie_index;
+atomic_t fscache_n_cookie_data;
+atomic_t fscache_n_cookie_special;
+
+atomic_t fscache_n_object_alloc;
+atomic_t fscache_n_object_no_alloc;
+atomic_t fscache_n_object_lookups;
+atomic_t fscache_n_object_lookups_negative;
+atomic_t fscache_n_object_lookups_positive;
+atomic_t fscache_n_object_created;
+atomic_t fscache_n_object_avail;
+atomic_t fscache_n_object_boosted;
+
+/*
+ * the number of operations and objects processed by each thread in the pool
+ */
+atomic_t fscache_n_ops_processed[FSCACHE_MAX_THREADS];
+atomic_t fscache_n_objs_processed[FSCACHE_MAX_THREADS];
+#endif
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+atomic_t fscache_obj_instantiate_histogram[HZ];
+atomic_t fscache_objs_histogram[HZ];
+atomic_t fscache_ops_histogram[HZ];
+atomic_t fscache_retrieval_delay_histogram[HZ];
+atomic_t fscache_retrieval_histogram[HZ];
+#endif
diff --git a/fs/fscache/fsc-threads.c b/fs/fscache/fsc-threads.c
new file mode 100644
index 0000000..4853176
--- /dev/null
+++ b/fs/fscache/fsc-threads.c
@@ -0,0 +1,676 @@
+/* FS-Cache worker thread pool manager
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL THREAD
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include "fsc-internal.h"
+
+static LIST_HEAD(fscache_async_object_fifo);
+static LIST_HEAD(fscache_sync_object_fifo);
+static LIST_HEAD(fscache_async_op_fifo);
+static LIST_HEAD(fscache_sync_op_fifo);
+
+static struct task_struct *fscache_threads[FSCACHE_MAX_THREADS];
+
+static unsigned fscache_n_threads = 21;
+module_param_named(n_threads, fscache_n_threads, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_n_threads, "FS-Cache thread pool size");
+
+static DEFINE_SPINLOCK(fscache_object_lock);
+static DEFINE_SPINLOCK(fscache_operation_lock);
+static DEFINE_MUTEX(fscache_thread_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(fscache_async_obj_threads);
+static DECLARE_WAIT_QUEUE_HEAD(fscache_sync_obj_threads);
+static DECLARE_WAIT_QUEUE_HEAD(fscache_async_op_threads);
+static DECLARE_WAIT_QUEUE_HEAD(fscache_sync_op_threads);
+
+/**
+ * fscache_enqueue_operation - Enqueue an operation for processing
+ * @op: The operation to enqueue
+ *
+ * Enqueue an operation for processing by the FS-Cache thread pool.
+ */
+void fscache_enqueue_operation(struct fscache_operation *op)
+{
+	unsigned long flags;
+	unsigned wake = 0;
+
+	_enter("{OBJ%x}", op->object->debug_id);
+
+	ASSERT(op->processor != NULL);
+	ASSERTCMP(op->object->state, >=, FSCACHE_OBJECT_AVAILABLE);
+	ASSERTCMP(atomic_read(&op->usage), >, 0);
+
+	spin_lock_irqsave(&fscache_operation_lock, flags);
+
+	if (list_empty(&op->work_link) &&
+	    !test_bit(FSCACHE_OP_REQUEUE, &op->flags)) {
+		if (!test_bit(FSCACHE_OP_LOCK, &op->flags)) {
+			atomic_inc(&op->usage);
+			if (test_bit(FSCACHE_OP_SYNC, &op->flags)) {
+				list_add_tail(&op->work_link,
+					      &fscache_sync_op_fifo);
+				wake = 1;
+			} else {
+				list_add_tail(&op->work_link,
+					      &fscache_async_op_fifo);
+				wake = 2;
+			}
+			fscache_stat(&fscache_n_op_enqueue);
+		} else {
+			set_bit(FSCACHE_OP_REQUEUE, &op->flags);
+			fscache_stat(&fscache_n_op_requeue);
+		}
+	}
+
+	spin_unlock_irqrestore(&fscache_operation_lock, flags);
+	if (wake) {
+		_debug("wake %u", wake);
+		switch (wake) {
+		case 1:
+			wake_up(&fscache_sync_op_threads);
+			break;
+		case 2:
+			wake_up(&fscache_async_op_threads);
+			break;
+		default:
+			break;
+		}
+	}
+}
+EXPORT_SYMBOL(fscache_enqueue_operation);
+
+/*
+ * jump start the operation processing on an object
+ * - caller must hold object->lock
+ */
+void fscache_start_operations(struct fscache_object *object)
+{
+	struct fscache_operation *op;
+	bool stop = false;
+
+	while (!list_empty(&object->pending_ops) && !stop) {
+		op = list_entry(object->pending_ops.next,
+				struct fscache_operation, work_link);
+
+		if (test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags)) {
+			if (object->n_in_progress > 0)
+				break;
+			stop = true;
+		}
+		list_del_init(&op->work_link);
+		object->n_in_progress++;
+
+		if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
+			wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
+		if (op->processor)
+			fscache_enqueue_operation(op);
+	}
+
+	ASSERTCMP(object->n_in_progress, <=, object->n_ops);
+
+	_debug("woke %d ops on OBJ%x",
+	       object->n_in_progress, object->debug_id);
+}
+
+/*
+ * release an operation
+ * - queues pending ops if this is the last in-progress op
+ */
+void fscache_put_operation(struct fscache_operation *op)
+{
+	struct fscache_object *object;
+
+	_enter("{%d}", atomic_read(&op->usage));
+
+	ASSERTCMP(atomic_read(&op->usage), >, 0);
+
+	if (!atomic_dec_and_test(&op->usage))
+		return;
+
+	_debug("PUT OP");
+	fscache_stat(&fscache_n_op_release);
+
+	if (op->release)
+		op->release(op);
+
+	object = op->object;
+	spin_lock(&object->lock);
+	if (test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags)) {
+		ASSERTCMP(object->n_exclusive, >, 0);
+		object->n_exclusive--;
+	}
+
+	ASSERTCMP(object->n_in_progress, >, 0);
+	object->n_in_progress--;
+	if (object->n_in_progress == 0)
+		fscache_start_operations(object);
+
+	ASSERTCMP(object->n_ops, >, 0);
+	object->n_ops--;
+	if (object->n_ops == 0)
+		fscache_raise_event(object, FSCACHE_OBJECT_EV_CLEARED);
+
+	spin_unlock(&object->lock);
+	_leave("");
+}
+EXPORT_SYMBOL(fscache_put_operation);
+
+/*
+ * add object to queue
+ * - caller must hold fscache_thread_lock
+ */
+static unsigned __fscache_enqueue_object(struct fscache_object *object)
+{
+	if (test_bit(FSCACHE_OBJECT_SYNC, &object->flags)) {
+		list_add_tail(&object->work_link, &fscache_sync_object_fifo);
+		return 1;
+	} else {
+		list_add_tail(&object->work_link, &fscache_async_object_fifo);
+		return 2;
+	}
+}
+
+/*
+ * enqueue an object for metadata-type processing
+ */
+void fscache_enqueue_object(struct fscache_object *object)
+{
+	struct fscache_cache *cache;
+	unsigned wake = 0;
+
+	_enter("{OBJ%x}", object->debug_id);
+
+	if (list_empty(&object->work_link) &&
+	    !test_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events)) {
+		spin_lock_irq(&fscache_object_lock);
+		if (!test_bit(FSCACHE_OBJECT_LOCK, &object->flags)) {
+			if (list_empty(&object->work_link)) {
+				_debug("add");
+				cache = object->cache;
+				atomic_inc(&cache->thread_usage);
+				cache->ops->grab_object(object);
+				wake = __fscache_enqueue_object(object);
+			}
+		} else {
+			_debug("defer");
+			set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+		}
+		spin_unlock_irq(&fscache_object_lock);
+		if (wake) {
+			_debug("wake %u", wake);
+			switch (wake) {
+			case 1:
+				wake_up(&fscache_sync_obj_threads);
+				break;
+			case 2:
+				wake_up(&fscache_async_obj_threads);
+				break;
+			default:
+				break;
+			}
+		}
+	}
+}
+
+/*
+ * enqueue the dependents of an object for metadata-type processing
+ * - the caller must hold the object's lock
+ * - this may cause an already locked object to wind up being processed again
+ */
+void fscache_enqueue_dependents(struct fscache_object *object)
+{
+	struct fscache_object *dep;
+	unsigned wake = 0;
+
+	_enter("{%p}", object);
+
+	if (list_empty(&object->dependents))
+		return;
+
+	spin_lock_irq(&fscache_object_lock);
+
+	while (!list_empty(&object->dependents)) {
+		dep = list_entry(object->dependents.next,
+				 struct fscache_object, work_link);
+		list_del(&dep->work_link);
+
+		clear_bit(FSCACHE_OBJECT_WAITING, &object->flags);
+
+		/* sort onto appropriate lists */
+		wake |= __fscache_enqueue_object(dep);
+
+		if (!list_empty(&object->dependents) && need_resched()) {
+			spin_unlock_irq(&fscache_object_lock);
+			cond_resched();
+			spin_lock_irq(&fscache_object_lock);
+		}
+	}
+
+	spin_unlock_irq(&fscache_object_lock);
+	if (wake) {
+		_debug("wake %u", wake);
+		if (wake & 1)
+			wake_up(&fscache_sync_obj_threads);
+		if (wake & 2)
+			wake_up(&fscache_async_obj_threads);
+	}
+}
+
+/*
+ * remove an object from whatever queue it's waiting on
+ */
+void fscache_dequeue_object(struct fscache_object *object)
+{
+	_enter("{OBJ%x}", object->debug_id);
+
+	if (!list_empty(&object->dependents)) {
+		spin_lock_irq(&fscache_object_lock);
+		list_del_init(&object->work_link);
+		spin_unlock_irq(&fscache_object_lock);
+	}
+	_leave("");
+}
+
+/*
+ * boost an object that's being waited upon by moving it to the priority queue
+ */
+void fscache_boost_object(struct fscache_object *object)
+{
+	set_bit(FSCACHE_OBJECT_BOOSTED, &object->flags);
+	if (!test_bit(FSCACHE_OBJECT_WAITING, &object->flags)) {
+		fscache_stat(&fscache_n_object_boosted);
+		spin_lock_irq(&fscache_object_lock);
+		if (!list_empty(&object->work_link))
+			list_move_tail(&object->work_link,
+				       &fscache_sync_object_fifo);
+		spin_unlock_irq(&fscache_object_lock);
+	}
+}
+
+/*
+ * object dispatcher
+ * - slow threads take from the object FIFO by preference
+ * - called with fscache_thread_lock locked, which it drops
+ */
+static void fscache_dispatch_object(unsigned thread)
+{
+	struct fscache_object *object;
+	struct fscache_cache *cache = NULL;
+	unsigned long start;
+	bool sync;
+
+	spin_lock_irq(&fscache_object_lock);
+
+	if (list_empty(&fscache_async_object_fifo) &&
+	    list_empty(&fscache_sync_object_fifo)) {
+		spin_unlock_irq(&fscache_object_lock);
+		return;
+	}
+
+	fscache_stat(&fscache_n_objs_processed[thread]);
+
+	sync = !list_empty(&fscache_sync_object_fifo);
+	if (sync) {
+		object = list_entry(fscache_sync_object_fifo.next,
+				    struct fscache_object, work_link);
+		set_user_nice(current, -1);
+	} else {
+		object = list_entry(fscache_async_object_fifo.next,
+				    struct fscache_object, work_link);
+		set_user_nice(current, 1);
+	}
+
+	/* lock the object so that it's only processed by one thread at
+	 * once */
+	_debug("LK OBJ%x", object->debug_id);
+	list_del_init(&object->work_link);
+	if (test_and_set_bit(FSCACHE_OBJECT_LOCK, &object->flags)) {
+		_debug("OBJ%x already locked {%s,%lx}\n",
+		       object->debug_id,
+		       fscache_object_states[object->state],
+		       object->events & object->event_mask);
+		set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+		spin_unlock_irq(&fscache_object_lock);
+		return;
+	}
+	do {
+		spin_unlock_irq(&fscache_object_lock);
+
+		do {
+			clear_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+			start = jiffies;
+			fscache_object_state_machine(object);
+			fscache_hist(fscache_objs_histogram, start);
+		} while (object->events & object->event_mask);
+
+		spin_lock_irq(&fscache_object_lock);
+	} while (object->events & object->event_mask);
+
+	/* unlock the object */
+	_debug("UN OBJ%x", object->debug_id);
+	if (!test_and_clear_bit(FSCACHE_OBJECT_LOCK, &object->flags))
+		BUG();
+
+	spin_unlock_irq(&fscache_object_lock);
+
+	if (object) {
+		/* must do the wake up outside the thread pool lock to avoid a
+		 * circular lock dependency against a __wake_up() lock in
+		 * CacheFiles */
+		wake_up_bit(&object->lock, FSCACHE_OBJECT_LOCK);
+		cache = object->cache;
+		cache->ops->put_object(object);
+		_debug("%d REMAIN", atomic_read(&cache->thread_usage));
+		if (atomic_dec_and_test(&cache->thread_usage)) {
+			_debug("ALL GONE");
+			wake_up_all(&fscache_clearance_wq);
+		}
+	}
+}
+
+/*
+ * operation dispatcher
+ * - all threads can take from the fast FIFO
+ * - called with fscache_thread_lock locked, which it drops
+ */
+static void fscache_dispatch_operation(unsigned thread, struct list_head *queue)
+{
+	struct fscache_operation *op;
+	unsigned long start;
+
+	spin_lock_irq(&fscache_operation_lock);
+
+	if (list_empty(queue)) {
+		spin_unlock_irq(&fscache_operation_lock);
+		return;
+	}
+
+	fscache_stat(&fscache_n_ops_processed[thread]);
+
+	op = list_entry(queue->next, struct fscache_operation, work_link);
+
+	/* lock the operation so that it's only processed by one thread
+	 * at once */
+	_debug("LK OP OBJ%x", op->object->debug_id);
+	if (test_and_set_bit(FSCACHE_OP_LOCK, &op->flags)) {
+		printk(KERN_ERR "FS-Cache: OP on OBJ%x already locked\n",
+		       op->object->debug_id);
+		BUG();
+	}
+	list_del_init(&op->work_link);
+	spin_unlock_irq(&fscache_operation_lock);
+
+	ASSERT(op->processor != NULL);
+	start = jiffies;
+	op->processor(op);
+	fscache_hist(fscache_ops_histogram, start);
+
+	/* unlock the op and requeue if requested */
+	spin_lock_irq(&fscache_operation_lock);
+	_debug("UN OP OBJ%x", op->object->debug_id);
+	clear_bit(FSCACHE_OP_LOCK, &op->flags);
+
+	if (test_and_clear_bit(FSCACHE_OP_REQUEUE, &op->flags)) {
+		list_add(&op->work_link, queue);
+		op = NULL;
+	}
+
+	spin_unlock_irq(&fscache_operation_lock);
+
+	if (op)
+		fscache_put_operation(op);
+}
+
+/*
+ * thread dispatcher
+ */
+static void fscache_dispatch(unsigned thread, unsigned level)
+{
+	_enter("");
+
+	switch (level) {
+	case 2:
+		/* threads 2, 5, 8, 11, ... do object processing then sync
+		 * object processing, then async ops then sync ops */
+		if (!list_empty(&fscache_async_object_fifo) ||
+		    !list_empty(&fscache_sync_object_fifo)) {
+			fscache_dispatch_object(thread);
+			break;
+		}
+	case 1:
+		/* threads 1, 4, 7, 10, ... do async ops then sync ops */
+		if (!list_empty(&fscache_async_op_fifo)) {
+			fscache_dispatch_operation(thread,
+						   &fscache_async_op_fifo);
+			break;
+		}
+		if (!list_empty(&fscache_sync_op_fifo)) {
+			fscache_dispatch_operation(thread,
+						   &fscache_sync_op_fifo);
+			break;
+		}
+		break;
+	default:
+		/* threads 0, 3, 6, 9, ... do sync ops only */
+		if (!list_empty(&fscache_sync_op_fifo)) {
+			fscache_dispatch_operation(thread,
+						   &fscache_sync_op_fifo);
+			break;
+		}
+		break;
+	}
+
+	_leave("");
+}
+
+/*
+ * operation dispatcher thread (where thread ID % 3 == 2) dedicated to doing
+ * asynchronous and synchronous object processing and asynchronous and
+ * synchronous operation processing in that order
+ */
+static int kfscached_type_2(unsigned thread)
+{
+	DECLARE_WAITQUEUE(myself_async_obj, current);
+	DECLARE_WAITQUEUE(myself_sync_obj, current);
+	DECLARE_WAITQUEUE(myself_async_op, current);
+	DECLARE_WAITQUEUE(myself_sync_op, current);
+
+	/* set_user_nice(current, -4); */
+
+	do {
+		while (!list_empty(&fscache_async_object_fifo) ||
+		       !list_empty(&fscache_sync_object_fifo) ||
+		       !list_empty(&fscache_async_op_fifo) ||
+		       !list_empty(&fscache_sync_op_fifo)
+		       ) {
+			fscache_dispatch(thread, 2);
+			cond_resched();
+		}
+
+		if (thread / 3 & 1) {
+			add_wait_queue(&fscache_async_obj_threads,
+				       &myself_async_obj);
+			add_wait_queue_tail(&fscache_sync_obj_threads,
+					    &myself_sync_obj);
+		} else {
+			add_wait_queue_tail(&fscache_async_obj_threads,
+					    &myself_async_obj);
+			add_wait_queue(&fscache_sync_obj_threads,
+				       &myself_sync_obj);
+		}
+		add_wait_queue_tail(&fscache_async_op_threads,
+				    &myself_async_op);
+		add_wait_queue_tail(&fscache_sync_op_threads, &myself_sync_op);
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (list_empty(&fscache_async_object_fifo) &&
+		    list_empty(&fscache_sync_object_fifo) &&
+		    list_empty(&fscache_async_op_fifo) &&
+		    list_empty(&fscache_sync_op_fifo) &&
+		    !kthread_should_stop())
+			schedule();
+		remove_wait_queue(&fscache_async_obj_threads,
+				  &myself_async_obj);
+		remove_wait_queue(&fscache_sync_obj_threads, &myself_sync_obj);
+		remove_wait_queue(&fscache_async_op_threads, &myself_async_op);
+		remove_wait_queue(&fscache_sync_op_threads, &myself_sync_op);
+		__set_current_state(TASK_RUNNING);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/*
+ * operation dispatcher thread (where thread ID % 3 == 1) dedicated to doing
+ * asynchronous and synchronous operation processing only
+ */
+static int kfscached_type_1(unsigned thread)
+{
+	DECLARE_WAITQUEUE(myself_async_op, current);
+	DECLARE_WAITQUEUE(myself_sync_op, current);
+
+	do {
+		while (!list_empty(&fscache_async_op_fifo) ||
+		       !list_empty(&fscache_sync_op_fifo)) {
+			fscache_dispatch(thread, 1);
+			cond_resched();
+		}
+
+		add_wait_queue(&fscache_async_op_threads, &myself_async_op);
+		add_wait_queue_tail(&fscache_sync_op_threads, &myself_sync_op);
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (list_empty(&fscache_async_op_fifo) &&
+		    list_empty(&fscache_sync_op_fifo) &&
+		    !kthread_should_stop())
+			schedule();
+		remove_wait_queue(&fscache_async_op_threads, &myself_async_op);
+		remove_wait_queue(&fscache_sync_op_threads, &myself_sync_op);
+		__set_current_state(TASK_RUNNING);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/*
+ * operation dispatcher thread (where thread ID % 3 == 0) dedicated to doing
+ * synchronous operation processing only
+ */
+static int kfscached(void *_thread)
+{
+	unsigned thread = (unsigned long) _thread;
+
+	DECLARE_WAITQUEUE(myself_sync_op, current);
+
+	switch (thread % 3) {
+	case 2:
+		return kfscached_type_2(thread);
+	case 1:
+		return kfscached_type_1(thread);
+	default:
+		break;
+	}
+
+	set_user_nice(current, 1);
+
+	do {
+		while (!list_empty(&fscache_sync_op_fifo)) {
+			fscache_dispatch(thread, 0);
+			cond_resched();
+		}
+
+		add_wait_queue(&fscache_sync_op_threads, &myself_sync_op);
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (list_empty(&fscache_sync_op_fifo) &&
+		    !kthread_should_stop())
+			schedule();
+		remove_wait_queue(&fscache_sync_op_threads, &myself_sync_op);
+		__set_current_state(TASK_RUNNING);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/*
+ * initialise the worker thread pool
+ */
+int fscache_init_threads(void)
+{
+	static bool inited = false;
+	struct task_struct *t;
+	unsigned long loop, max;
+	int ret;
+
+	_enter("");
+
+	mutex_lock(&fscache_thread_mutex);
+
+	if (!inited) {
+		max = fscache_n_threads;
+		if (max < FSCACHE_MIN_THREADS)
+			max = FSCACHE_MIN_THREADS;
+		else if (max > FSCACHE_MAX_THREADS)
+			max = FSCACHE_MAX_THREADS;
+
+		for (loop = 0; loop < max; loop++) {
+			t = kthread_create(kfscached, (void *) loop,
+					   "kfsc%02ud", loop);
+			if (IS_ERR(t))
+				goto failed;
+			fscache_threads[loop] = t;
+			wake_up_process(t);
+		}
+
+		inited = true;
+	}
+	mutex_unlock(&fscache_thread_mutex);
+	_leave(" = 0");
+	return 0;
+
+failed:
+	ret = PTR_ERR(t);
+	printk(KERN_ERR "FS-Cache: Unable to create kfscached threads (%d)\n",
+	       ret);
+
+	while (loop-- > 0) {
+		if (fscache_threads[loop]) {
+			kthread_stop(fscache_threads[loop]);
+			fscache_threads[loop] = NULL;
+		}
+	}
+
+	mutex_unlock(&fscache_thread_mutex);
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * kill the pool of running threads
+ */
+void fscache_kill_threads(void)
+{
+	unsigned int loop;
+
+	_enter("");
+	mutex_lock(&fscache_thread_mutex);
+
+	for (loop = 0; loop < FSCACHE_MAX_THREADS; loop++)
+		if (fscache_threads[loop])
+			kthread_stop(fscache_threads[loop]);
+
+	BUG_ON(!list_empty(&fscache_async_object_fifo));
+	BUG_ON(!list_empty(&fscache_sync_object_fifo));
+	BUG_ON(!list_empty(&fscache_async_op_fifo));
+	BUG_ON(!list_empty(&fscache_sync_op_fifo));
+
+	mutex_unlock(&fscache_thread_mutex);
+	_leave("");
+}
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
new file mode 100644
index 0000000..176797f
--- /dev/null
+++ b/include/linux/fscache-cache.h
@@ -0,0 +1,433 @@
+/* General filesystem caching backing cache interface
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * NOTE!!! See:
+ *
+ *	Documentation/filesystems/caching/backend-api.txt
+ *
+ * for a description of the cache backend interface declared here.
+ */
+
+#ifndef _LINUX_FSCACHE_CACHE_H
+#define _LINUX_FSCACHE_CACHE_H
+
+#include <linux/fscache.h>
+
+#define NR_MAXCACHES BITS_PER_LONG
+
+struct fscache_cache;
+struct fscache_cache_ops;
+struct fscache_object;
+
+#ifdef CONFIG_FSCACHE_PROC
+extern struct proc_dir_entry *proc_fscache;
+#endif
+
+/*
+ * cache tag definition
+ */
+struct fscache_cache_tag {
+	struct list_head	link;
+	struct fscache_cache	*cache;		/* cache referred to by this tag */
+	unsigned long		flags;
+#define FSCACHE_TAG_RESERVED	0		/* T if tag is reserved for a cache */
+	atomic_t		usage;
+	char			name[0];	/* tag name */
+};
+
+/*
+ * cache definition
+ */
+struct fscache_cache {
+	const struct fscache_cache_ops *ops;
+	struct fscache_cache_tag *tag;		/* tag representing this cache */
+	struct kobject		kobj;		/* system representation of this cache */
+	struct list_head	link;		/* link in list of caches */
+	size_t			max_index_size;	/* maximum size of index data */
+	char			identifier[32];	/* cache label */
+
+	/* node management */
+	struct list_head	object_list;	/* list of data/index objects */
+	spinlock_t		object_list_lock;
+	atomic_t		thread_usage;	/* no. of threads working this cache */
+	struct fscache_object	*fsdef;		/* object for the fsdef index */
+	unsigned long		flags;
+#define FSCACHE_IOERROR		0	/* cache stopped on I/O error */
+#define FSCACHE_CACHE_WITHDRAWN	1	/* cache has been withdrawn */
+};
+
+/*
+ * asynchronous operation being applied to or waiting to be applied to a cache
+ * object
+ * - slow operations are done in the context of the process that issued them,
+ *   not in the context of kfscached
+ */
+struct fscache_operation {
+	struct list_head	work_link;	/* link in worker thread FIFO or
+						 * link in object->pending_ops */
+	struct fscache_object	*object;	/* object to be operated upon */
+
+	unsigned long		flags;
+#define FSCACHE_OP_WAITING	0	/* cleared when op is woken */
+#define FSCACHE_OP_SYNC		1	/* synchronous operation */
+#define FSCACHE_OP_EXCLUSIVE	2	/* exclusive op, other ops must wait */
+#define FSCACHE_OP_LOCK		3	/* thread pool processing lock */
+#define FSCACHE_OP_REQUEUE	4	/* op needs more processing */
+
+	atomic_t		usage;
+
+	/* operation processor callback
+	 * - can be NULL if FSCACHE_OP_WAITING is going to be used to perform
+	 *   the op in a non-pool thread */
+	void (*processor)(struct fscache_operation *op);
+
+	/* operation releaser */
+	void (*release)(struct fscache_operation *op);
+};
+
+extern void fscache_enqueue_operation(struct fscache_operation *);
+extern void fscache_put_operation(struct fscache_operation *);
+
+/*
+ * data read operation
+ */
+struct fscache_retrieval {
+	struct fscache_operation op;
+	struct address_space	*mapping;	/* netfs pages */
+	fscache_rw_complete_t	end_io_func;	/* function to call on I/O completion */
+	void			*context;	/* netfs read context (pinned) */
+	struct list_head	to_do;		/* list of things to be done by the backend */
+	unsigned long		start_time;	/* time at which retrieval started */
+};
+
+typedef int (*fscache_page_retrieval_func_t)(struct fscache_retrieval *op,
+					     struct page *page,
+					     gfp_t gfp);
+
+typedef int (*fscache_pages_retrieval_func_t)(struct fscache_retrieval *op,
+					      struct list_head *pages,
+					      unsigned *nr_pages,
+					      gfp_t gfp);
+
+/**
+ * fscache_get_retrieval - Get an extra reference on a retrieval operation
+ * @op: The retrieval operation to get a reference on
+ *
+ * Get an extra reference on a retrieval operation.
+ */
+static inline
+struct fscache_retrieval *fscache_get_retrieval(struct fscache_retrieval *op)
+{
+	atomic_inc(&op->op.usage);
+	return op;
+}
+
+/**
+ * fscache_enqueue_retrieval - Enqueue a retrieval operation for processing
+ * @op: The retrieval operation affected
+ *
+ * Enqueue a retrieval operation for processing by the FS-Cache thread pool.
+ */
+static inline void fscache_enqueue_retrieval(struct fscache_retrieval *op)
+{
+	fscache_enqueue_operation(&op->op);
+}
+
+/**
+ * fscache_put_retrieval - Drop a reference to a retrieval operation
+ * @op: The retrieval operation affected
+ *
+ * Drop a reference to a retrieval operation.
+ */
+static inline void fscache_put_retrieval(struct fscache_retrieval *op)
+{
+	fscache_put_operation(&op->op);
+}
+
+/*
+ * cached page storage work item
+ * - used to do three things:
+ *   - batch writes to the cache
+ *   - do cache writes asynchronously
+ *   - defer writes until cache object lookup completion
+ */
+struct fscache_storage {
+	struct fscache_operation op;
+	pgoff_t			store_limit;	/* don't write more than this */
+};
+
+/*
+ * cache operations
+ */
+struct fscache_cache_ops {
+	/* name of cache provider */
+	const char *name;
+
+	/* allocate an object record for a cookie */
+	struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
+					       struct fscache_cookie *cookie);
+
+	/* look up the object for a cookie */
+	void (*lookup_object)(struct fscache_object *object);
+
+	/* finished looking up */
+	void (*lookup_complete)(struct fscache_object *object);
+
+	/* increment the usage count on this object (may fail if unmounting) */
+	struct fscache_object *(*grab_object)(struct fscache_object *object);
+
+	/* pin an object in the cache */
+	int (*pin_object)(struct fscache_object *object);
+
+	/* unpin an object in the cache */
+	void (*unpin_object)(struct fscache_object *object);
+
+	/* store the updated auxilliary data on an object */
+	void (*update_object)(struct fscache_object *object);
+
+	/* discard the resources pinned by an object and effect retirement if
+	 * necessary */
+	void (*drop_object)(struct fscache_object *object);
+
+	/* dispose of a reference to an object */
+	void (*put_object)(struct fscache_object *object);
+
+	/* sync a cache */
+	void (*sync_cache)(struct fscache_cache *cache);
+
+	/* notification that the attributes of a non-index object (such as
+	 * i_size) have changed */
+	int (*attr_changed)(struct fscache_object *object);
+
+	/* reserve space for an object's data and associated metadata */
+	int (*reserve_space)(struct fscache_object *object, loff_t i_size);
+
+	/* request a backing block for a page be read or allocated in the
+	 * cache */
+	fscache_page_retrieval_func_t read_or_alloc_page;
+
+	/* request backing blocks for a list of pages be read or allocated in
+	 * the cache */
+	fscache_pages_retrieval_func_t read_or_alloc_pages;
+
+	/* request a backing block for a page be allocated in the cache so that
+	 * it can be written directly */
+	fscache_page_retrieval_func_t allocate_page;
+
+	/* request backing blocks for pages be allocated in the cache so that
+	 * they can be written directly */
+	fscache_pages_retrieval_func_t allocate_pages;
+
+	/* write a page to its backing block in the cache */
+	int (*write_page)(struct fscache_storage *op, struct page *page);
+
+	/* detach backing block from a page (optional)
+	 * - must release the cookie lock before returning
+	 * - may sleep
+	 */
+	void (*uncache_page)(struct fscache_object *object,
+			     struct page *page);
+
+	/* dissociate a cache from all the pages it was backing */
+	void (*dissociate_pages)(struct fscache_cache *cache);
+};
+
+/*
+ * data file or index object cookie
+ * - a file will only appear in one cache
+ * - a request to cache a file may or may not be honoured, subject to
+ *   constraints such as disk space
+ * - indices are created on disk just-in-time
+ */
+struct fscache_cookie {
+	atomic_t			usage;		/* number of users of this cookie */
+	atomic_t			n_children;	/* number of children of this cookie */
+	spinlock_t			lock;
+	struct hlist_head		backing_objects; /* object(s) backing this file/index */
+	const struct fscache_cookie_def	*def;		/* definition */
+	struct fscache_cookie		*parent;	/* parent of this entry */
+	void				*netfs_data;	/* back pointer to netfs */
+	unsigned long			flags;
+#define FSCACHE_COOKIE_LOOKING_UP	0	/* T if non-index cookie being looked up still */
+#define FSCACHE_COOKIE_CREATING		1	/* T if non-index object being created still */
+#define FSCACHE_COOKIE_NO_DATA_YET	2	/* T if new object with no cached data yet */
+#define FSCACHE_COOKIE_PENDING_FILL	3	/* T if pending initial fill on object */
+#define FSCACHE_COOKIE_FILLING		4	/* T if filling object incrementally */
+#define FSCACHE_COOKIE_UNAVAILABLE	5	/* T if cookie is unavailable (error, etc) */
+};
+
+extern struct fscache_cookie fscache_fsdef_index;
+
+/*
+ * on-disk cache file or index handle
+ */
+struct fscache_object {
+	enum {
+		FSCACHE_OBJECT_INIT,		/* object in initial unbound state */
+		FSCACHE_OBJECT_LOOKING_UP,	/* looking up object */
+		FSCACHE_OBJECT_CREATING,	/* creating object */
+
+		/* active states */
+		FSCACHE_OBJECT_AVAILABLE,	/* cleaning up object after creation */
+		FSCACHE_OBJECT_ACTIVE,		/* object is usable */
+		FSCACHE_OBJECT_UPDATING,	/* object is updating */
+
+		/* terminal states */
+		FSCACHE_OBJECT_DYING,		/* object waiting for accessors to finish */
+		FSCACHE_OBJECT_LC_DYING,	/* object cleaning up after lookup/create */
+		FSCACHE_OBJECT_ABORT_INIT,	/* abort the init state */
+		FSCACHE_OBJECT_RELEASING,	/* releasing object */
+		FSCACHE_OBJECT_RECYCLING,	/* retiring object */
+		FSCACHE_OBJECT_WITHDRAWING,	/* withdrawing object */
+		FSCACHE_OBJECT_DEAD,		/* object is now dead */
+	} state;
+
+	int			debug_id;	/* debugging ID */
+	int			n_children;	/* number of child objects */
+	int			n_ops;		/* number of ops outstanding on object */
+	int			n_in_progress;	/* number of ops in progress */
+	int			n_exclusive;	/* number of exclusive ops queued */
+	spinlock_t		lock;		/* state and operations lock */
+
+	unsigned long		lookup_jif;	/* time at which lookup started */
+	unsigned long		event_mask;	/* events this object is interested in */
+	unsigned long		events;		/* events to be processed by this object
+						 * (order is important - using fls) */
+#define FSCACHE_OBJECT_EV_REQUEUE	0	/* T if object should be requeued */
+#define FSCACHE_OBJECT_EV_UPDATE	1	/* T if object should be updated */
+#define FSCACHE_OBJECT_EV_CLEARED	2	/* T if accessors all gone */
+#define FSCACHE_OBJECT_EV_ERROR		3	/* T if fatal error occurred during processing */
+#define FSCACHE_OBJECT_EV_RELEASE	4	/* T if netfs requested object release */
+#define FSCACHE_OBJECT_EV_RETIRE	5	/* T if netfs requested object retirement */
+#define FSCACHE_OBJECT_EV_WITHDRAW	6	/* T if cache requested object withdrawal */
+
+	unsigned long		flags;
+#define FSCACHE_OBJECT_LOCK		0	/* T if object is busy being processed */
+#define FSCACHE_OBJECT_SYNC		1	/* T if object is has waiters */
+#define FSCACHE_OBJECT_PENDING_WRITE	2	/* T if object has pending write */
+#define FSCACHE_OBJECT_WAITING		3	/* T if object is waiting on its parent */
+#define FSCACHE_OBJECT_BOOSTED		4	/* T if object was boosted to priority queue */
+
+	struct list_head	cache_link;	/* link in cache->object_list */
+	struct hlist_node	cookie_link;	/* link in cookie->backing_objects */
+	struct fscache_cache	*cache;		/* cache that supplied this object */
+	struct fscache_cookie	*cookie;	/* netfs's file/index object */
+	struct fscache_object	*parent;	/* parent object */
+	struct list_head	work_link;	/* link in worker thread FIFO */
+	struct list_head	dependents;	/* FIFO of dependent objects */
+	struct list_head	pending_ops;	/* unstarted operations on this object */
+	struct radix_tree_root	stores;		/* data to be stored */
+	pgoff_t			store_limit;	/* current storage limit */
+};
+
+extern const char *fscache_object_states[];
+
+#define fscache_object_is_active(obj)			      \
+	(!test_bit(FSCACHE_IOERROR, &(obj)->cache->flags) &&  \
+	 (obj)->state >= FSCACHE_OBJECT_AVAILABLE &&	      \
+	 (obj)->state < FSCACHE_OBJECT_DYING)
+
+/**
+ * fscache_object_init - Initialise a cache object description
+ * @object: Object description
+ *
+ * Initialise a cache object description to its basic values.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_object_init(struct fscache_object *object)
+{
+	object->state = FSCACHE_OBJECT_INIT;
+	spin_lock_init(&object->lock);
+	INIT_LIST_HEAD(&object->cache_link);
+	INIT_HLIST_NODE(&object->cookie_link);
+	INIT_LIST_HEAD(&object->work_link);
+	INIT_LIST_HEAD(&object->dependents);
+	INIT_LIST_HEAD(&object->pending_ops);
+	INIT_RADIX_TREE(&object->stores, GFP_NOFS);
+	object->n_children = 0;
+	object->n_ops = object->n_in_progress = object->n_exclusive = 0;
+	object->events = object->event_mask = 0;
+	object->flags = 0;
+	object->store_limit = 0;
+	object->cache = NULL;
+	object->cookie = NULL;
+}
+
+extern void fscache_object_lookup_negative(struct fscache_object *object);
+extern void fscache_obtained_object(struct fscache_object *object);
+
+/**
+ * fscache_object_lookup_error - Note an object encountered an error
+ * @object: The object on which the error was encountered
+ *
+ * Note that an object encountered a fatal error (usually an I/O error) and
+ * that it should be withdrawn as soon as possible.
+ */
+static inline void fscache_object_lookup_error(struct fscache_object *object)
+{
+	set_bit(FSCACHE_OBJECT_EV_ERROR, &object->events);
+}
+
+/**
+ * fscache_set_store_limit - Set the maximum size to be stored in an object
+ * @object: The object to set the maximum on
+ * @i_size: The limit to set in bytes
+ *
+ * Set the maximum size an object is permitted to reach, implying the highest
+ * byte that may be written.  Intended to be called by the attr_changed() op.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_set_store_limit(struct fscache_object *object, loff_t i_size)
+{
+	object->store_limit = i_size >> PAGE_SHIFT;
+	if (i_size & ~PAGE_MASK)
+		object->store_limit++;
+}
+
+/**
+ * fscache_end_io - End a retrieval operation on a page
+ * @op: The FS-Cache operation covering the retrieval
+ * @page: The page that was to be fetched
+ * @error: The error code (0 if successful)
+ *
+ * Note the end of an operation to retrieve a page, as covered by a particular
+ * operation record.
+ */
+static inline void fscache_end_io(struct fscache_retrieval *op,
+				  struct page *page, int error)
+{
+	op->end_io_func(page, op->context, error);
+}
+
+/*
+ * out-of-line cache backend functions
+ */
+extern void fscache_init_cache(struct fscache_cache *cache,
+			       const struct fscache_cache_ops *ops,
+			       const char *idfmt,
+			       ...) __attribute__ ((format (printf, 3, 4)));
+
+extern int fscache_add_cache(struct fscache_cache *cache,
+			     struct fscache_object *fsdef,
+			     const char *tagname);
+extern void fscache_withdraw_cache(struct fscache_cache *cache);
+
+extern void fscache_io_error(struct fscache_cache *cache);
+
+extern void fscache_mark_pages_cached(struct fscache_retrieval *op,
+				      struct pagevec *pagevec);
+
+#endif /* _LINUX_FSCACHE_CACHE_H */
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
new file mode 100644
index 0000000..91da5e8
--- /dev/null
+++ b/include/linux/fscache.h
@@ -0,0 +1,593 @@
+/* General filesystem caching interface
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * NOTE!!! See:
+ *
+ *	Documentation/filesystems/caching/netfs-api.txt
+ *
+ * for a description of the network filesystem interface declared here.
+ */
+
+#ifndef _LINUX_FSCACHE_H
+#define _LINUX_FSCACHE_H
+
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/pagemap.h>
+#include <linux/pagevec.h>
+
+#if defined(CONFIG_FSCACHE) || defined(CONFIG_FSCACHE_MODULE)
+#define fscache_available() (1)
+#define fscache_cookie_valid(cookie) (cookie)
+#else
+#define fscache_available() (0)
+#define fscache_cookie_valid(cookie) (0)
+#endif
+
+
+/* pattern used to fill dead space in an index entry */
+#define FSCACHE_INDEX_DEADFILL_PATTERN 0x79
+
+struct pagevec;
+struct fscache_cache_tag;
+struct fscache_cookie;
+struct fscache_netfs;
+struct fscache_netfs_operations;
+
+typedef void (*fscache_rw_complete_t)(struct page *page,
+				      void *context,
+				      int error);
+
+/* result of index entry consultation */
+enum fscache_checkaux {
+	FSCACHE_CHECKAUX_OKAY,		/* entry okay as is */
+	FSCACHE_CHECKAUX_NEEDS_UPDATE,	/* entry requires update */
+	FSCACHE_CHECKAUX_OBSOLETE,	/* entry requires deletion */
+};
+
+/*
+ * fscache cookie definition
+ */
+struct fscache_cookie_def {
+	/* name of cookie type */
+	char name[16];
+
+	/* cookie type */
+	uint8_t type;
+#define FSCACHE_COOKIE_TYPE_INDEX	0
+#define FSCACHE_COOKIE_TYPE_DATAFILE	1
+
+	/* select the cache into which to insert an entry in this index
+	 * - optional
+	 * - should return a cache identifier or NULL to cause the cache to be
+	 *   inherited from the parent if possible or the first cache picked
+	 *   for a non-index file if not
+	 */
+	struct fscache_cache_tag *(*select_cache)(
+		const void *parent_netfs_data,
+		const void *cookie_netfs_data);
+
+	/* get an index key
+	 * - should store the key data in the buffer
+	 * - should return the amount of amount stored
+	 * - not permitted to return an error
+	 * - the netfs data from the cookie being used as the source is
+	 *   presented
+	 */
+	uint16_t (*get_key)(const void *cookie_netfs_data,
+			    void *buffer,
+			    uint16_t bufmax);
+
+	/* get certain file attributes from the netfs data
+	 * - this function can be absent for an index
+	 * - not permitted to return an error
+	 * - the netfs data from the cookie being used as the source is
+	 *   presented
+	 */
+	void (*get_attr)(const void *cookie_netfs_data, uint64_t *size);
+
+	/* get the auxilliary data from netfs data
+	 * - this function can be absent if the index carries no state data
+	 * - should store the auxilliary data in the buffer
+	 * - should return the amount of amount stored
+	 * - not permitted to return an error
+	 * - the netfs data from the cookie being used as the source is
+	 *   presented
+	 */
+	uint16_t (*get_aux)(const void *cookie_netfs_data,
+			    void *buffer,
+			    uint16_t bufmax);
+
+	/* consult the netfs about the state of an object
+	 * - this function can be absent if the index carries no state data
+	 * - the netfs data from the cookie being used as the target is
+	 *   presented, as is the auxilliary data
+	 */
+	enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
+					   const void *data,
+					   uint16_t datalen);
+
+	/* get an extra reference on a read context
+	 * - this function can be absent if the completion function doesn't
+	 *   require a context
+	 */
+	void (*get_context)(void *cookie_netfs_data, void *context);
+
+	/* release an extra reference on a read context
+	 * - this function can be absent if the completion function doesn't
+	 *   require a context
+	 */
+	void (*put_context)(void *cookie_netfs_data, void *context);
+
+	/* indicate pages that now have cache metadata retained
+	 * - this function should mark the specified pages as now being cached
+	 * - the pages will have been marked with PG_fscache before this is
+	 *   called, so this is optional
+	 */
+	void (*mark_pages_cached)(void *cookie_netfs_data,
+				  struct address_space *mapping,
+				  struct pagevec *cached_pvec);
+
+	/* indicate the cookie is no longer cached
+	 * - this function is called when the backing store currently caching
+	 *   a cookie is removed
+	 * - the netfs should use this to clean up any markers indicating
+	 *   cached pages
+	 * - this is mandatory for any object that may have data
+	 */
+	void (*now_uncached)(void *cookie_netfs_data);
+};
+
+/*
+ * netfs operations pointer (currently there aren't any ops)
+ */
+struct fscache_netfs_operations {
+};
+
+/*
+ * fscache cached network filesystem type
+ * - name, version and ops must be filled in before registration
+ * - all other fields will be set during registration
+ */
+struct fscache_netfs {
+	uint32_t			version;	/* indexing version */
+	const char			*name;		/* filesystem name */
+	struct fscache_cookie		*primary_index;
+	const struct fscache_netfs_operations *ops;
+	struct list_head		link;		/* internal link */
+};
+
+/*
+ * slow-path functions for when there is actually caching available, and the
+ * netfs does actually have a valid token
+ * - these are not to be called directly
+ * - these are undefined symbols when FS-Cache is not configured and the
+ *   optimiser takes care of not using them
+ */
+extern int __fscache_register_netfs(struct fscache_netfs *);
+extern void __fscache_unregister_netfs(struct fscache_netfs *);
+extern struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *);
+extern void __fscache_release_cache_tag(struct fscache_cache_tag *);
+extern struct fscache_cookie *__fscache_acquire_cookie(
+	struct fscache_cookie *,
+	const struct fscache_cookie_def *,
+	void *);
+extern void __fscache_relinquish_cookie(struct fscache_cookie *, int);
+extern void __fscache_update_cookie(struct fscache_cookie *);
+extern int __fscache_pin_cookie(struct fscache_cookie *);
+extern void __fscache_unpin_cookie(struct fscache_cookie *);
+extern int __fscache_attr_changed(struct fscache_cookie *);
+extern int __fscache_reserve_space(struct fscache_cookie *, loff_t);
+extern int __fscache_read_or_alloc_page(struct fscache_cookie *,
+					struct page *,
+					fscache_rw_complete_t,
+					void *,
+					gfp_t);
+extern int __fscache_read_or_alloc_pages(struct fscache_cookie *,
+					 struct address_space *,
+					 struct list_head *,
+					 unsigned *,
+					 fscache_rw_complete_t,
+					 void *,
+					 gfp_t);
+extern int __fscache_alloc_page(struct fscache_cookie *, struct page *, gfp_t);
+extern int __fscache_write_page(struct fscache_cookie *, struct page *, gfp_t);
+
+extern int __fscache_write_pages(struct fscache_cookie *,
+				 struct pagevec *,
+				 fscache_rw_complete_t,
+				 void *,
+				 gfp_t);
+extern void __fscache_uncache_page(struct fscache_cookie *, struct page *);
+extern void __fscache_uncache_pages(struct fscache_cookie *, struct pagevec *);
+
+/**
+ * fscache_register_netfs - Register a filesystem as desiring caching services
+ * @netfs: The description of the filesystem
+ *
+ * Register a filesystem as desiring caching services if they're available.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_register_netfs(struct fscache_netfs *netfs)
+{
+	if (fscache_available())
+		return __fscache_register_netfs(netfs);
+	else
+		return 0;
+}
+
+/**
+ * fscache_unregister_netfs - Indicate that a filesystem no longer desires
+ * caching services
+ * @netfs: The description of the filesystem
+ *
+ * Indicate that a filesystem no longer desires caching services for the
+ * moment.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_unregister_netfs(struct fscache_netfs *netfs)
+{
+	if (fscache_available())
+		__fscache_unregister_netfs(netfs);
+}
+
+/**
+ * fscache_lookup_cache_tag - Look up a cache tag
+ * @name: The name of the tag to search for
+ *
+ * Acquire a specific cache referral tag that can be used to select a specific
+ * cache in which to cache an index.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name)
+{
+	if (fscache_available())
+		return __fscache_lookup_cache_tag(name);
+	else
+		return NULL;
+}
+
+/**
+ * fscache_release_cache_tag - Release a cache tag
+ * @tag: The tag to release
+ *
+ * Release a reference to a cache referral tag previously looked up.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_release_cache_tag(struct fscache_cache_tag *tag)
+{
+	if (fscache_available())
+		__fscache_release_cache_tag(tag);
+}
+
+/**
+ * fscache_acquire_cookie - Acquire a cookie to represent a cache object
+ * @parent: The cookie that's to be the parent of this one
+ * @def: A description of the cache object, including callback operations
+ * @netfs_data: An arbitrary piece of data to be kept in the cookie to
+ * represent the cache object to the netfs
+ *
+ * This function is used to inform FS-Cache about part of an index hierarchy
+ * that can be used to locate files.  This is done by requesting a cookie for
+ * each index in the path to the file.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+struct fscache_cookie *fscache_acquire_cookie(
+	struct fscache_cookie *parent,
+	const struct fscache_cookie_def *def,
+	void *netfs_data)
+{
+	if (fscache_cookie_valid(parent))
+		return __fscache_acquire_cookie(parent, def, netfs_data);
+	else
+		return NULL;
+}
+
+/**
+ * fscache_relinquish_cookie - Return the cookie to the cache, maybe discarding
+ * it
+ * @cookie: The cookie being returned
+ * @retire: True if the cache object the cookie represents is to be discarded
+ *
+ * This function returns a cookie to the cache, forcibly discarding the
+ * associated cache object if retire is set to true.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
+{
+	if (fscache_cookie_valid(cookie))
+		__fscache_relinquish_cookie(cookie, retire);
+}
+
+/**
+ * fscache_update_cookie - Request that a cache object be updated
+ * @cookie: The cookie representing the cache object
+ *
+ * Request an update of the index data for the cache object associated with the
+ * cookie.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_update_cookie(struct fscache_cookie *cookie)
+{
+	if (fscache_cookie_valid(cookie))
+		__fscache_update_cookie(cookie);
+}
+
+/**
+ * fscache_pin_cookie - Pin a data-storage cache object in its cache
+ * @cookie: The cookie representing the cache object
+ *
+ * Permit data-storage cache objects to be pinned in the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_pin_cookie(struct fscache_cookie *cookie)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_pin_cookie(cookie);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_pin_cookie - Unpin a data-storage cache object in its cache
+ * @cookie: The cookie representing the cache object
+ *
+ * Permit data-storage cache objects to be unpinned from the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_unpin_cookie(struct fscache_cookie *cookie)
+{
+	if (fscache_cookie_valid(cookie))
+		__fscache_unpin_cookie(cookie);
+}
+
+/**
+ * fscache_attr_changed - Notify cache that an object's attributes changed
+ * @cookie: The cookie representing the cache object
+ *
+ * Send a notification to the cache indicating that an object's attributes have
+ * changed.  This includes the data size.  These attributes will be obtained
+ * through the get_attr() cookie definition op.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_attr_changed(struct fscache_cookie *cookie)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_attr_changed(cookie);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_reserve_space - Reserve data space for a cached object
+ * @cookie: The cookie representing the cache object
+ * @i_size: The amount of space to be reserved
+ *
+ * Reserve an amount of space in the cache for the cache object attached to a
+ * cookie so that a write to that object within the space can always be
+ * honoured.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_reserve_space(cookie, size);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_read_or_alloc_page - Read a page from the cache or allocate a block
+ * in which to store it
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page to fill if possible
+ * @end_io_func: The callback to invoke when and if the page is filled
+ * @context: An arbitrary piece of data to pass on to end_io_func()
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Read a page from the cache, or if that's not possible make a potential
+ * one-block reservation in the cache into which the page may be stored once
+ * fetched from the server.
+ *
+ * If the page is not backed by the cache object, or if it there's some reason
+ * it can't be, -ENOBUFS will be returned and nothing more will be done for
+ * that page.
+ *
+ * Else, if that page is backed by the cache, a read will be initiated directly
+ * to the netfs's page and 0 will be returned by this function.  The
+ * end_io_func() callback will be invoked when the operation terminates on a
+ * completion or failure.  Note that the callback may be invoked before the
+ * return.
+ *
+ * Else, if the page is unbacked, -ENODATA is returned and a block may have
+ * been allocated in the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+			       struct page *page,
+			       fscache_rw_complete_t end_io_func,
+			       void *context,
+			       gfp_t gfp)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_read_or_alloc_page(cookie, page, end_io_func,
+						    context, gfp);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_read_or_alloc_pages - Read pages from the cache and/or allocate
+ * blocks in which to store them
+ * @cookie: The cookie representing the cache object
+ * @mapping: The netfs inode mapping to which the pages will be attached
+ * @pages: A list of potential netfs pages to be filled
+ * @end_io_func: The callback to invoke when and if each page is filled
+ * @context: An arbitrary piece of data to pass on to end_io_func()
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Read a set of pages from the cache, or if that's not possible, attempt to
+ * make a potential one-block reservation for each page in the cache into which
+ * that page may be stored once fetched from the server.
+ *
+ * If some pages are not backed by the cache object, or if it there's some
+ * reason they can't be, -ENOBUFS will be returned and nothing more will be
+ * done for that pages.
+ *
+ * Else, if some of the pages are backed by the cache, a read will be initiated
+ * directly to the netfs's page and 0 will be returned by this function.  The
+ * end_io_func() callback will be invoked when the operation terminates on a
+ * completion or failure.  Note that the callback may be invoked before the
+ * return.
+ *
+ * Else, if a page is unbacked, -ENODATA is returned and a block may have
+ * been allocated in the cache.
+ *
+ * Because the function may want to return all of -ENOBUFS, -ENODATA and 0 in
+ * regard to different pages, the return values are prioritised in that order.
+ * Any pages submitted for reading are removed from the pages list.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+				struct address_space *mapping,
+				struct list_head *pages,
+				unsigned *nr_pages,
+				fscache_rw_complete_t end_io_func,
+				void *context,
+				gfp_t gfp)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_read_or_alloc_pages(cookie, mapping, pages,
+						     nr_pages, end_io_func,
+						     context, gfp);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_alloc_page - Allocate a block in which to store a page
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page to allocate a page for
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Request Allocation a block in the cache in which to store a netfs page
+ * without retrieving any contents from the cache.
+ *
+ * If the page is not backed by a file then -ENOBUFS will be returned and
+ * nothing more will be done, and no reservation will be made.
+ *
+ * Else, a block will be allocated if one wasn't already, and 0 will be
+ * returned
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_alloc_page(struct fscache_cookie *cookie,
+		       struct page *page,
+		       gfp_t gfp)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_alloc_page(cookie, page, gfp);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_write_page - Request storage of a page in the cache
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page to store
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Request the contents of the netfs page be written into the cache.  This
+ * request may be ignored if no cache block is currently allocated, in which
+ * case it will return -ENOBUFS.
+ *
+ * If a cache block was already allocated, a write will be initiated and 0 will
+ * be returned.  The PG_fscache_write page bit is set immediately and will then
+ * be cleared at the completion of the write to indicate the success or failure
+ * of the operation.  Note that the completion may happen before the return.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_write_page(struct fscache_cookie *cookie,
+		       struct page *page,
+		       gfp_t gfp)
+{
+	if (fscache_cookie_valid(cookie))
+		return __fscache_write_page(cookie, page, gfp);
+	else
+		return -ENOBUFS;
+}
+
+/**
+ * fscache_uncache_page - Indicate that caching is no longer required on a page
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page that was being cached.
+ *
+ * Tell the cache that we no longer want a page to be cached and that it should
+ * remove any knowledge of the netfs page it may have.
+ *
+ * Note that this cannot cancel any outstanding I/O operations between this
+ * page and the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_uncache_page(struct fscache_cookie *cookie,
+			  struct page *page)
+{
+	if (fscache_cookie_valid(cookie))
+		__fscache_uncache_page(cookie, page);
+}
+
+#endif /* _LINUX_FSCACHE_H */


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 13/28] CacheFiles: Add missing copy_page export for ia64 [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (11 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 12/28] FS-Cache: Generic filesystem caching facility " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 14/28] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3 " David Howells
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

This one-line patch fixes the missing export of copy_page introduced
by the cachefile patches.  This patch is not yet upstream, but is required
for cachefile on ia64.  It will be pushed upstream when cachefile goes
upstream.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/ia64_ksyms.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/kernel/ia64_ksyms.c b/arch/ia64/kernel/ia64_ksyms.c
index bd17190..20c3546 100644
--- a/arch/ia64/kernel/ia64_ksyms.c
+++ b/arch/ia64/kernel/ia64_ksyms.c
@@ -43,6 +43,7 @@ EXPORT_SYMBOL(__do_clear_user);
 EXPORT_SYMBOL(__strlen_user);
 EXPORT_SYMBOL(__strncpy_from_user);
 EXPORT_SYMBOL(__strnlen_user);
+EXPORT_SYMBOL(copy_page);
 
 /* from arch/ia64/lib */
 extern void __divsi3(void);


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 14/28] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3 [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (12 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 13/28] CacheFiles: Add missing copy_page export for ia64 " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 15/28] CacheFiles: Add a hook to write a single page of data to an inode " David Howells
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Change all the usages of file->f_mapping in ext3_*write_end() functions to use
the mapping argument directly.  This has two consequences:

 (*) Consistency.  Without this patch sometimes one is used and sometimes the
     other is.

 (*) A NULL file pointer can be passed.  This feature is then made use of by
     the generic hook in the next patch, which is used by CacheFiles to write
     pages to a file without setting up a file struct.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/ext3/inode.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 9b162cd..bc918d3 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1227,7 +1227,7 @@ static int ext3_generic_write_end(struct file *file,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata)
 {
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = mapping->host;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
@@ -1252,7 +1252,7 @@ static int ext3_ordered_write_end(struct file *file,
 				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = mapping->host;
 	unsigned from, to;
 	int ret = 0, ret2;
 
@@ -1293,7 +1293,7 @@ static int ext3_writeback_write_end(struct file *file,
 				struct page *page, void *fsdata)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 	loff_t new_i_size;
 


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 15/28] CacheFiles: Add a hook to write a single page of data to an inode [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (13 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 14/28] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3 " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 16/28] CacheFiles: Permit the page lock state to be monitored " David Howells
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised).  The data source is a single page.

This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.

Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.

Hook the Ext2 and Ext3 operations to the generic implementation.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/ext2/inode.c    |    2 ++
 fs/ext3/inode.c    |    3 +++
 include/linux/fs.h |    7 ++++++
 mm/filemap.c       |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b1ab32a..cfa56e6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -796,6 +796,7 @@ const struct address_space_operations ext2_aops = {
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.write_one_page		= generic_file_buffered_write_one_page,
 };
 
 const struct address_space_operations ext2_aops_xip = {
@@ -814,6 +815,7 @@ const struct address_space_operations ext2_nobh_aops = {
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
+	.write_one_page		= generic_file_buffered_write_one_page,
 };
 
 /*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index bc918d3..435c684 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1780,6 +1780,7 @@ static const struct address_space_operations ext3_ordered_aops = {
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
 	.migratepage	= buffer_migrate_page,
+	.write_one_page	= generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_writeback_aops = {
@@ -1794,6 +1795,7 @@ static const struct address_space_operations ext3_writeback_aops = {
 	.releasepage	= ext3_releasepage,
 	.direct_IO	= ext3_direct_IO,
 	.migratepage	= buffer_migrate_page,
+	.write_one_page	= generic_file_buffered_write_one_page,
 };
 
 static const struct address_space_operations ext3_journalled_aops = {
@@ -1807,6 +1809,7 @@ static const struct address_space_operations ext3_journalled_aops = {
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
+	.write_one_page	= generic_file_buffered_write_one_page,
 };
 
 void ext3_set_aops(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 850d3fc..a3c3369 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -479,6 +479,11 @@ struct address_space_operations {
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
 	int (*launder_page) (struct page *);
+	/* write the contents of the source page over the page at the specified
+	 * index in the target address space (the source page does not need to
+	 * be related to the target address space) */
+	int (*write_one_page)(struct address_space *, pgoff_t, struct page *);
+
 };
 
 /*
@@ -1801,6 +1806,8 @@ extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
 		unsigned long *, loff_t, loff_t *, size_t, size_t);
 extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern int generic_file_buffered_write_one_page(struct address_space *,
+						pgoff_t, struct page *);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
 extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 extern void do_generic_mapping_read(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a8e5b8..bea1ba6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2321,6 +2321,67 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
+/**
+ * generic_file_buffered_write_one_page - Write a single page of data to an
+ *	inode
+ * @mapping - The address space of the target inode
+ * @index - The target page in the target inode to fill
+ * @source - The data to write into the target page
+ *
+ * Write the data from the source page to the page in the nominated address
+ * space at the @index specified.  Note that the file will not be extended if
+ * the page crosses the EOF marker, in which case only the first part of the
+ * page will be written.
+ *
+ * The @source page does not need to have any association with the file or the
+ * target page offset.
+ */
+int generic_file_buffered_write_one_page(struct address_space *mapping,
+					 pgoff_t index,
+					 struct page *source)
+{
+	const struct address_space_operations *a_ops = mapping->a_ops;
+	struct page *page;
+	unsigned len;
+	loff_t isize, pos;
+	void *fsdata;
+	int ret;
+
+	pos = index;
+	pos <<= PAGE_CACHE_SHIFT;
+
+	len = PAGE_CACHE_SIZE;
+	isize = i_size_read(mapping->host);
+	if ((isize >> PAGE_CACHE_SHIFT) == index)
+		len = isize & (PAGE_CACHE_SIZE - 1);
+
+	ret = pagecache_write_begin(NULL, mapping, pos, len,
+				    AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (ret < 0)
+		goto sync;
+
+	copy_highpage(page, source);
+
+	ret = pagecache_write_end(NULL, mapping, pos, len, len, page, fsdata);
+	if (ret < 0)
+		goto sync;
+
+	balance_dirty_pages_ratelimited(mapping);
+	cond_resched();
+
+sync:
+	/* the caller must handle O_SYNC themselves, but we handle S_SYNC and
+	 * MS_SYNCHRONOUS here */
+	if (unlikely(IS_SYNC(mapping->host)) && !a_ops->writepage)
+		ret = generic_osync_inode(mapping->host, mapping,
+					     OSYNC_METADATA | OSYNC_DATA);
+
+	/* the caller must handle O_DIRECT for themselves */
+
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_buffered_write_one_page);
+
 static ssize_t
 __generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t *ppos)


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 16/28] CacheFiles: Permit the page lock state to be monitored [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (14 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 15/28] CacheFiles: Add a hook to write a single page of data to an inode " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 17/28] CacheFiles: Export things for CacheFiles " David Howells
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/pagemap.h |    5 +++++
 mm/filemap.c            |   18 ++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6a1b317..21c35e2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -223,6 +223,11 @@ static inline void wait_on_page_fscache_write(struct page *page)
 extern void end_page_fscache_write(struct page *page);
 
 /*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index bea1ba6..6872d1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,6 +521,24 @@ void fastcall wait_on_page_bit(struct page *page, int bit_nr)
 EXPORT_SYMBOL(wait_on_page_bit);
 
 /**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+	wait_queue_head_t *q = page_waitqueue(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&q->lock, flags);
+	__add_wait_queue(q, waiter);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
  * unlock_page - unlock a locked page
  * @page: the page
  *


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 17/28] CacheFiles: Export things for CacheFiles [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (15 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 16/28] CacheFiles: Permit the page lock state to be monitored " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 18/28] CacheFiles: A cache that backs onto a mounted filesystem " David Howells
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index ceaf2e3..cd199ae 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -266,6 +266,7 @@ int fsync_super(struct super_block *sb)
 	__fsync_super(sb);
 	return sync_blockdev(sb->s_bdev);
 }
+EXPORT_SYMBOL_GPL(fsync_super);
 
 /**
  *	generic_shutdown_super	-	common helper for ->kill_sb()


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 18/28] CacheFiles: A cache that backs onto a mounted filesystem [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (16 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 17/28] CacheFiles: Export things for CacheFiles " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:39 ` [PATCH 19/28] NFS: Use local caching " David Howells
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.


CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling.  This is called cachefilesd and lives in
/sbin.  The source for the daemon can be downloaded from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.c

And an example configuration from:

	http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf

The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services.  Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.

CacheFiles creates a proc-file - "/proc/fs/cachefiles" - that is used for
communication with the daemon.  Only one thing may have this open at once, and
whilst it is open, a cache is at least partially in existence.  The daemon
opens this and sends commands down it to control the cache.

CacheFiles is currently limited to a single cache.

CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section.  This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.


============
REQUIREMENTS
============

The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:

	- dnotify.

	- extended attributes (xattrs).

	- openat() and friends.

	- bmap() support on files in the filesystem (FIBMAP ioctl).

	- The use of bmap() to detect a partial page at the end of the file.

It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.


=============
CONFIGURATION
=============

The cache is configured by a script in /etc/cachefilesd.conf.  These commands
set up cache ready for use.  The following script commands are available:

 (*) brun <N>%
 (*) bcull <N>%
 (*) bstop <N>%

	Configure the culling limits.  Optional.  See the section on culling
	The defaults are 7%, 5% and 1% respectively.

 (*) dir <path>

	Specify the directory containing the root of the cache.  Mandatory.

 (*) tag <name>

	Specify a tag to FS-Cache to use in distinguishing multiple caches.
	Optional.  The default is "CacheFiles".

 (*) debug <mask>

	Specify a numeric bitmask to control debugging in the kernel module.
	Optional.  The default is zero (all off).


==================
STARTING THE CACHE
==================

The cache is started by running the daemon.  The daemon opens the cache proc
file, configures the cache and tells it to begin caching.  At that point the
cache binds to fscache and the cache becomes live.

The daemon is run as follows:

	/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]

The flags are:

 (*) -d

	Increase the debugging level.  This can be specified multiple times and
	is cumulative with itself.

 (*) -s

	Send messages to stderr instead of syslog.

 (*) -n

	Don't daemonise and go into background.

 (*) -f <configfile>

	Use an alternative configuration file rather than the default one.


===============
THINGS TO AVOID
===============

Do not mount other things within the cache as this will cause problems.  The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.

Do not create, rename or unlink files and directories in the cache whilst the
cache is active, as this may cause the state to become uncertain.

Renaming files in the cache might make objects appear to be other objects (the
filename is part of the lookup key).

Do not change or remove the extended attributes attached to cache files by the
cache as this will cause the cache state management to get confused.

Do not create files or directories in the cache, lest the cache get confused or
serve incorrect data.

Do not chmod files in the cache.  The module creates things with minimal
permissions to prevent random users being able to access them directly.


=============
CACHE CULLING
=============

The cache may need culling occasionally to make space.  This involves
discarding objects from the cache that have been used less recently than
anything else.  Culling is based on the access time of data objects.  Empty
directories are culled if not in use.

Cache culling is done on the basis of the percentage of blocks available in the
underlying filesystem.  There are three "limits":

 (*) brun

     If the amount of available space in the cache rises above this limit, then
     culling is turned off.

 (*) bcull

     If the amount of available space in the cache falls below this limit, then
     culling is started.

 (*) bstop

     If the amount of available space in the cache falls below this limit, then
     no further allocation of disk space is permitted until culling has raised
     the amount above this limit again.

These must be configured thusly:

	0 <= bstop < bcull < brun < 100

Note that these are percentages of available space, and do _not_ appear as 100
minus the percentage displayed by the "df" program.

The userspace daemon scans the cache to build up a table of cullable objects.
These are then culled in least recently used order.  A new scan of the cache is
started as soon as space is made in the table.  Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.


===============
CACHE STRUCTURE
===============

The CacheFiles module will create two directories in the directory it was
given:

 (*) cache/

 (*) graveyard/

The active cache objects all reside in the first directory.  The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
to the graveyard from which the daemon will actually delete them.

The daemon uses dnotify to monitor the graveyard directory, and will delete
anything that appears therein.


The module represents index objects as directories with the filename "I..." or
"J...".  Note that the "cache/" directory is itself a special index.

Data objects are represented as files if they have no children, or directories
if they do.  Their filenames all begin "D..." or "E...".  If represented as a
directory, data objects will have a file in the directory called "data" that
actually holds the data.

Special objects are similar to data objects, except their filenames begin
"S..." or "T...".


If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended.  Into
this directory, if possible, will be placed the representations of the child
objects:

	INDEX     INDEX      INDEX                             DATA FILES
	========= ========== ================================= ================
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry


If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory.  The names of the intermediate directories will have
'+' prepended:

	J1223/@23/+xy...z/+kl...m/Epqr


Note that keys are raw data, and not only may they exceed NAME_MAX in size,
they may also contain things like '/' and NUL characters, and so they may not
be suitable for turning directly into a filename.

To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable.  The two versions of
object filenames indicate the encoding:

	OBJECT TYPE	PRINTABLE	ENCODED
	===============	===============	===============
	Index		"I..."		"J..."
	Data		"D..."		"E..."
	Special		"S..."		"T..."

Intermediate directories are always "@" or "+" as appropriate.


Each object in the cache has an extended attribute label that holds the object
type ID (required to distinguish special objects) and the auxiliary data from
the netfs.  The latter is used to detect stale objects in the cache and update
or retire them.


Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).


This documentation is added by the patch to:

	Documentation/filesystems/caching/cachefiles.txt

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/caching/cachefiles.txt |  395 ++++++++++
 fs/Kconfig                                       |    1 
 fs/Makefile                                      |    1 
 fs/cachefiles/Kconfig                            |   33 +
 fs/cachefiles/Makefile                           |   18 
 fs/cachefiles/cf-bind.c                          |  286 +++++++
 fs/cachefiles/cf-daemon.c                        |  724 +++++++++++++++++++
 fs/cachefiles/cf-interface.c                     |  445 ++++++++++++
 fs/cachefiles/cf-internal.h                      |  372 ++++++++++
 fs/cachefiles/cf-key.c                           |  159 ++++
 fs/cachefiles/cf-main.c                          |  108 +++
 fs/cachefiles/cf-namei.c                         |  739 +++++++++++++++++++
 fs/cachefiles/cf-proc.c                          |  166 ++++
 fs/cachefiles/cf-rdwr.c                          |  851 ++++++++++++++++++++++
 fs/cachefiles/cf-security.c                      |   80 ++
 fs/cachefiles/cf-xattr.c                         |  292 ++++++++
 security/security.c                              |    2 
 17 files changed, 4672 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt
new file mode 100644
index 0000000..b502cff
--- /dev/null
+++ b/Documentation/filesystems/caching/cachefiles.txt
@@ -0,0 +1,395 @@
+	       ===============================================
+	       CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
+	       ===============================================
+
+Contents:
+
+ (*) Overview.
+
+ (*) Requirements.
+
+ (*) Configuration.
+
+ (*) Starting the cache.
+
+ (*) Things to avoid.
+
+ (*) Cache culling.
+
+ (*) Cache structure.
+
+ (*) Security model and SELinux.
+
+========
+OVERVIEW
+========
+
+CacheFiles is a caching backend that's meant to use as a cache a directory on
+an already mounted filesystem of a local type (such as Ext3).
+
+CacheFiles uses a userspace daemon to do some of the cache management - such as
+reaping stale nodes and culling.  This is called cachefilesd and lives in
+/sbin.
+
+The filesystem and data integrity of the cache are only as good as those of the
+filesystem providing the backing services.  Note that CacheFiles does not
+attempt to journal anything since the journalling interfaces of the various
+filesystems are very specific in nature.
+
+CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
+to communication with the daemon.  Only one thing may have this open at once,
+and whilst it is open, a cache is at least partially in existence.  The daemon
+opens this and sends commands down it to control the cache.
+
+CacheFiles is currently limited to a single cache.
+
+CacheFiles attempts to maintain at least a certain percentage of free space on
+the filesystem, shrinking the cache by culling the objects it contains to make
+space if necessary - see the "Cache Culling" section.  This means it can be
+placed on the same medium as a live set of data, and will expand to make use of
+spare space and automatically contract when the set of data requires more
+space.
+
+
+============
+REQUIREMENTS
+============
+
+The use of CacheFiles and its daemon requires the following features to be
+available in the system and in the cache filesystem:
+
+	- dnotify.
+
+	- extended attributes (xattrs).
+
+	- openat() and friends.
+
+	- bmap() support on files in the filesystem (FIBMAP ioctl).
+
+	- The use of bmap() to detect a partial page at the end of the file.
+
+It is strongly recommended that the "dir_index" option is enabled on Ext3
+filesystems being used as a cache.
+
+
+=============
+CONFIGURATION
+=============
+
+The cache is configured by a script in /etc/cachefilesd.conf.  These commands
+set up cache ready for use.  The following script commands are available:
+
+ (*) brun <N>%
+ (*) bcull <N>%
+ (*) bstop <N>%
+ (*) frun <N>%
+ (*) fcull <N>%
+ (*) fstop <N>%
+
+	Configure the culling limits.  Optional.  See the section on culling
+	The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
+
+	The commands beginning with a 'b' are file space (block) limits, those
+	beginning with an 'f' are file count limits.
+
+ (*) dir <path>
+
+	Specify the directory containing the root of the cache.  Mandatory.
+
+ (*) tag <name>
+
+	Specify a tag to FS-Cache to use in distinguishing multiple caches.
+	Optional.  The default is "CacheFiles".
+
+ (*) debug <mask>
+
+	Specify a numeric bitmask to control debugging in the kernel module.
+	Optional.  The default is zero (all off).  The following values can be
+	OR'd into the mask to collect various information:
+
+		1	Turn on trace of function entry (_enter() macros)
+		2	Turn on trace of function exit (_leave() macros)
+		4	Turn on trace of internal debug points (_debug())
+
+	This mask can also be set through sysfs, eg:
+
+		echo 5 >/sys/modules/cachefiles/parameters/debug
+
+
+==================
+STARTING THE CACHE
+==================
+
+The cache is started by running the daemon.  The daemon opens the cache device,
+configures the cache and tells it to begin caching.  At that point the cache
+binds to fscache and the cache becomes live.
+
+The daemon is run as follows:
+
+	/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
+
+The flags are:
+
+ (*) -d
+
+	Increase the debugging level.  This can be specified multiple times and
+	is cumulative with itself.
+
+ (*) -s
+
+	Send messages to stderr instead of syslog.
+
+ (*) -n
+
+	Don't daemonise and go into background.
+
+ (*) -f <configfile>
+
+	Use an alternative configuration file rather than the default one.
+
+
+===============
+THINGS TO AVOID
+===============
+
+Do not mount other things within the cache as this will cause problems.  The
+kernel module contains its own very cut-down path walking facility that ignores
+mountpoints, but the daemon can't avoid them.
+
+Do not create, rename or unlink files and directories in the cache whilst the
+cache is active, as this may cause the state to become uncertain.
+
+Renaming files in the cache might make objects appear to be other objects (the
+filename is part of the lookup key).
+
+Do not change or remove the extended attributes attached to cache files by the
+cache as this will cause the cache state management to get confused.
+
+Do not create files or directories in the cache, lest the cache get confused or
+serve incorrect data.
+
+Do not chmod files in the cache.  The module creates things with minimal
+permissions to prevent random users being able to access them directly.
+
+
+=============
+CACHE CULLING
+=============
+
+The cache may need culling occasionally to make space.  This involves
+discarding objects from the cache that have been used less recently than
+anything else.  Culling is based on the access time of data objects.  Empty
+directories are culled if not in use.
+
+Cache culling is done on the basis of the percentage of blocks and the
+percentage of files available in the underlying filesystem.  There are six
+"limits":
+
+ (*) brun
+ (*) frun
+
+     If the amount of free space and the number of available files in the cache
+     rises above both these limits, then culling is turned off.
+
+ (*) bcull
+ (*) fcull
+
+     If the amount of available space or the number of available files in the
+     cache falls below either of these limits, then culling is started.
+
+ (*) bstop
+ (*) fstop
+
+     If the amount of available space or the number of available files in the
+     cache falls below either of these limits, then no further allocation of
+     disk space or files is permitted until culling has raised things above
+     these limits again.
+
+These must be configured thusly:
+
+	0 <= bstop < bcull < brun < 100
+	0 <= fstop < fcull < frun < 100
+
+Note that these are percentages of available space and available files, and do
+_not_ appear as 100 minus the percentage displayed by the "df" program.
+
+The userspace daemon scans the cache to build up a table of cullable objects.
+These are then culled in least recently used order.  A new scan of the cache is
+started as soon as space is made in the table.  Objects will be skipped if
+their atimes have changed or if the kernel module says it is still using them.
+
+
+===============
+CACHE STRUCTURE
+===============
+
+The CacheFiles module will create two directories in the directory it was
+given:
+
+ (*) cache/
+
+ (*) graveyard/
+
+The active cache objects all reside in the first directory.  The CacheFiles
+kernel module moves any retired or culled objects that it can't simply unlink
+to the graveyard from which the daemon will actually delete them.
+
+The daemon uses dnotify to monitor the graveyard directory, and will delete
+anything that appears therein.
+
+
+The module represents index objects as directories with the filename "I..." or
+"J...".  Note that the "cache/" directory is itself a special index.
+
+Data objects are represented as files if they have no children, or directories
+if they do.  Their filenames all begin "D..." or "E...".  If represented as a
+directory, data objects will have a file in the directory called "data" that
+actually holds the data.
+
+Special objects are similar to data objects, except their filenames begin
+"S..." or "T...".
+
+
+If an object has children, then it will be represented as a directory.
+Immediately in the representative directory are a collection of directories
+named for hash values of the child object keys with an '@' prepended.  Into
+this directory, if possible, will be placed the representations of the child
+objects:
+
+	INDEX     INDEX      INDEX                             DATA FILES
+	========= ========== ================================= ================
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
+	cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
+
+
+If the key is so long that it exceeds NAME_MAX with the decorations added on to
+it, then it will be cut into pieces, the first few of which will be used to
+make a nest of directories, and the last one of which will be the objects
+inside the last directory.  The names of the intermediate directories will have
+'+' prepended:
+
+	J1223/@23/+xy...z/+kl...m/Epqr
+
+
+Note that keys are raw data, and not only may they exceed NAME_MAX in size,
+they may also contain things like '/' and NUL characters, and so they may not
+be suitable for turning directly into a filename.
+
+To handle this, CacheFiles will use a suitably printable filename directly and
+"base-64" encode ones that aren't directly suitable.  The two versions of
+object filenames indicate the encoding:
+
+	OBJECT TYPE	PRINTABLE	ENCODED
+	===============	===============	===============
+	Index		"I..."		"J..."
+	Data		"D..."		"E..."
+	Special		"S..."		"T..."
+
+Intermediate directories are always "@" or "+" as appropriate.
+
+
+Each object in the cache has an extended attribute label that holds the object
+type ID (required to distinguish special objects) and the auxiliary data from
+the netfs.  The latter is used to detect stale objects in the cache and update
+or retire them.
+
+
+Note that CacheFiles will erase from the cache any file it doesn't recognise or
+any file of an incorrect type (such as a FIFO file or a device file).
+
+
+==========================
+SECURITY MODEL AND SELINUX
+==========================
+
+CacheFiles is implemented to deal properly with the LSM security features of
+the Linux kernel and the SELinux facility.
+
+One of the problems that CacheFiles faces is that it is generally acting on
+behalf of a process, and running in that process's context, and that includes a
+security context that is not appropriate for accessing the cache - either
+because the files in the cache are inaccessible to that process, or because if
+the process creates a file in the cache, that file may be inaccessible to other
+processes.
+
+The way CacheFiles works is to temporarily change the security context (fsuid,
+fsgid and actor security label) that the process acts as - without changing the
+security context of the process when it the target of an operation performed by
+some other process (so signalling and suchlike still work correctly).
+
+
+When the CacheFiles module is asked to bind to its cache, it:
+
+ (1) Finds the security label attached to the root cache directory and uses
+     that as the security label with which it will create files.  By default,
+     this is:
+
+	cachefiles_var_t
+
+ (2) Finds the security label of the process which issued the bind request
+     (presumed to be the cachefilesd daemon), which by default will be:
+
+	cachefilesd_t
+
+     and asks LSM to supply a security ID as which it should act given the
+     daemon's label.  By default, this will be:
+
+	cachefiles_kernel_t
+
+     SELinux transitions the daemon's security ID to the module's security ID
+     based on a rule of this form in the policy.
+
+	type_transition <daemon's-ID> kernel_t : process <module's-ID>;
+
+     For instance:
+
+	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
+
+
+The module's security ID gives it permission to create, move and remove files
+and directories in the cache, to find and access directories and files in the
+cache, to set and access extended attributes on cache objects, and to read and
+write files in the cache.
+
+The daemon's security ID gives it only a very restricted set of permissions: it
+may scan directories, stat files and erase files and directories.  It may
+not read or write files in the cache, and so it is precluded from accessing the
+data cached therein; nor is it permitted to create new files in the cache.
+
+
+There are policy source files available in:
+
+	http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
+
+and later versions.  In that tarball, see the files:
+
+	cachefilesd.te
+	cachefilesd.fc
+	cachefilesd.if
+
+They are built and installed directly by the RPM.
+
+If a non-RPM based system is being used, then copy the above files to their own
+directory and run:
+
+	make -f /usr/share/selinux/devel/Makefile
+	semodule -i cachefilesd.pp
+
+You will need checkpolicy and selinux-policy-devel installed prior to the
+build.
+
+
+By default, the cache is located in /var/fscache, but if it is desirable that
+it should be elsewhere, than either the above policy files must be altered, or
+an auxiliary policy must be installed to label the alternate location of the
+cache.
+
+For instructions on how to add an auxiliary policy to enable the cache to be
+located elsewhere when SELinux is in enforcing mode, please see:
+
+	/usr/share/doc/cachefilesd-*/move-cache.txt
+
+When the cachefilesd rpm is installed; alternatively, the document can be found
+in the sources.
diff --git a/fs/Kconfig b/fs/Kconfig
index 715ab7c..215b0d6 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -659,6 +659,7 @@ config GENERIC_ACL
 menu "Caches"
 
 source "fs/fscache/Kconfig"
+source "fs/cachefiles/Kconfig"
 
 endmenu
 
diff --git a/fs/Makefile b/fs/Makefile
index 1767091..0010318 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -116,6 +116,7 @@ obj-$(CONFIG_AFS_FS)		+= afs/
 obj-$(CONFIG_BEFS_FS)		+= befs/
 obj-$(CONFIG_HOSTFS)		+= hostfs/
 obj-$(CONFIG_HPPFS)		+= hppfs/
+obj-$(CONFIG_CACHEFILES)	+= cachefiles/
 obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
 obj-$(CONFIG_GFS2_FS)           += gfs2/
diff --git a/fs/cachefiles/Kconfig b/fs/cachefiles/Kconfig
new file mode 100644
index 0000000..ddbdd85
--- /dev/null
+++ b/fs/cachefiles/Kconfig
@@ -0,0 +1,33 @@
+
+config CACHEFILES
+	tristate "Filesystem caching on files"
+	depends on FSCACHE
+	help
+	  This permits use of a mounted filesystem as a cache for other
+	  filesystems - primarily networking filesystems - thus allowing fast
+	  local disk to enhance the speed of slower devices.
+
+	  See Documentation/filesystems/caching/cachefiles.txt for more
+	  information.
+
+config CACHEFILES_DEBUG
+	bool "Debug CacheFiles"
+	depends on CACHEFILES
+	help
+	  This permits debugging to be dynamically enabled in the filesystem
+	  caching on files module.  If this is set, the debugging output may be
+	  enabled by setting bits in /sys/modules/cachefiles/parameter/debug or
+	  by including a debugging specifier in /etc/cachefilesd.conf.
+
+config CACHEFILES_HISTOGRAM
+	bool "Gather latency information on CacheFiles"
+	depends on CACHEFILES && FSCACHE_PROC
+	help
+
+	  This option causes latency information to be gathered on CacheFiles
+	  operation and exported through file:
+
+		/proc/fs/fscache/cachefiles/histogram
+
+	  See Documentation/filesystems/caching/cachefiles.txt for more
+	  information.
diff --git a/fs/cachefiles/Makefile b/fs/cachefiles/Makefile
new file mode 100644
index 0000000..8a9c1bd
--- /dev/null
+++ b/fs/cachefiles/Makefile
@@ -0,0 +1,18 @@
+#
+# Makefile for caching in a mounted filesystem
+#
+
+cachefiles-y := \
+	cf-bind.o \
+	cf-daemon.o \
+	cf-interface.o \
+	cf-key.o \
+	cf-main.o \
+	cf-namei.o \
+	cf-rdwr.o \
+	cf-security.o \
+	cf-xattr.o
+
+cachefiles-$(CONFIG_CACHEFILES_HISTOGRAM) += cf-proc.o
+
+obj-$(CONFIG_CACHEFILES) := cachefiles.o
diff --git a/fs/cachefiles/cf-bind.c b/fs/cachefiles/cf-bind.c
new file mode 100644
index 0000000..bd0b95e
--- /dev/null
+++ b/fs/cachefiles/cf-bind.c
@@ -0,0 +1,286 @@
+/* Bind and unbind a cache from the filesystem backing it
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/ctype.h>
+#include "cf-internal.h"
+
+static int cachefiles_daemon_add_cache(struct cachefiles_cache *caches);
+
+/*
+ * bind a directory as a cache
+ */
+int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args)
+{
+	_enter("{%u,%u,%u,%u,%u,%u},%s",
+	       cache->frun_percent,
+	       cache->fcull_percent,
+	       cache->fstop_percent,
+	       cache->brun_percent,
+	       cache->bcull_percent,
+	       cache->bstop_percent,
+	       args);
+
+	/* start by checking things over */
+	ASSERT(cache->fstop_percent >= 0 &&
+	       cache->fstop_percent < cache->fcull_percent &&
+	       cache->fcull_percent < cache->frun_percent &&
+	       cache->frun_percent  < 100);
+
+	ASSERT(cache->bstop_percent >= 0 &&
+	       cache->bstop_percent < cache->bcull_percent &&
+	       cache->bcull_percent < cache->brun_percent &&
+	       cache->brun_percent  < 100);
+
+	if (*args) {
+		kerror("'bind' command doesn't take an argument");
+		return -EINVAL;
+	}
+
+	if (!cache->rootdirname) {
+		kerror("No cache directory specified");
+		return -EINVAL;
+	}
+
+	/* don't permit already bound caches to be re-bound */
+	if (test_bit(CACHEFILES_READY, &cache->flags)) {
+		kerror("Cache already bound");
+		return -EBUSY;
+	}
+
+	/* make sure we have copies of the tag and dirname strings */
+	if (!cache->tag) {
+		/* the tag string is released by the fops->release()
+		 * function, so we don't release it on error here */
+		cache->tag = kstrdup("CacheFiles", GFP_KERNEL);
+		if (!cache->tag)
+			return -ENOMEM;
+	}
+
+	/* add the cache */
+	return cachefiles_daemon_add_cache(cache);
+}
+
+/*
+ * add a cache
+ */
+static int cachefiles_daemon_add_cache(struct cachefiles_cache *cache)
+{
+	struct cachefiles_object *fsdef;
+	struct nameidata nd;
+	struct kstatfs stats;
+	struct dentry *graveyard, *cachedir, *root;
+	struct task_security *saved_security;
+	int ret;
+
+	_enter("");
+
+	/* we want to work under the module's security ID */
+	ret = cachefiles_get_security_ID(cache);
+	if (ret < 0)
+		return ret;
+
+	cachefiles_begin_secure(cache, &saved_security);
+
+	/* allocate the root index object */
+	ret = -ENOMEM;
+
+	fsdef = kmem_cache_alloc(cachefiles_object_jar, GFP_KERNEL);
+	if (!fsdef)
+		goto error_root_object;
+
+	ASSERTCMP(fsdef->backer, ==, NULL);
+
+	atomic_set(&fsdef->usage, 1);
+	fsdef->type = FSCACHE_COOKIE_TYPE_INDEX;
+
+	_debug("- fsdef %p", fsdef);
+
+	/* look up the directory at the root of the cache */
+	memset(&nd, 0, sizeof(nd));
+
+	ret = path_lookup(cache->rootdirname, LOOKUP_DIRECTORY, &nd);
+	if (ret < 0)
+		goto error_open_root;
+
+	cache->mnt = mntget(nd.mnt);
+	root = dget(nd.dentry);
+	path_release(&nd);
+
+	/* check parameters */
+	ret = -EOPNOTSUPP;
+	if (!root->d_inode ||
+	    !root->d_inode->i_op ||
+	    !root->d_inode->i_op->lookup ||
+	    !root->d_inode->i_op->mkdir ||
+	    !root->d_inode->i_op->setxattr ||
+	    !root->d_inode->i_op->getxattr ||
+	    !root->d_sb ||
+	    !root->d_sb->s_op ||
+	    !root->d_sb->s_op->statfs ||
+	    !root->d_sb->s_op->sync_fs)
+		goto error_unsupported;
+
+	ret = -EROFS;
+	if (root->d_sb->s_flags & MS_RDONLY)
+		goto error_unsupported;
+
+	/* determine the security of the on-disk cache as this governs
+	 * security ID of files we create */
+	ret = cachefiles_determine_cache_security(cache, root);
+	if (ret < 0)
+		goto error_unsupported;
+
+	/* get the cache size and blocksize */
+	ret = vfs_statfs(root, &stats);
+	if (ret < 0)
+		goto error_unsupported;
+
+	ret = -ERANGE;
+	if (stats.f_bsize <= 0)
+		goto error_unsupported;
+
+	ret = -EOPNOTSUPP;
+	if (stats.f_bsize > PAGE_SIZE)
+		goto error_unsupported;
+
+	cache->bsize = stats.f_bsize;
+	cache->bshift = 0;
+	if (stats.f_bsize < PAGE_SIZE)
+		cache->bshift = PAGE_SHIFT - ilog2(stats.f_bsize);
+
+	_debug("blksize %u (shift %u)",
+	       cache->bsize, cache->bshift);
+
+	_debug("size %llu, avail %llu",
+	       (unsigned long long) stats.f_blocks,
+	       (unsigned long long) stats.f_bavail);
+
+	/* set up caching limits */
+	do_div(stats.f_files, 100);
+	cache->fstop = stats.f_files * cache->fstop_percent;
+	cache->fcull = stats.f_files * cache->fcull_percent;
+	cache->frun  = stats.f_files * cache->frun_percent;
+
+	_debug("limits {%llu,%llu,%llu} files",
+	       (unsigned long long) cache->frun,
+	       (unsigned long long) cache->fcull,
+	       (unsigned long long) cache->fstop);
+
+	stats.f_blocks >>= cache->bshift;
+	do_div(stats.f_blocks, 100);
+	cache->bstop = stats.f_blocks * cache->bstop_percent;
+	cache->bcull = stats.f_blocks * cache->bcull_percent;
+	cache->brun  = stats.f_blocks * cache->brun_percent;
+
+	_debug("limits {%llu,%llu,%llu} blocks",
+	       (unsigned long long) cache->brun,
+	       (unsigned long long) cache->bcull,
+	       (unsigned long long) cache->bstop);
+
+	/* get the cache directory and check its type */
+	cachedir = cachefiles_get_directory(cache, root, "cache");
+	if (IS_ERR(cachedir)) {
+		ret = PTR_ERR(cachedir);
+		goto error_unsupported;
+	}
+
+	fsdef->dentry = cachedir;
+
+	ret = cachefiles_check_object_type(fsdef);
+	if (ret < 0)
+		goto error_unsupported;
+
+	/* get the graveyard directory */
+	graveyard = cachefiles_get_directory(cache, root, "graveyard");
+	if (IS_ERR(graveyard)) {
+		ret = PTR_ERR(graveyard);
+		goto error_unsupported;
+	}
+
+	cache->graveyard = graveyard;
+
+	/* publish the cache */
+	fscache_init_cache(&cache->cache,
+			   &cachefiles_cache_ops,
+			   "%02x:%02x",
+			   MAJOR(fsdef->dentry->d_sb->s_dev),
+			   MINOR(fsdef->dentry->d_sb->s_dev));
+
+	ret = fscache_add_cache(&cache->cache, &fsdef->fscache, cache->tag);
+	if (ret < 0)
+		goto error_add_cache;
+
+	/* done */
+	set_bit(CACHEFILES_READY, &cache->flags);
+	dput(root);
+
+	printk(KERN_INFO "CacheFiles:"
+	       " File cache on %s registered\n",
+	       cache->cache.identifier);
+
+	/* check how much space the cache has */
+	cachefiles_has_space(cache, 0, 0);
+	cachefiles_end_secure(cache, saved_security);
+	return 0;
+
+error_add_cache:
+	dput(cache->graveyard);
+	cache->graveyard = NULL;
+error_unsupported:
+	mntput(cache->mnt);
+	cache->mnt = NULL;
+	dput(fsdef->dentry);
+	fsdef->dentry = NULL;
+	dput(root);
+error_open_root:
+	kmem_cache_free(cachefiles_object_jar, fsdef);
+error_root_object:
+	cachefiles_end_secure(cache, saved_security);
+	kerror("Failed to register: %d", ret);
+	return ret;
+}
+
+/*
+ * unbind a cache on fd release
+ */
+void cachefiles_daemon_unbind(struct cachefiles_cache *cache)
+{
+	_enter("");
+
+	if (test_bit(CACHEFILES_READY, &cache->flags)) {
+		printk(KERN_INFO "CacheFiles:"
+		       " File cache on %s unregistering\n",
+		       cache->cache.identifier);
+
+		fscache_withdraw_cache(&cache->cache);
+	}
+
+	if (cache->cache.fsdef)
+		cache->cache.ops->put_object(cache->cache.fsdef);
+
+	dput(cache->graveyard);
+	mntput(cache->mnt);
+
+	kfree(cache->rootdirname);
+	kfree(cache->tag);
+
+	_leave("");
+}
diff --git a/fs/cachefiles/cf-daemon.c b/fs/cachefiles/cf-daemon.c
new file mode 100644
index 0000000..33df56e
--- /dev/null
+++ b/fs/cachefiles/cf-daemon.c
@@ -0,0 +1,724 @@
+/* Daemon interface
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/ctype.h>
+#include "cf-internal.h"
+
+static int cachefiles_daemon_open(struct inode *, struct file *);
+static int cachefiles_daemon_release(struct inode *, struct file *);
+static ssize_t cachefiles_daemon_read(struct file *, char __user *, size_t,
+				      loff_t *);
+static ssize_t cachefiles_daemon_write(struct file *, const char __user *,
+				       size_t, loff_t *);
+static unsigned int cachefiles_daemon_poll(struct file *,
+					   struct poll_table_struct *);
+static int cachefiles_daemon_frun(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_fcull(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_fstop(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_brun(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_bcull(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_bstop(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_cull(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_debug(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_dir(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_tag(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_inuse(struct cachefiles_cache *, char *);
+
+static unsigned long cachefiles_open;
+
+const struct file_operations cachefiles_daemon_fops = {
+	.owner		= THIS_MODULE,
+	.open		= cachefiles_daemon_open,
+	.release	= cachefiles_daemon_release,
+	.read		= cachefiles_daemon_read,
+	.write		= cachefiles_daemon_write,
+	.poll		= cachefiles_daemon_poll,
+};
+
+struct cachefiles_daemon_cmd {
+	char name[8];
+	int (*handler)(struct cachefiles_cache *cache, char *args);
+};
+
+static const struct cachefiles_daemon_cmd cachefiles_daemon_cmds[] = {
+	{ "bind",	cachefiles_daemon_bind	},
+	{ "brun",	cachefiles_daemon_brun	},
+	{ "bcull",	cachefiles_daemon_bcull	},
+	{ "bstop",	cachefiles_daemon_bstop	},
+	{ "cull",	cachefiles_daemon_cull	},
+	{ "debug",	cachefiles_daemon_debug	},
+	{ "dir",	cachefiles_daemon_dir	},
+	{ "frun",	cachefiles_daemon_frun	},
+	{ "fcull",	cachefiles_daemon_fcull	},
+	{ "fstop",	cachefiles_daemon_fstop	},
+	{ "inuse",	cachefiles_daemon_inuse	},
+	{ "tag",	cachefiles_daemon_tag	},
+	{ "",		NULL			}
+};
+
+
+/*
+ * do various checks
+ */
+static int cachefiles_daemon_open(struct inode *inode, struct file *file)
+{
+	struct cachefiles_cache *cache;
+
+	_enter("");
+
+	/* only the superuser may do this */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* the cachefiles device may only be open once at a time */
+	if (xchg(&cachefiles_open, 1) == 1)
+		return -EBUSY;
+
+	/* allocate a cache record */
+	cache = kzalloc(sizeof(struct cachefiles_cache), GFP_KERNEL);
+	if (!cache) {
+		cachefiles_open = 0;
+		return -ENOMEM;
+	}
+
+	mutex_init(&cache->daemon_mutex);
+	cache->active_nodes = RB_ROOT;
+	rwlock_init(&cache->active_lock);
+	init_waitqueue_head(&cache->daemon_pollwq);
+
+	/* set default caching limits
+	 * - limit at 1% free space and/or free files
+	 * - cull below 5% free space and/or free files
+	 * - cease culling above 7% free space and/or free files
+	 */
+	cache->frun_percent = 7;
+	cache->fcull_percent = 5;
+	cache->fstop_percent = 1;
+	cache->brun_percent = 7;
+	cache->bcull_percent = 5;
+	cache->bstop_percent = 1;
+
+	file->private_data = cache;
+	cache->cachefilesd = file;
+	return 0;
+}
+
+/*
+ * release a cache
+ */
+static int cachefiles_daemon_release(struct inode *inode, struct file *file)
+{
+	struct cachefiles_cache *cache = file->private_data;
+
+	_enter("");
+
+	ASSERT(cache);
+
+	set_bit(CACHEFILES_DEAD, &cache->flags);
+
+	cachefiles_daemon_unbind(cache);
+
+	ASSERT(!cache->active_nodes.rb_node);
+
+	/* clean up the control file interface */
+	cache->cachefilesd = NULL;
+	file->private_data = NULL;
+	cachefiles_open = 0;
+
+	kfree(cache);
+
+	_leave("");
+	return 0;
+}
+
+/*
+ * read the cache state
+ */
+static ssize_t cachefiles_daemon_read(struct file *file, char __user *_buffer,
+				      size_t buflen, loff_t *pos)
+{
+	struct cachefiles_cache *cache = file->private_data;
+	char buffer[256];
+	int n;
+
+	_enter(",,%zu,", buflen);
+
+	if (!test_bit(CACHEFILES_READY, &cache->flags))
+		return 0;
+
+	/* check how much space the cache has */
+	cachefiles_has_space(cache, 0, 0);
+
+	/* summarise */
+	clear_bit(CACHEFILES_STATE_CHANGED, &cache->flags);
+
+	n = snprintf(buffer, sizeof(buffer),
+		     "cull=%c"
+		     " frun=%llx"
+		     " fcull=%llx"
+		     " fstop=%llx"
+		     " brun=%llx"
+		     " bcull=%llx"
+		     " bstop=%llx",
+		     test_bit(CACHEFILES_CULLING, &cache->flags) ? '1' : '0',
+		     (unsigned long long) cache->frun,
+		     (unsigned long long) cache->fcull,
+		     (unsigned long long) cache->fstop,
+		     (unsigned long long) cache->brun,
+		     (unsigned long long) cache->bcull,
+		     (unsigned long long) cache->bstop
+		     );
+
+	if (n > buflen)
+		return -EMSGSIZE;
+
+	if (copy_to_user(_buffer, buffer, n) != 0)
+		return -EFAULT;
+
+	return n;
+}
+
+/*
+ * command the cache
+ */
+static ssize_t cachefiles_daemon_write(struct file *file,
+				       const char __user *_data,
+				       size_t datalen,
+				       loff_t *pos)
+{
+	const struct cachefiles_daemon_cmd *cmd;
+	struct cachefiles_cache *cache = file->private_data;
+	ssize_t ret;
+	char *data, *args, *cp;
+
+	_enter(",,%zu,", datalen);
+
+	ASSERT(cache);
+
+	if (test_bit(CACHEFILES_DEAD, &cache->flags))
+		return -EIO;
+
+	if (datalen < 0 || datalen > PAGE_SIZE - 1)
+		return -EOPNOTSUPP;
+
+	/* drag the command string into the kernel so we can parse it */
+	data = kmalloc(datalen + 1, GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(data, _data, datalen) != 0)
+		goto error;
+
+	data[datalen] = '\0';
+
+	ret = -EINVAL;
+	if (memchr(data, '\0', datalen))
+		goto error;
+
+	/* strip any newline */
+	cp = memchr(data, '\n', datalen);
+	if (cp) {
+		if (cp == data)
+			goto error;
+
+		*cp = '\0';
+	}
+
+	/* parse the command */
+	ret = -EOPNOTSUPP;
+
+	for (args = data; *args; args++)
+		if (isspace(*args))
+			break;
+	if (*args) {
+		if (args == data)
+			goto error;
+		*args = '\0';
+		for (args++; isspace(*args); args++)
+			continue;
+	}
+
+	/* run the appropriate command handler */
+	for (cmd = cachefiles_daemon_cmds; cmd->name[0]; cmd++)
+		if (strcmp(cmd->name, data) == 0)
+			goto found_command;
+
+error:
+	kfree(data);
+	_leave(" = %zd", ret);
+	return ret;
+
+found_command:
+	mutex_lock(&cache->daemon_mutex);
+
+	ret = -EIO;
+	if (!test_bit(CACHEFILES_DEAD, &cache->flags))
+		ret = cmd->handler(cache, args);
+
+	mutex_unlock(&cache->daemon_mutex);
+
+	if (ret == 0)
+		ret = datalen;
+	goto error;
+}
+
+/*
+ * poll for culling state
+ * - use POLLOUT to indicate culling state
+ */
+static unsigned int cachefiles_daemon_poll(struct file *file,
+					   struct poll_table_struct *poll)
+{
+	struct cachefiles_cache *cache = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &cache->daemon_pollwq, poll);
+	mask = 0;
+
+	if (test_bit(CACHEFILES_STATE_CHANGED, &cache->flags))
+		mask |= POLLIN;
+
+	if (test_bit(CACHEFILES_CULLING, &cache->flags))
+		mask |= POLLOUT;
+
+	return mask;
+}
+
+/*
+ * give a range error for cache space constraints
+ * - can be tail-called
+ */
+static int cachefiles_daemon_range_error(struct cachefiles_cache *cache,
+					 char *args)
+{
+	kerror("Free space limits must be in range"
+	       " 0%%<=stop<cull<run<100%%");
+
+	return -EINVAL;
+}
+
+/*
+ * set the percentage of files at which to stop culling
+ * - command: "frun <N>%"
+ */
+static int cachefiles_daemon_frun(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long frun;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	frun = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (frun <= cache->fcull_percent || frun >= 100)
+		return cachefiles_daemon_range_error(cache, args);
+
+	cache->frun_percent = frun;
+	return 0;
+}
+
+/*
+ * set the percentage of files at which to start culling
+ * - command: "fcull <N>%"
+ */
+static int cachefiles_daemon_fcull(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long fcull;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	fcull = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (fcull <= cache->fstop_percent || fcull >= cache->frun_percent)
+		return cachefiles_daemon_range_error(cache, args);
+
+	cache->fcull_percent = fcull;
+	return 0;
+}
+
+/*
+ * set the percentage of files at which to stop allocating
+ * - command: "fstop <N>%"
+ */
+static int cachefiles_daemon_fstop(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long fstop;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	fstop = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (fstop < 0 || fstop >= cache->fcull_percent)
+		return cachefiles_daemon_range_error(cache, args);
+
+	cache->fstop_percent = fstop;
+	return 0;
+}
+
+/*
+ * set the percentage of blocks at which to stop culling
+ * - command: "brun <N>%"
+ */
+static int cachefiles_daemon_brun(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long brun;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	brun = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (brun <= cache->bcull_percent || brun >= 100)
+		return cachefiles_daemon_range_error(cache, args);
+
+	cache->brun_percent = brun;
+	return 0;
+}
+
+/*
+ * set the percentage of blocks at which to start culling
+ * - command: "bcull <N>%"
+ */
+static int cachefiles_daemon_bcull(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long bcull;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	bcull = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (bcull <= cache->bstop_percent || bcull >= cache->brun_percent)
+		return cachefiles_daemon_range_error(cache, args);
+
+	cache->bcull_percent = bcull;
+	return 0;
+}
+
+/*
+ * set the percentage of blocks at which to stop allocating
+ * - command: "bstop <N>%"
+ */
+static int cachefiles_daemon_bstop(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long bstop;
+
+	_enter(",%s", args);
+
+	if (!*args)
+		return -EINVAL;
+
+	bstop = simple_strtoul(args, &args, 10);
+	if (args[0] != '%' || args[1] != '\0')
+		return -EINVAL;
+
+	if (bstop < 0 || bstop >= cache->bcull_percent)
+		return cachefiles_daemon_range_error(cache, args);
+
+	cache->bstop_percent = bstop;
+	return 0;
+}
+
+/*
+ * set the cache directory
+ * - command: "dir <name>"
+ */
+static int cachefiles_daemon_dir(struct cachefiles_cache *cache, char *args)
+{
+	char *dir;
+
+	_enter(",%s", args);
+
+	if (!*args) {
+		kerror("Empty directory specified");
+		return -EINVAL;
+	}
+
+	if (cache->rootdirname) {
+		kerror("Second cache directory specified");
+		return -EEXIST;
+	}
+
+	dir = kstrdup(args, GFP_KERNEL);
+	if (!dir)
+		return -ENOMEM;
+
+	cache->rootdirname = dir;
+	return 0;
+}
+
+/*
+ * set the cache tag
+ * - command: "tag <name>"
+ */
+static int cachefiles_daemon_tag(struct cachefiles_cache *cache, char *args)
+{
+	char *tag;
+
+	_enter(",%s", args);
+
+	if (!*args) {
+		kerror("Empty tag specified");
+		return -EINVAL;
+	}
+
+	if (cache->tag)
+		return -EEXIST;
+
+	tag = kstrdup(args, GFP_KERNEL);
+	if (!tag)
+		return -ENOMEM;
+
+	cache->tag = tag;
+	return 0;
+}
+
+/*
+ * request a node in the cache be culled from the current working directory
+ * - command: "cull <name>"
+ */
+static int cachefiles_daemon_cull(struct cachefiles_cache *cache, char *args)
+{
+	struct fs_struct *fs;
+	struct dentry *dir;
+	struct task_security *saved_security;
+	int ret;
+
+	_enter(",%s", args);
+
+	if (strchr(args, '/'))
+		goto inval;
+
+	if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+		kerror("cull applied to unready cache");
+		return -EIO;
+	}
+
+	if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+		kerror("cull applied to dead cache");
+		return -EIO;
+	}
+
+	/* extract the directory dentry from the cwd */
+	fs = current->fs;
+	read_lock(&fs->lock);
+	dir = dget(fs->pwd);
+	read_unlock(&fs->lock);
+
+	if (!S_ISDIR(dir->d_inode->i_mode))
+		goto notdir;
+
+	cachefiles_begin_secure(cache, &saved_security);
+	ret = cachefiles_cull(cache, dir, args);
+	cachefiles_end_secure(cache, saved_security);
+
+	dput(dir);
+	_leave(" = %d", ret);
+	return ret;
+
+notdir:
+	dput(dir);
+	kerror("cull command requires dirfd to be a directory");
+	return -ENOTDIR;
+
+inval:
+	kerror("cull command requires dirfd and filename");
+	return -EINVAL;
+}
+
+/*
+ * set debugging mode
+ * - command: "debug <mask>"
+ */
+static int cachefiles_daemon_debug(struct cachefiles_cache *cache, char *args)
+{
+	unsigned long mask;
+
+	_enter(",%s", args);
+
+	mask = simple_strtoul(args, &args, 0);
+	if (args[0] != '\0')
+		goto inval;
+
+	cachefiles_debug = mask;
+	_leave(" = 0");
+	return 0;
+
+inval:
+	kerror("debug command requires mask");
+	return -EINVAL;
+}
+
+/*
+ * find out whether an object in the current working directory is in use or not
+ * - command: "inuse <name>"
+ */
+static int cachefiles_daemon_inuse(struct cachefiles_cache *cache, char *args)
+{
+	struct fs_struct *fs;
+	struct dentry *dir;
+	struct task_security *saved_security;
+	int ret;
+
+	_enter(",%s", args);
+
+	if (strchr(args, '/'))
+		goto inval;
+
+	if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+		kerror("inuse applied to unready cache");
+		return -EIO;
+	}
+
+	if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+		kerror("inuse applied to dead cache");
+		return -EIO;
+	}
+
+	/* extract the directory dentry from the cwd */
+	fs = current->fs;
+	read_lock(&fs->lock);
+	dir = dget(fs->pwd);
+	read_unlock(&fs->lock);
+
+	if (!S_ISDIR(dir->d_inode->i_mode))
+		goto notdir;
+
+	cachefiles_begin_secure(cache, &saved_security);
+	ret = cachefiles_check_in_use(cache, dir, args);
+	cachefiles_end_secure(cache, saved_security);
+
+	dput(dir);
+	_leave(" = %d", ret);
+	return ret;
+
+notdir:
+	dput(dir);
+	kerror("inuse command requires dirfd to be a directory");
+	return -ENOTDIR;
+
+inval:
+	kerror("inuse command requires dirfd and filename");
+	return -EINVAL;
+}
+
+/*
+ * see if we have space for a number of pages and/or a number of files in the
+ * cache
+ */
+int cachefiles_has_space(struct cachefiles_cache *cache,
+			 unsigned fnr, unsigned bnr)
+{
+	struct kstatfs stats;
+	int ret;
+
+	_enter("{%llu,%llu,%llu,%llu,%llu,%llu},%u,%u",
+	       (unsigned long long) cache->frun,
+	       (unsigned long long) cache->fcull,
+	       (unsigned long long) cache->fstop,
+	       (unsigned long long) cache->brun,
+	       (unsigned long long) cache->bcull,
+	       (unsigned long long) cache->bstop,
+	       fnr, bnr);
+
+	/* find out how many pages of blockdev are available */
+	memset(&stats, 0, sizeof(stats));
+
+	ret = vfs_statfs(cache->mnt->mnt_root, &stats);
+	if (ret < 0) {
+		if (ret == -EIO)
+			cachefiles_io_error(cache, "statfs failed");
+		_leave(" = %d", ret);
+		return ret;
+	}
+
+	stats.f_bavail >>= cache->bshift;
+
+	_debug("avail %llu,%llu",
+	       (unsigned long long) stats.f_ffree,
+	       (unsigned long long) stats.f_bavail);
+
+	/* see if there is sufficient space */
+	if (stats.f_ffree > fnr)
+		stats.f_ffree -= fnr;
+	else
+		stats.f_ffree = 0;
+
+	if (stats.f_bavail > bnr)
+		stats.f_bavail -= bnr;
+	else
+		stats.f_bavail = 0;
+
+	ret = -ENOBUFS;
+	if (stats.f_ffree < cache->fstop ||
+	    stats.f_bavail < cache->bstop)
+		goto begin_cull;
+
+	ret = 0;
+	if (stats.f_ffree < cache->fcull ||
+	    stats.f_bavail < cache->bcull)
+		goto begin_cull;
+
+	if (test_bit(CACHEFILES_CULLING, &cache->flags) &&
+	    stats.f_ffree >= cache->frun &&
+	    stats.f_bavail >= cache->brun &&
+	    test_and_clear_bit(CACHEFILES_CULLING, &cache->flags)
+	    ) {
+		_debug("cease culling");
+		cachefiles_state_changed(cache);
+	}
+
+	_leave(" = 0");
+	return 0;
+
+begin_cull:
+	if (!test_and_set_bit(CACHEFILES_CULLING, &cache->flags)) {
+		_debug("### CULL CACHE ###");
+		cachefiles_state_changed(cache);
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
diff --git a/fs/cachefiles/cf-interface.c b/fs/cachefiles/cf-interface.c
new file mode 100644
index 0000000..1dfc362
--- /dev/null
+++ b/fs/cachefiles/cf-interface.c
@@ -0,0 +1,445 @@
+/* FS-Cache interface to CacheFiles
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/mount.h>
+#include <linux/buffer_head.h>
+#include "cf-internal.h"
+
+#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+
+struct cachefiles_lookup_data {
+	struct cachefiles_xattr	*auxdata;	/* auxiliary data */
+	char			*key;		/* key path */
+};
+
+static int cachefiles_attr_changed(struct fscache_object *_object);
+
+/*
+ * allocate an object record for a cookie lookup and prepare the lookup data
+ */
+static struct fscache_object *cachefiles_alloc_object(
+	struct fscache_cache *_cache,
+	struct fscache_cookie *cookie)
+{
+	struct cachefiles_lookup_data *lookup_data;
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct cachefiles_xattr *auxdata;
+	unsigned keylen, auxlen;
+	void *buffer;
+	char *key;
+
+	cache = container_of(_cache, struct cachefiles_cache, cache);
+
+	_enter("{%s},%p,", cache->cache.identifier, cookie);
+
+	lookup_data = kmalloc(sizeof(lookup_data), GFP_KERNEL);
+	if (!lookup_data)
+		goto nomem_lookup_data;
+
+	/* create a new object record and a temporary leaf image */
+	object = kmem_cache_alloc(cachefiles_object_jar, GFP_KERNEL);
+	if (!object)
+		goto nomem_object;
+
+	ASSERTCMP(object->backer, ==, NULL);
+
+	BUG_ON(test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+	atomic_set(&object->usage, 1);
+
+	fscache_object_init(&object->fscache);
+	object->fscache.cookie = cookie;
+	object->fscache.cache = &cache->cache;
+
+	object->type = cookie->def->type;
+
+	/* get hold of the raw key
+	 * - stick the length on the front and leave space on the back for the
+	 *   encoder
+	 */
+	buffer = kmalloc((2 + 512) + 3, GFP_KERNEL);
+	if (!buffer)
+		goto nomem_buffer;
+
+	keylen = cookie->def->get_key(cookie->netfs_data, buffer + 2, 512);
+	ASSERTCMP(keylen, <, 512);
+
+	*(uint16_t *)buffer = keylen;
+	((char *)buffer)[keylen + 2] = 0;
+	((char *)buffer)[keylen + 3] = 0;
+	((char *)buffer)[keylen + 4] = 0;
+
+	/* turn the raw key into something that can work with as a filename */
+	key = cachefiles_cook_key(buffer, keylen + 2, object->type);
+	if (!key)
+		goto nomem_key;
+
+	/* get hold of the auxiliary data and prepend the object type */
+	auxdata = buffer;
+	auxlen = 0;
+	if (cookie->def->get_aux) {
+		auxlen = cookie->def->get_aux(cookie->netfs_data,
+					      auxdata->data, 511);
+		ASSERTCMP(auxlen, <, 511);
+	}
+
+	auxdata->len = auxlen + 1;
+	auxdata->type = cookie->def->type;
+
+	lookup_data->auxdata = auxdata;
+	lookup_data->key = key;
+	object->lookup_data = lookup_data;
+
+	_leave(" = %p [%p]", &object->fscache, lookup_data);
+	return &object->fscache;
+
+nomem_key:
+	kfree(buffer);
+nomem_buffer:
+	BUG_ON(test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+	kmem_cache_free(cachefiles_object_jar, object);
+nomem_object:
+	kfree(lookup_data);
+nomem_lookup_data:
+	_leave(" = -ENOMEM");
+	return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * attempt to look up the nominated node in this cache
+ */
+static void cachefiles_lookup_object(struct fscache_object *_object)
+{
+	struct cachefiles_lookup_data *lookup_data;
+	struct cachefiles_object *parent, *object;
+	struct cachefiles_cache *cache;
+	struct task_security *saved_security;
+	int ret;
+
+	_enter("{OBJ%x}", _object->debug_id);
+
+	cache = container_of(_object->cache, struct cachefiles_cache, cache);
+	parent = container_of(_object->parent,
+			      struct cachefiles_object, fscache);
+	object = container_of(_object, struct cachefiles_object, fscache);
+	lookup_data = object->lookup_data;
+
+	ASSERTCMP(lookup_data, !=, NULL);
+
+	/* look up the key, creating any missing bits */
+	cachefiles_begin_secure(cache, &saved_security);
+	ret = cachefiles_walk_to_object(parent, object,
+					lookup_data->key,
+					lookup_data->auxdata);
+	cachefiles_end_secure(cache, saved_security);
+
+	/* polish off by setting the attributes of non-index files */
+	if (ret == 0 &&
+	    object->fscache.cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX)
+		cachefiles_attr_changed(&object->fscache);
+
+	if (ret < 0)
+		fscache_object_lookup_error(&object->fscache);
+
+	_leave(" [%d]", ret);
+}
+
+/*
+ * indication of lookup completion
+ */
+static void cachefiles_lookup_complete(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+	_enter("{OBJ%x,%p}", object->fscache.debug_id, object->lookup_data);
+
+	if (object->lookup_data) {
+		kfree(object->lookup_data->key);
+		kfree(object->lookup_data->auxdata);
+		kfree(object->lookup_data);
+		object->lookup_data = NULL;
+	}
+}
+
+/*
+ * increment the usage count on an inode object (may fail if unmounting)
+ */
+static
+struct fscache_object *cachefiles_grab_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+
+	_enter("{OBJ%x}", _object->debug_id);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+#ifdef CACHEFILES_DEBUG_SLAB
+	ASSERT((atomic_read(&object->usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+	atomic_inc(&object->usage);
+	return &object->fscache;
+}
+
+/*
+ * update the auxilliary data for an object object on disk
+ */
+static void cachefiles_update_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_xattr *auxdata;
+	struct cachefiles_cache *cache;
+	struct fscache_cookie *cookie;
+	struct task_security *saved_security;
+	unsigned auxlen;
+
+	_enter("{OBJ%x}", _object->debug_id);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache, struct cachefiles_cache,
+			     cache);
+	cookie = object->fscache.cookie;
+
+	if (!cookie->def->get_aux) {
+		_leave(" [no aux]");
+		return;
+	}
+
+	auxdata = kmalloc(2 + 512 + 3, GFP_KERNEL);
+	if (!auxdata) {
+		_leave(" [nomem]");
+		return;
+	}
+
+	auxlen = cookie->def->get_aux(cookie->netfs_data, auxdata->data, 511);
+	ASSERTCMP(auxlen, <, 511);
+
+	auxdata->len = auxlen + 1;
+	auxdata->type = cookie->def->type;
+
+	cachefiles_begin_secure(cache, &saved_security);
+	cachefiles_update_object_xattr(object, auxdata);
+	cachefiles_end_secure(cache, saved_security);
+	kfree(auxdata);
+	_leave("");
+}
+
+/*
+ * discard the resources pinned by an object and effect retirement if
+ * requested
+ */
+static void cachefiles_drop_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct task_security *saved_security;
+
+	ASSERT(_object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+	_enter("{OBJ%x,%d}",
+	       object->fscache.debug_id, atomic_read(&object->usage));
+
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+#ifdef CACHEFILES_DEBUG_SLAB
+	ASSERT((atomic_read(&object->usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+	/* delete retired objects */
+	if (object->fscache.state == FSCACHE_OBJECT_RECYCLING &&
+	    _object != cache->cache.fsdef
+	    ) {
+		_debug("- retire object OBJ%x", object->fscache.debug_id);
+		cachefiles_begin_secure(cache, &saved_security);
+		cachefiles_delete_object(cache, object);
+		cachefiles_end_secure(cache, saved_security);
+	}
+
+	/* close the filesystem stuff attached to the object */
+	if (object->backer != object->dentry)
+		dput(object->backer);
+	object->backer = NULL;
+
+	/* note that the object is now inactive */
+	if (test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags)) {
+		write_lock(&cache->active_lock);
+		if (!test_and_clear_bit(CACHEFILES_OBJECT_ACTIVE,
+					&object->flags))
+			BUG();
+		rb_erase(&object->active_node, &cache->active_nodes);
+		wake_up_bit(&object->flags, CACHEFILES_OBJECT_ACTIVE);
+		write_unlock(&cache->active_lock);
+	}
+
+	dput(object->dentry);
+	object->dentry = NULL;
+
+	_leave("");
+}
+
+/*
+ * dispose of a reference to an object
+ */
+static void cachefiles_put_object(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+
+	ASSERT(_object);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+
+	_enter("{OBJ%x,%d}",
+	       object->fscache.debug_id, atomic_read(&object->usage));
+
+#ifdef CACHEFILES_DEBUG_SLAB
+	ASSERT((atomic_read(&object->usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+	ASSERTIFCMP(object->fscache.parent,
+		    object->fscache.parent->n_children, >, 0);
+
+	if (atomic_dec_and_test(&object->usage)) {
+		_debug("- kill object OBJ%x", object->fscache.debug_id);
+
+		ASSERT(!test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+		ASSERTCMP(object->fscache.parent, ==, NULL);
+		ASSERTCMP(object->backer, ==, NULL);
+		ASSERTCMP(object->dentry, ==, NULL);
+		ASSERTCMP(object->fscache.n_ops, ==, 0);
+		ASSERTCMP(object->fscache.n_children, ==, 0);
+
+		if (object->lookup_data) {
+			kfree(object->lookup_data->key);
+			kfree(object->lookup_data->auxdata);
+			kfree(object->lookup_data);
+			object->lookup_data = NULL;
+		}
+
+		kmem_cache_free(cachefiles_object_jar, object);
+	}
+
+	_leave("");
+}
+
+/*
+ * sync a cache
+ */
+static void cachefiles_sync_cache(struct fscache_cache *_cache)
+{
+	struct cachefiles_cache *cache;
+	struct task_security *saved_security;
+	int ret;
+
+	_enter("%p", _cache);
+
+	cache = container_of(_cache, struct cachefiles_cache, cache);
+
+	/* make sure all pages pinned by operations on behalf of the netfs are
+	 * written to disc */
+	cachefiles_begin_secure(cache, &saved_security);
+	ret = fsync_super(cache->mnt->mnt_sb);
+	cachefiles_end_secure(cache, saved_security);
+
+	if (ret == -EIO)
+		cachefiles_io_error(cache,
+				    "Attempt to sync backing fs superblock"
+				    " returned error %d",
+				    ret);
+}
+
+/*
+ * notification the attributes on an object have changed
+ * - called with reads/writes excluded by FS-Cache
+ */
+static int cachefiles_attr_changed(struct fscache_object *_object)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct task_security *saved_security;
+	struct iattr newattrs;
+	uint64_t ni_size;
+	loff_t oi_size;
+	int ret;
+
+	_object->cookie->def->get_attr(_object->cookie->netfs_data, &ni_size);
+
+	_enter("{OBJ%x},[%llu]",
+	       _object->debug_id, (unsigned long long) ni_size);
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	if (ni_size == object->i_size)
+		return 0;
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	ASSERT(S_ISREG(object->backer->d_inode->i_mode));
+
+	fscache_set_store_limit(&object->fscache, ni_size);
+
+	oi_size = i_size_read(object->backer->d_inode);
+	if (oi_size == ni_size)
+		return 0;
+
+	newattrs.ia_size = ni_size;
+	newattrs.ia_valid = ATTR_SIZE;
+
+	cachefiles_begin_secure(cache, &saved_security);
+	mutex_lock(&object->backer->d_inode->i_mutex);
+	ret = notify_change(object->backer, &newattrs);
+	mutex_unlock(&object->backer->d_inode->i_mutex);
+	cachefiles_end_secure(cache, saved_security);
+
+	if (ret == -EIO) {
+		fscache_set_store_limit(&object->fscache, 0);
+		cachefiles_io_error_obj(object, "Size set failed");
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * dissociate a cache from all the pages it was backing
+ */
+static void cachefiles_dissociate_pages(struct fscache_cache *cache)
+{
+	_enter("");
+}
+
+const struct fscache_cache_ops cachefiles_cache_ops = {
+	.name			= "cachefiles",
+	.alloc_object		= cachefiles_alloc_object,
+	.lookup_object		= cachefiles_lookup_object,
+	.lookup_complete	= cachefiles_lookup_complete,
+	.grab_object		= cachefiles_grab_object,
+	.update_object		= cachefiles_update_object,
+	.drop_object		= cachefiles_drop_object,
+	.put_object		= cachefiles_put_object,
+	.sync_cache		= cachefiles_sync_cache,
+	.attr_changed		= cachefiles_attr_changed,
+	.read_or_alloc_page	= cachefiles_read_or_alloc_page,
+	.read_or_alloc_pages	= cachefiles_read_or_alloc_pages,
+	.allocate_page		= cachefiles_allocate_page,
+	.allocate_pages		= cachefiles_allocate_pages,
+	.write_page		= cachefiles_write_page,
+	.uncache_page		= cachefiles_uncache_page,
+	.dissociate_pages	= cachefiles_dissociate_pages,
+};
diff --git a/fs/cachefiles/cf-internal.h b/fs/cachefiles/cf-internal.h
new file mode 100644
index 0000000..c15ccb4
--- /dev/null
+++ b/fs/cachefiles/cf-internal.h
@@ -0,0 +1,372 @@
+/* General netfs cache on cache files internal defs
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fscache-cache.h>
+#include <linux/timer.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+#include <linux/security.h>
+
+struct cachefiles_cache;
+struct cachefiles_object;
+
+extern unsigned cachefiles_debug;
+#define CACHEFILES_DEBUG_KENTER	1
+#define CACHEFILES_DEBUG_KLEAVE	2
+#define CACHEFILES_DEBUG_KDEBUG	4
+
+/*
+ * node records
+ */
+struct cachefiles_object {
+	struct fscache_object		fscache;	/* fscache handle */
+	struct cachefiles_lookup_data	*lookup_data;	/* cached lookup data */
+	struct dentry			*dentry;	/* the file/dir representing this object */
+	struct dentry			*backer;	/* backing file */
+	loff_t				i_size;		/* object size */
+	unsigned long			flags;
+#define CACHEFILES_OBJECT_ACTIVE	0		/* T if marked active */
+	atomic_t			usage;		/* object usage count */
+	uint8_t				type;		/* object type */
+	uint8_t				new;		/* T if object new */
+	spinlock_t			work_lock;
+	struct rb_node			active_node;	/* link in active tree (dentry is key) */
+};
+
+extern struct kmem_cache *cachefiles_object_jar;
+
+/*
+ * Cache files cache definition
+ */
+struct cachefiles_cache {
+	struct fscache_cache		cache;		/* FS-Cache record */
+	struct vfsmount			*mnt;		/* mountpoint holding the cache */
+	struct dentry			*graveyard;	/* directory into which dead objects go */
+	struct file			*cachefilesd;	/* manager daemon handle */
+	struct task_security		*cache_sec;	/* security override for accessing cache */
+	struct mutex			daemon_mutex;	/* command serialisation mutex */
+	wait_queue_head_t		daemon_pollwq;	/* poll waitqueue for daemon */
+	struct rb_root			active_nodes;	/* active nodes (can't be culled) */
+	rwlock_t			active_lock;	/* lock for active_nodes */
+	atomic_t			gravecounter;	/* graveyard uniquifier */
+	unsigned			frun_percent;	/* when to stop culling (% files) */
+	unsigned			fcull_percent;	/* when to start culling (% files) */
+	unsigned			fstop_percent;	/* when to stop allocating (% files) */
+	unsigned			brun_percent;	/* when to stop culling (% blocks) */
+	unsigned			bcull_percent;	/* when to start culling (% blocks) */
+	unsigned			bstop_percent;	/* when to stop allocating (% blocks) */
+	unsigned			bsize;		/* cache's block size */
+	unsigned			bshift;		/* min(ilog2(PAGE_SIZE / bsize), 0) */
+	uint64_t			frun;		/* when to stop culling */
+	uint64_t			fcull;		/* when to start culling */
+	uint64_t			fstop;		/* when to stop allocating */
+	sector_t			brun;		/* when to stop culling */
+	sector_t			bcull;		/* when to start culling */
+	sector_t			bstop;		/* when to stop allocating */
+	unsigned long			flags;
+#define CACHEFILES_READY		0	/* T if cache prepared */
+#define CACHEFILES_DEAD			1	/* T if cache dead */
+#define CACHEFILES_CULLING		2	/* T if cull engaged */
+#define CACHEFILES_STATE_CHANGED	3	/* T if state changed (poll trigger) */
+	char				*rootdirname;	/* name of cache root directory */
+	char				*tag;		/* cache binding tag */
+};
+
+/*
+ * backing file read tracking
+ */
+struct cachefiles_one_read {
+	wait_queue_t			monitor;	/* link into monitored waitqueue */
+	struct page			*back_page;	/* backing file page we're waiting for */
+	struct page			*netfs_page;	/* netfs page we're going to fill */
+	struct fscache_retrieval	*op;		/* retrieval op covering this */
+	struct list_head		op_link;	/* link in op's todo list */
+};
+
+/*
+ * backing file write tracking
+ */
+struct cachefiles_one_write {
+	struct page			*netfs_page;	/* netfs page to copy */
+	struct cachefiles_object	*object;
+	struct list_head		obj_link;	/* link in object's lists */
+	fscache_rw_complete_t		end_io_func;
+	void				*context;
+};
+
+/*
+ * auxiliary data xattr buffer
+ */
+struct cachefiles_xattr {
+	uint16_t			len;
+	uint8_t				type;
+	uint8_t				data[];
+};
+
+/*
+ * note change of state for daemon
+ */
+static inline void cachefiles_state_changed(struct cachefiles_cache *cache)
+{
+	set_bit(CACHEFILES_STATE_CHANGED, &cache->flags);
+	wake_up_all(&cache->daemon_pollwq);
+}
+
+/*
+ * cf-bind.c
+ */
+extern int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args);
+extern void cachefiles_daemon_unbind(struct cachefiles_cache *cache);
+
+/*
+ * cf-daemon.c
+ */
+extern const struct file_operations cachefiles_daemon_fops;
+
+extern int cachefiles_has_space(struct cachefiles_cache *cache,
+				unsigned fnr, unsigned bnr);
+
+/*
+ * cf-interface.c
+ */
+extern const struct fscache_cache_ops cachefiles_cache_ops;
+
+/*
+ * cf-key.c
+ */
+extern char *cachefiles_cook_key(const u8 *raw, int keylen, uint8_t type);
+
+/*
+ * cf-namei.c
+ */
+extern int cachefiles_delete_object(struct cachefiles_cache *cache,
+				    struct cachefiles_object *object);
+extern int cachefiles_walk_to_object(struct cachefiles_object *parent,
+				     struct cachefiles_object *object,
+				     const char *key,
+				     struct cachefiles_xattr *auxdata);
+extern struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+					       struct dentry *dir,
+					       const char *name);
+
+extern int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
+			   char *filename);
+
+extern int cachefiles_check_in_use(struct cachefiles_cache *cache,
+				   struct dentry *dir, char *filename);
+
+/*
+ * cf-proc.c
+ */
+#ifdef CONFIG_CACHEFILES_HISTOGRAM
+extern atomic_t cachefiles_lookup_histogram[HZ];
+extern atomic_t cachefiles_mkdir_histogram[HZ];
+extern atomic_t cachefiles_create_histogram[HZ];
+
+extern int __init cachefiles_proc_init(void);
+extern void cachefiles_proc_cleanup(void);
+static inline
+void cachefiles_hist(atomic_t histogram[], unsigned long start_jif)
+{
+	unsigned long jif = jiffies - start_jif;
+	if (jif >= HZ)
+		jif = HZ - 1;
+	atomic_inc(&histogram[jif]);
+}
+
+#else
+#define cachefiles_proc_init()		(0)
+#define cachefiles_proc_cleanup()	do {} while (0)
+#define cachefiles_hist(hist, start_jif) do {} while (0)
+#endif
+
+/*
+ * cf-rdwr.c
+ */
+extern int cachefiles_read_or_alloc_page(struct fscache_retrieval *,
+					 struct page *, gfp_t);
+extern int cachefiles_read_or_alloc_pages(struct fscache_retrieval *,
+					  struct list_head *, unsigned *,
+					  gfp_t);
+extern int cachefiles_allocate_page(struct fscache_retrieval *, struct page *,
+				    gfp_t);
+extern int cachefiles_allocate_pages(struct fscache_retrieval *,
+				     struct list_head *, unsigned *, gfp_t);
+extern int cachefiles_write_page(struct fscache_storage *, struct page *);
+extern void cachefiles_uncache_page(struct fscache_object *, struct page *);
+
+/*
+ * cf-security.c
+ */
+extern int cachefiles_get_security_ID(struct cachefiles_cache *cache);
+extern int cachefiles_determine_cache_security(struct cachefiles_cache *cache,
+					       struct dentry *root);
+
+static inline void cachefiles_begin_secure(struct cachefiles_cache *cache,
+					   struct task_security **_saved_sec)
+{
+	*_saved_sec = current->act_as;
+	current->act_as = get_task_security(cache->cache_sec);
+}
+
+static inline void cachefiles_end_secure(struct cachefiles_cache *cache,
+					 struct task_security *saved_sec)
+{
+	struct task_security *old_sec = current->act_as;
+	current->act_as = saved_sec;
+	put_task_security(old_sec);
+}
+
+/*
+ * cf-xattr.c
+ */
+extern int cachefiles_check_object_type(struct cachefiles_object *object);
+extern int cachefiles_set_object_xattr(struct cachefiles_object *object,
+				       struct cachefiles_xattr *auxdata);
+extern int cachefiles_update_object_xattr(struct cachefiles_object *object,
+					  struct cachefiles_xattr *auxdata);
+extern int cachefiles_check_object_xattr(struct cachefiles_object *object,
+					 struct cachefiles_xattr *auxdata);
+extern int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+					  struct dentry *dentry);
+
+
+/*
+ * error handling
+ */
+#define kerror(FMT, ...) printk(KERN_ERR "CacheFiles: "FMT"\n", ##__VA_ARGS__)
+
+#define cachefiles_io_error(___cache, FMT, ...)		\
+do {							\
+	kerror("I/O Error: " FMT, ##__VA_ARGS__);	\
+	fscache_io_error(&(___cache)->cache);		\
+	set_bit(CACHEFILES_DEAD, &(___cache)->flags);	\
+} while (0)
+
+#define cachefiles_io_error_obj(object, FMT, ...)			\
+do {									\
+	struct cachefiles_cache *___cache;				\
+									\
+	___cache = container_of((object)->fscache.cache,		\
+				struct cachefiles_cache, cache);	\
+	cachefiles_io_error(___cache, FMT, ##__VA_ARGS__);		\
+} while (0)
+
+
+/*
+ * debug tracing
+ */
+#define dbgprintk(FMT, ...) \
+	printk(KERN_DEBUG "[%-6.6s] "FMT"\n", current->comm, ##__VA_ARGS__)
+
+/* make sure we maintain the format strings, even when debugging is disabled */
+static inline void _dbprintk(const char *fmt, ...)
+	__attribute__((format(printf, 1, 2)));
+static inline void _dbprintk(const char *fmt, ...)
+{
+}
+
+#define kenter(FMT, ...) dbgprintk("==> %s("FMT")", __FUNCTION__, ##__VA_ARGS__)
+#define kleave(FMT, ...) dbgprintk("<== %s()"FMT"", __FUNCTION__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) dbgprintk(FMT, ##__VA_ARGS__)
+
+
+#if defined(__KDEBUG)
+#define _enter(FMT, ...) kenter(FMT, ##__VA_ARGS__)
+#define _leave(FMT, ...) kleave(FMT, ##__VA_ARGS__)
+#define _debug(FMT, ...) kdebug(FMT, ##__VA_ARGS__)
+
+#elif defined(CONFIG_CACHEFILES_DEBUG)
+#define _enter(FMT, ...)				\
+do {							\
+	if (cachefiles_debug & CACHEFILES_DEBUG_KENTER)	\
+		kenter(FMT, ##__VA_ARGS__);		\
+} while (0)
+
+#define _leave(FMT, ...)				\
+do {							\
+	if (cachefiles_debug & CACHEFILES_DEBUG_KLEAVE)	\
+		kleave(FMT, ##__VA_ARGS__);		\
+} while (0)
+
+#define _debug(FMT, ...)				\
+do {							\
+	if (cachefiles_debug & CACHEFILES_DEBUG_KDEBUG)	\
+		kdebug(FMT, ##__VA_ARGS__);		\
+} while (0)
+
+#else
+#define _enter(FMT, ...) _dbprintk("==> %s("FMT")", __FUNCTION__, ##__VA_ARGS__)
+#define _leave(FMT, ...) _dbprintk("<== %s()"FMT"", __FUNCTION__, ##__VA_ARGS__)
+#define _debug(FMT, ...) _dbprintk(FMT, ##__VA_ARGS__)
+#endif
+
+#if 1 /* defined(__KDEBUGALL) */
+
+#define ASSERT(X)							\
+do {									\
+	if (unlikely(!(X))) {						\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		BUG();							\
+	}								\
+} while (0)
+
+#define ASSERTCMP(X, OP, Y)						\
+do {									\
+	if (unlikely(!((X) OP (Y)))) {					\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		printk(KERN_ERR "%lx " #OP " %lx is false\n",		\
+		       (unsigned long)(X), (unsigned long)(Y));		\
+		BUG();							\
+	}								\
+} while (0)
+
+#define ASSERTIF(C, X)							\
+do {									\
+	if (unlikely((C) && !(X))) {					\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		BUG();							\
+	}								\
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y)					\
+do {									\
+	if (unlikely((C) && !((X) OP (Y)))) {				\
+		printk(KERN_ERR "\n");					\
+		printk(KERN_ERR "CacheFiles: Assertion failed\n");	\
+		printk(KERN_ERR "%lx " #OP " %lx is false\n",		\
+		       (unsigned long)(X), (unsigned long)(Y));		\
+		BUG();							\
+	}								\
+} while (0)
+
+#else
+
+#define ASSERT(X)				\
+do {						\
+} while (0)
+
+#define ASSERTCMP(X, OP, Y)			\
+do {						\
+} while (0)
+
+#define ASSERTIF(C, X)				\
+do {						\
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y)		\
+do {						\
+} while (0)
+
+#endif
diff --git a/fs/cachefiles/cf-key.c b/fs/cachefiles/cf-key.c
new file mode 100644
index 0000000..f3fa75f
--- /dev/null
+++ b/fs/cachefiles/cf-key.c
@@ -0,0 +1,159 @@
+/* Key to pathname encoder
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/slab.h>
+#include "cf-internal.h"
+
+static const char cachefiles_charmap[64] =
+	"0123456789"			/* 0 - 9 */
+	"abcdefghijklmnopqrstuvwxyz"	/* 10 - 35 */
+	"ABCDEFGHIJKLMNOPQRSTUVWXYZ"	/* 36 - 61 */
+	"_-"				/* 62 - 63 */
+	;
+
+static const char cachefiles_filecharmap[256] = {
+	/* we skip space and tab and control chars */
+	[33 ... 46] = 1,		/* '!' -> '.' */
+	/* we skip '/' as it's significant to pathwalk */
+	[48 ... 127] = 1,		/* '0' -> '~' */
+};
+
+/*
+ * turn the raw key into something cooked
+ * - the raw key should include the length in the two bytes at the front
+ * - the key may be up to 514 bytes in length (including the length word)
+ *   - "base64" encode the strange keys, mapping 3 bytes of raw to four of
+ *     cooked
+ *   - need to cut the cooked key into 252 char lengths (189 raw bytes)
+ */
+char *cachefiles_cook_key(const u8 *raw, int keylen, uint8_t type)
+{
+	unsigned char csum, ch;
+	unsigned int acc;
+	char *key;
+	int loop, len, max, seg, mark, print;
+
+	_enter(",%d", keylen);
+
+	BUG_ON(keylen < 2 || keylen > 514);
+
+	csum = raw[0] + raw[1];
+	print = 1;
+	for (loop = 2; loop < keylen; loop++) {
+		ch = raw[loop];
+		csum += ch;
+		print &= cachefiles_filecharmap[ch];
+	}
+
+	if (print) {
+		/* if the path is usable ASCII, then we render it directly */
+		max = keylen - 2;
+		max += 2;	/* two base64'd length chars on the front */
+		max += 5;	/* @checksum/M */
+		max += 3 * 2;	/* maximum number of segment dividers (".../M")
+				 * is ((514 + 251) / 252) = 3
+				 */
+		max += 1;	/* NUL on end */
+	} else {
+		/* calculate the maximum length of the cooked key */
+		keylen = (keylen + 2) / 3;
+
+		max = keylen * 4;
+		max += 5;	/* @checksum/M */
+		max += 3 * 2;	/* maximum number of segment dividers (".../M")
+				 * is ((514 + 188) / 189) = 3
+				 */
+		max += 1;	/* NUL on end */
+	}
+
+	max += 1;	/* 2nd NUL on end */
+
+	_debug("max: %d", max);
+
+	key = kmalloc(max, GFP_KERNEL);
+	if (!key)
+		return NULL;
+
+	len = 0;
+
+	/* build the cooked key */
+	sprintf(key, "@%02x%c+", (unsigned) csum, 0);
+	len = 5;
+	mark = len - 1;
+
+	if (print) {
+		acc = *(uint16_t *) raw;
+		raw += 2;
+
+		key[len + 1] = cachefiles_charmap[acc & 63];
+		acc >>= 6;
+		key[len] = cachefiles_charmap[acc & 63];
+		len += 2;
+
+		seg = 250;
+		for (loop = keylen; loop > 0; loop--) {
+			if (seg <= 0) {
+				key[len++] = '\0';
+				mark = len;
+				key[len++] = '+';
+				seg = 252;
+			}
+
+			key[len++] = *raw++;
+			ASSERT(len < max);
+		}
+
+		switch (type) {
+		case FSCACHE_COOKIE_TYPE_INDEX:		type = 'I';	break;
+		case FSCACHE_COOKIE_TYPE_DATAFILE:	type = 'D';	break;
+		default:				type = 'S';	break;
+		}
+	} else {
+		seg = 252;
+		for (loop = keylen; loop > 0; loop--) {
+			if (seg <= 0) {
+				key[len++] = '\0';
+				mark = len;
+				key[len++] = '+';
+				seg = 252;
+			}
+
+			acc = *raw++;
+			acc |= *raw++ << 8;
+			acc |= *raw++ << 16;
+
+			_debug("acc: %06x", acc);
+
+			key[len++] = cachefiles_charmap[acc & 63];
+			acc >>= 6;
+			key[len++] = cachefiles_charmap[acc & 63];
+			acc >>= 6;
+			key[len++] = cachefiles_charmap[acc & 63];
+			acc >>= 6;
+			key[len++] = cachefiles_charmap[acc & 63];
+
+			ASSERT(len < max);
+		}
+
+		switch (type) {
+		case FSCACHE_COOKIE_TYPE_INDEX:		type = 'J';	break;
+		case FSCACHE_COOKIE_TYPE_DATAFILE:	type = 'E';	break;
+		default:				type = 'T';	break;
+		}
+	}
+
+	key[mark] = type;
+	key[len++] = 0;
+	key[len] = 0;
+
+	_leave(" = %p %d", key, len);
+	return key;
+}
diff --git a/fs/cachefiles/cf-main.c b/fs/cachefiles/cf-main.c
new file mode 100644
index 0000000..eb983b7
--- /dev/null
+++ b/fs/cachefiles/cf-main.c
@@ -0,0 +1,108 @@
+/* Network filesystem caching backend to use cache files on a premounted
+ * filesystem
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/sysctl.h>
+#include <linux/miscdevice.h>
+#include "cf-internal.h"
+
+unsigned cachefiles_debug;
+module_param_named(debug, cachefiles_debug, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(cachefiles_debug, "CacheFiles debugging mask");
+
+MODULE_DESCRIPTION("Mounted-filesystem based cache");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+struct kmem_cache *cachefiles_object_jar;
+
+static struct miscdevice cachefiles_dev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "cachefiles",
+	.fops	= &cachefiles_daemon_fops,
+};
+
+static void cachefiles_object_init_once(struct kmem_cache *cachep,
+					void *_object)
+{
+	struct cachefiles_object *object = _object;
+
+	memset(object, 0, sizeof(*object));
+	fscache_object_init(&object->fscache);
+	spin_lock_init(&object->work_lock);
+}
+
+/*
+ * initialise the fs caching module
+ */
+static int __init cachefiles_init(void)
+{
+	int ret;
+
+	ret = misc_register(&cachefiles_dev);
+	if (ret < 0)
+		goto error_dev;
+
+	/* create an object jar */
+	ret = -ENOMEM;
+	cachefiles_object_jar =
+		kmem_cache_create("cachefiles_object_jar",
+				  sizeof(struct cachefiles_object),
+				  0,
+				  SLAB_HWCACHE_ALIGN,
+				  cachefiles_object_init_once);
+	if (!cachefiles_object_jar) {
+		printk(KERN_NOTICE
+		       "CacheFiles: Failed to allocate an object jar\n");
+		goto error_object_jar;
+	}
+
+	ret = cachefiles_proc_init();
+	if (ret < 0)
+		goto error_proc;
+
+	printk(KERN_INFO "CacheFiles: Loaded\n");
+	return 0;
+
+error_proc:
+	kmem_cache_destroy(cachefiles_object_jar);
+error_object_jar:
+	misc_deregister(&cachefiles_dev);
+error_dev:
+	kerror("failed to register: %d", ret);
+	return ret;
+}
+
+fs_initcall(cachefiles_init);
+
+/*
+ * clean up on module removal
+ */
+static void __exit cachefiles_exit(void)
+{
+	printk(KERN_INFO "CacheFiles: Unloading\n");
+
+	cachefiles_proc_cleanup();
+	kmem_cache_destroy(cachefiles_object_jar);
+	misc_deregister(&cachefiles_dev);
+}
+
+module_exit(cachefiles_exit);
diff --git a/fs/cachefiles/cf-namei.c b/fs/cachefiles/cf-namei.c
new file mode 100644
index 0000000..e642145
--- /dev/null
+++ b/fs/cachefiles/cf-namei.c
@@ -0,0 +1,739 @@
+/* CacheFiles path walking and related routines
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/security.h>
+#include "cf-internal.h"
+
+static int cachefiles_wait_bit(void *flags)
+{
+	schedule();
+	return 0;
+}
+
+/*
+ * record the fact that an object is now active
+ */
+static void cachefiles_mark_object_active(struct cachefiles_cache *cache,
+					  struct cachefiles_object *object)
+{
+	struct cachefiles_object *xobject;
+	struct rb_node **_p, *_parent = NULL;
+	struct dentry *dentry;
+
+	_enter(",%p", object);
+
+try_again:
+	write_lock(&cache->active_lock);
+
+	if (test_and_set_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags))
+		BUG();
+
+	dentry = object->dentry;
+	_p = &cache->active_nodes.rb_node;
+	while (*_p) {
+		_parent = *_p;
+		xobject = rb_entry(_parent,
+				   struct cachefiles_object, active_node);
+
+		if (xobject->dentry > dentry)
+			_p = &(*_p)->rb_left;
+		else if (xobject->dentry < dentry)
+			_p = &(*_p)->rb_right;
+		else
+			goto wait_for_old_object;
+	}
+
+	rb_link_node(&object->active_node, _parent, _p);
+	rb_insert_color(&object->active_node, &cache->active_nodes);
+
+	write_unlock(&cache->active_lock);
+	_leave("");
+	return;
+
+	/* an old object from a previous incarnation is hogging the slot - we
+	 * need to wait for it to be destroyed */
+wait_for_old_object:
+	_debug("old OBJ%x", xobject->fscache.debug_id);
+	ASSERTCMP(xobject->fscache.state, >=, FSCACHE_OBJECT_DYING);
+	atomic_inc(&xobject->usage);
+	//clear_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags);
+	write_unlock(&cache->active_lock);
+
+	_debug(">>> wait");
+	wait_on_bit(&xobject->flags, CACHEFILES_OBJECT_ACTIVE,
+		    cachefiles_wait_bit, TASK_UNINTERRUPTIBLE);
+	_debug("<<< waited");
+
+	cache->cache.ops->put_object(&xobject->fscache);
+	goto try_again;
+}
+
+/*
+ * delete an object representation from the cache
+ * - file backed objects are unlinked
+ * - directory backed objects are stuffed into the graveyard for userspace to
+ *   delete
+ * - unlocks the directory mutex
+ */
+static int cachefiles_bury_object(struct cachefiles_cache *cache,
+				  struct dentry *dir,
+				  struct dentry *rep)
+{
+	struct dentry *grave, *trap;
+	char nbuffer[8 + 8 + 1];
+	int ret;
+
+	_enter(",'%*.*s','%*.*s'",
+	       dir->d_name.len, dir->d_name.len, dir->d_name.name,
+	       rep->d_name.len, rep->d_name.len, rep->d_name.name);
+
+	/* non-directories can just be unlinked */
+	if (!S_ISDIR(rep->d_inode->i_mode)) {
+		_debug("unlink stale object");
+		ret = vfs_unlink(dir->d_inode, rep);
+
+		mutex_unlock(&dir->d_inode->i_mutex);
+
+		if (ret == -EIO)
+			cachefiles_io_error(cache, "Unlink failed");
+
+		_leave(" = %d", ret);
+		return ret;
+	}
+
+	/* directories have to be moved to the graveyard */
+	_debug("move stale object to graveyard");
+	mutex_unlock(&dir->d_inode->i_mutex);
+
+try_again:
+	/* first step is to make up a grave dentry in the graveyard */
+	sprintf(nbuffer, "%08x%08x",
+		(uint32_t) get_seconds(),
+		(uint32_t) atomic_inc_return(&cache->gravecounter));
+
+	/* do the multiway lock magic */
+	trap = lock_rename(cache->graveyard, dir);
+
+	/* do some checks before getting the grave dentry */
+	if (rep->d_parent != dir) {
+		/* the entry was probably culled when we dropped the parent dir
+		 * lock */
+		unlock_rename(cache->graveyard, dir);
+		_leave(" = 0 [culled?]");
+		return 0;
+	}
+
+	if (!S_ISDIR(cache->graveyard->d_inode->i_mode)) {
+		unlock_rename(cache->graveyard, dir);
+		cachefiles_io_error(cache, "Graveyard no longer a directory");
+		return -EIO;
+	}
+
+	if (trap == rep) {
+		unlock_rename(cache->graveyard, dir);
+		cachefiles_io_error(cache, "May not make directory loop");
+		return -EIO;
+	}
+
+	if (d_mountpoint(rep)) {
+		unlock_rename(cache->graveyard, dir);
+		cachefiles_io_error(cache, "Mountpoint in cache");
+		return -EIO;
+	}
+
+	grave = lookup_one_len(nbuffer, cache->graveyard, strlen(nbuffer));
+	if (IS_ERR(grave)) {
+		unlock_rename(cache->graveyard, dir);
+
+		if (PTR_ERR(grave) == -ENOMEM) {
+			_leave(" = -ENOMEM");
+			return -ENOMEM;
+		}
+
+		cachefiles_io_error(cache, "Lookup error %ld",
+				    PTR_ERR(grave));
+		return -EIO;
+	}
+
+	if (grave->d_inode) {
+		unlock_rename(cache->graveyard, dir);
+		dput(grave);
+		grave = NULL;
+		cond_resched();
+		goto try_again;
+	}
+
+	if (d_mountpoint(grave)) {
+		unlock_rename(cache->graveyard, dir);
+		dput(grave);
+		cachefiles_io_error(cache, "Mountpoint in graveyard");
+		return -EIO;
+	}
+
+	/* target should not be an ancestor of source */
+	if (trap == grave) {
+		unlock_rename(cache->graveyard, dir);
+		dput(grave);
+		cachefiles_io_error(cache, "May not make directory loop");
+		return -EIO;
+	}
+
+	/* attempt the rename */
+	ret = vfs_rename(dir->d_inode, rep, cache->graveyard->d_inode, grave);
+	if (ret != 0 && ret != -ENOMEM)
+		cachefiles_io_error(cache, "Rename failed with error %d", ret);
+
+	unlock_rename(cache->graveyard, dir);
+	dput(grave);
+	_leave(" = 0");
+	return 0;
+}
+
+/*
+ * delete an object representation from the cache
+ */
+int cachefiles_delete_object(struct cachefiles_cache *cache,
+			     struct cachefiles_object *object)
+{
+	struct dentry *dir;
+	int ret;
+
+	_enter(",{%p}", object->dentry);
+
+	ASSERT(object->dentry);
+	ASSERT(object->dentry->d_inode);
+	ASSERT(object->dentry->d_parent);
+
+	dir = dget_parent(object->dentry);
+
+	mutex_lock(&dir->d_inode->i_mutex);
+	ret = cachefiles_bury_object(cache, dir, object->dentry);
+
+	dput(dir);
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * walk from the parent object to the child object through the backing
+ * filesystem, creating directories as we go
+ */
+int cachefiles_walk_to_object(struct cachefiles_object *parent,
+			      struct cachefiles_object *object,
+			      const char *key,
+			      struct cachefiles_xattr *auxdata)
+{
+	struct cachefiles_cache *cache;
+	struct dentry *dir, *next = NULL;
+	unsigned long start;
+	const char *name;
+	int ret, nlen;
+
+	_enter("{%p},,%s,", parent->dentry, key);
+
+	cache = container_of(parent->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	ASSERT(parent->dentry);
+	ASSERT(parent->dentry->d_inode);
+
+	if (!(S_ISDIR(parent->dentry->d_inode->i_mode))) {
+		// TODO: convert file to dir
+		_leave("looking up in none directory");
+		return -ENOBUFS;
+	}
+
+	dir = dget(parent->dentry);
+
+advance:
+	/* attempt to transit the first directory component */
+	name = key;
+	nlen = strlen(key);
+
+	/* key ends in a double NUL */
+	key = key + nlen + 1;
+	if (!*key)
+		key = NULL;
+
+lookup_again:
+	/* search the current directory for the element name */
+	_debug("lookup '%s'", name);
+
+	mutex_lock(&dir->d_inode->i_mutex);
+
+	start = jiffies;
+	next = lookup_one_len(name, dir, nlen);
+	cachefiles_hist(cachefiles_lookup_histogram, start);
+	if (IS_ERR(next))
+		goto lookup_error;
+
+	_debug("next -> %p %s", next, next->d_inode ? "positive" : "negative");
+
+	if (!key)
+		object->new = !next->d_inode;
+
+	/* if this element of the path doesn't exist, then the lookup phase
+	 * failed, and we can release any readers in the certain knowledge that
+	 * there's nothing for them to actually read */
+	if (!next->d_inode)
+		fscache_object_lookup_negative(&object->fscache);
+
+	/* we need to create the object if it's negative */
+	if (key || object->type == FSCACHE_COOKIE_TYPE_INDEX) {
+		/* index objects and intervening tree levels must be subdirs */
+		if (!next->d_inode) {
+			ret = cachefiles_has_space(cache, 1, 0);
+			if (ret < 0)
+				goto create_error;
+
+			start = jiffies;
+			ret = vfs_mkdir(dir->d_inode, next, 0);
+			cachefiles_hist(cachefiles_mkdir_histogram, start);
+			if (ret < 0)
+				goto create_error;
+
+			ASSERT(next->d_inode);
+
+			_debug("mkdir -> %p{%p{ino=%lu}}",
+			       next, next->d_inode, next->d_inode->i_ino);
+
+		} else if (!S_ISDIR(next->d_inode->i_mode)) {
+			kerror("inode %lu is not a directory",
+			       next->d_inode->i_ino);
+			ret = -ENOBUFS;
+			goto error;
+		}
+
+	} else {
+		/* non-index objects start out life as files */
+		if (!next->d_inode) {
+			ret = cachefiles_has_space(cache, 1, 0);
+			if (ret < 0)
+				goto create_error;
+
+			start = jiffies;
+			ret = vfs_create(dir->d_inode, next, S_IFREG, NULL);
+			cachefiles_hist(cachefiles_create_histogram, start);
+			if (ret < 0)
+				goto create_error;
+
+			ASSERT(next->d_inode);
+
+			_debug("create -> %p{%p{ino=%lu}}",
+			       next, next->d_inode, next->d_inode->i_ino);
+
+		} else if (!S_ISDIR(next->d_inode->i_mode) &&
+			   !S_ISREG(next->d_inode->i_mode)
+			   ) {
+			kerror("inode %lu is not a file or directory",
+			       next->d_inode->i_ino);
+			ret = -ENOBUFS;
+			goto error;
+		}
+	}
+
+	/* process the next component */
+	if (key) {
+		_debug("advance");
+		mutex_unlock(&dir->d_inode->i_mutex);
+		dput(dir);
+		dir = next;
+		next = NULL;
+		goto advance;
+	}
+
+	/* we've found the object we were looking for */
+	object->dentry = next;
+
+	/* if we've found that the terminal object exists, then we need to
+	 * check its attributes and delete it if it's out of date */
+	if (!object->new) {
+		_debug("validate '%*.*s'",
+		       next->d_name.len, next->d_name.len, next->d_name.name);
+
+		ret = cachefiles_check_object_xattr(object, auxdata);
+		if (ret == -ESTALE) {
+			/* delete the object (the deleter drops the directory
+			 * mutex) */
+			object->dentry = NULL;
+
+			ret = cachefiles_bury_object(cache, dir, next);
+			dput(next);
+			next = NULL;
+
+			if (ret < 0)
+				goto delete_error;
+
+			_debug("redo lookup");
+			goto lookup_again;
+		}
+	}
+
+	/* note that we're now using this object */
+	cachefiles_mark_object_active(cache, object);
+
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(dir);
+	dir = NULL;
+
+	_debug("=== OBTAINED_OBJECT ===");
+
+	if (object->new) {
+		/* attach data to a newly constructed terminal object */
+		ret = cachefiles_set_object_xattr(object, auxdata);
+		if (ret < 0)
+			goto check_error;
+	} else {
+		/* always update the atime on an object we've just looked up
+		 * (this is used to keep track of culling, and atimes are only
+		 * updated by read, write and readdir but not lookup or
+		 * open) */
+		touch_atime(cache->mnt, next);
+	}
+
+	/* open a file interface onto a data file */
+	if (object->type != FSCACHE_COOKIE_TYPE_INDEX) {
+		if (S_ISREG(object->dentry->d_inode->i_mode)) {
+			const struct address_space_operations *aops;
+
+			ret = -EPERM;
+			aops = object->dentry->d_inode->i_mapping->a_ops;
+			if (!aops->bmap ||
+			    !aops->write_one_page)
+				goto check_error;
+
+			object->backer = object->dentry;
+		} else {
+			BUG(); // TODO: open file in data-class subdir
+		}
+	}
+
+	object->new = 0;
+	fscache_obtained_object(&object->fscache);
+
+	_leave(" = 0 [%lu]", object->dentry->d_inode->i_ino);
+	return 0;
+
+create_error:
+	_debug("create error %d", ret);
+	if (ret == -EIO)
+		cachefiles_io_error(cache, "Create/mkdir failed");
+	goto error;
+
+check_error:
+	_debug("check error %d", ret);
+	write_lock(&cache->active_lock);
+	rb_erase(&object->active_node, &cache->active_nodes);
+	write_unlock(&cache->active_lock);
+
+	dput(object->dentry);
+	object->dentry = NULL;
+	goto error_out;
+
+delete_error:
+	_debug("delete error %d", ret);
+	goto error_out2;
+
+lookup_error:
+	_debug("lookup error %ld", PTR_ERR(next));
+	ret = PTR_ERR(next);
+	if (ret == -EIO)
+		cachefiles_io_error(cache, "Lookup failed");
+	next = NULL;
+error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(next);
+error_out2:
+	dput(dir);
+error_out:
+	if (ret == -ENOSPC)
+		ret = -ENOBUFS;
+
+	_leave(" = error %d", -ret);
+	return ret;
+}
+
+/*
+ * get a subdirectory
+ */
+struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+					struct dentry *dir,
+					const char *dirname)
+{
+	struct dentry *subdir;
+	unsigned long start;
+	int ret;
+
+	_enter(",,%s", dirname);
+
+	/* search the current directory for the element name */
+	mutex_lock(&dir->d_inode->i_mutex);
+
+	start = jiffies;
+	subdir = lookup_one_len(dirname, dir, strlen(dirname));
+	cachefiles_hist(cachefiles_lookup_histogram, start);
+	if (IS_ERR(subdir)) {
+		if (PTR_ERR(subdir) == -ENOMEM)
+			goto nomem_d_alloc;
+		goto lookup_error;
+	}
+
+	_debug("subdir -> %p %s",
+	       subdir, subdir->d_inode ? "positive" : "negative");
+
+	/* we need to create the subdir if it doesn't exist yet */
+	if (!subdir->d_inode) {
+		ret = cachefiles_has_space(cache, 1, 0);
+		if (ret < 0)
+			goto mkdir_error;
+
+		_debug("attempt mkdir");
+
+		ret = vfs_mkdir(dir->d_inode, subdir, 0700);
+		if (ret < 0)
+			goto mkdir_error;
+
+		ASSERT(subdir->d_inode);
+
+		_debug("mkdir -> %p{%p{ino=%lu}}",
+		       subdir,
+		       subdir->d_inode,
+		       subdir->d_inode->i_ino);
+	}
+
+	mutex_unlock(&dir->d_inode->i_mutex);
+
+	/* we need to make sure the subdir is a directory */
+	ASSERT(subdir->d_inode);
+
+	if (!S_ISDIR(subdir->d_inode->i_mode)) {
+		kerror("%s is not a directory", dirname);
+		ret = -EIO;
+		goto check_error;
+	}
+
+	ret = -EPERM;
+	if (!subdir->d_inode->i_op ||
+	    !subdir->d_inode->i_op->setxattr ||
+	    !subdir->d_inode->i_op->getxattr ||
+	    !subdir->d_inode->i_op->lookup ||
+	    !subdir->d_inode->i_op->mkdir ||
+	    !subdir->d_inode->i_op->create ||
+	    !subdir->d_inode->i_op->rename ||
+	    !subdir->d_inode->i_op->rmdir ||
+	    !subdir->d_inode->i_op->unlink)
+		goto check_error;
+
+	_leave(" = [%lu]", subdir->d_inode->i_ino);
+	return subdir;
+
+check_error:
+	dput(subdir);
+	_leave(" = %d [check]", ret);
+	return ERR_PTR(ret);
+
+mkdir_error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(subdir);
+	kerror("mkdir %s failed with error %d", dirname, ret);
+	return ERR_PTR(ret);
+
+lookup_error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	ret = PTR_ERR(subdir);
+	kerror("Lookup %s failed with error %d", dirname, ret);
+	return ERR_PTR(ret);
+
+nomem_d_alloc:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	_leave(" = -ENOMEM");
+	return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * find out if an object is in use or not
+ * - if finds object and it's not in use:
+ *   - returns a pointer to the object and a reference on it
+ *   - returns with the directory locked
+ */
+static struct dentry *cachefiles_check_active(struct cachefiles_cache *cache,
+					      struct dentry *dir,
+					      char *filename)
+{
+	struct cachefiles_object *object;
+	struct rb_node *_n;
+	struct dentry *victim;
+	unsigned long start;
+	int ret;
+
+	_enter(",%*.*s/,%s",
+	       dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+	/* look up the victim */
+	mutex_lock_nested(&dir->d_inode->i_mutex, 1);
+
+	start = jiffies;
+	victim = lookup_one_len(filename, dir, strlen(filename));
+	cachefiles_hist(cachefiles_lookup_histogram, start);
+	if (IS_ERR(victim))
+		goto lookup_error;
+
+	_debug("victim -> %p %s",
+	       victim, victim->d_inode ? "positive" : "negative");
+
+	/* if the object is no longer there then we probably retired the object
+	 * at the netfs's request whilst the cull was in progress
+	 */
+	if (!victim->d_inode) {
+		mutex_unlock(&dir->d_inode->i_mutex);
+		dput(victim);
+		_leave(" = -ENOENT [absent]");
+		return ERR_PTR(-ENOENT);
+	}
+
+	/* check to see if we're using this object */
+	read_lock(&cache->active_lock);
+
+	_n = cache->active_nodes.rb_node;
+
+	while (_n) {
+		object = rb_entry(_n, struct cachefiles_object, active_node);
+
+		if (object->dentry > victim)
+			_n = _n->rb_left;
+		else if (object->dentry < victim)
+			_n = _n->rb_right;
+		else
+			goto object_in_use;
+	}
+
+	read_unlock(&cache->active_lock);
+
+	_leave(" = %p", victim);
+	return victim;
+
+object_in_use:
+	read_unlock(&cache->active_lock);
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(victim);
+	_leave(" = -EBUSY [in use]");
+	return ERR_PTR(-EBUSY);
+
+lookup_error:
+	mutex_unlock(&dir->d_inode->i_mutex);
+	ret = PTR_ERR(victim);
+	if (ret == -ENOENT) {
+		/* file or dir now absent - probably retired by netfs */
+		_leave(" = -ESTALE [absent]");
+		return ERR_PTR(-ESTALE);
+	}
+
+	if (ret == -EIO) {
+		cachefiles_io_error(cache, "Lookup failed");
+	} else if (ret != -ENOMEM) {
+		kerror("Internal error: %d", ret);
+		ret = -EIO;
+	}
+
+	_leave(" = %d", ret);
+	return ERR_PTR(ret);
+}
+
+/*
+ * cull an object if it's not in use
+ * - called only by cache manager daemon
+ */
+int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
+		    char *filename)
+{
+	struct dentry *victim;
+	int ret;
+
+	_enter(",%*.*s/,%s",
+	       dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+	victim = cachefiles_check_active(cache, dir, filename);
+	if (IS_ERR(victim))
+		return PTR_ERR(victim);
+
+	_debug("victim -> %p %s",
+	       victim, victim->d_inode ? "positive" : "negative");
+
+	/* okay... the victim is not being used so we can cull it
+	 * - start by marking it as stale
+	 */
+	_debug("victim is cullable");
+
+	ret = cachefiles_remove_object_xattr(cache, victim);
+	if (ret < 0)
+		goto error_unlock;
+
+	/*  actually remove the victim (drops the dir mutex) */
+	_debug("bury");
+
+	ret = cachefiles_bury_object(cache, dir, victim);
+	if (ret < 0)
+		goto error;
+
+	dput(victim);
+	_leave(" = 0");
+	return 0;
+
+error_unlock:
+	mutex_unlock(&dir->d_inode->i_mutex);
+error:
+	dput(victim);
+	if (ret == -ENOENT) {
+		/* file or dir now absent - probably retired by netfs */
+		_leave(" = -ESTALE [absent]");
+		return -ESTALE;
+	}
+
+	if (ret != -ENOMEM) {
+		kerror("Internal error: %d", ret);
+		ret = -EIO;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * find out if an object is in use or not
+ * - called only by cache manager daemon
+ * - returns -EBUSY or 0 to indicate whether an object is in use or not
+ */
+int cachefiles_check_in_use(struct cachefiles_cache *cache, struct dentry *dir,
+			    char *filename)
+{
+	struct dentry *victim;
+
+	_enter(",%*.*s/,%s",
+	       dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+	victim = cachefiles_check_active(cache, dir, filename);
+	if (IS_ERR(victim))
+		return PTR_ERR(victim);
+
+	mutex_unlock(&dir->d_inode->i_mutex);
+	dput(victim);
+	_leave(" = 0");
+	return 0;
+}
diff --git a/fs/cachefiles/cf-proc.c b/fs/cachefiles/cf-proc.c
new file mode 100644
index 0000000..c0d5444
--- /dev/null
+++ b/fs/cachefiles/cf-proc.c
@@ -0,0 +1,166 @@
+/* CacheFiles statistics
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "cf-internal.h"
+
+struct cachefiles_proc {
+	unsigned			nlines;
+	const struct seq_operations	*ops;
+};
+
+atomic_t cachefiles_lookup_histogram[HZ];
+atomic_t cachefiles_mkdir_histogram[HZ];
+atomic_t cachefiles_create_histogram[HZ];
+
+static struct proc_dir_entry *proc_cachefiles;
+
+static int cachefiles_proc_open(struct inode *inode, struct file *file);
+static void *cachefiles_proc_start(struct seq_file *m, loff_t *pos);
+static void cachefiles_proc_stop(struct seq_file *m, void *v);
+static void *cachefiles_proc_next(struct seq_file *m, void *v, loff_t *pos);
+static int cachefiles_histogram_show(struct seq_file *m, void *v);
+
+static const struct file_operations cachefiles_proc_fops = {
+	.open		= cachefiles_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static const struct seq_operations cachefiles_histogram_ops = {
+	.start		= cachefiles_proc_start,
+	.stop		= cachefiles_proc_stop,
+	.next		= cachefiles_proc_next,
+	.show		= cachefiles_histogram_show,
+};
+
+static const struct cachefiles_proc cachefiles_histogram = {
+	.nlines		= HZ + 1,
+	.ops		= &cachefiles_histogram_ops,
+};
+
+/*
+ * initialise the /proc/fs/fscache/cachefiles/ directory
+ */
+int __init cachefiles_proc_init(void)
+{
+	struct proc_dir_entry *p;
+
+	_enter("");
+
+	proc_cachefiles = proc_mkdir("cachefiles", proc_fscache);
+	if (!proc_cachefiles)
+		goto error_dir;
+	proc_cachefiles->owner = THIS_MODULE;
+
+	p = create_proc_entry("histogram", 0, proc_cachefiles);
+	if (!p)
+		goto error_histogram;
+	p->proc_fops = &cachefiles_proc_fops;
+	p->owner = THIS_MODULE;
+	p->data = (void *) &cachefiles_histogram;
+
+	_leave(" = 0");
+	return 0;
+
+error_histogram:
+	remove_proc_entry("fs/cachefiles", NULL);
+error_dir:
+	_leave(" = -ENOMEM");
+	return -ENOMEM;
+}
+
+/*
+ * clean up the /proc/fs/fscache/cachefiles/ directory
+ */
+void cachefiles_proc_cleanup(void)
+{
+	remove_proc_entry("histogram", proc_cachefiles);
+	remove_proc_entry("cachefiles", proc_fscache);
+}
+
+/*
+ * open "/proc/fs/fscache/cachefiles/XXX" which provide statistics summaries
+ */
+static int cachefiles_proc_open(struct inode *inode, struct file *file)
+{
+	const struct cachefiles_proc *proc = PDE(inode)->data;
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open(file, proc->ops);
+	if (ret == 0) {
+		m = file->private_data;
+		m->private = (void *) proc;
+	}
+	return ret;
+}
+
+/*
+ * set up the iterator to start reading from the first line
+ */
+static void *cachefiles_proc_start(struct seq_file *m, loff_t *_pos)
+{
+	if (*_pos == 0)
+		*_pos = 1;
+	return (void *)(unsigned long) *_pos;
+}
+
+/*
+ * move to the next line
+ */
+static void *cachefiles_proc_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	const struct cachefiles_proc *proc = m->private;
+
+	(*pos)++;
+	return *pos > proc->nlines ? NULL : (void *)(unsigned long) *pos;
+}
+
+/*
+ * clean up after reading
+ */
+static void cachefiles_proc_stop(struct seq_file *m, void *v)
+{
+}
+
+/*
+ * display the time-taken histogram
+ */
+static int cachefiles_histogram_show(struct seq_file *m, void *v)
+{
+	unsigned long index;
+	unsigned x, y, z, t;
+
+	switch ((unsigned long) v) {
+	case 1:
+		seq_puts(m, "JIFS  SECS  LOOKUPS   MKDIRS    CREATES\n");
+		return 0;
+	case 2:
+		seq_puts(m, "===== ===== ========= ========= =========\n");
+		return 0;
+	default:
+		index = (unsigned long) v - 3;
+		x = atomic_read(&cachefiles_lookup_histogram[index]);
+		y = atomic_read(&cachefiles_mkdir_histogram[index]);
+		z = atomic_read(&cachefiles_create_histogram[index]);
+		if (x == 0 && y == 0 && z == 0)
+			return 0;
+
+		t = (index * 1000) / HZ;
+
+		seq_printf(m, "%4lu  0.%03u %9u %9u %9u\n", index, t, x, y, z);
+		return 0;
+	}
+}
diff --git a/fs/cachefiles/cf-rdwr.c b/fs/cachefiles/cf-rdwr.c
new file mode 100644
index 0000000..e5e45ac
--- /dev/null
+++ b/fs/cachefiles/cf-rdwr.c
@@ -0,0 +1,851 @@
+/* Storage object read/write
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include "cf-internal.h"
+
+/*
+ * detect wake up events generated by the unlocking of pages in which we're
+ * interested
+ * - we use this to detect read completion of backing pages
+ * - the caller holds the waitqueue lock
+ */
+static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
+				  int sync, void *_key)
+{
+	struct cachefiles_one_read *monitor =
+		container_of(wait, struct cachefiles_one_read, monitor);
+	struct cachefiles_object *object;
+	struct wait_bit_key *key = _key;
+	struct page *page = wait->private;
+
+	ASSERT(key);
+
+	_enter("{%lu},%u,%d,{%p,%u}",
+	       monitor->netfs_page->index, mode, sync,
+	       key->flags, key->bit_nr);
+
+	if (key->flags != &page->flags ||
+	    key->bit_nr != PG_locked)
+		return 0;
+
+	_debug("--- monitor %p %lx ---", page, page->flags);
+
+	if (!PageUptodate(page) && !PageError(page))
+		dump_stack();
+
+	/* remove from the waitqueue */
+	list_del(&wait->task_list);
+
+	/* move onto the action list and queue for FS-Cache thread pool */
+	ASSERT(monitor->op);
+
+	object = container_of(monitor->op->op.object,
+			      struct cachefiles_object, fscache);
+
+	spin_lock(&object->work_lock);
+	list_add(&monitor->op_link, &monitor->op->to_do);
+	spin_unlock(&object->work_lock);
+
+	fscache_enqueue_retrieval(monitor->op);
+	return 0;
+}
+
+/*
+ * copy data from backing pages to netfs pages to complete a read operation
+ * - driven by FS-Cache's thread pool
+ */
+static void cachefiles_read_copier(struct fscache_operation *_op)
+{
+	struct cachefiles_one_read *monitor;
+	struct cachefiles_object *object;
+	struct fscache_retrieval *op;
+	struct pagevec pagevec;
+	int error, max;
+
+	op = container_of(_op, struct fscache_retrieval, op);
+	object = container_of(op->op.object,
+			      struct cachefiles_object, fscache);
+
+	_enter("{ino=%lu}", object->backer->d_inode->i_ino);
+
+	pagevec_init(&pagevec, 0);
+
+	max = 8;
+	spin_lock_irq(&object->work_lock);
+
+	while (!list_empty(&op->to_do)) {
+		monitor = list_entry(op->to_do.next,
+				     struct cachefiles_one_read, op_link);
+		list_del(&monitor->op_link);
+
+		spin_unlock_irq(&object->work_lock);
+
+		_debug("- copy {%lu}", monitor->back_page->index);
+
+		error = -EIO;
+		if (PageUptodate(monitor->back_page)) {
+			copy_highpage(monitor->netfs_page, monitor->back_page);
+
+			pagevec_add(&pagevec, monitor->netfs_page);
+			fscache_mark_pages_cached(monitor->op, &pagevec);
+			error = 0;
+		}
+
+		if (error)
+			cachefiles_io_error_obj(
+				object,
+				"Readpage failed on backing file %lx",
+				(unsigned long) monitor->back_page->flags);
+
+		page_cache_release(monitor->back_page);
+
+		fscache_end_io(op, monitor->netfs_page, error);
+		page_cache_release(monitor->netfs_page);
+		fscache_put_retrieval(op);
+		kfree(monitor);
+
+		/* let the thread pool have some air occasionally */
+		max--;
+		if (max < 0 || need_resched()) {
+			if (!list_empty(&op->to_do))
+				fscache_enqueue_retrieval(op);
+			_leave(" [maxed out]");
+			return;
+		}
+
+		spin_lock_irq(&object->work_lock);
+	}
+
+	spin_unlock_irq(&object->work_lock);
+	_leave("");
+}
+
+/*
+ * read the corresponding page to the given set from the backing file
+ * - an uncertain page is simply discarded, to be tried again another time
+ */
+static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
+					    struct fscache_retrieval *op,
+					    struct page *netpage,
+					    struct pagevec *pagevec)
+{
+	struct cachefiles_one_read *monitor;
+	struct address_space *bmapping;
+	struct page *newpage, *backpage;
+	int ret;
+
+	_enter("");
+
+	pagevec_reinit(pagevec);
+
+	_debug("read back %p{%lu,%d}",
+	       netpage, netpage->index, page_count(netpage));
+
+	monitor = kzalloc(sizeof(*monitor), GFP_KERNEL);
+	if (!monitor)
+		goto nomem;
+
+	monitor->netfs_page = netpage;
+	monitor->op = fscache_get_retrieval(op);
+
+	init_waitqueue_func_entry(&monitor->monitor, cachefiles_read_waiter);
+
+	/* attempt to get hold of the backing page */
+	bmapping = object->backer->d_inode->i_mapping;
+	newpage = NULL;
+
+	for (;;) {
+		backpage = find_get_page(bmapping, netpage->index);
+		if (backpage)
+			goto backing_page_already_present;
+
+		if (!newpage) {
+			newpage = page_cache_alloc_cold(bmapping);
+			if (!newpage)
+				goto nomem_monitor;
+		}
+
+		ret = add_to_page_cache(newpage, bmapping,
+					netpage->index, GFP_KERNEL);
+		if (ret == 0)
+			goto installed_new_backing_page;
+		if (ret != -EEXIST)
+			goto nomem_page;
+	}
+
+	/* we've installed a new backing page, so now we need to add it
+	 * to the LRU list and start it reading */
+installed_new_backing_page:
+	_debug("- new %p", newpage);
+
+	backpage = newpage;
+	newpage = NULL;
+
+	page_cache_get(backpage);
+	pagevec_add(pagevec, backpage);
+	__pagevec_lru_add(pagevec);
+
+read_backing_page:
+	ret = bmapping->a_ops->readpage(NULL, backpage);
+	if (ret < 0)
+		goto read_error;
+
+	/* set the monitor to transfer the data across */
+monitor_backing_page:
+	_debug("- monitor add");
+
+	/* install the monitor */
+	page_cache_get(monitor->netfs_page);
+	page_cache_get(backpage);
+	monitor->back_page = backpage;
+	monitor->monitor.private = backpage;
+	add_page_wait_queue(backpage, &monitor->monitor);
+	monitor = NULL;
+
+	/* but the page may have been read before the monitor was installed, so
+	 * the monitor may miss the event - so we have to ensure that we do get
+	 * one in such a case */
+	if (!TestSetPageLocked(backpage)) {
+		_debug("jumpstart %p {%lx}", backpage, backpage->flags);
+		unlock_page(backpage);
+	}
+	goto success;
+
+	/* if the backing page is already present, it can be in one of
+	 * three states: read in progress, read failed or read okay */
+backing_page_already_present:
+	_debug("- present");
+
+	if (newpage) {
+		page_cache_release(newpage);
+		newpage = NULL;
+	}
+
+	if (PageError(backpage))
+		goto io_error;
+
+	if (PageUptodate(backpage))
+		goto backing_page_already_uptodate;
+
+	if (TestSetPageLocked(backpage))
+		goto monitor_backing_page;
+	_debug("read %p {%lx}", backpage, backpage->flags);
+	goto read_backing_page;
+
+	/* the backing page is already up to date, attach the netfs
+	 * page to the pagecache and LRU and copy the data across */
+backing_page_already_uptodate:
+	_debug("- uptodate");
+
+	pagevec_add(pagevec, netpage);
+	fscache_mark_pages_cached(op, pagevec);
+
+	copy_highpage(netpage, backpage);
+	fscache_end_io(op, netpage, 0);
+
+success:
+	_debug("success");
+	ret = 0;
+
+out:
+	if (backpage)
+		page_cache_release(backpage);
+	if (monitor) {
+		fscache_put_retrieval(monitor->op);
+		kfree(monitor);
+	}
+	_leave(" = %d", ret);
+	return ret;
+
+read_error:
+	_debug("read error %d", ret);
+	if (ret == -ENOMEM)
+		goto out;
+io_error:
+	cachefiles_io_error_obj(object, "Page read error on backing file");
+	ret = -ENOBUFS;
+	goto out;
+
+nomem_page:
+	page_cache_release(newpage);
+nomem_monitor:
+	fscache_put_retrieval(monitor->op);
+	kfree(monitor);
+nomem:
+	_leave(" = -ENOMEM");
+	return -ENOMEM;
+}
+
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if no buffers can be made available
+ * - returns -ENOBUFS if page is beyond EOF
+ * - if the page is backed by a block in the cache:
+ *   - a read will be started which will call the callback on completion
+ *   - 0 will be returned
+ * - else if the page is unbacked:
+ *   - the metadata will be retained
+ *   - -ENODATA will be returned
+ */
+int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
+				  struct page *page,
+				  gfp_t gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct pagevec pagevec;
+	struct inode *inode;
+	sector_t block0, block;
+	unsigned shift;
+	int ret;
+
+	object = container_of(op->op.object,
+			      struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("{%p},{%lx},,,", object, page->index);
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	inode = object->backer->d_inode;
+	ASSERT(S_ISREG(inode->i_mode));
+	ASSERT(inode->i_mapping->a_ops->bmap);
+	ASSERT(inode->i_mapping->a_ops->readpages);
+
+	/* calculate the shift required to use bmap */
+	if (inode->i_sb->s_blocksize > PAGE_SIZE)
+		return -ENOBUFS;
+
+	shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	op->op.processor = cachefiles_read_copier;
+
+	pagevec_init(&pagevec, 0);
+
+	/* we assume the absence or presence of the first block is a good
+	 * enough indication for the page as a whole
+	 * - TODO: don't use bmap() for this as it is _not_ actually good
+	 *   enough for this as it doesn't indicate errors, but it's all we've
+	 *   got for the moment
+	 */
+	block0 = page->index;
+	block0 <<= shift;
+
+	block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
+	_debug("%llx -> %llx",
+	       (unsigned long long) block0,
+	       (unsigned long long) block);
+
+	if (block) {
+		/* submit the apparently valid page to the backing fs to be
+		 * read from disk */
+		ret = cachefiles_read_backing_file_one(object, op, page,
+						       &pagevec);
+	} else if (cachefiles_has_space(cache, 0, 1) == 0) {
+		/* there's space in the cache we can use */
+		pagevec_add(&pagevec, page);
+		fscache_mark_pages_cached(op, &pagevec);
+		ret = -ENODATA;
+	} else {
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * read the corresponding pages to the given set from the backing file
+ * - any uncertain pages are simply discarded, to be tried again another time
+ */
+static int cachefiles_read_backing_file(struct cachefiles_object *object,
+					struct fscache_retrieval *op,
+					struct list_head *list,
+					struct pagevec *mark_pvec)
+{
+	struct cachefiles_one_read *monitor = NULL;
+	struct address_space *bmapping = object->backer->d_inode->i_mapping;
+	struct pagevec lru_pvec;
+	struct page *newpage = NULL, *netpage, *_n, *backpage = NULL;
+	int ret = 0;
+
+	_enter("");
+
+	pagevec_init(&lru_pvec, 0);
+
+	list_for_each_entry_safe(netpage, _n, list, lru) {
+		list_del(&netpage->lru);
+
+		_debug("read back %p{%lu,%d}",
+		       netpage, netpage->index, page_count(netpage));
+
+		if (!monitor) {
+			monitor = kzalloc(sizeof(*monitor), GFP_KERNEL);
+			if (!monitor)
+				goto nomem;
+
+			monitor->op = fscache_get_retrieval(op);
+			init_waitqueue_func_entry(&monitor->monitor,
+						  cachefiles_read_waiter);
+		}
+
+		for (;;) {
+			backpage = find_get_page(bmapping, netpage->index);
+			if (backpage)
+				goto backing_page_already_present;
+
+			if (!newpage) {
+				newpage = page_cache_alloc_cold(bmapping);
+				if (!newpage)
+					goto nomem;
+			}
+
+			ret = add_to_page_cache(newpage, bmapping,
+						netpage->index, GFP_KERNEL);
+			if (ret == 0)
+				goto installed_new_backing_page;
+			if (ret != -EEXIST)
+				goto nomem;
+		}
+
+		/* we've installed a new backing page, so now we need to add it
+		 * to the LRU list and start it reading */
+	installed_new_backing_page:
+		_debug("- new %p", newpage);
+
+		backpage = newpage;
+		newpage = NULL;
+
+		page_cache_get(backpage);
+		if (!pagevec_add(&lru_pvec, backpage))
+			__pagevec_lru_add(&lru_pvec);
+
+	reread_backing_page:
+		ret = bmapping->a_ops->readpage(NULL, backpage);
+		if (ret < 0)
+			goto read_error;
+
+		/* add the netfs page to the pagecache and LRU, and set the
+		 * monitor to transfer the data across */
+	monitor_backing_page:
+		_debug("- monitor add");
+
+		ret = add_to_page_cache(netpage, op->mapping, netpage->index,
+					GFP_KERNEL);
+		if (ret < 0) {
+			if (ret == -EEXIST) {
+				page_cache_release(netpage);
+				continue;
+			}
+			goto nomem;
+		}
+
+		page_cache_get(netpage);
+		if (!pagevec_add(&lru_pvec, netpage))
+			__pagevec_lru_add(&lru_pvec);
+
+		/* install a monitor */
+		page_cache_get(netpage);
+		monitor->netfs_page = netpage;
+
+		page_cache_get(backpage);
+		monitor->back_page = backpage;
+		monitor->monitor.private = backpage;
+		add_page_wait_queue(backpage, &monitor->monitor);
+		monitor = NULL;
+
+		/* but the page may have been read before the monitor was
+		 * installed, so the monitor may miss the event - so we have to
+		 * ensure that we do get one in such a case */
+		if (!TestSetPageLocked(backpage)) {
+			_debug("2unlock %p {%lx}", backpage, backpage->flags);
+			unlock_page(backpage);
+		}
+
+		page_cache_release(backpage);
+		backpage = NULL;
+
+		page_cache_release(netpage);
+		netpage = NULL;
+		continue;
+
+		/* if the backing page is already present, it can be in one of
+		 * three states: read in progress, read failed or read okay */
+	backing_page_already_present:
+		_debug("- present %p", backpage);
+
+		if (PageError(backpage))
+			goto io_error;
+
+		if (PageUptodate(backpage))
+			goto backing_page_already_uptodate;
+
+		_debug("- not ready %p{%lx}", backpage, backpage->flags);
+
+		if (TestSetPageLocked(backpage))
+			goto monitor_backing_page;
+
+		if (PageError(backpage)) {
+			_debug("error %lx", backpage->flags);
+			unlock_page(backpage);
+			goto io_error;
+		}
+
+		if (PageUptodate(backpage))
+			goto backing_page_already_uptodate_unlock;
+
+		/* we've locked a page that's neither up to date nor erroneous,
+		 * so we need to attempt to read it again */
+		goto reread_backing_page;
+
+		/* the backing page is already up to date, attach the netfs
+		 * page to the pagecache and LRU and copy the data across */
+	backing_page_already_uptodate_unlock:
+		_debug("uptodate %lx", backpage->flags);
+		unlock_page(backpage);
+	backing_page_already_uptodate:
+		_debug("- uptodate");
+
+		ret = add_to_page_cache(netpage, op->mapping, netpage->index,
+					GFP_KERNEL);
+		if (ret < 0) {
+			if (ret == -EEXIST) {
+				page_cache_release(netpage);
+				continue;
+			}
+			goto nomem;
+		}
+
+		copy_highpage(netpage, backpage);
+
+		page_cache_release(backpage);
+		backpage = NULL;
+
+		if (!pagevec_add(mark_pvec, netpage))
+			fscache_mark_pages_cached(op, mark_pvec);
+
+		page_cache_get(netpage);
+		if (!pagevec_add(&lru_pvec, netpage))
+			__pagevec_lru_add(&lru_pvec);
+
+		fscache_end_io(op, netpage, 0);
+		page_cache_release(netpage);
+		netpage = NULL;
+		continue;
+	}
+
+	netpage = NULL;
+
+	_debug("out");
+
+out:
+	/* tidy up */
+	pagevec_lru_add(&lru_pvec);
+
+	if (newpage)
+		page_cache_release(newpage);
+	if (netpage)
+		page_cache_release(netpage);
+	if (backpage)
+		page_cache_release(backpage);
+	if (monitor) {
+		fscache_put_retrieval(op);
+		kfree(monitor);
+	}
+
+	list_for_each_entry_safe(netpage, _n, list, lru) {
+		list_del(&netpage->lru);
+		page_cache_release(netpage);
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+
+nomem:
+	_debug("nomem");
+	ret = -ENOMEM;
+	goto out;
+
+read_error:
+	_debug("read error %d", ret);
+	if (ret == -ENOMEM)
+		goto out;
+io_error:
+	cachefiles_io_error_obj(object, "Page read error on backing file");
+	ret = -ENOBUFS;
+	goto out;
+}
+
+/*
+ * read a list of pages from the cache or allocate blocks in which to store
+ * them
+ */
+int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
+				   struct list_head *pages,
+				   unsigned *nr_pages,
+				   gfp_t gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct list_head backpages;
+	struct pagevec pagevec;
+	struct inode *inode;
+	struct page *page, *_n;
+	unsigned shift, nrbackpages;
+	int ret, ret2, space;
+
+	object = container_of(op->op.object,
+			      struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("{OBJ%x,%d},,%d,,",
+	       object->fscache.debug_id, atomic_read(&op->op.usage),
+	       *nr_pages);
+
+	if (!object->backer)
+		return -ENOBUFS;
+
+	space = 1;
+	if (cachefiles_has_space(cache, 0, *nr_pages) < 0)
+		space = 0;
+
+	inode = object->backer->d_inode;
+	ASSERT(S_ISREG(inode->i_mode));
+	ASSERT(inode->i_mapping->a_ops->bmap);
+	ASSERT(inode->i_mapping->a_ops->readpages);
+
+	/* calculate the shift required to use bmap */
+	if (inode->i_sb->s_blocksize > PAGE_SIZE)
+		return -ENOBUFS;
+
+	shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	pagevec_init(&pagevec, 0);
+
+	op->op.processor = cachefiles_read_copier;
+
+	INIT_LIST_HEAD(&backpages);
+	nrbackpages = 0;
+
+	ret = space ? -ENODATA : -ENOBUFS;
+	list_for_each_entry_safe(page, _n, pages, lru) {
+		sector_t block0, block;
+
+		/* we assume the absence or presence of the first block is a
+		 * good enough indication for the page as a whole
+		 * - TODO: don't use bmap() for this as it is _not_ actually
+		 *   good enough for this as it doesn't indicate errors, but
+		 *   it's all we've got for the moment
+		 */
+		block0 = page->index;
+		block0 <<= shift;
+
+		block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
+						      block0);
+		_debug("%llx -> %llx",
+		       (unsigned long long) block0,
+		       (unsigned long long) block);
+
+		if (block) {
+			/* we have data - add it to the list to give to the
+			 * backing fs */
+			list_move(&page->lru, &backpages);
+			(*nr_pages)--;
+			nrbackpages++;
+		} else if (space && pagevec_add(&pagevec, page) == 0) {
+			fscache_mark_pages_cached(op, &pagevec);
+			ret = -ENODATA;
+		}
+	}
+
+	if (pagevec_count(&pagevec) > 0)
+		fscache_mark_pages_cached(op, &pagevec);
+
+	if (list_empty(pages))
+		ret = 0;
+
+	/* submit the apparently valid pages to the backing fs to be read from
+	 * disk */
+	if (nrbackpages > 0) {
+		ret2 = cachefiles_read_backing_file(object, op, &backpages,
+						    &pagevec);
+		if (ret2 == -ENOMEM || ret2 == -EINTR)
+			ret = ret2;
+	}
+
+	if (pagevec_count(&pagevec) > 0)
+		fscache_mark_pages_cached(op, &pagevec);
+
+	_leave(" = %d [nr=%u%s]",
+	       ret, *nr_pages, list_empty(pages) ? " empty" : "");
+	return ret;
+}
+
+/*
+ * allocate a block in the cache in which to store a page
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if no buffers can be made available
+ * - returns -ENOBUFS if page is beyond EOF
+ * - otherwise:
+ *   - the metadata will be retained
+ *   - 0 will be returned
+ */
+int cachefiles_allocate_page(struct fscache_retrieval *op,
+			     struct page *page,
+			     gfp_t gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct pagevec pagevec;
+	int ret;
+
+	object = container_of(op->op.object,
+			      struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("%p,{%lx},", object, page->index);
+
+	ret = cachefiles_has_space(cache, 0, 1);
+	if (ret == 0) {
+		pagevec_init(&pagevec, 0);
+		pagevec_add(&pagevec, page);
+		fscache_mark_pages_cached(op, &pagevec);
+	} else {
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * allocate blocks in the cache in which to store a set of pages
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if some buffers couldn't be made available
+ * - returns -ENOBUFS if some pages are beyond EOF
+ * - otherwise:
+ *   - -ENODATA will be returned
+ * - metadata will be retained for any page marked
+ */
+int cachefiles_allocate_pages(struct fscache_retrieval *op,
+			      struct list_head *pages,
+			      unsigned *nr_pages,
+			      gfp_t gfp)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+	struct pagevec pagevec;
+	struct page *page;
+	int ret;
+
+	object = container_of(op->op.object,
+			      struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("%p,,,%d,", object, *nr_pages);
+
+	ret = cachefiles_has_space(cache, 0, *nr_pages);
+	if (ret == 0) {
+		pagevec_init(&pagevec, 0);
+
+		list_for_each_entry(page, pages, lru) {
+			if (pagevec_add(&pagevec, page) == 0)
+				fscache_mark_pages_cached(op, &pagevec);
+		}
+
+		if (pagevec_count(&pagevec) > 0)
+			fscache_mark_pages_cached(op, &pagevec);
+		ret = -ENODATA;
+	} else {
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * request a page be stored in the cache
+ * - cache withdrawal is prevented by the caller
+ * - this request may be ignored if there's no cache block available, in which
+ *   case -ENOBUFS will be returned
+ * - if the op is in progress, 0 will be returned
+ */
+int cachefiles_write_page(struct fscache_storage *op, struct page *page)
+{
+	struct cachefiles_object *object;
+	struct address_space *mapping;
+	int ret;
+
+	ASSERT(op != NULL);
+	ASSERT(page != NULL);
+
+	object = container_of(op->op.object,
+			      struct cachefiles_object, fscache);
+
+	_enter("%p,%p{%lx},,,", object, page, page->index);
+
+	if (!object->backer) {
+		_leave(" = -ENOBUFS");
+		return -ENOBUFS;
+	}
+
+	ASSERT(S_ISREG(object->backer->d_inode->i_mode));
+
+	/* copy the page to ext3 and let it store it in its own time */
+	mapping = object->backer->d_inode->i_mapping;
+	ret = -EIO;
+	if (mapping->a_ops->write_one_page) {
+		ret = mapping->a_ops->write_one_page(mapping, page->index,
+						     page);
+		_debug("write_one_page -> %d", ret);
+	}
+
+	if (ret < 0) {
+		if (ret == -EIO)
+			cachefiles_io_error_obj(
+				object, "Write page to backing file failed");
+		ret = -ENOBUFS;
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * detach a backing block from a page
+ * - cache withdrawal is prevented by the caller
+ */
+void cachefiles_uncache_page(struct fscache_object *_object, struct page *page)
+{
+	struct cachefiles_object *object;
+	struct cachefiles_cache *cache;
+
+	object = container_of(_object, struct cachefiles_object, fscache);
+	cache = container_of(object->fscache.cache,
+			     struct cachefiles_cache, cache);
+
+	_enter("%p,{%lu}", object, page->index);
+
+	spin_unlock(&object->fscache.cookie->lock);
+}
diff --git a/fs/cachefiles/cf-security.c b/fs/cachefiles/cf-security.c
new file mode 100644
index 0000000..d54ce10
--- /dev/null
+++ b/fs/cachefiles/cf-security.c
@@ -0,0 +1,80 @@
+/* CacheFiles security management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs.h>
+#include <linux/cred.h>
+#include "cf-internal.h"
+
+/*
+ * determine the security context within which we access the cache from within
+ * the kernel
+ */
+int cachefiles_get_security_ID(struct cachefiles_cache *cache)
+{
+	struct task_security *sec;
+	int ret;
+
+	_enter("");
+
+	sec = get_kernel_security("cachefiles", current);
+	if (IS_ERR(sec)) {
+		ret = PTR_ERR(sec);
+		goto error;
+	}
+
+	cache->cache_sec = sec;
+	ret = 0;
+error:
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * check the security details of the on-disk cache
+ * - must be called with security imposed
+ */
+int cachefiles_determine_cache_security(struct cachefiles_cache *cache,
+					struct dentry *root)
+{
+	int ret;
+
+	_enter("");
+
+	/* use the cache root dir's security context as the basis with which
+	 * create files */
+	ret = change_create_files_as(cache->cache_sec, root->d_inode);
+	if (ret < 0) {
+		_leave(" = %d [cfa]", ret);
+		return ret;
+	}
+
+	ret = security_inode_mkdir(root->d_inode, root, 0);
+	if (ret < 0) {
+		printk(KERN_ERR "CacheFiles:"
+		       " Security denies permission to make dirs: error %d",
+		       ret);
+		goto error;
+	}
+
+	ret = security_inode_create(root->d_inode, root, 0);
+	if (ret < 0) {
+		printk(KERN_ERR "CacheFiles:"
+		       " Security denies permission to create files: error %d",
+		       ret);
+		goto error;
+	}
+
+error:
+	if (ret == -EOPNOTSUPP)
+		ret = 0;
+	_leave(" = %d", ret);
+	return ret;
+}
diff --git a/fs/cachefiles/cf-xattr.c b/fs/cachefiles/cf-xattr.c
new file mode 100644
index 0000000..7bfb8dd
--- /dev/null
+++ b/fs/cachefiles/cf-xattr.c
@@ -0,0 +1,292 @@
+/* CacheFiles extended attribute management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include "cf-internal.h"
+
+static const char cachefiles_xattr_cache[] =
+	XATTR_USER_PREFIX "CacheFiles.cache";
+
+/*
+ * check the type label on an object
+ * - done using xattrs
+ */
+int cachefiles_check_object_type(struct cachefiles_object *object)
+{
+	struct dentry *dentry = object->dentry;
+	char type[3], xtype[3];
+	int ret;
+
+	ASSERT(dentry);
+	ASSERT(dentry->d_inode);
+
+	if (!object->fscache.cookie)
+		strcpy(type, "C3");
+	else
+		snprintf(type, 3, "%02x", object->fscache.cookie->def->type);
+
+	_enter("%p{%s}", object, type);
+
+	/* attempt to install a type label directly */
+	ret = vfs_setxattr(dentry, cachefiles_xattr_cache, type, 2,
+			   XATTR_CREATE);
+	if (ret == 0) {
+		_debug("SET"); /* we succeeded */
+		goto error;
+	}
+
+	if (ret != -EEXIST) {
+		kerror("Can't set xattr on %*.*s [%lu] (err %d)",
+		       dentry->d_name.len, dentry->d_name.len,
+		       dentry->d_name.name, dentry->d_inode->i_ino,
+		       -ret);
+		goto error;
+	}
+
+	/* read the current type label */
+	ret = vfs_getxattr(dentry, cachefiles_xattr_cache, xtype, 3);
+	if (ret < 0) {
+		if (ret == -ERANGE)
+			goto bad_type_length;
+
+		kerror("Can't read xattr on %*.*s [%lu] (err %d)",
+		       dentry->d_name.len, dentry->d_name.len,
+		       dentry->d_name.name, dentry->d_inode->i_ino,
+		       -ret);
+		goto error;
+	}
+
+	/* check the type is what we're expecting */
+	if (ret != 2)
+		goto bad_type_length;
+
+	if (xtype[0] != type[0] || xtype[1] != type[1])
+		goto bad_type;
+
+	ret = 0;
+
+error:
+	_leave(" = %d", ret);
+	return ret;
+
+bad_type_length:
+	kerror("Cache object %lu type xattr length incorrect",
+	       dentry->d_inode->i_ino);
+	ret = -EIO;
+	goto error;
+
+bad_type:
+	xtype[2] = 0;
+	kerror("Cache object %*.*s [%lu] type %s not %s",
+	       dentry->d_name.len, dentry->d_name.len,
+	       dentry->d_name.name, dentry->d_inode->i_ino,
+	       xtype, type);
+	ret = -EIO;
+	goto error;
+}
+
+/*
+ * set the state xattr on a cache file
+ */
+int cachefiles_set_object_xattr(struct cachefiles_object *object,
+				struct cachefiles_xattr *auxdata)
+{
+	struct dentry *dentry = object->dentry;
+	int ret;
+
+	ASSERT(object->fscache.cookie);
+	ASSERT(dentry);
+
+	_enter("%p,#%d", object, auxdata->len);
+
+	/* attempt to install the cache metadata directly */
+	_debug("SET %s #%u", object->fscache.cookie->def->name, auxdata->len);
+
+	ret = vfs_setxattr(dentry, cachefiles_xattr_cache,
+			   &auxdata->type, auxdata->len,
+			   XATTR_CREATE);
+	if (ret < 0 && ret != -ENOMEM)
+		cachefiles_io_error_obj(
+			object,
+			"Failed to set xattr with error %d", ret);
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * update the state xattr on a cache file
+ */
+int cachefiles_update_object_xattr(struct cachefiles_object *object,
+				   struct cachefiles_xattr *auxdata)
+{
+	struct dentry *dentry = object->dentry;
+	int ret;
+
+	ASSERT(object->fscache.cookie);
+	ASSERT(dentry);
+
+	_enter("%p,#%d", object, auxdata->len);
+
+	/* attempt to install the cache metadata directly */
+	_debug("SET %s #%u", object->fscache.cookie->def->name, auxdata->len);
+
+	ret = vfs_setxattr(dentry, cachefiles_xattr_cache,
+			   &auxdata->type, auxdata->len,
+			   XATTR_REPLACE);
+	if (ret < 0 && ret != -ENOMEM)
+		cachefiles_io_error_obj(
+			object,
+			"Failed to update xattr with error %d", ret);
+
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
+ * check the state xattr on a cache file
+ * - return -ESTALE if the object should be deleted
+ */
+int cachefiles_check_object_xattr(struct cachefiles_object *object,
+				  struct cachefiles_xattr *auxdata)
+{
+	struct cachefiles_xattr *auxbuf;
+	struct dentry *dentry = object->dentry;
+	int ret;
+
+	_enter("%p,#%d", object, auxdata->len);
+
+	ASSERT(dentry);
+	ASSERT(dentry->d_inode);
+
+	auxbuf = kmalloc(sizeof(struct cachefiles_xattr) + 512, GFP_KERNEL);
+	if (!auxbuf) {
+		_leave(" = -ENOMEM");
+		return -ENOMEM;
+	}
+
+	/* read the current type label */
+	ret = vfs_getxattr(dentry, cachefiles_xattr_cache,
+			   &auxbuf->type, 512 + 1);
+	if (ret < 0) {
+		if (ret == -ENODATA)
+			goto stale; /* no attribute - power went off
+				     * mid-cull? */
+
+		if (ret == -ERANGE)
+			goto bad_type_length;
+
+		cachefiles_io_error_obj(object,
+					"Can't read xattr on %lu (err %d)",
+					dentry->d_inode->i_ino, -ret);
+		goto error;
+	}
+
+	/* check the on-disk object */
+	if (ret < 1)
+		goto bad_type_length;
+
+	if (auxbuf->type != auxdata->type)
+		goto stale;
+
+	auxbuf->len = ret;
+
+	/* consult the netfs */
+	if (object->fscache.cookie->def->check_aux) {
+		enum fscache_checkaux result;
+		unsigned int dlen;
+
+		dlen = auxbuf->len - 1;
+
+		_debug("checkaux %s #%u",
+		       object->fscache.cookie->def->name, dlen);
+
+		result = object->fscache.cookie->def->check_aux(
+			object->fscache.cookie->netfs_data,
+			&auxbuf->data, dlen);
+
+		switch (result) {
+			/* entry okay as is */
+		case FSCACHE_CHECKAUX_OKAY:
+			goto okay;
+
+			/* entry requires update */
+		case FSCACHE_CHECKAUX_NEEDS_UPDATE:
+			break;
+
+			/* entry requires deletion */
+		case FSCACHE_CHECKAUX_OBSOLETE:
+			goto stale;
+
+		default:
+			BUG();
+		}
+
+		/* update the current label */
+		ret = vfs_setxattr(dentry, cachefiles_xattr_cache,
+				   &auxdata->type, auxdata->len,
+				   XATTR_REPLACE);
+		if (ret < 0) {
+			cachefiles_io_error_obj(object,
+						"Can't update xattr on %lu"
+						" (error %d)",
+						dentry->d_inode->i_ino, -ret);
+			goto error;
+		}
+	}
+
+okay:
+	ret = 0;
+
+error:
+	kfree(auxbuf);
+	_leave(" = %d", ret);
+	return ret;
+
+bad_type_length:
+	kerror("Cache object %lu xattr length incorrect",
+	       dentry->d_inode->i_ino);
+	ret = -EIO;
+	goto error;
+
+stale:
+	ret = -ESTALE;
+	goto error;
+}
+
+/*
+ * remove the object's xattr to mark it stale
+ */
+int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+				   struct dentry *dentry)
+{
+	int ret;
+
+	ret = vfs_removexattr(dentry, cachefiles_xattr_cache);
+	if (ret < 0) {
+		if (ret == -ENOENT || ret == -ENODATA)
+			ret = 0;
+		else if (ret != -ENOMEM)
+			cachefiles_io_error(cache,
+					    "Can't remove xattr from %lu"
+					    " (error %d)",
+					    dentry->d_inode->i_ino, -ret);
+	}
+
+	_leave(" = %d", ret);
+	return ret;
+}
diff --git a/security/security.c b/security/security.c
index 86d94e5..255b011 100644
--- a/security/security.c
+++ b/security/security.c
@@ -334,6 +334,7 @@ int security_inode_create(struct inode *dir, struct dentry *dentry, int mode)
 		return 0;
 	return security_ops->inode_create(dir, dentry, mode);
 }
+EXPORT_SYMBOL_GPL(security_inode_create);
 
 int security_inode_link(struct dentry *old_dentry, struct inode *dir,
 			 struct dentry *new_dentry)
@@ -364,6 +365,7 @@ int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 		return 0;
 	return security_ops->inode_mkdir(dir, dentry, mode);
 }
+EXPORT_SYMBOL_GPL(security_inode_mkdir);
 
 int security_inode_rmdir(struct inode *dir, struct dentry *dentry)
 {


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 19/28] NFS: Use local caching [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (17 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 18/28] CacheFiles: A cache that backs onto a mounted filesystem " David Howells
@ 2007-12-05 19:39 ` David Howells
  2007-12-05 19:40 ` [PATCH 20/28] NFS: Configuration and mount option changes to enable local caching on NFS " David Howells
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:39 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

The attached patch makes it possible for the NFS filesystem to make use of the
network filesystem local caching service (FS-Cache).

To be able to use this, an updated mount program is required.  This can be
obtained from:

	http://people.redhat.com/steved/fscache/util-linux/

To mount an NFS filesystem to use caching, add an "fsc" option to the mount:

	mount warthog:/ /a -o fsc

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/Makefile           |    1 
 fs/nfs/client.c           |    5 +
 fs/nfs/file.c             |   37 ++++
 fs/nfs/fscache-def.c      |  289 +++++++++++++++++++++++++++++++++
 fs/nfs/fscache.c          |  391 +++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/fscache.h          |  148 +++++++++++++++++
 fs/nfs/inode.c            |   47 +++++
 fs/nfs/read.c             |   28 +++
 fs/nfs/super.c            |    3 
 fs/nfs/sysctl.c           |    1 
 include/linux/nfs_fs.h    |    9 +
 include/linux/nfs_fs_sb.h |   18 ++
 12 files changed, 968 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index df0f41e..073d04c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4)	+= nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
 			   nfs4namespace.o
 nfs-$(CONFIG_NFS_DIRECTIO) += direct.o
 nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 70587f3..acb2179 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -43,6 +43,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_CLIENT
 
@@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname,
 	clp->cl_state = 1 << NFS4CLNT_LEASE_EXPIRED;
 #endif
 
+	nfs_fscache_get_client_cookie(clp);
+
 	return clp;
 
 error_3:
@@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp)
 
 	nfs4_shutdown_client(clp);
 
+	nfs_fscache_release_client_cookie(clp);
+
 	/* -EIO all pending I/O */
 	if (!IS_ERR(clp->cl_rpcclient))
 		rpc_shutdown_client(clp->cl_rpcclient);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index b3bb89f..d492cd7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -35,6 +35,7 @@
 #include "delegation.h"
 #include "internal.h"
 #include "iostat.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_FILE
 
@@ -352,22 +353,48 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
 	return status < 0 ? status : copied;
 }
 
+/*
+ * Partially or wholly invalidate a page
+ * - Release the private state associated with a page if undergoing complete
+ *   page invalidation
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ */
 static void nfs_invalidate_page(struct page *page, unsigned long offset)
 {
 	if (offset != 0)
 		return;
 	/* Cancel any unstarted writes on this page */
 	nfs_wb_page_cancel(page->mapping->host, page);
+
+	nfs_fscache_invalidate_page(page, page->mapping->host);
 }
 
+/*
+ * Release the private state associated with a page
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ * - Return true (may release) or false (may not)
+ */
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
 	/* If PagePrivate() is set, then the page is not freeable */
-	return 0;
+	if (PagePrivate(page))
+		return 0;
+	return nfs_fscache_release_page(page, gfp);
 }
 
+/*
+ * Attempt to clear the private state associated with a page when an error
+ * occurs that requires the cached contents of an inode to be written back or
+ * destroyed
+ * - Called if either PG_private or PG_fscache set on the page
+ * - Caller holds page lock
+ * - Return 0 if successful, -error otherwise
+ */
 static int nfs_launder_page(struct page *page)
 {
+	wait_on_page_fscache_write(page);
 	return nfs_wb_page(page->mapping->host, page);
 }
 
@@ -387,6 +414,11 @@ const struct address_space_operations nfs_file_aops = {
 	.launder_page = nfs_launder_page,
 };
 
+/*
+ * Notification that a PTE pointing to an NFS page is about to be made
+ * writable, implying that someone is about to modify the page through a
+ * shared-writable mapping
+ */
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
 {
 	struct file *filp = vma->vm_file;
@@ -396,6 +428,9 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
 	struct address_space *mapping;
 	loff_t offset;
 
+	/* make sure the cache has finished storing the page */
+	wait_on_page_fscache_write(page);
+
 	lock_page(page);
 	mapping = page->mapping;
 	if (mapping != vma->vm_file->f_path.dentry->d_inode->i_mapping) {
diff --git a/fs/nfs/fscache-def.c b/fs/nfs/fscache-def.c
new file mode 100644
index 0000000..bc20b7d
--- /dev/null
+++ b/fs/nfs/fscache-def.c
@@ -0,0 +1,289 @@
+/* NFS FS-Cache index structure definition
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+#include <linux/in6.h>
+
+#include "internal.h"
+#include "fscache.h"
+
+#define NFSDBG_FACILITY		NFSDBG_FSCACHE
+
+/*
+ * Definition of the auxiliary data attached to NFS inode storage objects.
+ * This is used for coherency management.
+ */
+struct nfs_fh_auxdata {
+	struct timespec	i_mtime;
+	struct timespec	i_ctime;
+	loff_t		i_size;
+};
+
+/*
+ * Definition of the key for an NFS server index object.  The server's IP
+ * address is stored as an IPv6 address, with IPv4 addresses being wrapped
+ * appropriately.
+ */
+struct nfs_server_key {
+	uint16_t nfsversion;
+	uint16_t port;
+	union {
+		struct {
+			uint8_t		ipv6wrapper[12];
+			struct in_addr	addr;
+		} ipv4_addr;
+		struct in6_addr ipv6_addr;
+	};
+};
+
+static const struct fscache_netfs_operations nfs_cache_ops = {
+};
+
+struct fscache_netfs nfs_cache_netfs = {
+	.name		= "nfs",
+	.version	= 0,
+	.ops		= &nfs_cache_ops,
+};
+
+static const uint8_t nfs_cache_ipv6_wrapper_for_ipv4[12] = {
+	[0 ... 9]	= 0x00,
+	[10 ... 11]	= 0xff
+};
+
+/*
+ * Generate a key to describe a server in the main NFS index
+ */
+static uint16_t nfs_server_get_key(const void *cookie_netfs_data,
+				   void *buffer, uint16_t bufmax)
+{
+	const struct nfs_client *clp = cookie_netfs_data;
+	struct nfs_server_key *key = buffer;
+	uint16_t len = 0;
+
+	key->nfsversion = clp->cl_nfsversion;
+
+	switch (clp->cl_addr.sin_family) {
+	case AF_INET:
+		key->port = clp->cl_addr.sin_port;
+
+		memcpy(&key->ipv4_addr.ipv6wrapper,
+		       &nfs_cache_ipv6_wrapper_for_ipv4,
+		       sizeof(key->ipv4_addr.ipv6wrapper));
+		memcpy(&key->ipv4_addr.addr,
+		       &clp->cl_addr.sin_addr,
+		       sizeof(key->ipv4_addr.addr));
+		len = sizeof(struct nfs_server_key);
+		break;
+
+	case AF_INET6:
+		key->port = clp->cl_addr.sin_port;
+
+		memcpy(&key->ipv6_addr,
+		       &clp->cl_addr.sin_addr,
+		       sizeof(key->ipv6_addr));
+		len = sizeof(struct nfs_server_key);
+		break;
+
+	default:
+		len = 0;
+		printk(KERN_WARNING "NFS: Unknown network family '%d'\n",
+			clp->cl_addr.sin_family);
+		break;
+	}
+
+	return len;
+}
+
+/*
+ * The root index for the filesystem is defined by nfsd IP address and ports
+ */
+const struct fscache_cookie_def nfs_cache_server_index_def = {
+	.name		= "NFS.servers",
+	.type 		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= nfs_server_get_key,
+};
+
+/*
+ * Generate a key to describe an NFS inode in an NFS server's index
+ */
+static uint16_t nfs_fh_get_key(const void *cookie_netfs_data,
+			       void *buffer, uint16_t bufmax)
+{
+	const struct nfs_inode *nfsi = cookie_netfs_data;
+	uint16_t nsize;
+
+	/* use the inode's NFS filehandle as the key */
+	nsize = nfsi->fh.size;
+	memcpy(buffer, nfsi->fh.data, nsize);
+	return nsize;
+}
+
+/*
+ * Get an extra reference on a read context
+ * - This function can be absent if the completion function doesn't require a
+ *   context
+ */
+static void nfs_fh_get_context(void *cookie_netfs_data, void *context)
+{
+	get_nfs_open_context(context);
+}
+
+/*
+ * Release an extra reference on a read context
+ * - This function can be absent if the completion function doesn't require a
+ *   context
+ */
+static void nfs_fh_put_context(void *cookie_netfs_data, void *context)
+{
+	if (context)
+		put_nfs_open_context(context);
+}
+
+/*
+ * Indication the cookie is no longer uncached
+ * - This function is called when the backing store currently caching a cookie
+ *   is removed
+ * - The netfs should use this to clean up any markers indicating cached pages
+ * - This is mandatory for any object that may have data
+ */
+static void nfs_fh_now_uncached(void *cookie_netfs_data)
+{
+	struct nfs_inode *nfsi = cookie_netfs_data;
+	struct pagevec pvec;
+	pgoff_t first;
+	int loop, nr_pages;
+
+	pagevec_init(&pvec, 0);
+	first = 0;
+
+	dprintk("NFS: nfs_fh_now_uncached: nfs_inode 0x%p\n", nfsi);
+
+	for (;;) {
+		/* grab a bunch of pages to clean */
+		nr_pages = pagevec_lookup(&pvec,
+					  nfsi->vfs_inode.i_mapping,
+					  first,
+					  PAGEVEC_SIZE - pagevec_count(&pvec));
+		if (!nr_pages)
+			break;
+
+		for (loop = 0; loop < nr_pages; loop++)
+			ClearPageFsCache(pvec.pages[loop]);
+
+		first = pvec.pages[nr_pages - 1]->index + 1;
+
+		pvec.nr = nr_pages;
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+}
+
+/*
+ * Tet certain file attributes from the netfs data
+ * - This function can be absent for an index
+ * - Not permitted to return an error
+ * - The netfs data from the cookie being used as the source is
+ *   presented
+ */
+static void nfs_fh_get_attr(const void *cookie_netfs_data, uint64_t *size)
+{
+	const struct nfs_inode *nfsi = cookie_netfs_data;
+
+	*size = nfsi->vfs_inode.i_size;
+}
+
+/*
+ * Get the auxiliary data from netfs data
+ * - This function can be absent if the index carries no state data
+ * - Should store the auxiliary data in the buffer
+ * - Should return the amount of amount stored
+ * - Not permitted to return an error
+ * - The netfs data from the cookie being used as the source is presented
+ */
+static uint16_t nfs_fh_get_aux(const void *cookie_netfs_data,
+			       void *buffer, uint16_t bufmax)
+{
+	struct nfs_fh_auxdata auxdata;
+	const struct nfs_inode *nfsi = cookie_netfs_data;
+
+	auxdata.i_size = nfsi->vfs_inode.i_size;
+	auxdata.i_mtime = nfsi->vfs_inode.i_mtime;
+	auxdata.i_ctime = nfsi->vfs_inode.i_ctime;
+
+	if (bufmax > sizeof(auxdata))
+		bufmax = sizeof(auxdata);
+
+	memcpy(buffer, &auxdata, bufmax);
+	return bufmax;
+}
+
+/*
+ * Consult the netfs about the state of an object
+ * - This function can be absent if the index carries no state data
+ * - The netfs data from the cookie being used as the target is
+ *   presented, as is the auxiliary data
+ */
+static enum fscache_checkaux nfs_fh_check_aux(void *cookie_netfs_data,
+					      const void *data,
+					      uint16_t datalen)
+{
+	struct nfs_fh_auxdata auxdata;
+	struct nfs_inode *nfsi = cookie_netfs_data;
+
+	if (datalen > sizeof(auxdata))
+		return FSCACHE_CHECKAUX_OBSOLETE;
+
+	auxdata.i_size = nfsi->vfs_inode.i_size;
+	auxdata.i_mtime = nfsi->vfs_inode.i_mtime;
+	auxdata.i_ctime = nfsi->vfs_inode.i_ctime;
+
+	if (memcmp(data, &auxdata, datalen) != 0)
+		return FSCACHE_CHECKAUX_OBSOLETE;
+
+	return FSCACHE_CHECKAUX_OKAY;
+}
+
+/*
+ * The primary index for each server is simply made up of a series of NFS file
+ * handles
+ */
+const struct fscache_cookie_def nfs_cache_fh_index_def = {
+	.name		= "NFS.fh",
+	.type		= FSCACHE_COOKIE_TYPE_DATAFILE,
+	.get_key	= nfs_fh_get_key,
+	.get_attr	= nfs_fh_get_attr,
+	.get_aux	= nfs_fh_get_aux,
+	.check_aux	= nfs_fh_check_aux,
+	.get_context	= nfs_fh_get_context,
+	.put_context	= nfs_fh_put_context,
+	.now_uncached	= nfs_fh_now_uncached,
+};
+
+/*
+ * Register NFS for caching
+ */
+int nfs_fscache_register(void)
+{
+	return fscache_register_netfs(&nfs_cache_netfs);
+}
+
+/*
+ * Unregister NFS for caching
+ */
+void nfs_fscache_unregister(void)
+{
+	fscache_unregister_netfs(&nfs_cache_netfs);
+}
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
new file mode 100644
index 0000000..465f961
--- /dev/null
+++ b/fs/nfs/fscache.c
@@ -0,0 +1,391 @@
+/* NFS filesystem cache interface
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+#include <linux/in6.h>
+#include <linux/seq_file.h>
+
+#include "internal.h"
+#include "fscache.h"
+
+#define NFSDBG_FACILITY		NFSDBG_FSCACHE
+
+/*
+ * Get the per-client index cookie for an NFS client if the appropriate mount
+ * flag was set
+ * - We always try and get an index cookie for the client, but get filehandle
+ *   cookies on a per-superblock basis, depending on the mount flags
+ */
+void nfs_fscache_get_client_cookie(struct nfs_client *clp)
+{
+	/* create a cache index for looking up filehandles */
+	clp->fscache = fscache_acquire_cookie(nfs_cache_netfs.primary_index,
+					      &nfs_cache_server_index_def,
+					      clp);
+	dfprintk(FSCACHE, "NFS: get client cookie (0x%p/0x%p)\n",
+		 clp, clp->fscache);
+}
+
+/*
+ * Dispose of a per-client cookie
+ */
+void nfs_fscache_release_client_cookie(struct nfs_client *clp)
+{
+	dfprintk(FSCACHE, "NFS: releasing client cookie (0x%p/0x%p)\n",
+		 clp, clp->fscache);
+
+	fscache_relinquish_cookie(clp->fscache, 0);
+	clp->fscache = NULL;
+}
+
+/*
+ * Display statistics through /proc/pid/mountstats
+ */
+void nfs_fscache_show_stats(struct seq_file *m, struct nfs_server *nfss)
+{
+	seq_printf(m, "\tfscache rd: ok=%u,fail=%u,lasterr=%d\n",
+		   atomic_read(&nfss->fscache_cnt_read_ok),
+		   atomic_read(&nfss->fscache_cnt_read_fail),
+		   nfss->fscache_last_read_error);
+	seq_printf(m, "\tfscache wr: ok=%u,fail=%u,lasterr=%d\n",
+		   atomic_read(&nfss->fscache_cnt_write_ok),
+		   atomic_read(&nfss->fscache_cnt_write_fail),
+		   nfss->fscache_last_write_error);
+	seq_printf(m, "\tfscache uncache: ok=%u\n",
+		   atomic_read(&nfss->fscache_cnt_uncache));
+}
+
+/*
+ * Initialise the per-filehandle cookie for an NFS inode
+ */
+void nfs_fscache_init_fh_cookie(struct inode *inode)
+{
+	NFS_I(inode)->fscache = NULL;
+	if (S_ISREG(inode->i_mode))
+		set_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+}
+
+/*
+ * Get the per-filehandle cookie for an NFS inode
+ */
+void nfs_fscache_enable_fh_cookie(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nfs_inode *nfsi = NFS_I(inode);
+
+	if (nfsi->fscache || !NFS_FSCACHE(inode))
+		return;
+
+	if ((NFS_SB(sb)->options & NFS_OPTION_FSCACHE)) {
+		nfsi->fscache = fscache_acquire_cookie(
+			NFS_SB(sb)->nfs_client->fscache,
+			&nfs_cache_fh_index_def,
+			nfsi);
+
+		dfprintk(FSCACHE, "NFS: get FH cookie (0x%p/0x%p/0x%p)\n",
+			 sb, nfsi, nfsi->fscache);
+	}
+}
+
+/*
+ * Release a per-filehandle cookie
+ */
+void nfs_fscache_release_fh_cookie(struct inode *inode)
+{
+	struct nfs_inode *nfsi = NFS_I(inode);
+
+	dfprintk(FSCACHE, "NFS: clear cookie (0x%p/0x%p)\n",
+		 nfsi, nfsi->fscache);
+
+	fscache_relinquish_cookie(nfsi->fscache, 0);
+	nfsi->fscache = NULL;
+}
+
+/*
+ * Retire a per-filehandle cookie, destroying the data attached to it
+ */
+void nfs_fscache_zap_fh_cookie(struct inode *inode)
+{
+	struct nfs_inode *nfsi = NFS_I(inode);
+
+	dfprintk(FSCACHE, "NFS: zapping cookie (0x%p/0x%p)\n",
+		 nfsi, nfsi->fscache);
+
+	fscache_relinquish_cookie(nfsi->fscache, 1);
+	nfsi->fscache = NULL;
+}
+
+/*
+ * Turn off the cache with regard to a filehandle cookie if opened for writing,
+ * invalidating all the pages in the page cache relating to the associated
+ * inode to clear the per-page caching
+ */
+void nfs_fscache_disable_fh_cookie(struct inode *inode)
+{
+	clear_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+
+	if (NFS_I(inode)->fscache) {
+		dfprintk(FSCACHE,
+			 "NFS: nfsi 0x%p turning cache off\n", NFS_I(inode));
+
+		/* Need to invalidate any mapped pages that were read in before
+		 * turning off the cache.
+		 */
+		if (inode->i_mapping && inode->i_mapping->nrpages)
+			invalidate_inode_pages2(inode->i_mapping);
+
+		nfs_fscache_zap_fh_cookie(inode);
+	}
+}
+
+/*
+ * Decide if we should enable or disable the FS cache for this inode
+ * - For now, only regular files that are open read-only will be able to use
+ *   the cache
+ */
+void nfs_fscache_set_fh_cookie(struct inode *inode, struct file *filp)
+{
+	if (NFS_FSCACHE(inode)) {
+		if ((filp->f_flags & O_ACCMODE) != O_RDONLY)
+			nfs_fscache_disable_fh_cookie(inode);
+		else
+			nfs_fscache_enable_fh_cookie(inode);
+	}
+}
+
+/*
+ * Replace a per-filehandle cookie due to revalidation detecting a file having
+ * changed on the server
+ */
+void nfs_fscache_renew_fh_cookie(struct inode *inode)
+{
+	struct nfs_inode *nfsi = NFS_I(inode);
+	struct nfs_server *nfss = NFS_SERVER(inode);
+	struct fscache_cookie *old = nfsi->fscache;
+
+	if (nfsi->fscache) {
+		/* retire the current fscache cache and get a new one */
+		fscache_relinquish_cookie(nfsi->fscache, 1);
+
+		nfsi->fscache = fscache_acquire_cookie(
+			nfss->nfs_client->fscache,
+			&nfs_cache_fh_index_def,
+			nfsi);
+
+		dfprintk(FSCACHE,
+			 "NFS: revalidation new cookie (0x%p/0x%p/0x%p/0x%p)\n",
+			 nfss, nfsi, old, nfsi->fscache);
+	}
+}
+
+/*
+ * change the filesize associated with a per-filehandle cookie
+ */
+void nfs_fscache_attr_changed(struct inode *inode)
+{
+	fscache_attr_changed(NFS_I(inode)->fscache);
+}
+
+/*
+ * Handle completion of a page being read from the cache
+ * - Called in process (keventd) context
+ */
+static void nfs_readpage_from_fscache_complete(struct page *page,
+					       void *context,
+					       int error)
+{
+	dfprintk(FSCACHE,
+		 "NFS: readpage_from_fscache_complete (0x%p/0x%p/%d)\n",
+		 page, context, error);
+
+	/* if the read completes with an error, we just unlock the page and let
+	 * the VM reissue the readpage */
+	if (!error) {
+		SetPageUptodate(page);
+		unlock_page(page);
+	} else {
+		error = nfs_readpage_async(context, page->mapping->host, page);
+		if (error)
+			unlock_page(page);
+	}
+}
+
+/*
+ * Store a newly fetched page in fscache
+ * - PG_fscache must be set on the page
+ */
+void __nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+	struct nfs_server *nfss = NFS_SERVER(inode);
+	int ret;
+
+	dfprintk(FSCACHE,
+		 "NFS: readpage_to_fscache(fsc:%p/p:%p(i:%lx f:%lx)/%d)\n",
+		 NFS_I(inode)->fscache, page, page->index, page->flags, sync);
+
+	ret = fscache_write_page(NFS_I(inode)->fscache, page, GFP_KERNEL);
+	dfprintk(FSCACHE,
+		 "NFS:     readpage_to_fscache: p:%p(i:%lu f:%lx) ret %d\n",
+		 page, page->index, page->flags, ret);
+
+	if (ret != 0) {
+		fscache_uncache_page(NFS_I(inode)->fscache, page);
+		atomic_inc(&nfss->fscache_cnt_uncache);
+		atomic_inc(&nfss->fscache_cnt_write_fail);
+		nfss->fscache_last_write_error = ret;
+	} else {
+		atomic_inc(&nfss->fscache_cnt_write_ok);
+	}
+}
+
+/*
+ * Retrieve a page from fscache
+ */
+int __nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+				struct inode *inode, struct page *page)
+{
+	struct nfs_server *nfss = NFS_SERVER(inode);
+	int ret;
+
+	dfprintk(FSCACHE,
+		 "NFS: readpage_from_fscache(fsc:%p/p:%p(i:%lx f:%lx)/0x%p)\n",
+		 NFS_I(inode)->fscache, page, page->index, page->flags, inode);
+
+	ret = fscache_read_or_alloc_page(NFS_I(inode)->fscache,
+					 page,
+					 nfs_readpage_from_fscache_complete,
+					 ctx,
+					 GFP_KERNEL);
+
+	switch (ret) {
+	case 0: /* read BIO submitted (page in fscache) */
+		dfprintk(FSCACHE,
+			 "NFS:    readpage_from_fscache: BIO submitted\n");
+		atomic_inc(&nfss->fscache_cnt_read_ok);
+		return ret;
+
+	case -ENOBUFS: /* inode not in cache */
+	case -ENODATA: /* page not in cache */
+		atomic_inc(&nfss->fscache_cnt_read_fail);
+		dfprintk(FSCACHE,
+			 "NFS:    readpage_from_fscache %d\n", ret);
+		return 1;
+
+	default:
+		dfprintk(FSCACHE, "NFS:    readpage_from_fscache %d\n", ret);
+		atomic_inc(&nfss->fscache_cnt_read_fail);
+		nfss->fscache_last_read_error = ret;
+	}
+	return ret;
+}
+
+/*
+ * Retrieve a set of pages from fscache
+ */
+int __nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+				 struct inode *inode,
+				 struct address_space *mapping,
+				 struct list_head *pages,
+				 unsigned *nr_pages)
+{
+	struct nfs_server *nfss = NFS_SERVER(inode);
+	int ret, npages = *nr_pages;
+
+	dfprintk(FSCACHE, "NFS: nfs_getpages_from_fscache (0x%p/%u/0x%p)\n",
+		 NFS_I(inode)->fscache, npages, inode);
+
+	ret = fscache_read_or_alloc_pages(NFS_I(inode)->fscache,
+					  mapping, pages, nr_pages,
+					  nfs_readpage_from_fscache_complete,
+					  ctx,
+					  mapping_gfp_mask(mapping));
+
+	switch (ret) {
+	case 0: /* read submitted to the cache for all pages */
+		BUG_ON(!list_empty(pages));
+		BUG_ON(*nr_pages != 0);
+		dfprintk(FSCACHE,
+			 "NFS: nfs_getpages_from_fscache: submitted\n");
+
+		atomic_add(npages, &nfss->fscache_cnt_read_ok);
+		return ret;
+
+	case -ENOBUFS: /* some pages aren't cached and can't be */
+	case -ENODATA: /* some pages aren't cached */
+		dfprintk(FSCACHE,
+			 "NFS: nfs_getpages_from_fscache: no page: %d\n", ret);
+		atomic_add(*nr_pages, &nfss->fscache_cnt_read_fail);
+		return 1;
+
+	default:
+		dfprintk(FSCACHE,
+			 "NFS: nfs_getpages_from_fscache: ret  %d\n", ret);
+		nfss->fscache_last_read_error = ret;
+	}
+
+	return ret;
+}
+
+/*
+ * Release the caching state associated with a page, if the page isn't busy
+ * interacting with the cache
+ * - Returns true (can release page) or false (page busy)
+ */
+int nfs_fscache_release_page(struct page *page, gfp_t gfp)
+{
+	struct nfs_server *nfss = NFS_SERVER(page->mapping->host);
+
+	if (PageFsCacheWrite(page)) {
+		if (!(gfp & __GFP_WAIT))
+			return 0;
+		wait_on_page_fscache_write(page);
+	}
+
+	if (PageFsCache(page)) {
+		struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+		BUG_ON(!nfsi->fscache);
+
+		dfprintk(FSCACHE, "NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
+			 nfsi->fscache, page, nfsi);
+
+		fscache_uncache_page(nfsi->fscache, page);
+		atomic_inc(&nfss->fscache_cnt_uncache);
+	}
+
+	return 1;
+}
+
+/*
+ * Release the caching state associated with a page if undergoing complete page
+ * invalidation
+ */
+void __nfs_fscache_invalidate_page(struct page *page, struct inode *inode)
+{
+	struct nfs_server *nfss = NFS_SERVER(inode);
+	struct nfs_inode *nfsi = NFS_I(inode);
+
+	BUG_ON(!nfsi->fscache);
+
+	dfprintk(FSCACHE, "NFS: fscache invalidatepage (0x%p/0x%p/0x%p)\n",
+		 nfsi->fscache, page, nfsi);
+
+	wait_on_page_fscache_write(page);
+
+	BUG_ON(!PageLocked(page));
+	fscache_uncache_page(nfsi->fscache, page);
+	atomic_inc(&nfss->fscache_cnt_uncache);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
new file mode 100644
index 0000000..144fb58
--- /dev/null
+++ b/fs/nfs/fscache.h
@@ -0,0 +1,148 @@
+/* NFS filesystem cache interface definitions
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NFS_FSCACHE_H
+#define _NFS_FSCACHE_H
+
+#include <linux/nfs_fs.h>
+#include <linux/nfs_mount.h>
+#include <linux/nfs4_mount.h>
+
+#ifdef CONFIG_NFS_FSCACHE
+#include <linux/fscache.h>
+
+/*
+ * fscache-def.c
+ */
+extern struct fscache_netfs nfs_cache_netfs;
+extern const struct fscache_cookie_def nfs_cache_server_index_def;
+extern const struct fscache_cookie_def nfs_cache_fh_index_def;
+
+extern int nfs_fscache_register(void);
+extern void nfs_fscache_unregister(void);
+
+/*
+ * fscache.c
+ */
+extern void nfs_fscache_get_client_cookie(struct nfs_client *);
+extern void nfs_fscache_release_client_cookie(struct nfs_client *);
+extern void nfs_fscache_show_stats(struct seq_file *, struct nfs_server *);
+extern void nfs_fscache_init_fh_cookie(struct inode *);
+extern void nfs_fscache_enable_fh_cookie(struct inode *);
+extern void nfs_fscache_release_fh_cookie(struct inode *);
+extern void nfs_fscache_zap_fh_cookie(struct inode *);
+extern void nfs_fscache_disable_fh_cookie(struct inode *);
+extern void nfs_fscache_set_fh_cookie(struct inode *, struct file *);
+extern void nfs_fscache_renew_fh_cookie(struct inode *);
+extern void nfs_fscache_attr_changed(struct inode *);
+extern void __nfs_readpage_to_fscache(struct inode *, struct page *, int);
+extern int __nfs_readpage_from_fscache(struct nfs_open_context *,
+				       struct inode *, struct page *);
+extern int __nfs_readpages_from_fscache(struct nfs_open_context *,
+					struct inode *, struct address_space *,
+					struct list_head *, unsigned *);
+extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
+extern int nfs_fscache_release_page(struct page *, gfp_t);
+
+/*
+ * release the caching state associated with a page if undergoing complete page
+ * invalidation
+ */
+static inline void nfs_fscache_invalidate_page(struct page *page,
+					       struct inode *inode)
+{
+	if (PageFsCache(page))
+		__nfs_fscache_invalidate_page(page, inode);
+}
+
+/*
+ * store a newly fetched page in fscache
+ */
+static inline void nfs_readpage_to_fscache(struct inode *inode,
+					   struct page *page,
+					   int sync)
+{
+	if (PageFsCache(page))
+		__nfs_readpage_to_fscache(inode, page, sync);
+}
+
+/*
+ * retrieve a page from fscache
+ */
+static inline int nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+					    struct inode *inode,
+					    struct page *page)
+{
+	if (NFS_I(inode)->fscache)
+		return __nfs_readpage_from_fscache(ctx, inode, page);
+	return -ENOBUFS;
+}
+
+/*
+ * retrieve a set of pages from fscache
+ */
+static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+					     struct inode *inode,
+					     struct address_space *mapping,
+					     struct list_head *pages,
+					     unsigned *nr_pages)
+{
+	if (NFS_I(inode)->fscache)
+		return __nfs_readpages_from_fscache(ctx, inode, mapping, pages,
+						    nr_pages);
+	return -ENOBUFS;
+}
+
+#else /* CONFIG_NFS_FSCACHE */
+static inline int nfs_fscache_register(void) { return 0; }
+static inline void nfs_fscache_unregister(void) {}
+static inline void nfs_fscache_get_client_cookie(struct nfs_client *clp) {}
+static inline void nfs4_fscache_get_client_cookie(struct nfs_client *clp) {}
+static inline void nfs_fscache_release_client_cookie(struct nfs_client *clp) {}
+static inline void nfs_fscache_show_stats(struct seq_file *m,
+					  struct nfs_server *nfss) {}
+
+static inline void nfs_fscache_init_fh_cookie(struct inode *inode) {}
+static inline void nfs_fscache_enable_fh_cookie(struct inode *inode) {}
+static inline void nfs_fscache_attr_changed(struct inode *inode) {}
+static inline void nfs_fscache_release_fh_cookie(struct inode *inode) {}
+static inline void nfs_fscache_zap_fh_cookie(struct inode *inode) {}
+static inline void nfs_fscache_renew_fh_cookie(struct inode *inode) {}
+static inline void nfs_fscache_disable_fh_cookie(struct inode *inode) {}
+static inline void nfs_fscache_set_fh_cookie(struct inode *inode,
+					     struct file *filp) {}
+static inline void nfs_fscache_install_vm_ops(struct inode *inode,
+					      struct vm_area_struct *vma) {}
+static inline int nfs_fscache_release_page(struct page *page, gfp_t gfp)
+{
+	return 1; /* True: may release page */
+}
+static inline void nfs_fscache_invalidate_page(struct page *page,
+					       struct inode *inode) {}
+static inline void nfs_readpage_to_fscache(struct inode *inode,
+					   struct page *page, int sync) {}
+static inline int nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+					    struct inode *inode,
+					    struct page *page)
+{
+	return -ENOBUFS;
+}
+static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+					     struct inode *inode,
+					     struct address_space *mapping,
+					     struct list_head *pages,
+					     unsigned *nr_pages)
+{
+	return -ENOBUFS;
+}
+
+#endif /* CONFIG_NFS_FSCACHE */
+#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index db5d96d..49726b1 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -46,6 +46,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_VFS
 
@@ -111,6 +112,7 @@ void nfs_clear_inode(struct inode *inode)
 	BUG_ON(!list_empty(&NFS_I(inode)->open_files));
 	nfs_zap_acl_cache(inode);
 	nfs_access_zap_cache(inode);
+	nfs_fscache_release_fh_cookie(inode);
 }
 
 /**
@@ -330,6 +332,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 		memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
 		nfsi->access_cache = RB_ROOT;
 
+		nfs_fscache_init_fh_cookie(inode);
+
 		unlock_new_inode(inode);
 	} else
 		nfs_refresh_inode(inode, fattr);
@@ -613,6 +617,7 @@ int nfs_open(struct inode *inode, struct file *filp)
 	ctx->mode = filp->f_mode;
 	nfs_file_set_open_context(filp, ctx);
 	put_nfs_open_context(ctx);
+	nfs_fscache_set_fh_cookie(inode, filp);
 	return 0;
 }
 
@@ -673,7 +678,13 @@ __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 			 (long long)NFS_FILEID(inode), status);
 		goto out;
 	}
-	spin_unlock(&inode->i_lock);
+	if (nfsi->cache_validity & NFS_INO_INVALID_FSCACHE_ATTR) {
+		nfsi->cache_validity &= ~NFS_INO_INVALID_FSCACHE_ATTR;
+		spin_unlock(&inode->i_lock);
+		nfs_fscache_attr_changed(inode);
+	} else {
+		spin_unlock(&inode->i_lock);
+	}
 
 	if (nfsi->cache_validity & NFS_INO_INVALID_ACL)
 		nfs_zap_acl_cache(inode);
@@ -729,6 +740,7 @@ static int nfs_invalidate_mapping_nolock(struct inode *inode, struct address_spa
 		memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
 	spin_unlock(&inode->i_lock);
 	nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
+	nfs_fscache_renew_fh_cookie(inode);
 	dfprintk(PAGECACHE, "NFS: (%s/%Ld) data cache invalidated\n",
 			inode->i_sb->s_id, (long long)NFS_FILEID(inode));
 	return 0;
@@ -904,7 +916,13 @@ int nfs_refresh_inode(struct inode *inode, struct nfs_fattr *fattr)
 	else
 		status = nfs_check_inode_attributes(inode, fattr);
 
-	spin_unlock(&inode->i_lock);
+	if (nfsi->cache_validity & NFS_INO_INVALID_FSCACHE_ATTR) {
+		nfsi->cache_validity &= ~NFS_INO_INVALID_FSCACHE_ATTR;
+		spin_unlock(&inode->i_lock);
+		nfs_fscache_attr_changed(inode);
+	} else {
+		spin_unlock(&inode->i_lock);
+	}
 	return status;
 }
 
@@ -925,12 +943,19 @@ int nfs_refresh_inode(struct inode *inode, struct nfs_fattr *fattr)
 int nfs_post_op_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
+	bool update_fscache = false;
 
 	spin_lock(&inode->i_lock);
 	nfsi->cache_validity |= NFS_INO_INVALID_ATTR|NFS_INO_REVAL_PAGECACHE;
 	if (S_ISDIR(inode->i_mode))
 		nfsi->cache_validity |= NFS_INO_INVALID_DATA;
+	if (nfsi->cache_validity & NFS_INO_INVALID_FSCACHE_ATTR) {
+		nfsi->cache_validity &= ~NFS_INO_INVALID_FSCACHE_ATTR;
+		update_fscache = true;
+	}
 	spin_unlock(&inode->i_lock);
+	if (update_fscache)
+		nfs_fscache_attr_changed(inode);
 	return nfs_refresh_inode(inode, fattr);
 }
 
@@ -1018,7 +1043,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 		if (!timespec_equal(&inode->i_mtime, &fattr->mtime)) {
 			dprintk("NFS: mtime change on server for file %s/%ld\n",
 					inode->i_sb->s_id, inode->i_ino);
-			invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
+			invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA |
+				NFS_INO_INVALID_FSCACHE_ATTR;
 			nfsi->cache_change_attribute = now;
 		}
 		/* If ctime has changed we should definitely clear access+acl caches */
@@ -1027,7 +1053,9 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 	} else if (nfsi->change_attr != fattr->change_attr) {
 		dprintk("NFS: change_attr change on server for file %s/%ld\n",
 				inode->i_sb->s_id, inode->i_ino);
-		invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA|NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL;
+		invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA |
+			NFS_INO_INVALID_ACCESS|NFS_INO_INVALID_ACL |
+			NFS_INO_INVALID_FSCACHE_ATTR;
 		nfsi->cache_change_attribute = now;
 	}
 
@@ -1039,13 +1067,13 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 		 * the file grown beyond our last write? */
 		if (nfsi->npages == 0 || new_isize > cur_isize) {
 			inode->i_size = new_isize;
-			invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
+			invalid |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA |
+				NFS_INO_INVALID_FSCACHE_ATTR;
 		}
 		dprintk("NFS: isize change on server for file %s/%ld\n",
 				inode->i_sb->s_id, inode->i_ino);
 	}
 
-
 	memcpy(&inode->i_mtime, &fattr->mtime, sizeof(inode->i_mtime));
 	memcpy(&inode->i_ctime, &fattr->ctime, sizeof(inode->i_ctime));
 	memcpy(&inode->i_atime, &fattr->atime, sizeof(inode->i_atime));
@@ -1214,6 +1242,10 @@ static int __init init_nfs_fs(void)
 {
 	int err;
 
+	err = nfs_fscache_register();
+	if (err < 0)
+		goto out6;
+
 	err = nfs_fs_proc_init();
 	if (err)
 		goto out5;
@@ -1260,6 +1292,8 @@ out3:
 out4:
 	nfs_fs_proc_exit();
 out5:
+	nfs_fscache_unregister();
+out6:
 	return err;
 }
 
@@ -1270,6 +1304,7 @@ static void __exit exit_nfs_fs(void)
 	nfs_destroy_readpagecache();
 	nfs_destroy_inodecache();
 	nfs_destroy_nfspagecache();
+	nfs_fscache_unregister();
 #ifdef CONFIG_PROC_FS
 	rpc_proc_unregister("nfs");
 #endif
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 4587a86..56e919e 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -18,12 +18,14 @@
 #include <linux/sunrpc/clnt.h>
 #include <linux/nfs_fs.h>
 #include <linux/nfs_page.h>
+#include <linux/nfs_mount.h>
 #include <linux/smp_lock.h>
 
 #include <asm/system.h>
 
 #include "internal.h"
 #include "iostat.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
@@ -114,8 +116,8 @@ static void nfs_readpage_truncate_uninitialised_page(struct nfs_read_data *data)
 	}
 }
 
-static int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
-		struct page *page)
+int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
+		       struct page *page)
 {
 	LIST_HEAD(one_request);
 	struct nfs_page	*new;
@@ -142,6 +144,11 @@ static int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
 
 static void nfs_readpage_release(struct nfs_page *req)
 {
+	struct inode *d_inode = req->wb_context->path.dentry->d_inode;
+
+	if (PageUptodate(req->wb_page))
+		nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+
 	unlock_page(req->wb_page);
 
 	dprintk("NFS: read done (%s/%Ld %d@%Ld)\n",
@@ -496,8 +503,15 @@ int nfs_readpage(struct file *file, struct page *page)
 	} else
 		ctx = get_nfs_open_context(nfs_file_open_context(file));
 
+	if (!IS_SYNC(inode)) {
+		error = nfs_readpage_from_fscache(ctx, inode, page);
+		if (error == 0)
+			goto out;
+	}
+
 	error = nfs_readpage_async(ctx, inode, page);
 
+out:
 	put_nfs_open_context(ctx);
 	return error;
 out_unlock:
@@ -573,6 +587,15 @@ int nfs_readpages(struct file *filp, struct address_space *mapping,
 			return -EBADF;
 	} else
 		desc.ctx = get_nfs_open_context(nfs_file_open_context(filp));
+
+	/* attempt to read as many of the pages as possible from the cache
+	 * - this returns -ENOBUFS immediately if the cookie is negative
+	 */
+	ret = nfs_readpages_from_fscache(desc.ctx, inode, mapping,
+					 pages, &nr_pages);
+	if (ret == 0)
+		goto read_complete; /* all pages were read */
+
 	if (rsize < PAGE_CACHE_SIZE)
 		nfs_pageio_init(&pgio, inode, nfs_pagein_multi, rsize, 0);
 	else
@@ -583,6 +606,7 @@ int nfs_readpages(struct file *filp, struct address_space *mapping,
 	nfs_pageio_complete(&pgio);
 	npages = (pgio.pg_bytes_written + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	nfs_add_stats(inode, NFSIOS_READPAGES, npages);
+read_complete:
 	put_nfs_open_context(desc.ctx);
 out:
 	return ret;
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 2426e71..040d65b 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -57,6 +57,7 @@
 #include "delegation.h"
 #include "iostat.h"
 #include "internal.h"
+#include "fscache.h"
 
 #define NFSDBG_FACILITY		NFSDBG_VFS
 
@@ -549,6 +550,8 @@ static int nfs_show_stats(struct seq_file *m, struct vfsmount *mnt)
 
 	rpc_print_iostats(m, nfss->client);
 
+	nfs_fscache_show_stats(m, nfss);
+
 	return 0;
 }
 
diff --git a/fs/nfs/sysctl.c b/fs/nfs/sysctl.c
index b62481d..b3b3280 100644
--- a/fs/nfs/sysctl.c
+++ b/fs/nfs/sysctl.c
@@ -14,6 +14,7 @@
 #include <linux/nfs_fs.h>
 
 #include "callback.h"
+#include "internal.h"
 
 static const int nfs_set_port_min = 0;
 static const int nfs_set_port_max = 65535;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 2d15d4a..8a5685f 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -174,6 +174,9 @@ struct nfs_inode {
 	int			 delegation_state;
 	struct rw_semaphore	rwsem;
 #endif /* CONFIG_NFS_V4*/
+#ifdef CONFIG_NFS_FSCACHE
+	struct fscache_cookie	*fscache;
+#endif
 	struct inode		vfs_inode;
 };
 
@@ -187,6 +190,7 @@ struct nfs_inode {
 #define NFS_INO_INVALID_ACL	0x0010		/* cached acls are invalid */
 #define NFS_INO_REVAL_PAGECACHE	0x0020		/* must revalidate pagecache */
 #define NFS_INO_REVAL_FORCED	0x0040		/* force revalidation ignoring a delegation */
+#define NFS_INO_INVALID_FSCACHE_ATTR	0x0080	/* local cache attributes are invalid */
 
 /*
  * Bit offsets in flags field
@@ -195,6 +199,7 @@ struct nfs_inode {
 #define NFS_INO_ADVISE_RDPLUS	(1)		/* advise readdirplus */
 #define NFS_INO_STALE		(2)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(3)		/* Inode is on the LRU list */
+#define NFS_INO_FSCACHE		(4)		/* inode can be cached by FS-Cache */
 
 static inline struct nfs_inode *NFS_I(struct inode *inode)
 {
@@ -216,6 +221,7 @@ static inline struct nfs_inode *NFS_I(struct inode *inode)
 
 #define NFS_FLAGS(inode)		(NFS_I(inode)->flags)
 #define NFS_STALE(inode)		(test_bit(NFS_INO_STALE, &NFS_FLAGS(inode)))
+#define NFS_FSCACHE(inode)	(test_bit(NFS_INO_FSCACHE, &NFS_FLAGS(inode)))
 
 #define NFS_FILEID(inode)		(NFS_I(inode)->fileid)
 
@@ -455,6 +461,8 @@ extern int  nfs_readpages(struct file *, struct address_space *,
 		struct list_head *, unsigned);
 extern int  nfs_readpage_result(struct rpc_task *, struct nfs_read_data *);
 extern void nfs_readdata_release(void *data);
+extern int  nfs_readpage_async(struct nfs_open_context *, struct inode *,
+			       struct page *);
 
 /*
  * Allocate nfs_read_data structures
@@ -545,6 +553,7 @@ extern void * nfs_root_data(void);
 #define NFSDBG_CALLBACK		0x0100
 #define NFSDBG_CLIENT		0x0200
 #define NFSDBG_MOUNT		0x0400
+#define NFSDBG_FSCACHE		0x0800
 #define NFSDBG_ALL		0xFFFF
 
 #ifdef __KERNEL__
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 0cac49b..3c8e15d 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -3,6 +3,7 @@
 
 #include <linux/list.h>
 #include <linux/backing-dev.h>
+#include <linux/fscache.h>
 
 struct nfs_iostats;
 
@@ -65,6 +66,10 @@ struct nfs_client {
 	char			cl_ipaddr[16];
 	unsigned char		cl_id_uniquifier;
 #endif
+
+#ifdef CONFIG_NFS_FSCACHE
+	struct fscache_cookie	*fscache;	/* client index cache cookie */
+#endif
 };
 
 /*
@@ -95,12 +100,25 @@ struct nfs_server {
 	unsigned int		acdirmin;
 	unsigned int		acdirmax;
 	unsigned int		namelen;
+	unsigned int		options;	/* extra options enabled by mount */
+#define NFS_OPTION_FSCACHE	0x00000001	/* - local caching enabled */
 
 	struct nfs_fsid		fsid;
 	__u64			maxfilesize;	/* maximum file size */
 	unsigned long		mount_time;	/* when this fs was mounted */
 	dev_t			s_dev;		/* superblock dev numbers */
 
+#ifdef CONFIG_NFS_FSCACHE
+	/* statistical counters for local caching */
+	atomic_t		fscache_cnt_read_ok;
+	atomic_t		fscache_cnt_read_fail;
+	atomic_t		fscache_cnt_write_ok;
+	atomic_t		fscache_cnt_write_fail;
+	atomic_t		fscache_cnt_uncache;
+	int			fscache_last_read_error;
+	int			fscache_last_write_error;
+#endif
+
 #ifdef CONFIG_NFS_V4
 	u32			attr_bitmask[2];/* V4 bitmask representing the set
 						   of attributes supported on this


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 20/28] NFS: Configuration and mount option changes to enable local caching on NFS [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (18 preceding siblings ...)
  2007-12-05 19:39 ` [PATCH 19/28] NFS: Use local caching " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 21/28] NFS: Display local caching state " David Howells
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Changes to the kernel configuration defintions and to the NFS mount options to
allow the local caching support added by the previous patch to be enabled.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/Kconfig        |    8 ++++++++
 fs/nfs/client.c   |    2 ++
 fs/nfs/internal.h |    1 +
 fs/nfs/super.c    |   14 ++++++++++++++
 4 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 215b0d6..83d1227 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1650,6 +1650,14 @@ config NFS_V4
 
 	  If unsure, say N.
 
+config NFS_FSCACHE
+	bool "Provide NFS client caching support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+	help
+	  Say Y here if you want NFS data to be cached locally on disc through
+	  the general filesystem cache manager
+
 config NFS_DIRECTIO
 	bool "Allow direct I/O on NFS files"
 	depends on NFS_FS
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index acb2179..be38c3c 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -575,6 +575,7 @@ static int nfs_init_server(struct nfs_server *server,
 
 	/* Initialise the client representation from the mount data */
 	server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+	server->options = data->options;
 
 	if (data->rsize)
 		server->rsize = nfs_block_size(data->rsize, NULL);
@@ -931,6 +932,7 @@ static int nfs4_init_server(struct nfs_server *server,
 	/* Initialise the client representation from the mount data */
 	server->flags = data->flags & NFS_MOUNT_FLAGMASK;
 	server->caps |= NFS_CAP_ATOMIC_OPEN;
+	server->options = data->options;
 
 	if (data->rsize)
 		server->rsize = nfs_block_size(data->rsize, NULL);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f3acf48..ef09e00 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -35,6 +35,7 @@ struct nfs_parsed_mount_data {
 	int			acregmin, acregmax,
 				acdirmin, acdirmax;
 	int			namlen;
+	unsigned int		options;
 	unsigned int		bsize;
 	unsigned int		auth_flavor_len;
 	rpc_authflavor_t	auth_flavors[1];
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 040d65b..a5a3bd1 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -74,6 +74,7 @@ enum {
 	Opt_acl, Opt_noacl,
 	Opt_rdirplus, Opt_nordirplus,
 	Opt_sharecache, Opt_nosharecache,
+	Opt_fscache, Opt_nofscache,
 
 	/* Mount options that take integer arguments */
 	Opt_port,
@@ -123,6 +124,8 @@ static match_table_t nfs_mount_option_tokens = {
 	{ Opt_nordirplus, "nordirplus" },
 	{ Opt_sharecache, "sharecache" },
 	{ Opt_nosharecache, "nosharecache" },
+	{ Opt_fscache, "fsc" },
+	{ Opt_nofscache, "nofsc" },
 
 	{ Opt_port, "port=%u" },
 	{ Opt_rsize, "rsize=%u" },
@@ -459,6 +462,8 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
 	seq_printf(m, ",timeo=%lu", 10U * clp->retrans_timeo / HZ);
 	seq_printf(m, ",retrans=%u", clp->retrans_count);
 	seq_printf(m, ",sec=%s", nfs_pseudoflavour_to_name(nfss->client->cl_auth->au_flavor));
+	if (nfss->options & NFS_OPTION_FSCACHE)
+		seq_printf(m, ",fsc");
 }
 
 /*
@@ -697,6 +702,15 @@ static int nfs_parse_mount_options(char *raw,
 			break;
 		case Opt_nosharecache:
 			mnt->flags |= NFS_MOUNT_UNSHARED;
+			mnt->options &= ~NFS_OPTION_FSCACHE;
+			break;
+		case Opt_fscache:
+			/* sharing is mandatory with fscache */
+			mnt->options |= NFS_OPTION_FSCACHE;
+			mnt->flags &= ~NFS_MOUNT_UNSHARED;
+			break;
+		case Opt_nofscache:
+			mnt->options &= ~NFS_OPTION_FSCACHE;
 			break;
 
 		case Opt_port:


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 21/28] NFS: Display local caching state [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (19 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 20/28] NFS: Configuration and mount option changes to enable local caching on NFS " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 22/28] fcrypt endianness misannotations " David Howells
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/client.c  |    7 ++++---
 fs/nfs/fscache.h |   15 +++++++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index be38c3c..91ecea3 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1335,7 +1335,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
 
 	/* display header on line 1 */
 	if (v == &nfs_volume_list) {
-		seq_puts(m, "NV SERVER   PORT DEV     FSID\n");
+		seq_puts(m, "NV SERVER   PORT DEV     FSID              FSC\n");
 		return 0;
 	}
 	/* display one transport per line on subsequent lines */
@@ -1349,12 +1349,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
 		 (unsigned long long) server->fsid.major,
 		 (unsigned long long) server->fsid.minor);
 
-	seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s\n",
+	seq_printf(m, "v%d %02x%02x%02x%02x %4hx %-7s %-17s %s\n",
 		   clp->cl_nfsversion,
 		   NIPQUAD(clp->cl_addr.sin_addr),
 		   ntohs(clp->cl_addr.sin_port),
 		   dev,
-		   fsid);
+		   fsid,
+		   nfs_server_fscache_state(server));
 
 	return 0;
 }
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 144fb58..9a735fc 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -53,6 +53,17 @@ extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
 extern int nfs_fscache_release_page(struct page *, gfp_t);
 
 /*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+	if (server->nfs_client->fscache &&
+	    (server->options & NFS_OPTION_FSCACHE))
+		return "yes";
+	return "no ";
+}
+
+/*
  * release the caching state associated with a page if undergoing complete page
  * invalidation
  */
@@ -109,6 +120,10 @@ static inline void nfs4_fscache_get_client_cookie(struct nfs_client *clp) {}
 static inline void nfs_fscache_release_client_cookie(struct nfs_client *clp) {}
 static inline void nfs_fscache_show_stats(struct seq_file *m,
 					  struct nfs_server *nfss) {}
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+	return "no ";
+}
 
 static inline void nfs_fscache_init_fh_cookie(struct inode *inode) {}
 static inline void nfs_fscache_enable_fh_cookie(struct inode *inode) {}


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 22/28] fcrypt endianness misannotations [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (20 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 21/28] NFS: Display local caching state " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 23/28] AFS: Add TestSetPageError() " David Howells
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

 crypto/fcrypt.c |   88 ++++++++++++++++++++++++++++---------------------------
 1 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/crypto/fcrypt.c b/crypto/fcrypt.c
index d161949..a32cb68 100644
--- a/crypto/fcrypt.c
+++ b/crypto/fcrypt.c
@@ -51,7 +51,7 @@
 #define ROUNDS 16
 
 struct fcrypt_ctx {
-	u32 sched[ROUNDS];
+	__be32 sched[ROUNDS];
 };
 
 /* Rotate right two 32 bit numbers as a 56 bit number */
@@ -73,8 +73,8 @@ do {								\
  * /afs/transarc.com/public/afsps/afs.rel31b.export-src/rxkad/sboxes.h
  */
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x << 3)
-static const u32 sbox0[256] = {
+#define Z(x) __constant_cpu_to_be32(x << 3)
+static const __be32 sbox0[256] = {
 	Z(0xea), Z(0x7f), Z(0xb2), Z(0x64), Z(0x9d), Z(0xb0), Z(0xd9), Z(0x11),
 	Z(0xcd), Z(0x86), Z(0x86), Z(0x91), Z(0x0a), Z(0xb2), Z(0x93), Z(0x06),
 	Z(0x0e), Z(0x06), Z(0xd2), Z(0x65), Z(0x73), Z(0xc5), Z(0x28), Z(0x60),
@@ -110,8 +110,8 @@ static const u32 sbox0[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu((x << 27) | (x >> 5))
-static const u32 sbox1[256] = {
+#define Z(x) __constant_cpu_to_be32((x << 27) | (x >> 5))
+static const __be32 sbox1[256] = {
 	Z(0x77), Z(0x14), Z(0xa6), Z(0xfe), Z(0xb2), Z(0x5e), Z(0x8c), Z(0x3e),
 	Z(0x67), Z(0x6c), Z(0xa1), Z(0x0d), Z(0xc2), Z(0xa2), Z(0xc1), Z(0x85),
 	Z(0x6c), Z(0x7b), Z(0x67), Z(0xc6), Z(0x23), Z(0xe3), Z(0xf2), Z(0x89),
@@ -147,8 +147,8 @@ static const u32 sbox1[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x << 11)
-static const u32 sbox2[256] = {
+#define Z(x) __constant_cpu_to_be32(x << 11)
+static const __be32 sbox2[256] = {
 	Z(0xf0), Z(0x37), Z(0x24), Z(0x53), Z(0x2a), Z(0x03), Z(0x83), Z(0x86),
 	Z(0xd1), Z(0xec), Z(0x50), Z(0xf0), Z(0x42), Z(0x78), Z(0x2f), Z(0x6d),
 	Z(0xbf), Z(0x80), Z(0x87), Z(0x27), Z(0x95), Z(0xe2), Z(0xc5), Z(0x5d),
@@ -184,8 +184,8 @@ static const u32 sbox2[256] = {
 };
 
 #undef Z
-#define Z(x) __constant_be32_to_cpu(x << 19)
-static const u32 sbox3[256] = {
+#define Z(x) __constant_cpu_to_be32(x << 19)
+static const __be32 sbox3[256] = {
 	Z(0xa9), Z(0x2a), Z(0x48), Z(0x51), Z(0x84), Z(0x7e), Z(0x49), Z(0xe2),
 	Z(0xb5), Z(0xb7), Z(0x42), Z(0x33), Z(0x7d), Z(0x5d), Z(0xa6), Z(0x12),
 	Z(0x44), Z(0x48), Z(0x6d), Z(0x28), Z(0xaa), Z(0x20), Z(0x6d), Z(0x57),
@@ -225,7 +225,7 @@ static const u32 sbox3[256] = {
  */
 #define F_ENCRYPT(R, L, sched)						\
 do {									\
-	union lc4 { u32 l; u8 c[4]; } u;				\
+	union lc4 { __be32 l; u8 c[4]; } u;				\
 	u.l = sched ^ R;						\
 	L ^= sbox0[u.c[0]] ^ sbox1[u.c[1]] ^ sbox2[u.c[2]] ^ sbox3[u.c[3]]; \
 } while(0)
@@ -237,7 +237,7 @@ static void fcrypt_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src)
 {
 	const struct fcrypt_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct {
-		u32 l, r;
+		__be32 l, r;
 	} X;
 
 	memcpy(&X, src, sizeof(X));
@@ -269,7 +269,7 @@ static void fcrypt_decrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src)
 {
 	const struct fcrypt_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct {
-		u32 l, r;
+		__be32 l, r;
 	} X;
 
 	memcpy(&X, src, sizeof(X));
@@ -328,22 +328,22 @@ static int fcrypt_setkey(struct crypto_tfm *tfm, const u8 *key, unsigned int key
 	k |= (*key) >> 1;
 
 	/* Use lower 32 bits for schedule, rotate by 11 each round (16 times) */
-	ctx->sched[0x0] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x1] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x2] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x3] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x4] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x5] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x6] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x7] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x8] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0x9] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0xa] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0xb] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0xc] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0xd] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0xe] = be32_to_cpu(k); ror56_64(k, 11);
-	ctx->sched[0xf] = be32_to_cpu(k);
+	ctx->sched[0x0] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x1] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x2] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x3] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x4] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x5] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x6] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x7] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x8] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0x9] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0xa] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0xb] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0xc] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0xd] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0xe] = cpu_to_be32(k); ror56_64(k, 11);
+	ctx->sched[0xf] = cpu_to_be32(k);
 
 	return 0;
 #else
@@ -369,22 +369,22 @@ static int fcrypt_setkey(struct crypto_tfm *tfm, const u8 *key, unsigned int key
 	lo |= (*key) >> 1;
 
 	/* Use lower 32 bits for schedule, rotate by 11 each round (16 times) */
-	ctx->sched[0x0] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x1] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x2] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x3] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x4] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x5] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x6] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x7] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x8] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0x9] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0xa] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0xb] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0xc] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0xd] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0xe] = be32_to_cpu(lo); ror56(hi, lo, 11);
-	ctx->sched[0xf] = be32_to_cpu(lo);
+	ctx->sched[0x0] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x1] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x2] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x3] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x4] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x5] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x6] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x7] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x8] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0x9] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0xa] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0xb] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0xc] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0xd] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0xe] = cpu_to_be32(lo); ror56(hi, lo, 11);
+	ctx->sched[0xf] = cpu_to_be32(lo);
 	return 0;
 #endif
 }


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 23/28] AFS: Add TestSetPageError() [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (21 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 22/28] fcrypt endianness misannotations " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache " David Howells
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Add a TestSetPageError() macro to the suite of page flag manipulators.  This
can be used by AFS to prevent over-excision of rejected writes from the page
cache.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/page-flags.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fcc9e23..0350c37 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -130,6 +130,7 @@
 #define PageError(page)		test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)	set_bit(PG_error, &(page)->flags)
 #define ClearPageError(page)	clear_bit(PG_error, &(page)->flags)
+#define TestSetPageError(page)	test_and_set_bit(PG_error, &(page)->flags)
 
 #define PageReferenced(page)	test_bit(PG_referenced, &(page)->flags)
 #define SetPageReferenced(page)	set_bit(PG_referenced, &(page)->flags)


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (22 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 23/28] AFS: Add TestSetPageError() " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-14  4:21   ` Nick Piggin
  2007-12-17 22:54   ` David Howells
  2007-12-05 19:40 ` [PATCH 25/28] AFS: Improve handling of a rejected writeback " David Howells
                   ` (3 subsequent siblings)
  27 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Add a function - cancel_rejected_write() - to excise a rejected write from the
pagecache.  This function is related to the truncation family of routines.  It
permits the pages modified by a network filesystem client (such as AFS) to be
excised and discarded from the pagecache if the attempt to write them back to
the server fails.

The dirty and writeback states of the afflicted pages are cancelled and the
pages themselves are detached for recycling.  All PTEs referring to those
pages are removed.

Note that the locking is tricky as it's very easy to deadlock against
truncate() and other routines once the pages have been unlocked as part of the
writeback process.  To this end, the PG_error flag is set, then the
PG_writeback flag is cleared, and only *then* can lock_page() be called.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    5 ++-
 mm/truncate.c      |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 520238c..438270f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1005,12 +1005,13 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
-/* filemap.c */
-extern unsigned long page_unuse(struct page *);
+/* truncate.c */
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
 
+/* filemap.c */
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
diff --git a/mm/truncate.c b/mm/truncate.c
index 5b7d1c5..95fc1a8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -465,3 +465,86 @@ int invalidate_inode_pages2(struct address_space *mapping)
 	return invalidate_inode_pages2_range(mapping, 0, -1);
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
+
+/*
+ * Cancel that part of a rejected write that affects a particular page
+ */
+static void cancel_rejected_page(struct address_space *mapping,
+				 struct page *page, pgoff_t *_next)
+{
+	if (!TestSetPageError(page)) {
+		/* can't lock the page until we've cleared PG_writeback lest we
+		 * deadlock with truncate (amongst other things) */
+		end_page_writeback(page);
+		if (page->mapping == mapping) {
+			lock_page(page);
+			if (page->mapping == mapping) {
+				truncate_complete_page(mapping, page);
+				*_next = page->index + 1;
+			}
+			unlock_page(page);
+		}
+	} else if (PageWriteback(page) || PageDirty(page)) {
+		BUG();
+	}
+}
+
+/**
+ * cancel_rejected_write - Cancel a write on a contiguous set of pages
+ * @mapping: mapping affected
+ * @start: first page in set
+ * @end: last page in set
+ *
+ * Cancel a write of a contiguous set of pages when the writeback was rejected
+ * by the target medium or server.
+ *
+ * The pages in question are detached and discarded from the pagecache, and the
+ * writeback and dirty states are cleared prior to invalidation.  The caller
+ * must make sure that all the pages in the range are present in the pagecache,
+ * and the caller must hold PG_writeback on each of them.  NOTE! All the pages
+ * are locked and unlocked as part of this process, so the caller must take
+ * care to avoid deadlock.
+ *
+ * The PTEs pointing to those pages are also cleared, leading to the PTEs being
+ * reset when new pages are allocated and the contents reloaded.
+ */
+void cancel_rejected_write(struct address_space *mapping,
+			   pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	pgoff_t n;
+	int i;
+
+	BUG_ON(mapping->nrpages < end - start + 1);
+
+	/* dispose of any PTEs pointing to the affected pages */
+	unmap_mapping_range(mapping,
+			    (loff_t)start << PAGE_CACHE_SHIFT,
+			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
+			    0);
+
+	pagevec_init(&pvec, 0);
+	do {
+		cond_resched();
+		n = end - start + 1;
+		if (n > PAGEVEC_SIZE)
+			n = PAGEVEC_SIZE;
+		n = pagevec_lookup(&pvec, mapping, start, n);
+		for (i = 0; i < n; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (page->index < start || page->index > end)
+				continue;
+			start++;
+			cancel_rejected_page(mapping, page, &start);
+		}
+		pagevec_release(&pvec);
+	} while (start - 1 < end);
+
+	/* dispose of any new PTEs pointing to the affected pages */
+	unmap_mapping_range(mapping,
+			    (loff_t)start << PAGE_CACHE_SHIFT,
+			    (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
+			    0);
+}
+EXPORT_SYMBOL_GPL(cancel_rejected_write);


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 25/28] AFS: Improve handling of a rejected writeback [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (23 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 26/28] AF_RXRPC: Save the operation ID for debugging " David Howells
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Improve the handling of the case of a server rejecting an attempt to write back
a cached write.  AFS operates a write-back cache, so the following sequence of
events can theoretically occur:

	CLIENT 1		CLIENT 2
	=======================	=======================
	cat data >/the/file
	 (sits in pagecache)
				fs setacl -dir /the/dir/of/the/file \
					-acl system:administrators rlidka
				 (write permission removed for client 1)
	sync
	 (writeback attempt fails)

The way AFS attempts to handle this is:

 (1) The affected region will be excised and discarded on the basis that it
     can't be written back, yet we don't want it lurking in the page cache
     either.  The contents of the affected region will be reread from the
     server when called for again.

 (2) The EOF size will be set to the current server-based file size - usually
     that which it was before the affected write was made - assuming no
     conflicting write has been appended, and assuming the affected write
     extended the file.


This patch makes the following changes:

 (1) Zero-length short reads don't produce EBADMSG now just because the OpenAFS
     server puts a silly value as the size of the returned data.  This prevents
     excised pages beyond the revised EOF being reinstantiated with a surprise
     PG_error.

 (2) Writebacks can now be put into a 'rejected' state in which all further
     attempts to write them back will result in excision of the affected pages
     instead.

 (3) Preparing a page for overwriting now reads the whole page instead of just
     those parts of it that aren't to be covered by the copy to be made.  This
     handles the possibility that the copy might fail on EFAULT.  Corollary to
     this, PG_update can now be set by afs_prepare_page() on behalf of
     afs_prepare_write() rather than setting it in afs_commit_write().

 (4) In the case of a conflicting write, afs_prepare_write() will attempt to
     flush the write to the server, and will then wait for PG_writeback to go
     away - after unlocking the page.  This helps prevent deadlock against the
     writeback-rejection handler.  AOP_TRUNCATED_PAGE is then returned to the
     caller to signify that the page has been unlocked, and that it should be
     revalidated.

 (5) The writeback-rejection handler now calls cancel_rejected_write() added by
     the previous patch to excise the affected pages rather than clearing the
     PG_uptodate flag on all the pages.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/fsclient.c |    4 +
 fs/afs/internal.h |    1 
 fs/afs/write.c    |  154 ++++++++++++++++++++++++++++-------------------------
 3 files changed, 85 insertions(+), 74 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 023b95b..04584c0 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -353,7 +353,9 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call,
 
 		call->count = ntohl(call->tmp);
 		_debug("DATA length: %u", call->count);
-		if (call->count > PAGE_SIZE)
+		if ((s32) call->count < 0)
+			call->count = 0; /* access completely beyond EOF */
+		else if (call->count > PAGE_SIZE)
 			return -EBADMSG;
 		call->offset = 0;
 		call->unmarshall++;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 5ca3625..84b90b0 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -156,6 +156,7 @@ struct afs_writeback {
 		AFS_WBACK_PENDING,		/* write pending */
 		AFS_WBACK_CONFLICTING,		/* conflicting writes posted */
 		AFS_WBACK_WRITING,		/* writing back */
+		AFS_WBACK_REJECTED,		/* the writeback was rejected */
 		AFS_WBACK_COMPLETE		/* the writeback record has been unlinked */
 	} state __attribute__((packed));
 };
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 9a849ad..add5892 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -81,18 +81,16 @@ void afs_put_writeback(struct afs_writeback *wb)
 }
 
 /*
- * partly or wholly fill a page that's under preparation for writing
+ * fill a page that's under preparation for writing
  */
 static int afs_fill_page(struct afs_vnode *vnode, struct key *key,
-			 unsigned start, unsigned len, struct page *page)
+			 unsigned len, struct page *page)
 {
 	int ret;
 
-	_enter(",,%u,%u", start, len);
+	_enter(",,%u,", len);
 
-	ASSERTCMP(start + len, <=, PAGE_SIZE);
-
-	ret = afs_vnode_fetch_data(vnode, key, start, len, page);
+	ret = afs_vnode_fetch_data(vnode, key, 0, len, page);
 	if (ret < 0) {
 		if (ret == -ENOENT) {
 			_debug("got NOENT from server"
@@ -110,18 +108,15 @@ static int afs_fill_page(struct afs_vnode *vnode, struct key *key,
  * prepare a page for being written to
  */
 static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
-			    struct key *key, unsigned offset, unsigned to)
+			    struct key *key)
 {
-	unsigned eof, tail, start, stop, len;
+	unsigned len;
 	loff_t i_size, pos;
 	void *p;
 	int ret;
 
 	_enter("");
 
-	if (offset == 0 && to == PAGE_SIZE)
-		return 0;
-
 	p = kmap_atomic(page, KM_USER0);
 
 	i_size = i_size_read(&vnode->vfs_inode);
@@ -129,10 +124,7 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
 	if (pos >= i_size) {
 		/* partial write, page beyond EOF */
 		_debug("beyond");
-		if (offset > 0)
-			memset(p, 0, offset);
-		if (to < PAGE_SIZE)
-			memset(p + to, 0, PAGE_SIZE - to);
+		memset(p, 0, PAGE_SIZE);
 		kunmap_atomic(p, KM_USER0);
 		return 0;
 	}
@@ -140,31 +132,20 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
 	if (i_size - pos >= PAGE_SIZE) {
 		/* partial write, page entirely before EOF */
 		_debug("before");
-		tail = eof = PAGE_SIZE;
+		len = PAGE_SIZE;
 	} else {
 		/* partial write, page overlaps EOF */
-		eof = i_size - pos;
-		_debug("overlap %u", eof);
-		tail = max(eof, to);
-		if (tail < PAGE_SIZE)
-			memset(p + tail, 0, PAGE_SIZE - tail);
-		if (offset > eof)
-			memset(p + eof, 0, PAGE_SIZE - eof);
+		len = i_size - pos;
+		_debug("overlap %u", len);
+		ASSERTRANGE(0, <, len, <, PAGE_SIZE);
+		memset(p + len, 0, PAGE_SIZE - len);
 	}
 
 	kunmap_atomic(p, KM_USER0);
 
-	ret = 0;
-	if (offset > 0 || eof > to) {
-		/* need to fill one or two bits that aren't going to be written
-		 * (cover both fillers in one read if there are two) */
-		start = (offset > 0) ? 0 : to;
-		stop = (eof > to) ? eof : offset;
-		len = stop - start;
-		_debug("wr=%u-%u av=0-%u rd=%u@%u",
-		       offset, to, eof, start, len);
-		ret = afs_fill_page(vnode, key, start, len, page);
-	}
+	ret = afs_fill_page(vnode, key, len, page);
+	if (ret == 0)
+		SetPageUptodate(page);
 
 	_leave(" = %d", ret);
 	return ret;
@@ -187,6 +168,8 @@ int afs_prepare_write(struct file *file, struct page *page,
 	_enter("{%x:%u},{%lx},%u,%u",
 	       vnode->fid.vid, vnode->fid.vnode, page->index, offset, to);
 
+	BUG_ON(PageError(page));
+
 	candidate = kzalloc(sizeof(*candidate), GFP_KERNEL);
 	if (!candidate)
 		return -ENOMEM;
@@ -200,7 +183,7 @@ int afs_prepare_write(struct file *file, struct page *page,
 
 	if (!PageUptodate(page)) {
 		_debug("not up to date");
-		ret = afs_prepare_page(vnode, page, key, offset, to);
+		ret = afs_prepare_page(vnode, page, key);
 		if (ret < 0) {
 			kfree(candidate);
 			_leave(" = %d [prep]", ret);
@@ -269,21 +252,41 @@ flush_conflicting_wb:
 	_debug("flush conflict");
 	if (wb->state == AFS_WBACK_PENDING)
 		wb->state = AFS_WBACK_CONFLICTING;
+	wb->usage++;
 	spin_unlock(&vnode->writeback_lock);
-	if (PageDirty(page)) {
+	if (!PageDirty(page) && !PageWriteback(page)) {
+		/* no change outstanding - just reuse the page */
+		_debug("reuse");
+		afs_put_writeback(wb);
+		set_page_private(page, 0);
+		ClearPagePrivate(page);
+		goto try_again;
+	}
+
+	kfree(candidate);
+
+	/* if we're busy writing back a conflicting write, then unlock the page
+	 * and wait for the writeback to complete - this lets the process doing
+	 * the write-out handle rejection without deadlock */
+	if (PageWriteback(page)) {
+		_debug("wait wb");
+		unlock_page(page);
+	} else {
+		/* there's a conflicting modification we have to write back and
+		 * wait for before letting the next one proceed */
+		_debug("dirty");
 		ret = afs_write_back_from_locked_page(wb, page);
 		if (ret < 0) {
-			afs_put_writeback(candidate);
+			afs_put_writeback(wb);
 			_leave(" = %d", ret);
 			return ret;
 		}
 	}
 
-	/* the page holds a ref on the writeback record */
 	afs_put_writeback(wb);
-	set_page_private(page, 0);
-	ClearPagePrivate(page);
-	goto try_again;
+	wait_on_page_writeback(page);
+	_leave(" = A_T_P");
+	return AOP_TRUNCATED_PAGE;
 }
 
 /*
@@ -310,7 +313,6 @@ int afs_commit_write(struct file *file, struct page *page,
 		spin_unlock(&vnode->writeback_lock);
 	}
 
-	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (PageDirty(page))
 		_debug("dirtied");
@@ -319,38 +321,35 @@ int afs_commit_write(struct file *file, struct page *page,
 }
 
 /*
- * kill all the pages in the given range
+ * note the failure of a write, either due to an error or to a permission
+ * failure
+ * - all the pages in the affected range must have PG_writeback set
+ * - the caller must be responsible for the pages: no-one else should be trying
+ *   to note rejection
  */
-static void afs_kill_pages(struct afs_vnode *vnode, bool error,
-			   pgoff_t first, pgoff_t last)
+static void afs_write_rejected(struct afs_writeback *wb, bool error,
+			       pgoff_t first, pgoff_t last)
 {
-	struct pagevec pv;
-	unsigned count, loop;
+	struct afs_vnode *vnode = wb->vnode;
+	loff_t i_size;
 
 	_enter("{%x:%u},%lx-%lx",
 	       vnode->fid.vid, vnode->fid.vnode, first, last);
 
-	pagevec_init(&pv, 0);
-
-	do {
-		_debug("kill %lx-%lx", first, last);
-
-		count = last - first + 1;
-		if (count > PAGEVEC_SIZE)
-			count = PAGEVEC_SIZE;
-		pv.nr = find_get_pages_contig(vnode->vfs_inode.i_mapping,
-					      first, count, pv.pages);
-		ASSERTCMP(pv.nr, ==, count);
-
-		for (loop = 0; loop < count; loop++) {
-			ClearPageUptodate(pv.pages[loop]);
-			if (error)
-				SetPageError(pv.pages[loop]);
-			end_page_writeback(pv.pages[loop]);
-		}
+	spin_lock(&vnode->writeback_lock);
+	wb->state = AFS_WBACK_REJECTED;
+
+	/* wind back the file size if this write extended the file, and wasn't
+	 * followed by a conflicting write */
+	i_size = ((loff_t) wb->last) << PAGE_SHIFT;
+	i_size += wb->to_last;
+	if (i_size_read(&vnode->vfs_inode) == i_size) {
+		_debug("shorten");
+		i_size_write(&vnode->vfs_inode, vnode->status.size);
+	}
 
-		__pagevec_release(&pv);
-	} while (first < last);
+	spin_unlock(&vnode->writeback_lock);
+	cancel_rejected_write(vnode->vfs_inode.i_mapping, first, last);
 
 	_leave("");
 }
@@ -358,6 +357,7 @@ static void afs_kill_pages(struct afs_vnode *vnode, bool error,
 /*
  * synchronously write back the locked page and any subsequent non-locked dirty
  * pages also covered by the same writeback record
+ * - all pages written will be unlocked prior to returning
  */
 static int afs_write_back_from_locked_page(struct afs_writeback *wb,
 					   struct page *primary_page)
@@ -407,6 +407,7 @@ static int afs_write_back_from_locked_page(struct afs_writeback *wb,
 			if (TestSetPageLocked(page))
 				break;
 			if (!PageDirty(page) ||
+			    PageWriteback(page) ||
 			    page_private(page) != (unsigned long) wb) {
 				unlock_page(page);
 				break;
@@ -430,14 +431,22 @@ static int afs_write_back_from_locked_page(struct afs_writeback *wb,
 
 no_more:
 	/* we now have a contiguous set of dirty pages, each with writeback set
-	 * and the dirty mark cleared; the first page is locked and must remain
-	 * so, all the rest are unlocked */
+	 * and the dirty mark cleared; all the pages barring the first are now
+	 * unlocked */
 	first = primary_page->index;
 	last = first + count - 1;
+	unlock_page(primary_page);
 
 	offset = (first == wb->first) ? wb->offset_first : 0;
 	to = (last == wb->last) ? wb->to_last : PAGE_SIZE;
 
+	if (wb->state == AFS_WBACK_REJECTED) {
+		cancel_rejected_write(wb->vnode->vfs_inode.i_mapping,
+				      first, last);
+		ret = 0;
+		goto out;
+	}
+
 	_debug("write back %lx[%u..] to %lx[..%u]", first, offset, last, to);
 
 	ret = afs_vnode_store_data(wb, first, last, offset, to);
@@ -455,7 +464,7 @@ no_more:
 		case -ENOENT:
 		case -ENOMEDIUM:
 		case -ENXIO:
-			afs_kill_pages(wb->vnode, true, first, last);
+			afs_write_rejected(wb, true, first, last);
 			set_bit(AS_EIO, &wb->vnode->vfs_inode.i_mapping->flags);
 			break;
 		case -EACCES:
@@ -464,7 +473,7 @@ no_more:
 		case -EKEYEXPIRED:
 		case -EKEYREJECTED:
 		case -EKEYREVOKED:
-			afs_kill_pages(wb->vnode, false, first, last);
+			afs_write_rejected(wb, false, first, last);
 			break;
 		default:
 			break;
@@ -473,6 +482,7 @@ no_more:
 		ret = count;
 	}
 
+out:
 	_leave(" = %d", ret);
 	return ret;
 }
@@ -493,7 +503,6 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	ASSERT(wb != NULL);
 
 	ret = afs_write_back_from_locked_page(wb, page);
-	unlock_page(page);
 	if (ret < 0) {
 		_leave(" = %d", ret);
 		return 0;
@@ -565,7 +574,6 @@ static int afs_writepages_region(struct address_space *mapping,
 		spin_unlock(&wb->vnode->writeback_lock);
 
 		ret = afs_write_back_from_locked_page(wb, page);
-		unlock_page(page);
 		page_cache_release(page);
 		if (ret < 0) {
 			_leave(" = %d", ret);


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 26/28] AF_RXRPC: Save the operation ID for debugging [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (24 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 25/28] AFS: Improve handling of a rejected writeback " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 27/28] AFS: Implement shared-writable mmap " David Howells
  2007-12-05 19:40 ` [PATCH 28/28] FS-Cache: Make kAFS use FS-Cache " David Howells
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Save the operation ID to be used with a call that we're making for display
through /proc/net/rxrpc_calls.  This helps debugging stuck operations as we
then know what they are.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/fsclient.c       |   32 +++++++++++++++++++++++---------
 fs/afs/rxrpc.c          |    1 +
 fs/afs/vlclient.c       |    2 ++
 include/net/af_rxrpc.h  |    1 +
 net/rxrpc/af_rxrpc.c    |    3 +++
 net/rxrpc/ar-internal.h |    1 +
 net/rxrpc/ar-proc.c     |    7 ++++---
 7 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 04584c0..a468f2d 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -287,6 +287,7 @@ int afs_fs_fetch_file_status(struct afs_server *server,
 	call->reply2 = volsync;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSFETCHSTATUS);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -316,7 +317,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *call,
 	case 0:
 		call->offset = 0;
 		call->unmarshall++;
-		if (call->operation_ID != FSFETCHDATA64) {
+		if (call->operation_ID != htonl(FSFETCHDATA64)) {
 			call->unmarshall++;
 			goto no_msw;
 		}
@@ -464,7 +465,7 @@ static int afs_fs_fetch_data64(struct afs_server *server,
 	call->reply3 = buffer;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
-	call->operation_ID = FSFETCHDATA64;
+	call->operation_ID = htonl(FSFETCHDATA64);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -509,7 +510,7 @@ int afs_fs_fetch_data(struct afs_server *server,
 	call->reply3 = buffer;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
-	call->operation_ID = FSFETCHDATA;
+	call->operation_ID = htonl(FSFETCHDATA);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -577,6 +578,7 @@ int afs_fs_give_up_callbacks(struct afs_server *server,
 
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSGIVEUPCALLBACKS);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -683,10 +685,11 @@ int afs_fs_create(struct afs_server *server,
 	call->reply4 = newcb;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(S_ISDIR(mode) ? FSMAKEDIR : FSCREATEFILE);
 
 	/* marshall the parameters */
 	bp = call->request;
-	*bp++ = htonl(S_ISDIR(mode) ? FSMAKEDIR : FSCREATEFILE);
+	*bp++ = call->operation_ID;
 	*bp++ = htonl(vnode->fid.vid);
 	*bp++ = htonl(vnode->fid.vnode);
 	*bp++ = htonl(vnode->fid.unique);
@@ -772,10 +775,11 @@ int afs_fs_remove(struct afs_server *server,
 	call->reply = vnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(isdir ? FSREMOVEDIR : FSREMOVEFILE);
 
 	/* marshall the parameters */
 	bp = call->request;
-	*bp++ = htonl(isdir ? FSREMOVEDIR : FSREMOVEFILE);
+	*bp++ = call->operation_ID;
 	*bp++ = htonl(vnode->fid.vid);
 	*bp++ = htonl(vnode->fid.vnode);
 	*bp++ = htonl(vnode->fid.unique);
@@ -857,6 +861,7 @@ int afs_fs_link(struct afs_server *server,
 	call->reply2 = vnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSLINK);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -954,6 +959,7 @@ int afs_fs_symlink(struct afs_server *server,
 	call->reply3 = newstatus;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSSYMLINK);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1062,6 +1068,7 @@ int afs_fs_rename(struct afs_server *server,
 	call->reply2 = new_dvnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSRENAME);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1178,6 +1185,7 @@ static int afs_fs_store_data64(struct afs_server *server,
 	call->last_to = to;
 	call->send_pages = true;
 	call->store_version = vnode->status.data_version + 1;
+	call->operation_ID = htonl(FSSTOREDATA64);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1255,6 +1263,7 @@ int afs_fs_store_data(struct afs_server *server, struct afs_writeback *wb,
 	call->last_to = to;
 	call->send_pages = true;
 	call->store_version = vnode->status.data_version + 1;
+	call->operation_ID = htonl(FSSTOREDATA);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1303,7 +1312,8 @@ static int afs_deliver_fs_store_status(struct afs_call *call,
 
 	/* unmarshall the reply once we've received all of it */
 	store_version = NULL;
-	if (call->operation_ID == FSSTOREDATA)
+	if (call->operation_ID == htonl(FSSTOREDATA) ||
+	    call->operation_ID == htonl(FSSTOREDATA64))
 		store_version = &call->store_version;
 
 	bp = call->buffer;
@@ -1365,7 +1375,7 @@ static int afs_fs_setattr_size64(struct afs_server *server, struct key *key,
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
 	call->store_version = vnode->status.data_version + 1;
-	call->operation_ID = FSSTOREDATA;
+	call->operation_ID = htonl(FSSTOREDATA64);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1416,7 +1426,7 @@ static int afs_fs_setattr_size(struct afs_server *server, struct key *key,
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
 	call->store_version = vnode->status.data_version + 1;
-	call->operation_ID = FSSTOREDATA;
+	call->operation_ID = htonl(FSSTOREDATA);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1462,7 +1472,7 @@ int afs_fs_setattr(struct afs_server *server, struct key *key,
 	call->reply = vnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
-	call->operation_ID = FSSTORESTATUS;
+	call->operation_ID = htonl(FSSTORESTATUS);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1742,6 +1752,7 @@ int afs_fs_get_volume_status(struct afs_server *server,
 	call->reply3 = tmpbuf;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSGETVOLUMESTATUS);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1828,6 +1839,7 @@ int afs_fs_set_lock(struct afs_server *server,
 	call->reply = vnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSSETLOCK);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1861,6 +1873,7 @@ int afs_fs_extend_lock(struct afs_server *server,
 	call->reply = vnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSEXTENDLOCK);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -1893,6 +1906,7 @@ int afs_fs_release_lock(struct afs_server *server,
 	call->reply = vnode;
 	call->service_id = FS_SERVICE;
 	call->port = htons(AFS_FS_PORT);
+	call->operation_ID = htonl(FSRELEASELOCK);
 
 	/* marshall the parameters */
 	bp = call->request;
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index bde3f19..348caf4 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -336,6 +336,7 @@ int afs_make_call(struct in_addr *addr, struct afs_call *call, gfp_t gfp,
 
 	/* create a call */
 	rxcall = rxrpc_kernel_begin_call(afs_socket, &srx, call->key,
+					 ntohl(call->operation_ID),
 					 (unsigned long) call, gfp);
 	call->key = NULL;
 	if (IS_ERR(rxcall)) {
diff --git a/fs/afs/vlclient.c b/fs/afs/vlclient.c
index 36c1306..60ddd4f 100644
--- a/fs/afs/vlclient.c
+++ b/fs/afs/vlclient.c
@@ -170,6 +170,7 @@ int afs_vl_get_entry_by_name(struct in_addr *addr,
 	call->reply = entry;
 	call->service_id = VL_SERVICE;
 	call->port = htons(AFS_VL_PORT);
+	call->operation_ID = htonl(VLGETENTRYBYNAME);
 
 	/* marshall the parameters */
 	bp = call->request;
@@ -206,6 +207,7 @@ int afs_vl_get_entry_by_id(struct in_addr *addr,
 	call->reply = entry;
 	call->service_id = VL_SERVICE;
 	call->port = htons(AFS_VL_PORT);
+	call->operation_ID = htonl(VLGETENTRYBYID);
 
 	/* marshall the parameters */
 	bp = call->request;
diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h
index 00c2eaa..7e99733 100644
--- a/include/net/af_rxrpc.h
+++ b/include/net/af_rxrpc.h
@@ -38,6 +38,7 @@ extern void rxrpc_kernel_intercept_rx_messages(struct socket *,
 extern struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *,
 						  struct sockaddr_rxrpc *,
 						  struct key *,
+						  u32,
 						  unsigned long,
 						  gfp_t);
 extern int rxrpc_kernel_send_data(struct rxrpc_call *, struct msghdr *,
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index d638945..bd59665 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -253,6 +253,7 @@ static struct rxrpc_transport *rxrpc_name_to_transport(struct socket *sock,
  * @sock: The socket on which to make the call
  * @srx: The address of the peer to contact (defaults to socket setting)
  * @key: The security context to use (defaults to socket setting)
+ * @operation_ID: The operation ID for this call (debugging only)
  * @user_call_ID: The ID to use
  *
  * Allow a kernel service to begin a call on the nominated socket.  This just
@@ -265,6 +266,7 @@ static struct rxrpc_transport *rxrpc_name_to_transport(struct socket *sock,
 struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *sock,
 					   struct sockaddr_rxrpc *srx,
 					   struct key *key,
+					   u32 operation_ID,
 					   unsigned long user_call_ID,
 					   gfp_t gfp)
 {
@@ -313,6 +315,7 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *sock,
 	call = rxrpc_get_client_call(rx, trans, bundle, user_call_ID, true,
 				     gfp);
 	rxrpc_put_bundle(trans, bundle);
+	call->op_id = operation_ID;
 out:
 	rxrpc_put_transport(trans);
 	release_sock(&rx->sk);
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 58aaf89..f362e7e 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -367,6 +367,7 @@ struct rxrpc_call {
 		RXRPC_CALL_DEAD,		/* - call is dead */
 	} state;
 	int			debug_id;	/* debug ID for printks */
+	u32			op_id;		/* operation ID (for debugging only) */
 	u8			channel;	/* connection channel occupied by this call */
 
 	/* transmission-phase ACK management */
diff --git a/net/rxrpc/ar-proc.c b/net/rxrpc/ar-proc.c
index 2e83ce3..521b826 100644
--- a/net/rxrpc/ar-proc.c
+++ b/net/rxrpc/ar-proc.c
@@ -53,8 +53,8 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
 	if (v == &rxrpc_calls) {
 		seq_puts(seq,
 			 "Proto Local                  Remote                "
-			 " SvID ConnID   CallID   End Use State    Abort   "
-			 " UserID\n");
+			 " SvID ConnID   CallID   OpID     End Use State   "
+			 " Abort    UserID\n");
 		return 0;
 	}
 
@@ -70,13 +70,14 @@ static int rxrpc_call_seq_show(struct seq_file *seq, void *v)
 		ntohs(trans->peer->srx.transport.sin.sin_port));
 
 	seq_printf(seq,
-		   "UDP   %-22.22s %-22.22s %4x %08x %08x %s %3u"
+		   "UDP   %-22.22s %-22.22s %4x %08x %08x %08x %s %3u"
 		   " %-8.8s %08x %lx\n",
 		   lbuff,
 		   rbuff,
 		   ntohs(call->conn->service_id),
 		   ntohl(call->conn->cid),
 		   ntohl(call->call_id),
+		   call->op_id,
 		   call->conn->in_clientflag ? "Svc" : "Clt",
 		   atomic_read(&call->usage),
 		   rxrpc_call_states[call->state],


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 27/28] AFS: Implement shared-writable mmap [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (25 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 26/28] AF_RXRPC: Save the operation ID for debugging " David Howells
@ 2007-12-05 19:40 ` David Howells
  2007-12-05 19:40 ` [PATCH 28/28] FS-Cache: Make kAFS use FS-Cache " David Howells
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c     |   20 +++++++++++++++++++-
 fs/afs/internal.h |    1 +
 fs/afs/write.c    |   35 +++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 525f7c5..1323df4 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned long offset);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
+static int afs_mmap(struct file *file, struct vm_area_struct *vma);
 
 const struct file_operations afs_file_operations = {
 	.open		= afs_open,
@@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = {
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
 	.aio_write	= afs_file_write,
-	.mmap		= generic_file_readonly_mmap,
+	.mmap		= afs_mmap,
 	.splice_read	= generic_file_splice_read,
 	.fsync		= afs_fsync,
 	.lock		= afs_lock,
@@ -56,6 +57,11 @@ const struct address_space_operations afs_fs_aops = {
 	.writepages	= afs_writepages,
 };
 
+static struct vm_operations_struct afs_file_vm_ops = {
+	.fault		= filemap_fault,
+	.page_mkwrite	= afs_page_mkwrite,
+};
+
 /*
  * open an AFS file or directory and attach a key to it
  */
@@ -295,3 +301,15 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags)
 	_leave(" = 0");
 	return 0;
 }
+
+/*
+ * memory map part of an AFS file
+ */
+static int afs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	_enter("");
+
+	file_accessed(file);
+	vma->vm_ops = &afs_file_vm_ops;
+	return 0;
+}
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 84b90b0..b3da9ab 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -742,6 +742,7 @@ extern ssize_t afs_file_write(struct kiocb *, const struct iovec *,
 			      unsigned long, loff_t);
 extern int afs_writeback_all(struct afs_vnode *);
 extern int afs_fsync(struct file *, struct dentry *, int);
+extern int afs_page_mkwrite(struct vm_area_struct *, struct page *);
 
 
 /*****************************************************************************/
diff --git a/fs/afs/write.c b/fs/afs/write.c
index add5892..8a3e9e2 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -155,6 +155,8 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
  * prepare to perform part of a write to a page
  * - the caller holds the page locked, preventing it from being written out or
  *   modified by anyone else
+ * - may be called from afs_page_mkwrite() to set up a page for modification
+ *   through shared-writable mmap
  */
 int afs_prepare_write(struct file *file, struct page *page,
 		      unsigned offset, unsigned to)
@@ -833,3 +835,36 @@ int afs_fsync(struct file *file, struct dentry *dentry, int datasync)
 	_leave(" = %d", ret);
 	return ret;
 }
+
+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ *
+ * we use this to make a record of the key with which the writeback should be
+ * performed and to flush any outstanding writes made with a different key
+ *
+ * the key to be used is attached to the struct file pinned by the VMA
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+	struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+	struct key *key = vma->vm_file->private_data;
+	int ret;
+
+	_enter("{{%x:%u},%x},{%lx}",
+	       vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
+
+	do {
+		lock_page(page);
+		if (page->mapping == vma->vm_file->f_mapping)
+			ret = afs_prepare_write(vma->vm_file, page, 0,
+						PAGE_SIZE);
+		else
+			ret = 0; /* seems there was interference - let the
+				  * caller deal with it */
+		unlock_page(page);
+	} while (ret == AOP_TRUNCATED_PAGE);
+
+	_leave(" = %d", ret);
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 28/28] FS-Cache: Make kAFS use FS-Cache [try #2]
  2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
                   ` (26 preceding siblings ...)
  2007-12-05 19:40 ` [PATCH 27/28] AFS: Implement shared-writable mmap " David Howells
@ 2007-12-05 19:40 ` David Howells
  27 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-05 19:40 UTC (permalink / raw)
  To: viro, hch, Trond.Myklebust, sds, casey
  Cc: linux-kernel, selinux, linux-security-module, dhowells

The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches.  The kAFS filesystem will use caching
automatically if it's available.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 fs/Kconfig         |    8 +
 fs/afs/Makefile    |    3 
 fs/afs/cache.c     |  505 ++++++++++++++++++++++++++++++++++------------------
 fs/afs/cache.h     |   15 --
 fs/afs/cell.c      |   16 +-
 fs/afs/file.c      |  212 +++++++++++++---------
 fs/afs/inode.c     |   26 +--
 fs/afs/internal.h  |   53 ++---
 fs/afs/main.c      |   27 +--
 fs/afs/mntpt.c     |    4 
 fs/afs/vlocation.c |   23 +-
 fs/afs/volume.c    |   14 -
 fs/afs/write.c     |    6 -
 13 files changed, 537 insertions(+), 375 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 83d1227..7f3278f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -2120,6 +2120,14 @@ config AFS_DEBUG
 
 	  If unsure, say N.
 
+config AFS_FSCACHE
+	bool "Provide AFS client caching support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	depends on AFS_FS=m && FSCACHE || AFS_FS=y && FSCACHE=y
+	help
+	  Say Y here if you want AFS data to be cached locally on disk through
+	  the generic filesystem cache manager
+
 config 9P_FS
 	tristate "Plan 9 Resource Sharing Support (9P2000) (Experimental)"
 	depends on INET && NET_9P && EXPERIMENTAL
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index a666710..4f64b95 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -2,7 +2,10 @@
 # Makefile for Red Hat Linux AFS client.
 #
 
+afs-cache-$(CONFIG_AFS_FSCACHE) := cache.o
+
 kafs-objs := \
+	$(afs-cache-y) \
 	callback.o \
 	cell.o \
 	cmservice.o \
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index de0d7de..8e179a9 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -9,248 +9,399 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
-						const void *entry);
-static void afs_cell_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_cache_cell_index_def = {
-	.name			= "cell_ix",
-	.data_size		= sizeof(struct afs_cache_cell),
-	.keys[0]		= { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
-	.match			= afs_cell_cache_match,
-	.update			= afs_cell_cache_update,
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+static uint16_t afs_cell_cache_get_key(const void *cookie_netfs_data,
+				       void *buffer, uint16_t buflen);
+static uint16_t afs_cell_cache_get_aux(const void *cookie_netfs_data,
+				       void *buffer, uint16_t buflen);
+static enum fscache_checkaux afs_cell_cache_check_aux(void *cookie_netfs_data,
+						      const void *buffer,
+						      uint16_t buflen);
+
+static uint16_t afs_vlocation_cache_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t buflen);
+static uint16_t afs_vlocation_cache_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t buflen);
+static enum fscache_checkaux afs_vlocation_cache_check_aux(
+	void *cookie_netfs_data, const void *buffer, uint16_t buflen);
+
+static uint16_t afs_volume_cache_get_key(const void *cookie_netfs_data,
+					 void *buffer, uint16_t buflen);
+
+static uint16_t afs_vnode_cache_get_key(const void *cookie_netfs_data,
+					void *buffer, uint16_t buflen);
+static void afs_vnode_cache_get_attr(const void *cookie_netfs_data,
+				     uint64_t *size);
+static uint16_t afs_vnode_cache_get_aux(const void *cookie_netfs_data,
+					void *buffer, uint16_t buflen);
+static enum fscache_checkaux afs_vnode_cache_check_aux(void *cookie_netfs_data,
+						       const void *buffer,
+						       uint16_t buflen);
+static void afs_vnode_cache_now_uncached(void *cookie_netfs_data);
+
+static struct fscache_netfs_operations afs_cache_ops = {
+};
+
+struct fscache_netfs afs_cache_netfs = {
+	.name			= "afs",
+	.version		= 0,
+	.ops			= &afs_cache_ops,
+};
+
+struct fscache_cookie_def afs_cell_cache_index_def = {
+	.name		= "AFS.cell",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= afs_cell_cache_get_key,
+	.get_aux	= afs_cell_cache_get_aux,
+	.check_aux	= afs_cell_cache_check_aux,
+};
+
+struct fscache_cookie_def afs_vlocation_cache_index_def = {
+	.name			= "AFS.vldb",
+	.type			= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key		= afs_vlocation_cache_get_key,
+	.get_aux		= afs_vlocation_cache_get_aux,
+	.check_aux		= afs_vlocation_cache_check_aux,
+};
+
+struct fscache_cookie_def afs_volume_cache_index_def = {
+	.name		= "AFS.volume",
+	.type		= FSCACHE_COOKIE_TYPE_INDEX,
+	.get_key	= afs_volume_cache_get_key,
+};
+
+struct fscache_cookie_def afs_vnode_cache_index_def = {
+	.name			= "AFS.vnode",
+	.type			= FSCACHE_COOKIE_TYPE_DATAFILE,
+	.get_key		= afs_vnode_cache_get_key,
+	.get_attr		= afs_vnode_cache_get_attr,
+	.get_aux		= afs_vnode_cache_get_aux,
+	.check_aux		= afs_vnode_cache_check_aux,
+	.now_uncached		= afs_vnode_cache_now_uncached,
 };
-#endif
 
 /*
- * match a cell record obtained from the cache
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
-						const void *entry)
+static uint16_t afs_cell_cache_get_key(const void *cookie_netfs_data,
+				       void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_cell *ccell = entry;
-	struct afs_cell *cell = target;
+	const struct afs_cell *cell = cookie_netfs_data;
+	uint16_t klen;
 
-	_enter("{%s},{%s}", ccell->name, cell->name);
+	_enter("%p,%p,%u", cell, buffer, bufmax);
 
-	if (strncmp(ccell->name, cell->name, sizeof(ccell->name)) == 0) {
-		_leave(" = SUCCESS");
-		return CACHEFS_MATCH_SUCCESS;
-	}
+	klen = strlen(cell->name);
+	if (klen > bufmax)
+		return 0;
 
-	_leave(" = FAILED");
-	return CACHEFS_MATCH_FAILED;
+	memcpy(buffer, cell->name, klen);
+	return klen;
 }
-#endif
 
 /*
- * update a cell record in the cache
+ * provide new auxilliary cache data
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_cell_cache_update(void *source, void *entry)
+static uint16_t afs_cell_cache_get_aux(const void *cookie_netfs_data,
+				       void *buffer, uint16_t bufmax)
 {
-	struct afs_cache_cell *ccell = entry;
-	struct afs_cell *cell = source;
+	const struct afs_cell *cell = cookie_netfs_data;
+	uint16_t dlen;
 
-	_enter("%p,%p", source, entry);
+	_enter("%p,%p,%u", cell, buffer, bufmax);
 
-	strncpy(ccell->name, cell->name, sizeof(ccell->name));
+	dlen = cell->vl_naddrs * sizeof(cell->vl_addrs[0]);
+	dlen = min(dlen, bufmax);
+	dlen &= ~(sizeof(cell->vl_addrs[0]) - 1);
 
-	memcpy(ccell->vl_servers,
-	       cell->vl_addrs,
-	       min(sizeof(ccell->vl_servers), sizeof(cell->vl_addrs)));
+	memcpy(buffer, cell->vl_addrs, dlen);
+	return dlen;
+}
 
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+static enum fscache_checkaux afs_cell_cache_check_aux(void *cookie_netfs_data,
+						      const void *buffer,
+						      uint16_t buflen)
+{
+	_leave(" = OKAY");
+	return FSCACHE_CHECKAUX_OKAY;
 }
-#endif
-
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vlocation_cache_match(void *target,
-						     const void *entry);
-static void afs_vlocation_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_vlocation_cache_index_def = {
-	.name		= "vldb",
-	.data_size	= sizeof(struct afs_cache_vlocation),
-	.keys[0]	= { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
-	.match		= afs_vlocation_cache_match,
-	.update		= afs_vlocation_cache_update,
-};
-#endif
 
+/*****************************************************************************/
 /*
- * match a VLDB record stored in the cache
- * - may also load target from entry
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vlocation_cache_match(void *target,
-						     const void *entry)
+static uint16_t afs_vlocation_cache_get_key(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_vlocation *vldb = entry;
-	struct afs_vlocation *vlocation = target;
+	const struct afs_vlocation *vlocation = cookie_netfs_data;
+	uint16_t klen;
 
-	_enter("{%s},{%s}", vlocation->vldb.name, vldb->name);
+	_enter("{%s},%p,%u", vlocation->vldb.name, buffer, bufmax);
 
-	if (strncmp(vlocation->vldb.name, vldb->name, sizeof(vldb->name)) == 0
-	    ) {
-		if (!vlocation->valid ||
-		    vlocation->vldb.rtime == vldb->rtime
+	klen = strnlen(vlocation->vldb.name, sizeof(vlocation->vldb.name));
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, vlocation->vldb.name, klen);
+
+	_leave(" = %u", klen);
+	return klen;
+}
+
+/*
+ * provide new auxilliary cache data
+ */
+static uint16_t afs_vlocation_cache_get_aux(const void *cookie_netfs_data,
+					    void *buffer, uint16_t bufmax)
+{
+	const struct afs_vlocation *vlocation = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%s},%p,%u", vlocation->vldb.name, buffer, bufmax);
+
+	dlen = sizeof(struct afs_cache_vlocation);
+	dlen -= offsetof(struct afs_cache_vlocation, nservers);
+	if (dlen > bufmax)
+		return 0;
+
+	memcpy(buffer, (uint8_t *)&vlocation->vldb.nservers, dlen);
+
+	_leave(" = %u", dlen);
+	return dlen;
+}
+
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+static
+enum fscache_checkaux afs_vlocation_cache_check_aux(void *cookie_netfs_data,
+						    const void *buffer,
+						    uint16_t buflen)
+{
+	const struct afs_cache_vlocation *cvldb;
+	struct afs_vlocation *vlocation = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%s},%p,%u", vlocation->vldb.name, buffer, buflen);
+
+	/* check the size of the data is what we're expecting */
+	dlen = sizeof(struct afs_cache_vlocation);
+	dlen -= offsetof(struct afs_cache_vlocation, nservers);
+	if (dlen != buflen)
+		return FSCACHE_CHECKAUX_OBSOLETE;
+
+	cvldb = container_of(buffer, struct afs_cache_vlocation, nservers);
+
+	/* if what's on disk is more valid than what's in memory, then use the
+	 * VL record from the cache */
+	if (!vlocation->valid || vlocation->vldb.rtime == cvldb->rtime) {
+		memcpy((uint8_t *)&vlocation->vldb.nservers, buffer, dlen);
+		vlocation->valid = 1;
+		_leave(" = SUCCESS [c->m]");
+		return FSCACHE_CHECKAUX_OKAY;
+	}
+
+	/* need to update the cache if the cached info differs */
+	if (memcmp(&vlocation->vldb, buffer, dlen) != 0) {
+		/* delete if the volume IDs for this name differ */
+		if (memcmp(&vlocation->vldb.vid, &cvldb->vid,
+			   sizeof(cvldb->vid)) != 0
 		    ) {
-			vlocation->vldb = *vldb;
-			vlocation->valid = 1;
-			_leave(" = SUCCESS [c->m]");
-			return CACHEFS_MATCH_SUCCESS;
-		} else if (memcmp(&vlocation->vldb, vldb, sizeof(*vldb)) != 0) {
-			/* delete if VIDs for this name differ */
-			if (memcmp(&vlocation->vldb.vid,
-				   &vldb->vid,
-				   sizeof(vldb->vid)) != 0) {
-				_leave(" = DELETE");
-				return CACHEFS_MATCH_SUCCESS_DELETE;
-			}
-
-			_leave(" = UPDATE");
-			return CACHEFS_MATCH_SUCCESS_UPDATE;
-		} else {
-			_leave(" = SUCCESS");
-			return CACHEFS_MATCH_SUCCESS;
+			_leave(" = OBSOLETE");
+			return FSCACHE_CHECKAUX_OBSOLETE;
 		}
+
+		_leave(" = UPDATE");
+		return FSCACHE_CHECKAUX_NEEDS_UPDATE;
 	}
 
-	_leave(" = FAILED");
-	return CACHEFS_MATCH_FAILED;
+	_leave(" = OKAY");
+	return FSCACHE_CHECKAUX_OKAY;
 }
-#endif
 
+/*****************************************************************************/
 /*
- * update a VLDB record stored in the cache
+ * set the key for the volume index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_vlocation_cache_update(void *source, void *entry)
+static uint16_t afs_volume_cache_get_key(const void *cookie_netfs_data,
+					void *buffer, uint16_t bufmax)
 {
-	struct afs_cache_vlocation *vldb = entry;
-	struct afs_vlocation *vlocation = source;
+	const struct afs_volume *volume = cookie_netfs_data;
+	uint16_t klen;
+
+	_enter("{%u},%p,%u", volume->type, buffer, bufmax);
 
-	_enter("");
+	klen = sizeof(volume->type);
+	if (klen > bufmax)
+		return 0;
+
+	memcpy(buffer, &volume->type, sizeof(volume->type));
+
+	_leave(" = %u", klen);
+	return klen;
 
-	*vldb = vlocation->vldb;
 }
-#endif
-
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_volume_cache_match(void *target,
-						  const void *entry);
-static void afs_volume_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_volume_cache_index_def = {
-	.name		= "volume",
-	.data_size	= sizeof(struct afs_cache_vhash),
-	.keys[0]	= { CACHEFS_INDEX_KEYS_BIN, 1 },
-	.keys[1]	= { CACHEFS_INDEX_KEYS_BIN, 1 },
-	.match		= afs_volume_cache_match,
-	.update		= afs_volume_cache_update,
-};
-#endif
 
+/*****************************************************************************/
 /*
- * match a volume hash record stored in the cache
+ * set the key for the index entry
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_volume_cache_match(void *target,
-						  const void *entry)
+static uint16_t afs_vnode_cache_get_key(const void *cookie_netfs_data,
+					void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_vhash *vhash = entry;
-	struct afs_volume *volume = target;
+	const struct afs_vnode *vnode = cookie_netfs_data;
+	uint16_t klen;
 
-	_enter("{%u},{%u}", volume->type, vhash->vtype);
+	_enter("{%x,%x,%llx},%p,%u",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version,
+	       buffer, bufmax);
 
-	if (volume->type == vhash->vtype) {
-		_leave(" = SUCCESS");
-		return CACHEFS_MATCH_SUCCESS;
-	}
+	klen = sizeof(vnode->fid.vnode);
+	if (klen > bufmax)
+		return 0;
 
-	_leave(" = FAILED");
-	return CACHEFS_MATCH_FAILED;
+	memcpy(buffer, &vnode->fid.vnode, sizeof(vnode->fid.vnode));
+
+	_leave(" = %u", klen);
+	return klen;
 }
-#endif
 
 /*
- * update a volume hash record stored in the cache
+ * provide updated file attributes
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_volume_cache_update(void *source, void *entry)
+static void afs_vnode_cache_get_attr(const void *cookie_netfs_data,
+				     uint64_t *size)
 {
-	struct afs_cache_vhash *vhash = entry;
-	struct afs_volume *volume = source;
+	const struct afs_vnode *vnode = cookie_netfs_data;
 
-	_enter("");
+	_enter("{%x,%x,%llx},",
+	       vnode->fid.vnode, vnode->fid.unique,
+	       vnode->status.data_version);
 
-	vhash->vtype = volume->type;
+	*size = vnode->status.size;
 }
-#endif
-
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vnode_cache_match(void *target,
-						 const void *entry);
-static void afs_vnode_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_vnode_cache_index_def = {
-	.name		= "vnode",
-	.data_size	= sizeof(struct afs_cache_vnode),
-	.keys[0]	= { CACHEFS_INDEX_KEYS_BIN, 4 },
-	.match		= afs_vnode_cache_match,
-	.update		= afs_vnode_cache_update,
-};
-#endif
 
 /*
- * match a vnode record stored in the cache
+ * provide new auxilliary cache data
  */
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vnode_cache_match(void *target,
-						 const void *entry)
+static uint16_t afs_vnode_cache_get_aux(const void *cookie_netfs_data,
+					void *buffer, uint16_t bufmax)
 {
-	const struct afs_cache_vnode *cvnode = entry;
-	struct afs_vnode *vnode = target;
-
-	_enter("{%x,%x,%Lx},{%x,%x,%Lx}",
-	       vnode->fid.vnode,
-	       vnode->fid.unique,
-	       vnode->status.version,
-	       cvnode->vnode_id,
-	       cvnode->vnode_unique,
-	       cvnode->data_version);
-
-	if (vnode->fid.vnode != cvnode->vnode_id) {
-		_leave(" = FAILED");
-		return CACHEFS_MATCH_FAILED;
+	const struct afs_vnode *vnode = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%x,%x,%Lx},%p,%u",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version,
+	       buffer, bufmax);
+
+	dlen = sizeof(vnode->fid.unique) + sizeof(vnode->status.data_version);
+	if (dlen > bufmax)
+		return 0;
+
+	memcpy(buffer, &vnode->fid.unique, sizeof(vnode->fid.unique));
+	buffer += sizeof(vnode->fid.unique);
+	memcpy(buffer, &vnode->status.data_version,
+	       sizeof(vnode->status.data_version));
+
+	_leave(" = %u", dlen);
+	return dlen;
+}
+
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+static enum fscache_checkaux afs_vnode_cache_check_aux(void *cookie_netfs_data,
+						       const void *buffer,
+						       uint16_t buflen)
+{
+	struct afs_vnode *vnode = cookie_netfs_data;
+	uint16_t dlen;
+
+	_enter("{%x,%x,%llx},%p,%u",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version,
+	       buffer, buflen);
+
+	/* check the size of the data is what we're expecting */
+	dlen = sizeof(vnode->fid.unique) + sizeof(vnode->status.data_version);
+	if (dlen != buflen) {
+		_leave(" = OBSOLETE [len %hx != %hx]", dlen, buflen);
+		return FSCACHE_CHECKAUX_OBSOLETE;
 	}
 
-	if (vnode->fid.unique != cvnode->vnode_unique ||
-	    vnode->status.version != cvnode->data_version) {
-		_leave(" = DELETE");
-		return CACHEFS_MATCH_SUCCESS_DELETE;
+	if (memcmp(buffer,
+		   &vnode->fid.unique,
+		   sizeof(vnode->fid.unique)
+		   ) != 0) {
+		unsigned unique;
+
+		memcpy(&unique, buffer, sizeof(unique));
+
+		_leave(" = OBSOLETE [uniq %x != %x]",
+		       unique, vnode->fid.unique);
+		return FSCACHE_CHECKAUX_OBSOLETE;
+	}
+
+	if (memcmp(buffer + sizeof(vnode->fid.unique),
+		   &vnode->status.data_version,
+		   sizeof(vnode->status.data_version)
+		   ) != 0) {
+		afs_dataversion_t version;
+
+		memcpy(&version, buffer + sizeof(vnode->fid.unique),
+		       sizeof(version));
+
+		_leave(" = OBSOLETE [vers %llx != %llx]",
+		       version, vnode->status.data_version);
+		return FSCACHE_CHECKAUX_OBSOLETE;
 	}
 
 	_leave(" = SUCCESS");
-	return CACHEFS_MATCH_SUCCESS;
+	return FSCACHE_CHECKAUX_OKAY;
 }
-#endif
 
 /*
- * update a vnode record stored in the cache
+ * indication the cookie is no longer uncached
+ * - this function is called when the backing store currently caching a cookie
+ *   is removed
+ * - the netfs should use this to clean up any markers indicating cached pages
+ * - this is mandatory for any object that may have data
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_vnode_cache_update(void *source, void *entry)
+static void afs_vnode_cache_now_uncached(void *cookie_netfs_data)
 {
-	struct afs_cache_vnode *cvnode = entry;
-	struct afs_vnode *vnode = source;
+	struct afs_vnode *vnode = cookie_netfs_data;
+	struct pagevec pvec;
+	pgoff_t first;
+	int loop, nr_pages;
+
+	_enter("{%x,%x,%Lx}",
+	       vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version);
+
+	pagevec_init(&pvec, 0);
+	first = 0;
+
+	for (;;) {
+		/* grab a bunch of pages to clean */
+		nr_pages = pagevec_lookup(&pvec, vnode->vfs_inode.i_mapping,
+					  first,
+					  PAGEVEC_SIZE - pagevec_count(&pvec));
+		if (!nr_pages)
+			break;
 
-	_enter("");
+		for (loop = 0; loop < nr_pages; loop++)
+			ClearPageFsCache(pvec.pages[loop]);
+
+		first = pvec.pages[nr_pages - 1]->index + 1;
+
+		pvec.nr = nr_pages;
+		pagevec_release(&pvec);
+		cond_resched();
+	}
 
-	cvnode->vnode_id	= vnode->fid.vnode;
-	cvnode->vnode_unique	= vnode->fid.unique;
-	cvnode->data_version	= vnode->status.version;
+	_leave("");
 }
-#endif
diff --git a/fs/afs/cache.h b/fs/afs/cache.h
index 36a3642..b985052 100644
--- a/fs/afs/cache.h
+++ b/fs/afs/cache.h
@@ -1,6 +1,6 @@
 /* AFS local cache management interface
  *
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
  *
  * This program is free software; you can redistribute it and/or
@@ -9,15 +9,4 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifndef AFS_CACHE_H
-#define AFS_CACHE_H
-
-#undef AFS_CACHING_SUPPORT
-
-#include <linux/mm.h>
-#ifdef AFS_CACHING_SUPPORT
-#include <linux/cachefs.h>
-#endif
-#include "types.h"
-
-#endif /* AFS_CACHE_H */
+#include <linux/fscache.h>
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index 970d38f..360bba5 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -140,12 +140,11 @@ struct afs_cell *afs_cell_create(const char *name, char *vllist)
 	if (ret < 0)
 		goto error;
 
-#ifdef AFS_CACHING_SUPPORT
-	/* put it up for caching */
-	cachefs_acquire_cookie(afs_cache_netfs.primary_index,
-			       &afs_vlocation_cache_index_def,
-			       cell,
-			       &cell->cache);
+#ifdef CONFIG_AFS_FSCACHE
+	/* put it up for caching (this never returns an error) */
+	cell->cache = fscache_acquire_cookie(afs_cache_netfs.primary_index,
+					     &afs_cell_cache_index_def,
+					     cell);
 #endif
 
 	/* add to the cell lists */
@@ -350,10 +349,7 @@ static void afs_cell_destroy(struct afs_cell *cell)
 	list_del_init(&cell->proc_link);
 	up_write(&afs_proc_cells_sem);
 
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(cell->cache, 0);
-#endif
-
+	fscache_relinquish_cookie(cell->cache, 0);
 	key_put(cell->anonymous_key);
 	kfree(cell);
 
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 1323df4..276ed86 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -24,6 +24,9 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
 static int afs_mmap(struct file *file, struct vm_area_struct *vma);
 
+static int afs_readpages(struct file *filp, struct address_space *mapping,
+			 struct list_head *pages, unsigned nr_pages);
+
 const struct file_operations afs_file_operations = {
 	.open		= afs_open,
 	.release	= afs_release,
@@ -47,6 +50,7 @@ const struct inode_operations afs_file_inode_operations = {
 
 const struct address_space_operations afs_fs_aops = {
 	.readpage	= afs_readpage,
+	.readpages	= afs_readpages,
 	.set_page_dirty	= afs_set_page_dirty,
 	.launder_page	= afs_launder_page,
 	.releasepage	= afs_releasepage,
@@ -107,37 +111,18 @@ int afs_release(struct inode *inode, struct file *file)
 /*
  * deal with notification that a page was read from the cache
  */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_readpage_read_complete(void *cookie_data,
-				       struct page *page,
-				       void *data,
-				       int error)
+static void afs_file_readpage_read_complete(struct page *page,
+					    void *data,
+					    int error)
 {
-	_enter("%p,%p,%p,%d", cookie_data, page, data, error);
+	_enter("%p,%p,%d", page, data, error);
 
-	if (error)
-		SetPageError(page);
-	else
+	/* if the read completes with an error, we just unlock the page and let
+	 * the VM reissue the readpage */
+	if (!error)
 		SetPageUptodate(page);
 	unlock_page(page);
-
-}
-#endif
-
-/*
- * deal with notification that a page was written to the cache
- */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_readpage_write_complete(void *cookie_data,
-					struct page *page,
-					void *data,
-					int error)
-{
-	_enter("%p,%p,%p,%d", cookie_data, page, data, error);
-
-	unlock_page(page);
 }
-#endif
 
 /*
  * AFS read page from file, directory or symlink
@@ -167,30 +152,27 @@ static int afs_readpage(struct file *file, struct page *page)
 	if (test_bit(AFS_VNODE_DELETED, &vnode->flags))
 		goto error;
 
-#ifdef AFS_CACHING_SUPPORT
 	/* is it cached? */
-	ret = cachefs_read_or_alloc_page(vnode->cache,
+	ret = fscache_read_or_alloc_page(vnode->cache,
 					 page,
 					 afs_file_readpage_read_complete,
 					 NULL,
 					 GFP_KERNEL);
-#else
-	ret = -ENOBUFS;
-#endif
-
 	switch (ret) {
-		/* read BIO submitted and wb-journal entry found */
-	case 1:
-		BUG(); // TODO - handle wb-journal match
-
 		/* read BIO submitted (page in cache) */
 	case 0:
 		break;
 
-		/* no page available in cache */
-	case -ENOBUFS:
+		/* page not yet cached */
 	case -ENODATA:
+		_debug("cache said ENODATA");
+		goto go_on;
+
+		/* page will not be cached */
+	case -ENOBUFS:
+		_debug("cache said ENOBUFS");
 	default:
+	go_on:
 		offset = page->index << PAGE_CACHE_SHIFT;
 		len = min_t(size_t, i_size_read(inode) - offset, PAGE_SIZE);
 
@@ -204,27 +186,21 @@ static int afs_readpage(struct file *file, struct page *page)
 				set_bit(AFS_VNODE_DELETED, &vnode->flags);
 				ret = -ESTALE;
 			}
-#ifdef AFS_CACHING_SUPPORT
-			cachefs_uncache_page(vnode->cache, page);
-#endif
+
+			fscache_uncache_page(vnode->cache, page);
+			BUG_ON(PageFsCache(page));
 			goto error;
 		}
 
 		SetPageUptodate(page);
 
-#ifdef AFS_CACHING_SUPPORT
-		if (cachefs_write_page(vnode->cache,
-				       page,
-				       afs_file_readpage_write_complete,
-				       NULL,
-				       GFP_KERNEL) != 0
-		    ) {
-			cachefs_uncache_page(vnode->cache, page);
-			unlock_page(page);
+		/* send the page to the cache */
+		if (PageFsCache(page) &&
+		    fscache_write_page(vnode->cache, page, GFP_KERNEL) != 0) {
+			fscache_uncache_page(vnode->cache, page);
+			BUG_ON(PageFsCache(page));
 		}
-#else
 		unlock_page(page);
-#endif
 	}
 
 	_leave(" = 0");
@@ -238,34 +214,55 @@ error:
 }
 
 /*
- * invalidate part or all of a page
+ * read a set of pages
  */
-static void afs_invalidatepage(struct page *page, unsigned long offset)
+static int afs_readpages(struct file *file, struct address_space *mapping,
+			 struct list_head *pages, unsigned nr_pages)
 {
-	int ret = 1;
+	struct afs_vnode *vnode;
+	int ret = 0;
 
-	_enter("{%lu},%lu", page->index, offset);
+	_enter(",{%lu},,%d", mapping->host->i_ino, nr_pages);
 
-	BUG_ON(!PageLocked(page));
+	vnode = AFS_FS_I(mapping->host);
+	if (vnode->flags & AFS_VNODE_DELETED) {
+		_leave(" = -ESTALE");
+		return -ESTALE;
+	}
 
-	if (PagePrivate(page)) {
-		/* We release buffers only if the entire page is being
-		 * invalidated.
-		 * The get_block cached value has been unconditionally
-		 * invalidated, so real IO is not possible anymore.
-		 */
-		if (offset == 0) {
-			BUG_ON(!PageLocked(page));
-
-			ret = 0;
-			if (!PageWriteback(page))
-				ret = page->mapping->a_ops->releasepage(page,
-									0);
-			/* possibly should BUG_ON(!ret); - neilb */
-		}
+	/* attempt to read as many of the pages as possible */
+	ret = fscache_read_or_alloc_pages(vnode->cache,
+					  mapping,
+					  pages,
+					  &nr_pages,
+					  afs_file_readpage_read_complete,
+					  NULL,
+					  mapping_gfp_mask(mapping));
+
+	switch (ret) {
+		/* all pages are being read from the cache */
+	case 0:
+		BUG_ON(!list_empty(pages));
+		BUG_ON(nr_pages != 0);
+		_leave(" = 0 [reading all]");
+		return 0;
+
+		/* there were pages that couldn't be read from the cache */
+	case -ENODATA:
+	case -ENOBUFS:
+		break;
+
+		/* other error */
+	default:
+		_leave(" = %d", ret);
+		return ret;
 	}
 
-	_leave(" = %d", ret);
+	/* load the missing pages from the network */
+	ret = read_cache_pages(mapping, pages, (void *) afs_readpage, file);
+
+	_leave(" = %d [netting]", ret);
+	return ret;
 }
 
 /*
@@ -279,27 +276,80 @@ static int afs_launder_page(struct page *page)
 }
 
 /*
- * release a page and cleanup its private data
+ * invalidate part or all of a page
+ * - release a page and clean up its private data if offset is 0 (indicating
+ *   the entire page)
+ */
+static void afs_invalidatepage(struct page *page, unsigned long offset)
+{
+	struct afs_writeback *wb = (struct afs_writeback *) page_private(page);
+	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
+
+	_enter("{%lu},%lu", page->index, offset);
+
+	BUG_ON(!PageLocked(page));
+
+	/* we clean up only if the entire page is being invalidated */
+	if (offset == 0) {
+		if (PageFsCache(page)) {
+			wait_on_page_fscache_write(page);
+			fscache_uncache_page(vnode->cache, page);
+			ClearPageFsCache(page);
+		}
+
+		if (PagePrivate(page)) {
+			if (wb && !PageWriteback(page)) {
+				set_page_private(page, 0);
+				afs_put_writeback(wb);
+			}
+
+			if (!page_private(page))
+				ClearPagePrivate(page);
+		}
+	}
+
+	_leave("");
+}
+
+/*
+ * release a page and clean up its private state if it's not busy
+ * - return true if the page can now be released, false if not
  */
 static int afs_releasepage(struct page *page, gfp_t gfp_flags)
 {
+	struct afs_writeback *wb = (struct afs_writeback *) page_private(page);
 	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
-	struct afs_writeback *wb;
 
 	_enter("{{%x:%u}[%lu],%lx},%x",
 	       vnode->fid.vid, vnode->fid.vnode, page->index, page->flags,
 	       gfp_flags);
 
+	/* deny if page is being written to the cache and the caller hasn't
+	 * elected to wait */
+	if (PageFsCache(page)) {
+		if (PageFsCacheWrite(page)) {
+			if (!(gfp_flags & __GFP_WAIT)) {
+				_leave(" = F [cache busy]");
+				return 0;
+			}
+			wait_on_page_fscache_write(page);
+		}
+
+		fscache_uncache_page(vnode->cache, page);
+		ClearPageFsCache(page);
+	}
+
 	if (PagePrivate(page)) {
-		wb = (struct afs_writeback *) page_private(page);
-		ASSERT(wb != NULL);
-		set_page_private(page, 0);
+		if (wb) {
+			set_page_private(page, 0);
+			afs_put_writeback(wb);
+		}
 		ClearPagePrivate(page);
-		afs_put_writeback(wb);
 	}
 
-	_leave(" = 0");
-	return 0;
+	/* indicate that the page can be released */
+	_leave(" = T");
+	return 1;
 }
 
 /*
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index d196840..2eef03f 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -61,6 +61,9 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
 		return -EBADMSG;
 	}
 
+	if (vnode->status.size != inode->i_size)
+		fscache_attr_changed(vnode->cache);
+
 	inode->i_nlink		= vnode->status.nlink;
 	inode->i_uid		= vnode->status.owner;
 	inode->i_gid		= 0;
@@ -149,15 +152,6 @@ struct inode *afs_iget(struct super_block *sb, struct key *key,
 		return inode;
 	}
 
-#ifdef AFS_CACHING_SUPPORT
-	/* set up caching before reading the status, as fetch-status reads the
-	 * first page of symlinks to see if they're really mntpts */
-	cachefs_acquire_cookie(vnode->volume->cache,
-			       NULL,
-			       vnode,
-			       &vnode->cache);
-#endif
-
 	if (!status) {
 		/* it's a remotely extant inode */
 		set_bit(AFS_VNODE_CB_BROKEN, &vnode->flags);
@@ -183,6 +177,13 @@ struct inode *afs_iget(struct super_block *sb, struct key *key,
 		}
 	}
 
+	/* set up caching before mapping the status, as map-status reads the
+	 * first page of symlinks to see if they're really mountpoints */
+	inode->i_size = vnode->status.size;
+	vnode->cache = fscache_acquire_cookie(vnode->volume->cache,
+					      &afs_vnode_cache_index_def,
+					      vnode);
+
 	ret = afs_inode_map_status(vnode, key);
 	if (ret < 0)
 		goto bad_inode;
@@ -196,10 +197,11 @@ struct inode *afs_iget(struct super_block *sb, struct key *key,
 
 	/* failure */
 bad_inode:
+	fscache_relinquish_cookie(vnode->cache, 0);
+	vnode->cache = NULL;
 	make_bad_inode(inode);
 	unlock_new_inode(inode);
 	iput(inode);
-
 	_leave(" = %d [bad]", ret);
 	return ERR_PTR(ret);
 }
@@ -342,10 +344,8 @@ void afs_clear_inode(struct inode *inode)
 	ASSERT(list_empty(&vnode->writebacks));
 	ASSERT(!vnode->cb_promised);
 
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(vnode->cache, 0);
+	fscache_relinquish_cookie(vnode->cache, 0);
 	vnode->cache = NULL;
-#endif
 
 	mutex_lock(&vnode->permits_lock);
 	permits = vnode->permits;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index b3da9ab..8f8f3c5 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -21,6 +21,7 @@
 
 #include "afs.h"
 #include "afs_vl.h"
+#include "cache.h"
 
 #define AFS_CELL_MAX_ADDRS 15
 
@@ -194,9 +195,7 @@ struct afs_cell {
 	struct key		*anonymous_key;	/* anonymous user key for this cell */
 	struct list_head	proc_link;	/* /proc cell list link */
 	struct proc_dir_entry	*proc_dir;	/* /proc dir for this cell */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
-#endif
+	struct fscache_cookie	*cache;		/* caching cookie */
 
 	/* server record management */
 	rwlock_t		servers_lock;	/* active server list lock */
@@ -250,9 +249,7 @@ struct afs_vlocation {
 	struct list_head	grave;		/* link in master graveyard list */
 	struct list_head	update;		/* link in master update list */
 	struct afs_cell		*cell;		/* cell to which volume belongs */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
-#endif
+	struct fscache_cookie	*cache;		/* caching cookie */
 	struct afs_cache_vlocation vldb;	/* volume information DB record */
 	struct afs_volume	*vols[3];	/* volume access record pointer (index by type) */
 	wait_queue_head_t	waitq;		/* status change waitqueue */
@@ -303,9 +300,7 @@ struct afs_volume {
 	atomic_t		usage;
 	struct afs_cell		*cell;		/* cell to which belongs (unrefd ptr) */
 	struct afs_vlocation	*vlocation;	/* volume location */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
-#endif
+	struct fscache_cookie	*cache;		/* caching cookie */
 	afs_volid_t		vid;		/* volume ID */
 	afs_voltype_t		type;		/* type of volume */
 	char			type_force;	/* force volume type (suppress R/O -> R/W) */
@@ -334,9 +329,7 @@ struct afs_vnode {
 	struct afs_server	*server;	/* server currently supplying this file */
 	struct afs_fid		fid;		/* the file identifier for this inode */
 	struct afs_file_status	status;		/* AFS status info for this file */
-#ifdef AFS_CACHING_SUPPORT
-	struct cachefs_cookie	*cache;		/* caching cookie */
-#endif
+	struct fscache_cookie	*cache;		/* caching cookie */
 	struct afs_permits	*permits;	/* cache of permits so far obtained */
 	struct mutex		permits_lock;	/* lock for altering permits list */
 	struct mutex		validate_lock;	/* lock for validating this vnode */
@@ -429,6 +422,22 @@ struct afs_uuid {
 
 /*****************************************************************************/
 /*
+ * cache.c
+ */
+#ifdef CONFIG_AFS_FSCACHE
+extern struct fscache_netfs afs_cache_netfs;
+extern struct fscache_cookie_def afs_cell_cache_index_def;
+extern struct fscache_cookie_def afs_vlocation_cache_index_def;
+extern struct fscache_cookie_def afs_volume_cache_index_def;
+extern struct fscache_cookie_def afs_vnode_cache_index_def;
+#else
+#define afs_cell_cache_index_def	(*(struct fscache_cookie_def *) NULL)
+#define afs_vlocation_cache_index_def	(*(struct fscache_cookie_def *) NULL)
+#define afs_volume_cache_index_def	(*(struct fscache_cookie_def *) NULL)
+#define afs_vnode_cache_index_def	(*(struct fscache_cookie_def *) NULL)
+#endif
+
+/*
  * callback.c
  */
 extern void afs_init_callback_state(struct afs_server *);
@@ -447,9 +456,6 @@ extern void afs_callback_update_kill(void);
  */
 extern struct rw_semaphore afs_proc_cells_sem;
 extern struct list_head afs_proc_cells;
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_cache_cell_index_def;
-#endif
 
 #define afs_get_cell(C) do { atomic_inc(&(C)->usage); } while(0)
 extern int afs_cell_init(char *);
@@ -557,9 +563,6 @@ extern void afs_clear_inode(struct inode *);
  * main.c
  */
 extern struct afs_uuid afs_uuid;
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_netfs afs_cache_netfs;
-#endif
 
 /*
  * misc.c
@@ -641,10 +644,6 @@ extern int afs_get_MAC_address(u8 *, size_t);
 /*
  * vlclient.c
  */
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_vlocation_cache_index_def;
-#endif
-
 extern int afs_vl_get_entry_by_name(struct in_addr *, struct key *,
 				    const char *, struct afs_cache_vlocation *,
 				    const struct afs_wait_mode *);
@@ -668,12 +667,6 @@ extern void afs_vlocation_purge(void);
 /*
  * vnode.c
  */
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_vnode_cache_index_def;
-#endif
-
-extern struct afs_timer_ops afs_vnode_cb_timed_out_ops;
-
 static inline struct afs_vnode *AFS_FS_I(struct inode *inode)
 {
 	return container_of(inode, struct afs_vnode, vfs_inode);
@@ -715,10 +708,6 @@ extern int afs_vnode_release_lock(struct afs_vnode *, struct key *);
 /*
  * volume.c
  */
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_volume_cache_index_def;
-#endif
-
 #define afs_get_volume(V) do { atomic_inc(&(V)->usage); } while(0)
 
 extern void afs_put_volume(struct afs_volume *);
diff --git a/fs/afs/main.c b/fs/afs/main.c
index 0f60f6b..f04b838 100644
--- a/fs/afs/main.c
+++ b/fs/afs/main.c
@@ -1,6 +1,6 @@
 /* AFS client file system
  *
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2002,5 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowells@redhat.com)
  *
  * This program is free software; you can redistribute it and/or
@@ -29,18 +29,6 @@ static char *rootcell;
 module_param(rootcell, charp, 0);
 MODULE_PARM_DESC(rootcell, "root AFS cell name and VL server IP addr list");
 
-#ifdef AFS_CACHING_SUPPORT
-static struct cachefs_netfs_operations afs_cache_ops = {
-	.get_page_cookie	= afs_cache_get_page_cookie,
-};
-
-struct cachefs_netfs afs_cache_netfs = {
-	.name			= "afs",
-	.version		= 0,
-	.ops			= &afs_cache_ops,
-};
-#endif
-
 struct afs_uuid afs_uuid;
 
 /*
@@ -104,10 +92,9 @@ static int __init afs_init(void)
 	if (ret < 0)
 		return ret;
 
-#ifdef AFS_CACHING_SUPPORT
+#ifdef CONFIG_AFS_FSCACHE
 	/* we want to be able to cache */
-	ret = cachefs_register_netfs(&afs_cache_netfs,
-				     &afs_cache_cell_index_def);
+	ret = fscache_register_netfs(&afs_cache_netfs);
 	if (ret < 0)
 		goto error_cache;
 #endif
@@ -142,8 +129,8 @@ error_fs:
 error_open_socket:
 error_vl_update_init:
 error_cell_init:
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_unregister_netfs(&afs_cache_netfs);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_unregister_netfs(&afs_cache_netfs);
 error_cache:
 #endif
 	afs_callback_update_kill();
@@ -175,8 +162,8 @@ static void __exit afs_exit(void)
 	afs_vlocation_purge();
 	flush_scheduled_work();
 	afs_cell_purge();
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_unregister_netfs(&afs_cache_netfs);
+#ifdef CONFIG_AFS_FSCACHE
+	fscache_unregister_netfs(&afs_cache_netfs);
 #endif
 	afs_proc_cleanup();
 	rcu_barrier();
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index 5ce43b6..be7b83c 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -173,9 +173,9 @@ static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
 	if (PageError(page))
 		goto error;
 
-	buf = kmap(page);
+	buf = kmap_atomic(page, KM_USER0);
 	memcpy(devname, buf, size);
-	kunmap(page);
+	kunmap_atomic(buf, KM_USER0);
 	page_cache_release(page);
 	page = NULL;
 
diff --git a/fs/afs/vlocation.c b/fs/afs/vlocation.c
index 849fc31..0b6ee24 100644
--- a/fs/afs/vlocation.c
+++ b/fs/afs/vlocation.c
@@ -281,10 +281,7 @@ static void afs_vlocation_apply_update(struct afs_vlocation *vl,
 
 	vl->vldb = *vldb;
 
-#ifdef AFS_CACHING_SUPPORT
-	/* update volume entry in local cache */
-	cachefs_update_cookie(vl->cache);
-#endif
+	fscache_update_cookie(vl->cache);
 }
 
 /*
@@ -304,12 +301,8 @@ static int afs_vlocation_fill_in_record(struct afs_vlocation *vl,
 	memset(&vldb, 0, sizeof(vldb));
 
 	/* see if we have an in-cache copy (will set vl->valid if there is) */
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_acquire_cookie(cell->cache,
-			       &afs_volume_cache_index_def,
-			       vlocation,
-			       &vl->cache);
-#endif
+	vl->cache = fscache_acquire_cookie(vl->cell->cache,
+					   &afs_vlocation_cache_index_def, vl);
 
 	if (vl->valid) {
 		/* try to update a known volume in the cell VL databases by
@@ -420,6 +413,9 @@ fill_in_record:
 	spin_unlock(&vl->lock);
 	wake_up(&vl->waitq);
 
+	/* update volume entry in local cache */
+	fscache_update_cookie(vl->cache);
+
 	/* schedule for regular updates */
 	afs_vlocation_queue_for_updates(vl);
 	goto success;
@@ -465,7 +461,7 @@ found_in_memory:
 	spin_unlock(&vl->lock);
 
 success:
-	_leave(" = %p",vl);
+	_leave(" = %p", vl);
 	return vl;
 
 error_abandon:
@@ -523,10 +519,7 @@ static void afs_vlocation_destroy(struct afs_vlocation *vl)
 {
 	_enter("%p", vl);
 
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(vl->cache, 0);
-#endif
-
+	fscache_relinquish_cookie(vl->cache, 0);
 	afs_put_cell(vl->cell);
 	kfree(vl);
 }
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 8bab0e3..2cc3dab 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -124,13 +124,9 @@ struct afs_volume *afs_volume_lookup(struct afs_mount_params *params)
 	}
 
 	/* attach the cache and volume location */
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_acquire_cookie(vlocation->cache,
-			       &afs_vnode_cache_index_def,
-			       volume,
-			       &volume->cache);
-#endif
-
+	volume->cache = fscache_acquire_cookie(vlocation->cache,
+					       &afs_volume_cache_index_def,
+					       volume);
 	afs_get_vlocation(vlocation);
 	volume->vlocation = vlocation;
 
@@ -194,9 +190,7 @@ void afs_put_volume(struct afs_volume *volume)
 	up_write(&vlocation->cell->vl_sem);
 
 	/* finish cleaning up the volume */
-#ifdef AFS_CACHING_SUPPORT
-	cachefs_relinquish_cookie(volume->cache, 0);
-#endif
+	fscache_relinquish_cookie(volume->cache, 0);
 	afs_put_vlocation(vlocation);
 
 	for (loop = volume->nservers - 1; loop >= 0; loop--)
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 8a3e9e2..f0af39b 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -261,7 +261,6 @@ flush_conflicting_wb:
 		_debug("reuse");
 		afs_put_writeback(wb);
 		set_page_private(page, 0);
-		ClearPagePrivate(page);
 		goto try_again;
 	}
 
@@ -694,7 +693,6 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call)
 			end_page_writeback(page);
 			if (page_private(page) == (unsigned long) wb) {
 				set_page_private(page, 0);
-				ClearPagePrivate(page);
 				wb->usage--;
 			}
 		}
@@ -854,6 +852,10 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
 	_enter("{{%x:%u},%x},{%lx}",
 	       vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
 
+	/* wait for the page to be written to the cache before we allow it to
+	 * be modified */
+	wait_on_page_fscache_write(page);
+
 	do {
 		lock_page(page);
 		if (page->mapping == vma->vm_file->f_mapping)


^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-05 19:38 ` [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions " David Howells
@ 2007-12-10 16:46   ` Stephen Smalley
  2007-12-10 17:07   ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-10 16:46 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, casey, linux-kernel, selinux,
	linux-security-module

On Wed, 2007-12-05 at 19:38 +0000, David Howells wrote:
> Allow kernel services to override LSM settings appropriate to the actions
> performed by a task by duplicating a security record, modifying it and then
> using task_struct::act_as to point to it when performing operations on behalf
> of a task.
> 
> This is used, for example, by CacheFiles which has to transparently access the
> cache on behalf of a process that thinks it is doing, say, NFS accesses with a
> potentially inappropriate (with respect to accessing the cache) set of
> security data.
> 
> This patch provides two LSM hooks for modifying a task security record:
> 
>  (*) security_kernel_act_as() which allows modification of the security datum
>      with which a task acts on other objects (most notably files).
> 
>  (*) security_create_files_as() which allows modification of the security
>      datum that is used to initialise the security data on a file that a task
>      creates.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  include/linux/cred.h     |   22 ++++++++++++
>  include/linux/security.h |   35 +++++++++++++++++++
>  kernel/cred.c            |   86 ++++++++++++++++++++++++++++++++++++++++++++++
>  security/dummy.c         |   15 ++++++++
>  security/security.c      |   13 +++++++
>  security/selinux/hooks.c |   45 ++++++++++++++++++++++++
>  6 files changed, 216 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> new file mode 100644
> index 0000000..c9f8906
> --- /dev/null
> +++ b/include/linux/cred.h
> @@ -0,0 +1,22 @@
> +/* Credential management
> + *
> + * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_CRED_H
> +#define _LINUX_CRED_H
> +
> +struct task_security;
> +struct inode;
> +
> +extern struct task_security *get_kernel_security(const char *,
> +						 struct task_struct *);
> +extern int change_create_files_as(struct task_security *, struct inode *);
> +
> +#endif /* _LINUX_CRED_H */
> diff --git a/include/linux/security.h b/include/linux/security.h
> index b7ba073..f0dce11 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -557,6 +557,18 @@ struct request_sock;
>   *	Duplicate and attach the security structure currently attached to the
>   *	p->security field.
>   *	Return 0 if operation was successful.
> + * @task_kernel_act_as:
> + *	Set the credentials for a kernel service to act as (subjective context).
> + *	@sec points to the task security record to be modified.
> + *	@service names the service making the request.
> + *	@daemon: A userspace daemon to be used as a reference.
> + *	Return 0 if successful.
> + * @task_create_files_as:
> + *	Set the file creation context in a task security record to be the same
> + *	as the objective context of the specified inode.
> + *	@sec points to the task security record to be modified.
> + *	@inode points to the inode to use as a reference.
> + *	Return 0 if successful.
>   * @task_setuid:
>   *	Check permission before setting one or more of the user identity
>   *	attributes of the current process.  The @flags parameter indicates
> @@ -1321,6 +1333,11 @@ struct security_operations {
>  	int (*task_alloc_security) (struct task_struct *p);
>  	void (*task_free_security) (struct task_security *p);
>  	int (*task_dup_security) (struct task_security *p);
> +	int (*task_kernel_act_as)(struct task_security *sec,
> +				  const char *service,
> +				  struct task_struct *daemon);
> +	int (*task_create_files_as)(struct task_security *sec,
> +				    struct inode *inode);
>  	int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags);
>  	int (*task_post_setuid) (uid_t old_ruid /* or fsuid */ ,
>  				 uid_t old_euid, uid_t old_suid, int flags);
> @@ -1571,6 +1588,11 @@ int security_task_create(unsigned long clone_flags);
>  int security_task_alloc(struct task_struct *p);
>  void security_task_free(struct task_security *p);
>  int security_task_dup(struct task_security *p);
> +int security_task_kernel_act_as(struct task_security *sec,
> +				const char *service,
> +				struct task_struct *daemon);
> +int security_task_create_files_as(struct task_security *sec,
> +				  struct inode *inode);
>  int security_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags);
>  int security_task_post_setuid(uid_t old_ruid, uid_t old_euid,
>  			       uid_t old_suid, int flags);
> @@ -2054,6 +2076,19 @@ static inline int security_task_dup(struct task_security *p)
>  	return 0;
>  }
>  
> +static inline int security_task_kernel_act_as(struct task_security *sec,
> +					      const char *service,
> +					      struct task_struct *daemon)
> +{
> +	return 0;
> +}
> +
> +static inline int security_task_create_files_as(struct task_security *sec,
> +						struct inode *inode)
> +{
> +	return 0;
> +}
> +
>  static inline int security_task_setuid (uid_t id0, uid_t id1, uid_t id2,
>  					int flags)
>  {
> diff --git a/kernel/cred.c b/kernel/cred.c
> index ddf98c6..eac1d0c 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -11,6 +11,7 @@
>  #include <linux/module.h>
>  #include <linux/sched.h>
>  #include <linux/key.h>
> +#include <linux/keyctl.h>
>  #include <linux/init_task.h>
>  #include <linux/security.h>
>  
> @@ -137,3 +138,88 @@ void put_task_security(struct task_security *sec)
>  	}
>  }
>  EXPORT_SYMBOL(put_task_security);
> +
> +/**
> + * get_kernel_security - Get a task security record for a named kernel service
> + * @service: The name of the service
> + * @daemon: A userspace daemon to be used as a reference
> + *
> + * Get a task security record for a specific kernel service.  This can then be
> + * used to override a task's own security so that work can be done on behalf of
> + * that task that requires a different security context.
> + *
> + * @daemon is used to provide a base for the security record, but can be NULL.
> + * If @daemon is supplied, then the security data will be derived from that;
> + * otherwise they'll be set to 0 and no groups, full capabilities and no keys.
> + *
> + * @daemon is also passd to the LSM module as a base from which to initialise
> + * any MAC controls.
> + *
> + * The caller may change these controls afterwards if desired.
> + */
> +struct task_security *get_kernel_security(const char *service,
> +					  struct task_struct *daemon)
> +{
> +	const struct task_security *dsec;
> +	struct task_security *sec;
> +	int ret;
> +
> +	sec = kzalloc(sizeof *sec, GFP_KERNEL);
> +	if (!sec)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (daemon) {
> +		rcu_read_lock();
> +		dsec = rcu_dereference(daemon->sec);
> +		*sec = *dsec;
> +		get_group_info(sec->group_info);
> +		get_uid(sec->user);
> +		rcu_read_unlock();
> +#ifdef CONFIG_KEYS
> +		sec->request_key_auth = NULL;
> +		sec->thread_keyring = NULL;
> +		sec->tgsec = NULL;
> +#endif
> +	} else {
> +		sec->keep_capabilities = 0;
> +		sec->cap_inheritable = CAP_INIT_INH_SET;
> +		sec->cap_permitted = CAP_FULL_SET;
> +		sec->cap_effective = CAP_INIT_EFF_SET;
> +		sec->user = &root_user;
> +		get_uid(sec->user);
> +		sec->group_info = &init_groups;
> +		get_group_info(sec->group_info);
> +	}
> +
> +	atomic_set(&sec->usage, 1);
> +	spin_lock_init(&sec->lock);
> +#ifdef CONFIG_KEYS
> +	sec->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING;
> +#endif
> +
> +	ret = security_task_kernel_act_as(sec, service, daemon);
> +	if (ret < 0) {
> +		put_task_security(sec);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return sec;
> +}
> +EXPORT_SYMBOL(get_kernel_security);
> +
> +/**
> + * change_create_files_as - Change the file create context in a security record
> + * @sec: The security record to alter
> + * @inode: The inode to take the context from
> + *
> + * Change the file creation context in a security record to be the same as the
> + * object context of the specified inode, so that the new inodes have the same
> + * MAC context as that inode.
> + */
> +int change_create_files_as(struct task_security *sec, struct inode *inode)
> +{
> +	sec->fsuid = inode->i_uid;
> +	sec->fsgid = inode->i_gid;
> +	return security_task_create_files_as(sec, inode);
> +}
> +EXPORT_SYMBOL(change_create_files_as);
> diff --git a/security/dummy.c b/security/dummy.c
> index 526549d..1c8bad6 100644
> --- a/security/dummy.c
> +++ b/security/dummy.c
> @@ -495,6 +495,19 @@ static int dummy_task_dup_security(struct task_security *p)
>  	return 0;
>  }
>  
> +static int dummy_task_kernel_act_as(struct task_security *sec,
> +				    const char *service,
> +				    struct task_struct *daemon)
> +{
> +	return 0;
> +}
> +
> +static int dummy_task_create_files_as(struct task_security *sec,
> +				      struct inode *inode)
> +{
> +	return 0;
> +}
> +
>  static int dummy_task_setuid (uid_t id0, uid_t id1, uid_t id2, int flags)
>  {
>  	return 0;
> @@ -1063,6 +1076,8 @@ void security_fixup_ops (struct security_operations *ops)
>  	set_to_dummy_if_null(ops, task_alloc_security);
>  	set_to_dummy_if_null(ops, task_free_security);
>  	set_to_dummy_if_null(ops, task_dup_security);
> +	set_to_dummy_if_null(ops, task_kernel_act_as);
> +	set_to_dummy_if_null(ops, task_create_files_as);
>  	set_to_dummy_if_null(ops, task_setuid);
>  	set_to_dummy_if_null(ops, task_post_setuid);
>  	set_to_dummy_if_null(ops, task_setgid);
> diff --git a/security/security.c b/security/security.c
> index 92d66d6..86d94e5 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -583,6 +583,19 @@ int security_task_dup(struct task_security *sec)
>  	return security_ops->task_dup_security(sec);
>  }
>  
> +int security_task_kernel_act_as(struct task_security *sec,
> +				const char *service,
> +				struct task_struct *daemon)
> +{
> +	return security_ops->task_kernel_act_as(sec, service, daemon);
> +}
> +
> +int security_task_create_files_as(struct task_security *sec,
> +				  struct inode *inode)
> +{
> +	return security_ops->task_create_files_as(sec, inode);
> +}
> +
>  int security_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags)
>  {
>  	return security_ops->task_setuid(id0, id1, id2, flags);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 20a6b55..2416f54 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2846,6 +2846,49 @@ static int selinux_task_dup_security(struct task_security *sec)
>  	return 0;
>  }
>  
> +/*
> + * get the security data for a kernel service, deriving the subjective context
> + * from the security data of a userspace daemon if one supplied
> + * - all the creation contexts are set to unlabelled
> + */
> +static int selinux_task_kernel_act_as(struct task_security *sec,
> +				      const char *service,
> +				      struct task_struct *daemon)
> +{
> +	struct task_security_struct *tsec, *dtsec;
> +	u32 ksid;
> +	int ret;
> +
> +	tsec = sec->security;
> +	dtsec = daemon ? daemon->sec->security : init_task.sec->security;
> +
> +	ret = security_transition_sid(dtsec->sid, SECINITSID_KERNEL,
> +				      SECCLASS_PROCESS, &ksid);
> +	if (ret < 0)
> +		return ret;
> +
> +	tsec->sid = ksid;
> +	tsec->create_sid = SECINITSID_UNLABELED;
> +	tsec->keycreate_sid = SECINITSID_UNLABELED;
> +	tsec->sockcreate_sid = SECINITSID_UNLABELED;

These SIDs need to be cleared rather than set to SECINITSID_UNLABELED.

> +	sec->security = tsec;
> +	return 0;
> +}
> +
> +/*
> + * set the file creation context in a security record to the same as the
> + * objective context of the specified inode
> + */
> +static int selinux_task_create_files_as(struct task_security *sec,
> +					struct inode *inode)
> +{
> +	struct task_security_struct *tsec = sec->security;
> +	struct inode_security_struct *isec = inode->i_security;
> +
> +	tsec->create_sid = isec->sid;
> +	return 0;
> +}
> +
>  static int selinux_task_setuid(uid_t id0, uid_t id1, uid_t id2, int flags)
>  {
>  	/* Since setuid only affects the current process, and
> @@ -4884,6 +4927,8 @@ static struct security_operations selinux_ops = {
>  	.task_alloc_security =		selinux_task_alloc_security,
>  	.task_free_security =		selinux_task_free_security,
>  	.task_dup_security =		selinux_task_dup_security,
> +	.task_kernel_act_as =		selinux_task_kernel_act_as,
> +	.task_create_files_as =		selinux_task_create_files_as,
>  	.task_setuid =			selinux_task_setuid,
>  	.task_post_setuid =		selinux_task_post_setuid,
>  	.task_setgid =			selinux_task_setgid,
> 
> 
> --
> This message was distributed to subscribers of the selinux mailing list.
> If you no longer wish to subscribe, send mail to majordomo@tycho.nsa.gov with
> the words "unsubscribe selinux" without quotes as the message.
-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-05 19:38 ` [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions " David Howells
  2007-12-10 16:46   ` Stephen Smalley
@ 2007-12-10 17:07   ` David Howells
  2007-12-10 17:23     ` Stephen Smalley
                       ` (3 more replies)
  1 sibling, 4 replies; 126+ messages in thread
From: David Howells @ 2007-12-10 17:07 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module


Stephen Smalley <sds@tycho.nsa.gov> wrote:

> > +	tsec->create_sid = SECINITSID_UNLABELED;
> > +	tsec->keycreate_sid = SECINITSID_UNLABELED;
> > +	tsec->sockcreate_sid = SECINITSID_UNLABELED;

Cleared means what?  Setting to 0?  Or is there some other constant I should
use for that?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 17:07   ` David Howells
@ 2007-12-10 17:23     ` Stephen Smalley
  2007-12-10 21:08     ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-10 17:23 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, casey, linux-kernel, selinux,
	linux-security-module

On Mon, 2007-12-10 at 17:07 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > > +	tsec->create_sid = SECINITSID_UNLABELED;
> > > +	tsec->keycreate_sid = SECINITSID_UNLABELED;
> > > +	tsec->sockcreate_sid = SECINITSID_UNLABELED;
> 
> Cleared means what?  Setting to 0?  Or is there some other constant I should
> use for that?

Yes, setting to 0.

Otherwise, only other issue I have with this interface is it won't
generalize to dealing with nfsd, where we want to set the acting context
to a context we obtain from or determine based upon the client.

Why can't cachefilesd just push a context into the kernel and pass that
into the hook as the acting context, and then nfsd can do likewise using
the context provided by the client or obtained locally from exports for
ordinary clients?  Avoids the transition SID computation altogether
within the kernel and makes this more generic.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 17:07   ` David Howells
  2007-12-10 17:23     ` Stephen Smalley
@ 2007-12-10 21:08     ` David Howells
  2007-12-10 21:27       ` Stephen Smalley
  2007-12-10 23:36       ` David Howells
  2008-01-09 16:51     ` David Howells
  2008-01-09 17:27     ` David Howells
  3 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-10 21:08 UTC (permalink / raw)
  To: Stephen Smalley, Karl MacMillan
  Cc: dhowells, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> Otherwise, only other issue I have with this interface is it won't
> generalize to dealing with nfsd, where we want to set the acting context
> to a context we obtain from or determine based upon the client.

Are you speaking of security_kernel_act_as() and security_create_files_as()
specifically?  Or the task_struct::act_as override pointer in general?

I don't really know how nfsd wants to obtain and set its LSM context, so it's
a bit difficult for me to make something that works for nfsd as well as
cachefiles.

> Why can't cachefilesd just push a context into the kernel and pass that
> into the hook as the acting context,

How does cachefilesd come up with such a context?  Grab it from
/etc/cachefilesd.conf?

I use to do that, but someone objected...  Possibly Karl MacMillan.

> and then nfsd can do likewise using the context provided by the client or
> obtained locally from exports for ordinary clients?  Avoids the transition
> SID computation altogether within the kernel and makes this more generic.

I seem to remember that I was told that it should be done this way, possibly
by Karl MacMillan, but I don't remember exactly.

Now it's configured by cachefilesd.te:

	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 21:08     ` David Howells
@ 2007-12-10 21:27       ` Stephen Smalley
  2007-12-10 22:26         ` Casey Schaufler
  2007-12-10 23:36       ` David Howells
  1 sibling, 1 reply; 126+ messages in thread
From: Stephen Smalley @ 2007-12-10 21:27 UTC (permalink / raw)
  To: David Howells
  Cc: Karl MacMillan, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module

On Mon, 2007-12-10 at 21:08 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > Otherwise, only other issue I have with this interface is it won't
> > generalize to dealing with nfsd, where we want to set the acting context
> > to a context we obtain from or determine based upon the client.
> 
> Are you speaking of security_kernel_act_as() and security_create_files_as()
> specifically?  Or the task_struct::act_as override pointer in general?

security_kernel_act_as()

> I don't really know how nfsd wants to obtain and set its LSM context, so it's
> a bit difficult for me to make something that works for nfsd as well as
> cachefiles.

It would get a context from the client or from a local configuration
that would map security-unaware clients to a default context, and then
want to assume that context for the particular operation.  No transition
involved.

> > Why can't cachefilesd just push a context into the kernel and pass that
> > into the hook as the acting context,
> 
> How does cachefilesd come up with such a context?  Grab it from
> /etc/cachefilesd.conf?

>From a config file whose pathname would be provided by libselinux (ala
the way in which dbusd imports contexts), or directly as a context
returned by a libselinux function.  Has to be done that way so that it
can be set differently for different policy types (strict, targeted,
mls).

Naturally, cachefiles (the kernel module) would invoke a security hook
to check whether the daemon is allowed to set the specified context.

> I use to do that, but someone objected...  Possibly Karl MacMillan.

Yes, but I think I disagreed then too.

> > and then nfsd can do likewise using the context provided by the client or
> > obtained locally from exports for ordinary clients?  Avoids the transition
> > SID computation altogether within the kernel and makes this more generic.
> 
> I seem to remember that I was told that it should be done this way, possibly
> by Karl MacMillan, but I don't remember exactly.
> 
> Now it's configured by cachefilesd.te:
> 
> 	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;

It doesn't fit with how other users of security_kernel_act_as() will
likely want to work (they will want to just set the context to a
specified value, whether one obtained from the client or from some local
source), nor with how type transitions normally work (exec, with the
program type as the second type field).  I think it will just cause
confusion and subtle breakage.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 21:27       ` Stephen Smalley
@ 2007-12-10 22:26         ` Casey Schaufler
  2007-12-10 23:44           ` David Howells
  2007-12-11 18:34           ` Stephen Smalley
  0 siblings, 2 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-10 22:26 UTC (permalink / raw)
  To: Stephen Smalley, David Howells
  Cc: Karl MacMillan, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module


--- Stephen Smalley <sds@tycho.nsa.gov> wrote:

> On Mon, 2007-12-10 at 21:08 +0000, David Howells wrote:
> > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > 
> > > Otherwise, only other issue I have with this interface is it won't
> > > generalize to dealing with nfsd, where we want to set the acting context
> > > to a context we obtain from or determine based upon the client.
> > 
> > Are you speaking of security_kernel_act_as() and security_create_files_as()
> > specifically?  Or the task_struct::act_as override pointer in general?
> 
> security_kernel_act_as()
> 
> > I don't really know how nfsd wants to obtain and set its LSM context, so
> it's
> > a bit difficult for me to make something that works for nfsd as well as
> > cachefiles.
> 
> It would get a context from the client or from a local configuration
> that would map security-unaware clients to a default context, and then
> want to assume that context for the particular operation.  No transition
> involved.

I would expect that the operation would be more sophisticated
than that. You certainly aren't going to use what comes from
the other side without any processing, and I expect you'll have
some sort of operation on anything you pull from a config file
before you actually apply it.

> > > Why can't cachefilesd just push a context into the kernel and pass that
> > > into the hook as the acting context,
> > 
> > How does cachefilesd come up with such a context?  Grab it from
> > /etc/cachefilesd.conf?
> 
> >From a config file whose pathname would be provided by libselinux (ala
> the way in which dbusd imports contexts), or directly as a context
> returned by a libselinux function.  Has to be done that way so that it
> can be set differently for different policy types (strict, targeted,
> mls).

Unless you've got an LSM other than SELinux, of course. If
cachefilesd is going to be responsible for maintaining this
magic context there needs to be an LSM interface for it, not
just an SELinux interface.

> Naturally, cachefiles (the kernel module) would invoke a security hook
> to check whether the daemon is allowed to set the specified context.
> 
> > I use to do that, but someone objected...  Possibly Karl MacMillan.
> 
> Yes, but I think I disagreed then too.
> 
> > > and then nfsd can do likewise using the context provided by the client or
> > > obtained locally from exports for ordinary clients?  Avoids the
> transition
> > > SID computation altogether within the kernel and makes this more generic.
> > 
> > I seem to remember that I was told that it should be done this way,
> possibly
> > by Karl MacMillan, but I don't remember exactly.
> > 
> > Now it's configured by cachefilesd.te:
> > 
> > 	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
> 
> It doesn't fit with how other users of security_kernel_act_as() will
> likely want to work (they will want to just set the context to a
> specified value, whether one obtained from the client or from some local
> source), nor with how type transitions normally work (exec, with the
> program type as the second type field).  I think it will just cause
> confusion and subtle breakage.

I think that I agree with Stephen, although I could be mirely confused.
That happens to me when interfaces are described in SELinux terms. I
still don't care much for multiple contexts, and I don't have a good
grasp of how you'll deal with Smack, or any LSM other than SELinux.
Just as Stephen mentions, I also don't see the generality that a change
of this magnitude really ought to provide.



Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 21:08     ` David Howells
  2007-12-10 21:27       ` Stephen Smalley
@ 2007-12-10 23:36       ` David Howells
  2007-12-10 23:46         ` Casey Schaufler
                           ` (2 more replies)
  1 sibling, 3 replies; 126+ messages in thread
From: David Howells @ 2007-12-10 23:36 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> From a config file whose pathname would be provided by libselinux (ala
> the way in which dbusd imports contexts), or directly as a context
> returned by a libselinux function.

That sounds too SELinux specific.  How do I do it so that it works for any
LSM?

Is linking against libselinux is a viable option if it's not available under
all LSM models?  Is it available under all LSM models?  Perhaps Casey can
answer this one.

> > I use to do that, but someone objected...  Possibly Karl MacMillan.
> 
> Yes, but I think I disagreed then too.

So, who's right?

> It doesn't fit with how other users of security_kernel_act_as() will
> likely want to work (they will want to just set the context to a
> specified value, whether one obtained from the client or from some local
> source), nor with how type transitions normally work (exec, with the
> program type as the second type field).  I think it will just cause
> confusion and subtle breakage.

It's causing me lots of confusion as it is.  I have been / am being told by
different people to do different things just in dealing with SELinux, and
various people are raising extra requirements or restrictions beyond that.
There doesn't seem to be a consensus.

It sounds like the best option is just to have the kernel nick the userspace
daemon's security context and use that as is, and junk all the restrictions on
what the daemon can do so that the kernel isn't too restricted.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 22:26         ` Casey Schaufler
@ 2007-12-10 23:44           ` David Howells
  2007-12-10 23:56             ` Casey Schaufler
  2007-12-11 18:34           ` Stephen Smalley
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-10 23:44 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> That happens to me when interfaces are described in SELinux terms. I
> still don't care much for multiple contexts, and I don't have a good
> grasp of how you'll deal with Smack, or any LSM other than SELinux.

Me neither.  I understand SELinux somewhat, though it's got a lot of wibbly
bits, and WinNT's security system, but I have no experience of the other
stuff.

> Just as Stephen mentions, I also don't see the generality that a change
> of this magnitude really ought to provide.

Perhaps it should be a specific interface, solely for cachefiles's use then.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 23:36       ` David Howells
@ 2007-12-10 23:46         ` Casey Schaufler
  2007-12-11 19:52           ` Stephen Smalley
  2007-12-11 19:37         ` Stephen Smalley
  2007-12-11 20:42         ` David Howells
  2 siblings, 1 reply; 126+ messages in thread
From: Casey Schaufler @ 2007-12-10 23:46 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > From a config file whose pathname would be provided by libselinux (ala
> > the way in which dbusd imports contexts), or directly as a context
> > returned by a libselinux function.
> 
> That sounds too SELinux specific.  How do I do it so that it works for any
> LSM?
> 
> Is linking against libselinux is a viable option if it's not available under
> all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> answer this one.

Linking against libselinux is not now, nor will it ever be, a viable
option. There's just too much sophistication contained in libselinux
for us simple folk to deal with.

> > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > 
> > Yes, but I think I disagreed then too.
> 
> So, who's right?

Me! (smiley inserted here, for those in need)

> > It doesn't fit with how other users of security_kernel_act_as() will
> > likely want to work (they will want to just set the context to a
> > specified value, whether one obtained from the client or from some local
> > source), nor with how type transitions normally work (exec, with the
> > program type as the second type field).  I think it will just cause
> > confusion and subtle breakage.
> 
> It's causing me lots of confusion as it is.  I have been / am being told by
> different people to do different things just in dealing with SELinux, and
> various people are raising extra requirements or restrictions beyond that.
> There doesn't seem to be a consensus.
> 
> It sounds like the best option is just to have the kernel nick the userspace
> daemon's security context and use that as is, and junk all the restrictions
> on
> what the daemon can do so that the kernel isn't too restricted.

That would be consistant with the (perhaps archaic now) behavior
of nfsd on Unix, which did nothing but "lend it's credential" to the
underlying kernel code. I think it's a rational approach, although I
expect that in may have troubles under SELinux.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 23:44           ` David Howells
@ 2007-12-10 23:56             ` Casey Schaufler
  0 siblings, 0 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-10 23:56 UTC (permalink / raw)
  To: David Howells, casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> > That happens to me when interfaces are described in SELinux terms. I
> > still don't care much for multiple contexts, and I don't have a good
> > grasp of how you'll deal with Smack, or any LSM other than SELinux.
> 
> Me neither.  I understand SELinux somewhat, though it's got a lot of wibbly
> bits, and WinNT's security system, but I have no experience of the other
> stuff.
> 
> > Just as Stephen mentions, I also don't see the generality that a change
> > of this magnitude really ought to provide.
> 
> Perhaps it should be a specific interface, solely for cachefiles's use then.

That would help focus things, to be sure. I don't know if that
focus will speed things up or slow them down, but I think that
attempting to accomodate SELinux/NFS, with the state that effort
is in, will only lead to tears.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 22:26         ` Casey Schaufler
  2007-12-10 23:44           ` David Howells
@ 2007-12-11 18:34           ` Stephen Smalley
  2007-12-11 19:26             ` Casey Schaufler
  1 sibling, 1 reply; 126+ messages in thread
From: Stephen Smalley @ 2007-12-11 18:34 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Mon, 2007-12-10 at 14:26 -0800, Casey Schaufler wrote:
> --- Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > On Mon, 2007-12-10 at 21:08 +0000, David Howells wrote:
> > > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > 
> > > > Otherwise, only other issue I have with this interface is it won't
> > > > generalize to dealing with nfsd, where we want to set the acting context
> > > > to a context we obtain from or determine based upon the client.
> > > 
> > > Are you speaking of security_kernel_act_as() and security_create_files_as()
> > > specifically?  Or the task_struct::act_as override pointer in general?
> > 
> > security_kernel_act_as()
> > 
> > > I don't really know how nfsd wants to obtain and set its LSM context, so
> > it's
> > > a bit difficult for me to make something that works for nfsd as well as
> > > cachefiles.
> > 
> > It would get a context from the client or from a local configuration
> > that would map security-unaware clients to a default context, and then
> > want to assume that context for the particular operation.  No transition
> > involved.
> 
> I would expect that the operation would be more sophisticated
> than that. You certainly aren't going to use what comes from
> the other side without any processing, and I expect you'll have
> some sort of operation on anything you pull from a config file
> before you actually apply it.

Yes, that's true - the contexts would be subjected to a permission
check.  But that's separable from the act of setting it as the task's
acting security state (and needs to be separated, as the precise check
will vary depending on the situation - cachefiles is going to apply a
different sort of check than nfsd).

> > > > Why can't cachefilesd just push a context into the kernel and pass that
> > > > into the hook as the acting context,
> > > 
> > > How does cachefilesd come up with such a context?  Grab it from
> > > /etc/cachefilesd.conf?
> > 
> > >From a config file whose pathname would be provided by libselinux (ala
> > the way in which dbusd imports contexts), or directly as a context
> > returned by a libselinux function.  Has to be done that way so that it
> > can be set differently for different policy types (strict, targeted,
> > mls).
> 
> Unless you've got an LSM other than SELinux, of course. If
> cachefilesd is going to be responsible for maintaining this
> magic context there needs to be an LSM interface for it, not
> just an SELinux interface.

LSM is an in-kernel interface.  Here we are talking about a userspace
interface for obtaining the right security label to use.  There is no
equivalent to LSM in userspace as of yet.  Feel free to invent one, but
don't ask the rest of us to do it or wait for it to materialize.

> > Naturally, cachefiles (the kernel module) would invoke a security hook
> > to check whether the daemon is allowed to set the specified context.
> > 
> > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > 
> > Yes, but I think I disagreed then too.
> > 
> > > > and then nfsd can do likewise using the context provided by the client or
> > > > obtained locally from exports for ordinary clients?  Avoids the
> > transition
> > > > SID computation altogether within the kernel and makes this more generic.
> > > 
> > > I seem to remember that I was told that it should be done this way,
> > possibly
> > > by Karl MacMillan, but I don't remember exactly.
> > > 
> > > Now it's configured by cachefilesd.te:
> > > 
> > > 	type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
> > 
> > It doesn't fit with how other users of security_kernel_act_as() will
> > likely want to work (they will want to just set the context to a
> > specified value, whether one obtained from the client or from some local
> > source), nor with how type transitions normally work (exec, with the
> > program type as the second type field).  I think it will just cause
> > confusion and subtle breakage.
> 
> I think that I agree with Stephen, although I could be mirely confused.
> That happens to me when interfaces are described in SELinux terms. I
> still don't care much for multiple contexts, and I don't have a good
> grasp of how you'll deal with Smack, or any LSM other than SELinux.
> Just as Stephen mentions, I also don't see the generality that a change
> of this magnitude really ought to provide.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 18:34           ` Stephen Smalley
@ 2007-12-11 19:26             ` Casey Schaufler
  2007-12-11 19:56               ` Stephen Smalley
  0 siblings, 1 reply; 126+ messages in thread
From: Casey Schaufler @ 2007-12-11 19:26 UTC (permalink / raw)
  To: Stephen Smalley, casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module


--- Stephen Smalley <sds@tycho.nsa.gov> wrote:

> On Mon, 2007-12-10 at 14:26 -0800, Casey Schaufler wrote:
> > --- Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > 
> > > On Mon, 2007-12-10 at 21:08 +0000, David Howells wrote:
> > > > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > > 
> > > > > Otherwise, only other issue I have with this interface is it won't
> > > > > generalize to dealing with nfsd, where we want to set the acting
> context
> > > > > to a context we obtain from or determine based upon the client.
> > > > 
> > > > Are you speaking of security_kernel_act_as() and
> security_create_files_as()
> > > > specifically?  Or the task_struct::act_as override pointer in general?
> > > 
> > > security_kernel_act_as()
> > > 
> > > > I don't really know how nfsd wants to obtain and set its LSM context,
> so
> > > it's
> > > > a bit difficult for me to make something that works for nfsd as well as
> > > > cachefiles.
> > > 
> > > It would get a context from the client or from a local configuration
> > > that would map security-unaware clients to a default context, and then
> > > want to assume that context for the particular operation.  No transition
> > > involved.
> > 
> > I would expect that the operation would be more sophisticated
> > than that. You certainly aren't going to use what comes from
> > the other side without any processing, and I expect you'll have
> > some sort of operation on anything you pull from a config file
> > before you actually apply it.
> 
> Yes, that's true - the contexts would be subjected to a permission
> check.  But that's separable from the act of setting it as the task's
> acting security state (and needs to be separated, as the precise check
> will vary depending on the situation - cachefiles is going to apply a
> different sort of check than nfsd).
> 
> > > > > Why can't cachefilesd just push a context into the kernel and pass
> that
> > > > > into the hook as the acting context,
> > > > 
> > > > How does cachefilesd come up with such a context?  Grab it from
> > > > /etc/cachefilesd.conf?
> > > 
> > > >From a config file whose pathname would be provided by libselinux (ala
> > > the way in which dbusd imports contexts), or directly as a context
> > > returned by a libselinux function.  Has to be done that way so that it
> > > can be set differently for different policy types (strict, targeted,
> > > mls).
> > 
> > Unless you've got an LSM other than SELinux, of course. If
> > cachefilesd is going to be responsible for maintaining this
> > magic context there needs to be an LSM interface for it, not
> > just an SELinux interface.
> 
> LSM is an in-kernel interface.  Here we are talking about a userspace
> interface for obtaining the right security label to use.  There is no
> equivalent to LSM in userspace as of yet.  Feel free to invent one, but
> don't ask the rest of us to do it or wait for it to materialize.

I am much more concerned with the interfaces used to pass the
information into the kernel. I would expect that to be LSM
independent, not a call into libselinux that resolves into a
selinuxfs operation, or it's netlink equivilant. It would be
unfortunate if the userland/kernel interface became an obstacle
to cachefiles being adopted.

> ...


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 23:36       ` David Howells
  2007-12-10 23:46         ` Casey Schaufler
@ 2007-12-11 19:37         ` Stephen Smalley
  2007-12-12 14:41           ` Karl MacMillan
  2007-12-12 14:53           ` David Howells
  2007-12-11 20:42         ` David Howells
  2 siblings, 2 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-11 19:37 UTC (permalink / raw)
  To: David Howells
  Cc: Karl MacMillan, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module

On Mon, 2007-12-10 at 23:36 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > From a config file whose pathname would be provided by libselinux (ala
> > the way in which dbusd imports contexts), or directly as a context
> > returned by a libselinux function.
> 
> That sounds too SELinux specific.  How do I do it so that it works for any
> LSM?

You can't.  There is no LSM for userspace; LSM specifically disavowed
any common userspace API, and that was one of our original
objections/concerns about it.

> Is linking against libselinux is a viable option if it's not available under
> all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> answer this one.

Nope, they would all have their own libraries, if they have a library at
all.  But that isn't your problem - your kernel interface should be
generic, and your LSM hooks should be generic, but your userspace isn't
required to be.  Have a look at how many programs in the distribution
currently link against libselinux, whether directly or by dlopen'ing it.

> > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > 
> > Yes, but I think I disagreed then too.
> 
> So, who's right?

Karl isn't a maintainer of the SELinux kernel code.  And I had thought
that even he had reconsidered this idea after further discussion.

> > It doesn't fit with how other users of security_kernel_act_as() will
> > likely want to work (they will want to just set the context to a
> > specified value, whether one obtained from the client or from some local
> > source), nor with how type transitions normally work (exec, with the
> > program type as the second type field).  I think it will just cause
> > confusion and subtle breakage.
> 
> It's causing me lots of confusion as it is.  I have been / am being told by
> different people to do different things just in dealing with SELinux, and
> various people are raising extra requirements or restrictions beyond that.
> There doesn't seem to be a consensus.
> 
> It sounds like the best option is just to have the kernel nick the userspace
> daemon's security context and use that as is, and junk all the restrictions on
> what the daemon can do so that the kernel isn't too restricted.

Well, you could do that, if that meets your needs, but it doesn't sound
very optimal either.  Why are you opposed to having userspace determine
the context and write it to a cachefiles interface, and just have the
kernel authorize it (invoke a hook to check permissions between the
daemon's context and the specified label), and make it the acting
context when appropriate (invoke a different hook to set it as the
acting context)?

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 23:46         ` Casey Schaufler
@ 2007-12-11 19:52           ` Stephen Smalley
  0 siblings, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-11 19:52 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Mon, 2007-12-10 at 15:46 -0800, Casey Schaufler wrote:
> --- David Howells <dhowells@redhat.com> wrote:
> 
> > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > 
> > > From a config file whose pathname would be provided by libselinux (ala
> > > the way in which dbusd imports contexts), or directly as a context
> > > returned by a libselinux function.
> > 
> > That sounds too SELinux specific.  How do I do it so that it works for any
> > LSM?
> > 
> > Is linking against libselinux is a viable option if it's not available under
> > all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> > answer this one.
> 
> Linking against libselinux is not now, nor will it ever be, a viable
> option. There's just too much sophistication contained in libselinux
> for us simple folk to deal with.
> 
> > > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > > 
> > > Yes, but I think I disagreed then too.
> > 
> > So, who's right?
> 
> Me! (smiley inserted here, for those in need)
> 
> > > It doesn't fit with how other users of security_kernel_act_as() will
> > > likely want to work (they will want to just set the context to a
> > > specified value, whether one obtained from the client or from some local
> > > source), nor with how type transitions normally work (exec, with the
> > > program type as the second type field).  I think it will just cause
> > > confusion and subtle breakage.
> > 
> > It's causing me lots of confusion as it is.  I have been / am being told by
> > different people to do different things just in dealing with SELinux, and
> > various people are raising extra requirements or restrictions beyond that.
> > There doesn't seem to be a consensus.
> > 
> > It sounds like the best option is just to have the kernel nick the userspace
> > daemon's security context and use that as is, and junk all the restrictions
> > on
> > what the daemon can do so that the kernel isn't too restricted.
> 
> That would be consistant with the (perhaps archaic now) behavior
> of nfsd on Unix, which did nothing but "lend it's credential" to the
> underlying kernel code. I think it's a rational approach, although I
> expect that in may have troubles under SELinux.

nfsd needs to able to set the acting label to a value determined based
on the client so that file operations performed on behalf of the client
are subjected to the right set of permission checks and new files are
labeled properly, just as it already does for uid and gid (via fsuid and
fsgid).  So merely inheriting the label from the nfsd daemon doesn't
help with that purpose.

Both nfsd and cachefiles need a way to set the acting label, so having a
common hook for both to do that makes sense.  The authorization of that
label will differ, so splitting the authorization into a separate hook
also makes sense.
 
-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 19:26             ` Casey Schaufler
@ 2007-12-11 19:56               ` Stephen Smalley
  2007-12-11 20:40                 ` Casey Schaufler
  0 siblings, 1 reply; 126+ messages in thread
From: Stephen Smalley @ 2007-12-11 19:56 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Tue, 2007-12-11 at 11:26 -0800, Casey Schaufler wrote:
> --- Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > On Mon, 2007-12-10 at 14:26 -0800, Casey Schaufler wrote:
> > > --- Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > 
> > > > On Mon, 2007-12-10 at 21:08 +0000, David Howells wrote:
> > > > > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > > > 
> > > > > > Otherwise, only other issue I have with this interface is it won't
> > > > > > generalize to dealing with nfsd, where we want to set the acting
> > context
> > > > > > to a context we obtain from or determine based upon the client.
> > > > > 
> > > > > Are you speaking of security_kernel_act_as() and
> > security_create_files_as()
> > > > > specifically?  Or the task_struct::act_as override pointer in general?
> > > > 
> > > > security_kernel_act_as()
> > > > 
> > > > > I don't really know how nfsd wants to obtain and set its LSM context,
> > so
> > > > it's
> > > > > a bit difficult for me to make something that works for nfsd as well as
> > > > > cachefiles.
> > > > 
> > > > It would get a context from the client or from a local configuration
> > > > that would map security-unaware clients to a default context, and then
> > > > want to assume that context for the particular operation.  No transition
> > > > involved.
> > > 
> > > I would expect that the operation would be more sophisticated
> > > than that. You certainly aren't going to use what comes from
> > > the other side without any processing, and I expect you'll have
> > > some sort of operation on anything you pull from a config file
> > > before you actually apply it.
> > 
> > Yes, that's true - the contexts would be subjected to a permission
> > check.  But that's separable from the act of setting it as the task's
> > acting security state (and needs to be separated, as the precise check
> > will vary depending on the situation - cachefiles is going to apply a
> > different sort of check than nfsd).
> > 
> > > > > > Why can't cachefilesd just push a context into the kernel and pass
> > that
> > > > > > into the hook as the acting context,
> > > > > 
> > > > > How does cachefilesd come up with such a context?  Grab it from
> > > > > /etc/cachefilesd.conf?
> > > > 
> > > > >From a config file whose pathname would be provided by libselinux (ala
> > > > the way in which dbusd imports contexts), or directly as a context
> > > > returned by a libselinux function.  Has to be done that way so that it
> > > > can be set differently for different policy types (strict, targeted,
> > > > mls).
> > > 
> > > Unless you've got an LSM other than SELinux, of course. If
> > > cachefilesd is going to be responsible for maintaining this
> > > magic context there needs to be an LSM interface for it, not
> > > just an SELinux interface.
> > 
> > LSM is an in-kernel interface.  Here we are talking about a userspace
> > interface for obtaining the right security label to use.  There is no
> > equivalent to LSM in userspace as of yet.  Feel free to invent one, but
> > don't ask the rest of us to do it or wait for it to materialize.
> 
> I am much more concerned with the interfaces used to pass the
> information into the kernel. I would expect that to be LSM
> independent, not a call into libselinux that resolves into a
> selinuxfs operation, or it's netlink equivilant. It would be
> unfortunate if the userland/kernel interface became an obstacle
> to cachefiles being adopted.

That wasn't the issue.  The interface to the cachefiles module would
just consist of cachefilesd writing a string label to some pseudo file
tell cachefiles what label to apply as the acting label for operations
performed by cachefiles.  Which isn't SELinux-specific at all.

David was asking though how cachefilesd (the userspace agent) would
obtain such a label to use.  And that may very well be LSM-specific, and
as there is no LSM userspace API, it makes sense for him to invoke a
libselinux function at present.  If a liblsm is later created and
provides a common front-end API (internally dlopen'ing the right shared
library based on some configuration, whether libselinux or libsmack or
whatever), then cachefilesd can instead call the liblsm interface, but
that doesn't exist today.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 19:56               ` Stephen Smalley
@ 2007-12-11 20:40                 ` Casey Schaufler
  0 siblings, 0 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-11 20:40 UTC (permalink / raw)
  To: Stephen Smalley, casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module


--- Stephen Smalley <sds@tycho.nsa.gov> wrote:

> 
> > I am much more concerned with the interfaces used to pass the
> > information into the kernel. I would expect that to be LSM
> > independent, not a call into libselinux that resolves into a
> > selinuxfs operation, or it's netlink equivilant. It would be
> > unfortunate if the userland/kernel interface became an obstacle
> > to cachefiles being adopted.
> 
> That wasn't the issue.  The interface to the cachefiles module would
> just consist of cachefilesd writing a string label to some pseudo file
> tell cachefiles what label to apply as the acting label for operations
> performed by cachefiles.  Which isn't SELinux-specific at all.

Calm down. Just checking.

> David was asking though how cachefilesd (the userspace agent) would
> obtain such a label to use.  And that may very well be LSM-specific, and
> as there is no LSM userspace API, it makes sense for him to invoke a
> libselinux function at present.  If a liblsm is later created and
> provides a common front-end API (internally dlopen'ing the right shared
> library based on some configuration, whether libselinux or libsmack or
> whatever), then cachefilesd can instead call the liblsm interface, but
> that doesn't exist today.

I am certainly not in favor of adding such complexity.
I suggest that cachefilesd get the context it wants using
whatever scheme works. I think there should be a cachefiles
specific (LSM independent) scheme for getting it into the
kernel, and a LSM hooks for setting it, if that's what he
really wants to do.



Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 23:36       ` David Howells
  2007-12-10 23:46         ` Casey Schaufler
  2007-12-11 19:37         ` Stephen Smalley
@ 2007-12-11 20:42         ` David Howells
  2007-12-11 21:18           ` Casey Schaufler
                             ` (4 more replies)
  2 siblings, 5 replies; 126+ messages in thread
From: David Howells @ 2007-12-11 20:42 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> > That sounds too SELinux specific.  How do I do it so that it works for any
> > LSM?
> 
> You can't.  There is no LSM for userspace; LSM specifically disavowed
> any common userspace API, and that was one of our original
> objections/concerns about it.

So, basically, userspace programs (outside of security tools) aren't supposed
to need access to the security infrastructure?

> > Is linking against libselinux is a viable option if it's not available under
> > all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> > answer this one.
> 
> Nope, they would all have their own libraries, if they have a library at
> all.  But that isn't your problem

It is if I have to maintain a special pieces of code for each possible LSM.
One piece for SELinux, one piece for AppArmour, one piece for Smack, one piece
for Casey's security system.  That sounds like a pain.

> - your kernel interface should be generic, and your LSM hooks should be
> generic, but your userspace isn't required to be.  Have a look at how many
> programs in the distribution currently link against libselinux, whether
> directly or by dlopen'ing it.

In /usr/bin ldd reports approximately 297 binaries link to libselinux, though
I can't say how many of those linked against it directly rather than picking
it up by contamination through a shared library.  Furthermore, I've no idea
how many more find it by dlopen.

> > > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > > 
> > > Yes, but I think I disagreed then too.
> > 
> > So, who's right?
> 
> Karl isn't a maintainer of the SELinux kernel code.  And I had thought
> that even he had reconsidered this idea after further discussion.

Not that I know of.

> > > It doesn't fit with how other users of security_kernel_act_as() will
> > > likely want to work (they will want to just set the context to a
> > > specified value, whether one obtained from the client or from some local
> > > source), nor with how type transitions normally work (exec, with the
> > > program type as the second type field).  I think it will just cause
> > > confusion and subtle breakage.
> > 
> > It's causing me lots of confusion as it is.  I have been / am being told by
> > different people to do different things just in dealing with SELinux, and
> > various people are raising extra requirements or restrictions beyond that.
> > There doesn't seem to be a consensus.
> > 
> > It sounds like the best option is just to have the kernel nick the userspace
> > daemon's security context and use that as is, and junk all the restrictions on
> > what the daemon can do so that the kernel isn't too restricted.
> 
> Well, you could do that, if that meets your needs, but it doesn't sound
> very optimal either.

True.  I'd rather have separate security for each.

> Why are you opposed to having userspace determine the context and write it
> to a cachefiles interface,

Because, from what I gather, that means my userspace program needs to do
something different, depending on the security model that's currently in force
on a system.  Furthermore, I would have to have separate code, as far as I
know, for each security model as there's no commonality in userspace.

I can't just link against libselinux.  It might not be there.  I'm not going
to tie my program to SELinux either.

Furthermore, I worked out "the right way to do this" with some apparent
SELinux person, and you seemed reasonably accepting of it.  Now I have to go
and redo all the work, having had to redo the security stuff a couple of times
already because someone objected.  *That* is the main obstacle to getting my
code accepted at the moment, I think.

How about I just stick the context in /etc/cachefilesd.conf as a textual
configuration item and have the daemon pass that as a string to the cachefiles
kernel module, which can then ask LSM if it's valid to set this context as an
override, given the daemon's own security context?  That seems entirely
reasonable to me.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 20:42         ` David Howells
@ 2007-12-11 21:18           ` Casey Schaufler
  2007-12-11 21:34           ` Stephen Smalley
                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-11 21:18 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

...
> 
> How about I just stick the context in /etc/cachefilesd.conf as a textual
> configuration item and have the daemon pass that as a string to the
> cachefiles
> kernel module, which can then ask LSM if it's valid to set this context as an
> override, given the daemon's own security context?  That seems entirely
> reasonable to me.

Works for Smack. I can't say definitively, but I think it will
work for SELinux. Beyond that and we're into the fuzzy bit of the
LSM.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 20:42         ` David Howells
  2007-12-11 21:18           ` Casey Schaufler
@ 2007-12-11 21:34           ` Stephen Smalley
  2007-12-19  3:28             ` Crispin Cowan
  2007-12-11 22:43           ` David Howells
                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Stephen Smalley @ 2007-12-11 21:34 UTC (permalink / raw)
  To: David Howells
  Cc: Karl MacMillan, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module

On Tue, 2007-12-11 at 20:42 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > > That sounds too SELinux specific.  How do I do it so that it works for any
> > > LSM?
> > 
> > You can't.  There is no LSM for userspace; LSM specifically disavowed
> > any common userspace API, and that was one of our original
> > objections/concerns about it.
> 
> So, basically, userspace programs (outside of security tools) aren't supposed
> to need access to the security infrastructure?

I didn't say that.  I said that LSM provides no standard way for
userspace to access the security infrastructure, at least beyond the
raw /proc/self/attr and xattr APIs, and I view that as a defect in LSM.
SELinux does have a userspace API, because we think there should be a
way for userspace to do that, and it is being used by userspace programs
outside of security tools.

> > > Is linking against libselinux is a viable option if it's not available under
> > > all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> > > answer this one.
> > 
> > Nope, they would all have their own libraries, if they have a library at
> > all.  But that isn't your problem
> 
> It is if I have to maintain a special pieces of code for each possible LSM.
> One piece for SELinux, one piece for AppArmour, one piece for Smack, one piece
> for Casey's security system.  That sounds like a pain.

All your code has to do is invoke a function provided by libselinux.  If
at some later time a liblsm is introduced that provides a common
front-end to a libselinux, libsmack, ..., then you can use that.  But it
doesn't exist today.   But it all just becomes a simple function call
regardless.

> > - your kernel interface should be generic, and your LSM hooks should be
> > generic, but your userspace isn't required to be.  Have a look at how many
> > programs in the distribution currently link against libselinux, whether
> > directly or by dlopen'ing it.
> 
> In /usr/bin ldd reports approximately 297 binaries link to libselinux, though
> I can't say how many of those linked against it directly rather than picking
> it up by contamination through a shared library.  Furthermore, I've no idea
> how many more find it by dlopen.

The point being that userspace is already using the SELinux API directly
- cachefilesd certainly won't be the first or most crucial application
to do so.

> > > > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > > > 
> > > > Yes, but I think I disagreed then too.
> > > 
> > > So, who's right?
> > 
> > Karl isn't a maintainer of the SELinux kernel code.  And I had thought
> > that even he had reconsidered this idea after further discussion.
> 
> Not that I know of.

Well, he can speak to that.  But regardless, he isn't the maintainer.

> > > > It doesn't fit with how other users of security_kernel_act_as() will
> > > > likely want to work (they will want to just set the context to a
> > > > specified value, whether one obtained from the client or from some local
> > > > source), nor with how type transitions normally work (exec, with the
> > > > program type as the second type field).  I think it will just cause
> > > > confusion and subtle breakage.
> > > 
> > > It's causing me lots of confusion as it is.  I have been / am being told by
> > > different people to do different things just in dealing with SELinux, and
> > > various people are raising extra requirements or restrictions beyond that.
> > > There doesn't seem to be a consensus.
> > > 
> > > It sounds like the best option is just to have the kernel nick the userspace
> > > daemon's security context and use that as is, and junk all the restrictions on
> > > what the daemon can do so that the kernel isn't too restricted.
> > 
> > Well, you could do that, if that meets your needs, but it doesn't sound
> > very optimal either.
> 
> True.  I'd rather have separate security for each.
> 
> > Why are you opposed to having userspace determine the context and write it
> > to a cachefiles interface,
> 
> Because, from what I gather, that means my userspace program needs to do
> something different, depending on the security model that's currently in force
> on a system.  Furthermore, I would have to have separate code, as far as I
> know, for each security model as there's no commonality in userspace.
>
> I can't just link against libselinux.  It might not be there.  I'm not going
> to tie my program to SELinux either.

You can certainly make the linking conditional at compile time based on
a build flag, and/or you can dlopen it at runtime and fall back
gracefully if not present.  It isn't tying your program to libselinux,
just using it if present, in this case to obtain an input to supply to
your kernel module so that the "policy" of deciding what context to use
is completely determined in userspace and not hardcoded at all into the
kernel.  If init and ls and ps can do it, cachefilesd can too.

> Furthermore, I worked out "the right way to do this" with some apparent
> SELinux person, and you seemed reasonably accepting of it.  Now I have to go
> and redo all the work, having had to redo the security stuff a couple of times
> already because someone objected.  *That* is the main obstacle to getting my
> code accepted at the moment, I think.

I did express reservations about it, and I really don't think that this
is the main obstacle - the determination of the acting task security
label and how to set it is really only a small part of your patch set,
and considerably less disruptive than the rest of it.

> How about I just stick the context in /etc/cachefilesd.conf as a textual
> configuration item and have the daemon pass that as a string to the cachefiles
> kernel module, which can then ask LSM if it's valid to set this context as an
> override, given the daemon's own security context?  That seems entirely
> reasonable to me.

That mostly works, but it means that an update to policy may require an
update to /etc/cachefilesd.conf, or that switching from one policy to
another might likewise require changing that file.  Versus using a
separate policy-provided config file for the label.

BTW, as should be obvious, some LSMs aren't label-based at all, so it
would need to be optional.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 20:42         ` David Howells
  2007-12-11 21:18           ` Casey Schaufler
  2007-12-11 21:34           ` Stephen Smalley
@ 2007-12-11 22:43           ` David Howells
  2007-12-11 23:04             ` Casey Schaufler
  2007-12-19  3:28           ` Crispin Cowan
  2007-12-19 23:38           ` David Howells
  4 siblings, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-11 22:43 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> All your code has to do is invoke a function provided by libselinux.

Calling libselinux means it's a special case for a specific LSM.

I think the best way to do this, then, has to be to dlopen the appropriate LSM
library.  That way I don't need to do any conditional compilation or linking,
but can build all the bits in to cachefilesd and have the appropriate one
selected by the /etc/cachefilesd.conf.

So, what do I invoke in libselinux, how do I configure it, and how do I
integrate the config into my RPM and install it?

And then what does it give me that I can hand to the kernel (a context string
for SELinux, I presume), how do I get the kernel to make a check on it, how do
I configure the check and how do I install that config from my RPM (I presume
I just need to modify the .fc, .if and .te files that I have already)?

> That mostly works, but it means that an update to policy may require an
> update to /etc/cachefilesd.conf, or that switching from one policy to
> another might likewise require changing that file.  Versus using a
> separate policy-provided config file for the label.

Whilst that's a fair point, if it's in a config file somewhere, then someone
may want to change it or someone may want to provide a second file for a
second cache with a different security label.

> BTW, as should be obvious, some LSMs aren't label-based at all, so it
> would need to be optional.

Aargh.  In which case it might not be possible to make the SELinux context
passing from userspace -> kernel generic for all LSMs:-(

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 22:43           ` David Howells
@ 2007-12-11 23:04             ` Casey Schaufler
  2007-12-12 15:25               ` Stephen Smalley
                                 ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-11 23:04 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > All your code has to do is invoke a function provided by libselinux.
> 
> Calling libselinux means it's a special case for a specific LSM.
> 
> I think the best way to do this, then, has to be to dlopen the appropriate
> LSM
> library.  That way I don't need to do any conditional compilation or linking,
> but can build all the bits in to cachefilesd and have the appropriate one
> selected by the /etc/cachefilesd.conf.
> 
> So, what do I invoke in libselinux, how do I configure it, and how do I
> integrate the config into my RPM and install it?
> 
> And then what does it give me that I can hand to the kernel (a context string
> for SELinux, I presume), how do I get the kernel to make a check on it, how
> do
> I configure the check and how do I install that config from my RPM (I presume
> I just need to modify the .fc, .if and .te files that I have already)?

That seems like an awful lot of work. I suggest that what you
put in /etc/cachefilesd.conf is a line like:

   security_context:"<whatever>"

and have your daemon pass "<whatever>" into the kernel using
a cachefile mechanism. The kernel code can call
security_secctx_to_secid("<whatever>") to determine if it's valid.
No need to invoke LSM specific code in your daemon. You may need
to have an application, say cachefileselinuxcontext, that will
read the current policy and spit out an appropriate value of
"<whatever>", but that can be separate and LSM specific without
mucking up your basic infrastructure applications. 

> > That mostly works, but it means that an update to policy may require an
> > update to /etc/cachefilesd.conf, or that switching from one policy to
> > another might likewise require changing that file.  Versus using a
> > separate policy-provided config file for the label.
> 
> Whilst that's a fair point, if it's in a config file somewhere, then someone
> may want to change it or someone may want to provide a second file for a
> second cache with a different security label.
>
> > BTW, as should be obvious, some LSMs aren't label-based at all, so it
> > would need to be optional.
> 
> Aargh.  In which case it might not be possible to make the SELinux context
> passing from userspace -> kernel generic for all LSMs:-(

For LSM's that don't use labels what you will have to pass in
won't be a label, it will be something else. But since any LSM
that wants to do networking or audit will have to deal with
secid's and secctx's the method outlined above ought to fit the
bill.



Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 19:37         ` Stephen Smalley
@ 2007-12-12 14:41           ` Karl MacMillan
  2007-12-12 14:53           ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Karl MacMillan @ 2007-12-12 14:41 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: David Howells, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module


On Tue, 2007-12-11 at 14:37 -0500, Stephen Smalley wrote:
> On Mon, 2007-12-10 at 23:36 +0000, David Howells wrote:
> > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > 
> > > From a config file whose pathname would be provided by libselinux (ala
> > > the way in which dbusd imports contexts), or directly as a context
> > > returned by a libselinux function.
> > 
> > That sounds too SELinux specific.  How do I do it so that it works for any
> > LSM?
> 
> You can't.  There is no LSM for userspace; LSM specifically disavowed
> any common userspace API, and that was one of our original
> objections/concerns about it.
> 
> > Is linking against libselinux is a viable option if it's not available under
> > all LSM models?  Is it available under all LSM models?  Perhaps Casey can
> > answer this one.
> 
> Nope, they would all have their own libraries, if they have a library at
> all.  But that isn't your problem - your kernel interface should be
> generic, and your LSM hooks should be generic, but your userspace isn't
> required to be.  Have a look at how many programs in the distribution
> currently link against libselinux, whether directly or by dlopen'ing it.
> 
> > > > I use to do that, but someone objected...  Possibly Karl MacMillan.
> > > 
> > > Yes, but I think I disagreed then too.
> > 
> > So, who's right?
> 
> Karl isn't a maintainer of the SELinux kernel code.  And I had thought
> that even he had reconsidered this idea after further discussion.
> 

That's what I remember as well - I suggested the transition idea and
then, after discussion, agreed that it wasn't the best approach.

And, as Steve points out, it's not my call to make.

Karl


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 19:37         ` Stephen Smalley
  2007-12-12 14:41           ` Karl MacMillan
@ 2007-12-12 14:53           ` David Howells
  2007-12-12 14:59             ` Karl MacMillan
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-12 14:53 UTC (permalink / raw)
  To: Karl MacMillan
  Cc: dhowells, Stephen Smalley, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module

Karl MacMillan <kmacmill@redhat.com> wrote:

> That's what I remember as well - I suggested the transition idea and
> then, after discussion, agreed that it wasn't the best approach.

Sigh.

Can you tell me then how to do it now?  I don't know very much about using
SELinux userspace stuff libraries or configuring them.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 14:53           ` David Howells
@ 2007-12-12 14:59             ` Karl MacMillan
  0 siblings, 0 replies; 126+ messages in thread
From: Karl MacMillan @ 2007-12-12 14:59 UTC (permalink / raw)
  To: David Howells
  Cc: Stephen Smalley, viro, hch, Trond.Myklebust, casey, linux-kernel,
	selinux, linux-security-module


On Wed, 2007-12-12 at 14:53 +0000, David Howells wrote:
> Karl MacMillan <kmacmill@redhat.com> wrote:
> 
> > That's what I remember as well - I suggested the transition idea and
> > then, after discussion, agreed that it wasn't the best approach.
> 
> Sigh.
> 
> Can you tell me then how to do it now?  I don't know very much about using
> SELinux userspace stuff libraries or configuring them.
> 

I don't have enough context anymore to help with design, but I can help
with the actual usage of the libraries if you have specific questions.

Karl


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 23:04             ` Casey Schaufler
@ 2007-12-12 15:25               ` Stephen Smalley
  2007-12-12 16:51                 ` Casey Schaufler
  2007-12-12 18:25               ` David Howells
  2007-12-12 18:29               ` David Howells
  2 siblings, 1 reply; 126+ messages in thread
From: Stephen Smalley @ 2007-12-12 15:25 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Tue, 2007-12-11 at 15:04 -0800, Casey Schaufler wrote:
> --- David Howells <dhowells@redhat.com> wrote:
> 
> > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > 
> > > All your code has to do is invoke a function provided by libselinux.
> > 
> > Calling libselinux means it's a special case for a specific LSM.
> > 
> > I think the best way to do this, then, has to be to dlopen the appropriate
> > LSM
> > library.  That way I don't need to do any conditional compilation or linking,
> > but can build all the bits in to cachefilesd and have the appropriate one
> > selected by the /etc/cachefilesd.conf.
> > 
> > So, what do I invoke in libselinux, how do I configure it, and how do I
> > integrate the config into my RPM and install it?
> > 
> > And then what does it give me that I can hand to the kernel (a context string
> > for SELinux, I presume), how do I get the kernel to make a check on it, how
> > do
> > I configure the check and how do I install that config from my RPM (I presume
> > I just need to modify the .fc, .if and .te files that I have already)?
> 
> That seems like an awful lot of work. I suggest that what you
> put in /etc/cachefilesd.conf is a line like:
> 
>    security_context:"<whatever>"
> 
> and have your daemon pass "<whatever>" into the kernel using
> a cachefile mechanism. The kernel code can call
> security_secctx_to_secid("<whatever>") to determine if it's valid.
> No need to invoke LSM specific code in your daemon. You may need
> to have an application, say cachefileselinuxcontext, that will
> read the current policy and spit out an appropriate value of
> "<whatever>", but that can be separate and LSM specific without
> mucking up your basic infrastructure applications. 

That sounds workable, although I think he will want a more specific hook
than security_secctx_to_secid(), or possibly a second hook call, that
would not only validate the context but authorize the use of it by the
cachefilesd process.  And then the security_task_kernel_act_as() hook
just takes the secid as input rather than the task struct of the daemon,
and applies it.  At that point, nfsd can use the same mechanism for
setting the acting SID based on the client process after doing its own
authorization.

> > > That mostly works, but it means that an update to policy may require an
> > > update to /etc/cachefilesd.conf, or that switching from one policy to
> > > another might likewise require changing that file.  Versus using a
> > > separate policy-provided config file for the label.
> > 
> > Whilst that's a fair point, if it's in a config file somewhere, then someone
> > may want to change it or someone may want to provide a second file for a
> > second cache with a different security label.
> >
> > > BTW, as should be obvious, some LSMs aren't label-based at all, so it
> > > would need to be optional.
> > 
> > Aargh.  In which case it might not be possible to make the SELinux context
> > passing from userspace -> kernel generic for all LSMs:-(
> 
> For LSM's that don't use labels what you will have to pass in
> won't be a label, it will be something else. But since any LSM
> that wants to do networking or audit will have to deal with
> secid's and secctx's the method outlined above ought to fit the
> bill.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 15:25               ` Stephen Smalley
@ 2007-12-12 16:51                 ` Casey Schaufler
  2007-12-12 18:12                   ` Stephen Smalley
  2007-12-12 18:34                   ` David Howells
  0 siblings, 2 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-12 16:51 UTC (permalink / raw)
  To: Stephen Smalley, casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module


--- Stephen Smalley <sds@tycho.nsa.gov> wrote:

> On Tue, 2007-12-11 at 15:04 -0800, Casey Schaufler wrote:
> > --- David Howells <dhowells@redhat.com> wrote:
> > 
> > > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > 
> > > > All your code has to do is invoke a function provided by libselinux.
> > > 
> > > Calling libselinux means it's a special case for a specific LSM.
> > > 
> > > I think the best way to do this, then, has to be to dlopen the
> appropriate
> > > LSM
> > > library.  That way I don't need to do any conditional compilation or
> linking,
> > > but can build all the bits in to cachefilesd and have the appropriate one
> > > selected by the /etc/cachefilesd.conf.
> > > 
> > > So, what do I invoke in libselinux, how do I configure it, and how do I
> > > integrate the config into my RPM and install it?
> > > 
> > > And then what does it give me that I can hand to the kernel (a context
> string
> > > for SELinux, I presume), how do I get the kernel to make a check on it,
> how
> > > do
> > > I configure the check and how do I install that config from my RPM (I
> presume
> > > I just need to modify the .fc, .if and .te files that I have already)?
> > 
> > That seems like an awful lot of work. I suggest that what you
> > put in /etc/cachefilesd.conf is a line like:
> > 
> >    security_context:"<whatever>"
> > 
> > and have your daemon pass "<whatever>" into the kernel using
> > a cachefile mechanism. The kernel code can call
> > security_secctx_to_secid("<whatever>") to determine if it's valid.
> > No need to invoke LSM specific code in your daemon. You may need
> > to have an application, say cachefileselinuxcontext, that will
> > read the current policy and spit out an appropriate value of
> > "<whatever>", but that can be separate and LSM specific without
> > mucking up your basic infrastructure applications. 
> 
> That sounds workable, although I think he will want a more specific hook
> than security_secctx_to_secid(), or possibly a second hook call, that
> would not only validate the context but authorize the use of it by the
> cachefilesd process.

What sort of authorization are you thinking of? I would expect
that to have been done by cachefileselinuxcontext (or
cachefilesspiffylsmcontext) up in userspace. If you're going to
rely on userspace applications for policy enforcement they need
to be good enough to count on after all.

> And then the security_task_kernel_act_as() hook
> just takes the secid as input rather than the task struct of the daemon,
> and applies it.  At that point, nfsd can use the same mechanism for
> setting the acting SID based on the client process after doing its own
> authorization.

good points all, in spite of my personal distaste for secids.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 16:51                 ` Casey Schaufler
@ 2007-12-12 18:12                   ` Stephen Smalley
  2007-12-12 18:34                   ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-12 18:12 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Wed, 2007-12-12 at 08:51 -0800, Casey Schaufler wrote:
> --- Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > On Tue, 2007-12-11 at 15:04 -0800, Casey Schaufler wrote:
> > > --- David Howells <dhowells@redhat.com> wrote:
> > > 
> > > > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > > 
> > > > > All your code has to do is invoke a function provided by libselinux.
> > > > 
> > > > Calling libselinux means it's a special case for a specific LSM.
> > > > 
> > > > I think the best way to do this, then, has to be to dlopen the
> > appropriate
> > > > LSM
> > > > library.  That way I don't need to do any conditional compilation or
> > linking,
> > > > but can build all the bits in to cachefilesd and have the appropriate one
> > > > selected by the /etc/cachefilesd.conf.
> > > > 
> > > > So, what do I invoke in libselinux, how do I configure it, and how do I
> > > > integrate the config into my RPM and install it?
> > > > 
> > > > And then what does it give me that I can hand to the kernel (a context
> > string
> > > > for SELinux, I presume), how do I get the kernel to make a check on it,
> > how
> > > > do
> > > > I configure the check and how do I install that config from my RPM (I
> > presume
> > > > I just need to modify the .fc, .if and .te files that I have already)?
> > > 
> > > That seems like an awful lot of work. I suggest that what you
> > > put in /etc/cachefilesd.conf is a line like:
> > > 
> > >    security_context:"<whatever>"
> > > 
> > > and have your daemon pass "<whatever>" into the kernel using
> > > a cachefile mechanism. The kernel code can call
> > > security_secctx_to_secid("<whatever>") to determine if it's valid.
> > > No need to invoke LSM specific code in your daemon. You may need
> > > to have an application, say cachefileselinuxcontext, that will
> > > read the current policy and spit out an appropriate value of
> > > "<whatever>", but that can be separate and LSM specific without
> > > mucking up your basic infrastructure applications. 
> > 
> > That sounds workable, although I think he will want a more specific hook
> > than security_secctx_to_secid(), or possibly a second hook call, that
> > would not only validate the context but authorize the use of it by the
> > cachefilesd process.
> 
> What sort of authorization are you thinking of? I would expect
> that to have been done by cachefileselinuxcontext (or
> cachefilesspiffylsmcontext) up in userspace. If you're going to
> rely on userspace applications for policy enforcement they need
> to be good enough to count on after all.

In Smack, I'd expect that you'd want to apply a CAP_MAC_OVERRIDE check.
In SELinux, we'd apply a permission check between the task's security
context and the specified security context so that we can control the
pairwise relationship between them via allow rules and constraints.

The kernel has no way of knowing whether the context was determined by
cachefileselinuxcontext or not; it only knows that some task is trying
to write some value to /cachefiles/context or whatever the kernel
interface is, and it needs to apply some authorization check there,
where that check is security-module-specific.

> > And then the security_task_kernel_act_as() hook
> > just takes the secid as input rather than the task struct of the daemon,
> > and applies it.  At that point, nfsd can use the same mechanism for
> > setting the acting SID based on the client process after doing its own
> > authorization.
> 
> good points all, in spite of my personal distaste for secids.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 23:04             ` Casey Schaufler
  2007-12-12 15:25               ` Stephen Smalley
@ 2007-12-12 18:25               ` David Howells
  2007-12-12 19:20                 ` Casey Schaufler
  2007-12-12 18:29               ` David Howells
  2 siblings, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-12 18:25 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> You may need to have an application, say cachefileselinuxcontext, that will
> read the current policy and spit out an appropriate value of "<whatever>",
> but that can be separate and LSM specific without mucking up your basic
> infrastructure applications.

What would I do with such a thing?  How would it get run?  Spat out to where?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 23:04             ` Casey Schaufler
  2007-12-12 15:25               ` Stephen Smalley
  2007-12-12 18:25               ` David Howells
@ 2007-12-12 18:29               ` David Howells
  2007-12-12 19:33                 ` Stephen Smalley
                                   ` (2 more replies)
  2 siblings, 3 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 18:29 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> That sounds workable, although I think he will want a more specific hook
> than security_secctx_to_secid(), or possibly a second hook call, that
> would not only validate the context but authorize the use of it by the
> cachefilesd process.  And then the security_task_kernel_act_as() hook
> just takes the secid as input rather than the task struct of the daemon,
> and applies it.  At that point, nfsd can use the same mechanism for
> setting the acting SID based on the client process after doing its own
> authorization.

I thought using secids was verboten as it made things too specific.

Have you example code for the security hook you mention?  I'm not sure I
understand why security_secctx_to_secid() is not sufficient.

Or is it that I need something that takes a secctx, converts it to a secid and
authorises its use all in one go?  If it's this, why can't that be rolles into
security_task_kernel_act_as()?  That sets up a task_security struct which is
then switched in and out without consultation of the LSM.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 16:51                 ` Casey Schaufler
  2007-12-12 18:12                   ` Stephen Smalley
@ 2007-12-12 18:34                   ` David Howells
  2007-12-12 19:44                     ` Casey Schaufler
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-12 18:34 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> What sort of authorization are you thinking of? I would expect
> that to have been done by cachefileselinuxcontext (or
> cachefilesspiffylsmcontext) up in userspace. If you're going to
> rely on userspace applications for policy enforcement they need
> to be good enough to count on after all.

It can't be done in userspace, otherwise someone using the cachefilesd
interface can pass an arbitrary context up.  The security context has to be
passed across the file descriptor attached to /dev/cachefiles along with the
other configuration parameters as a text string.  This fd selects the
particular cache context that a particular instance of a running daemon is
using.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 18:25               ` David Howells
@ 2007-12-12 19:20                 ` Casey Schaufler
  2007-12-12 19:29                   ` David Howells
                                     ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-12 19:20 UTC (permalink / raw)
  To: David Howells, casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> > You may need to have an application, say cachefileselinuxcontext, that will
> > read the current policy and spit out an appropriate value of "<whatever>",
> > but that can be separate and LSM specific without mucking up your basic
> > infrastructure applications.
> 
> What would I do with such a thing?  How would it get run?  Spat out to where?

Put it in /etc/init.d/cachefiles and run it at boot time. Put the
result into /etc/cachefiles.conf. Have cachefilesd read it and pass
it downward.



Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:20                 ` Casey Schaufler
@ 2007-12-12 19:29                   ` David Howells
  2007-12-12 19:35                   ` Stephen Smalley
  2007-12-12 22:55                   ` David Howells
  2 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 19:29 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> Put the result into /etc/cachefiles.conf.

Ewww.  Runtime mangling of the configuration.  I suppose it doesn't have to be
in that file with the rest of the config.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 18:29               ` David Howells
@ 2007-12-12 19:33                 ` Stephen Smalley
  2007-12-12 19:37                 ` Casey Schaufler
  2007-12-12 22:49                 ` David Howells
  2 siblings, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-12 19:33 UTC (permalink / raw)
  To: David Howells
  Cc: casey, Karl MacMillan, viro, hch, Trond.Myklebust, linux-kernel,
	selinux, linux-security-module

On Wed, 2007-12-12 at 18:29 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > That sounds workable, although I think he will want a more specific hook
> > than security_secctx_to_secid(), or possibly a second hook call, that
> > would not only validate the context but authorize the use of it by the
> > cachefilesd process.  And then the security_task_kernel_act_as() hook
> > just takes the secid as input rather than the task struct of the daemon,
> > and applies it.  At that point, nfsd can use the same mechanism for
> > setting the acting SID based on the client process after doing its own
> > authorization.
> 
> I thought using secids was verboten as it made things too specific.

Well, that has been Casey's objection in the past to it, but he seems to
have accepted their use now for certain purposes, and they are already
entrenched in the audit and labeled networking interfaces.

> Have you example code for the security hook you mention?  I'm not sure I
> understand why security_secctx_to_secid() is not sufficient.

security_secctx_to_secid() would just validate and map a context string
to a secid.  It wouldn't perform any permission check, as the caller
might a kernel-internal user that is just mapping back and forth like
current users of security_secid_to_secctx, or it might be something that
ultimately originated from userspace but the hook has no way of knowing
why or what set of checks would be appropriate.  You'd need a more
specific hook for the authorization, one that would perform a permission
check, e.g. an avc_has_perm() call.  Which likely requires defining a
new class and permissions for your cachefiles kernel interface.

> Or is it that I need something that takes a secctx, converts it to a secid and
> authorises its use all in one go?  If it's this, why can't that be rolles into
> security_task_kernel_act_as()?  That sets up a task_security struct which is
> then switched in and out without consultation of the LSM.

I was under the impression that security_task_kernel_act_as() was being
used to switch the current task to an acting context, not to initially
set up a struct for later use.  If you go with the latter approach, then
what is the lifecycle on that struct?

BTW, it gets a little confusing with your use of task_security for the
full task security state vs our existing use of task_security_struct
within SELinux for the task's LSM security blob.  I suppose ours could
be renamed to task_selinux.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:20                 ` Casey Schaufler
  2007-12-12 19:29                   ` David Howells
@ 2007-12-12 19:35                   ` Stephen Smalley
  2007-12-12 22:55                   ` David Howells
  2 siblings, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-12 19:35 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Wed, 2007-12-12 at 11:20 -0800, Casey Schaufler wrote:
> --- David Howells <dhowells@redhat.com> wrote:
> 
> > Casey Schaufler <casey@schaufler-ca.com> wrote:
> > 
> > > You may need to have an application, say cachefileselinuxcontext, that will
> > > read the current policy and spit out an appropriate value of "<whatever>",
> > > but that can be separate and LSM specific without mucking up your basic
> > > infrastructure applications.
> > 
> > What would I do with such a thing?  How would it get run?  Spat out to where?
> 
> Put it in /etc/init.d/cachefiles and run it at boot time. Put the
> result into /etc/cachefiles.conf. Have cachefilesd read it and pass
> it downward.

More likely, run it at build time in your .spec file to generate
cachefiles.conf, then run it again maybe upon a policy update or if the
user selects a different policy.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 18:29               ` David Howells
  2007-12-12 19:33                 ` Stephen Smalley
@ 2007-12-12 19:37                 ` Casey Schaufler
  2007-12-12 22:52                   ` David Howells
  2007-12-12 22:49                 ` David Howells
  2 siblings, 1 reply; 126+ messages in thread
From: Casey Schaufler @ 2007-12-12 19:37 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > That sounds workable, although I think he will want a more specific hook
> > than security_secctx_to_secid(), or possibly a second hook call, that
> > would not only validate the context but authorize the use of it by the
> > cachefilesd process.  And then the security_task_kernel_act_as() hook
> > just takes the secid as input rather than the task struct of the daemon,
> > and applies it.  At that point, nfsd can use the same mechanism for
> > setting the acting SID based on the client process after doing its own
> > authorization.
> 
> I thought using secids was verboten as it made things too specific.

I have argued that in the past. I'm reasonably convinced that I have
lost that argument at least for the immediate future as audit, usb,
and networking are all dependent on them. I can't image an LSM that
manages to avoid them, at least for the short term. If secid's are
ever expundged from the kernel cachefiles will require reeducation,
but that will be a minor effort.

> Have you example code for the security hook you mention?  I'm not sure I
> understand why security_secctx_to_secid() is not sufficient.
> 
> Or is it that I need something that takes a secctx, converts it to a secid
> and
> authorises its use all in one go?
> If it's this, why can't that be rolles
> into
> security_task_kernel_act_as()?  That sets up a task_security struct which is
> then switched in and out without consultation of the LSM.

It would seem to me that security_secctx_to_secid() ought to suffice
if the application code was written correctly. Be aware that factors
outside the LSM may be important, too. As Stephen points out elsewhere,
Smack will require you have particular capabilities (CAP_MAC_OVERRIDE,
CAP_MAC_ADMIN) while a DAC LSM may require CAP_DAC_OVERRIDE. SELinux
is likely to be the odd duck in this pond in that it does not use the
capability mechanism in the way Nature intends it to be, opting to
treat "privilege" with a completely different model.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 18:34                   ` David Howells
@ 2007-12-12 19:44                     ` Casey Schaufler
  2007-12-12 19:49                       ` Stephen Smalley
  2007-12-12 22:32                       ` David Howells
  0 siblings, 2 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-12 19:44 UTC (permalink / raw)
  To: David Howells, casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
> > What sort of authorization are you thinking of? I would expect
> > that to have been done by cachefileselinuxcontext (or
> > cachefilesspiffylsmcontext) up in userspace. If you're going to
> > rely on userspace applications for policy enforcement they need
> > to be good enough to count on after all.
> 
> It can't be done in userspace, otherwise someone using the cachefilesd
> interface can pass an arbitrary context up.

Yes, but I would expect that interface to be protected (owned by root,
mode 0400). If /dev/cachefiles has to be publicly accessable make it
a privileged ioctl.

> The security context has to be
> passed across the file descriptor attached to /dev/cachefiles along with the
> other configuration parameters as a text string.

I got that.

> This fd selects the
> particular cache context that a particular instance of a running daemon is
> using.

Yes, but forgive me being slow, I don't see the problem.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:44                     ` Casey Schaufler
@ 2007-12-12 19:49                       ` Stephen Smalley
  2007-12-12 20:09                         ` Casey Schaufler
  2007-12-12 22:32                       ` David Howells
  1 sibling, 1 reply; 126+ messages in thread
From: Stephen Smalley @ 2007-12-12 19:49 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

On Wed, 2007-12-12 at 11:44 -0800, Casey Schaufler wrote:
> --- David Howells <dhowells@redhat.com> wrote:
> 
> > Casey Schaufler <casey@schaufler-ca.com> wrote:
> > 
> > > What sort of authorization are you thinking of? I would expect
> > > that to have been done by cachefileselinuxcontext (or
> > > cachefilesspiffylsmcontext) up in userspace. If you're going to
> > > rely on userspace applications for policy enforcement they need
> > > to be good enough to count on after all.
> > 
> > It can't be done in userspace, otherwise someone using the cachefilesd
> > interface can pass an arbitrary context up.
> 
> Yes, but I would expect that interface to be protected (owned by root,
> mode 0400). If /dev/cachefiles has to be publicly accessable make it
> a privileged ioctl.

Uid 0 != CAP_MAC_OVERRIDE if you configure file caps and such.

> > The security context has to be
> > passed across the file descriptor attached to /dev/cachefiles along with the
> > other configuration parameters as a text string.
> 
> I got that.
> 
> > This fd selects the
> > particular cache context that a particular instance of a running daemon is
> > using.
> 
> Yes, but forgive me being slow, I don't see the problem.
> 
> 
> Casey Schaufler
> casey@schaufler-ca.com
-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:49                       ` Stephen Smalley
@ 2007-12-12 20:09                         ` Casey Schaufler
  2007-12-12 22:29                           ` David Howells
  0 siblings, 1 reply; 126+ messages in thread
From: Casey Schaufler @ 2007-12-12 20:09 UTC (permalink / raw)
  To: Stephen Smalley, casey
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module


--- Stephen Smalley <sds@tycho.nsa.gov> wrote:

> On Wed, 2007-12-12 at 11:44 -0800, Casey Schaufler wrote:
> > --- David Howells <dhowells@redhat.com> wrote:
> > 
> > > Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > 
> > > > What sort of authorization are you thinking of? I would expect
> > > > that to have been done by cachefileselinuxcontext (or
> > > > cachefilesspiffylsmcontext) up in userspace. If you're going to
> > > > rely on userspace applications for policy enforcement they need
> > > > to be good enough to count on after all.
> > > 
> > > It can't be done in userspace, otherwise someone using the cachefilesd
> > > interface can pass an arbitrary context up.
> > 
> > Yes, but I would expect that interface to be protected (owned by root,
> > mode 0400). If /dev/cachefiles has to be publicly accessable make it
> > a privileged ioctl.
> 
> Uid 0 != CAP_MAC_OVERRIDE if you configure file caps and such.

Yes, but we're talking about writing the configuration information
to the kernel, not actually making any access checks with it. I
think. What I think we're talking about (and please correct me David
if I've stepped into the wrong theatre) is getting the magic
secctx that cachefiles will use instead of the secctx that the task
would have otherwise. I don't think we're talking about recomputing
it on every access, I think David is looking for the blunderbuss
secctx that he can use any time he needs one.

And it would be CAP_MAC_ADMIN, since you're changing the MAC
characteristics of the system, not doing an access check.

Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 20:09                         ` Casey Schaufler
@ 2007-12-12 22:29                           ` David Howells
  0 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 22:29 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> Yes, but we're talking about writing the configuration information
> to the kernel, not actually making any access checks with it. I
> think. What I think we're talking about (and please correct me David
> if I've stepped into the wrong theatre) is getting the magic
> secctx that cachefiles will use instead of the secctx that the task
> would have otherwise. I don't think we're talking about recomputing
> it on every access, I think David is looking for the blunderbuss
> secctx that he can use any time he needs one.

Indeed.

The way I do it is:

 (1) The daemon opens /dev/cachefiles to being an instance of a cache.

 (2) The daemon negotiates a security context for the module to use.

 (3) The security context is place in a task_security structure.

 (4) This task_security struct is attached temporarily to task->act_as each
     time any task attempts to access the cache through the module.

 (5) The task_security struct is discarded when the file descriptor that was
     created in (1) is closed and the cache is withdrawn at the same time.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:44                     ` Casey Schaufler
  2007-12-12 19:49                       ` Stephen Smalley
@ 2007-12-12 22:32                       ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 22:32 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> > This fd selects the
> > particular cache context that a particular instance of a running daemon is
> > using.
> 
> Yes, but forgive me being slow, I don't see the problem.

I mean that it's not particularly sensible to have an auxiliary interface (say
a separate /sys/cachefiles file) for setting the cache context as there may be
several caches instantiated simply by opening /dev/cachefiles several times
and retaining all the file descriptors involved.

I meant this as information only.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 18:29               ` David Howells
  2007-12-12 19:33                 ` Stephen Smalley
  2007-12-12 19:37                 ` Casey Schaufler
@ 2007-12-12 22:49                 ` David Howells
  2007-12-13 14:49                   ` Stephen Smalley
  2007-12-13 15:36                   ` David Howells
  2 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 22:49 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> > Have you example code for the security hook you mention?  I'm not sure I
> > understand why security_secctx_to_secid() is not sufficient.
> 
> security_secctx_to_secid() would just validate and map a context string to a
> secid.

Validate as in check it's a valid string?  Okay, I see that.

> It wouldn't perform any permission check, as the caller might a
> kernel-internal user that is just mapping back and forth like current users
> of security_secid_to_secctx, or it might be something that ultimately
> originated from userspace but the hook has no way of knowing why or what set
> of checks would be appropriate.  You'd need a more specific hook for the
> authorization, one that would perform a permission check, e.g. an
> avc_has_perm() call.  Which likely requires defining a new class and
> permissions for your cachefiles kernel interface.

Hmmm...  This is sounding very not-simple.  I don't know how to do this.  I
can probably guess the kernel side by looking at how SECCLASS_KEY is done, but
it sounds like it involves changes to the userspace policy processing tools
too.

It also sounds a bit like overkill, but if it's the right way then I guess it
has to be done.

What does the security class represent in this case?  And can it be generalised
to apply to non-cachefiles stuff too?

> > Or is it that I need something that takes a secctx, converts it to a secid
> > and authorises its use all in one go?  If it's this, why can't that be
> > rolles into security_task_kernel_act_as()?  That sets up a task_security
> > struct which is then switched in and out without consultation of the LSM.
> 
> I was under the impression that security_task_kernel_act_as() was being used
> to switch the current task to an acting context, not to initially set up a
> struct for later use.

Definitely the latter.  I guess I wasn't very clear in the patch description.
It also sounds like I need to adjust the naming of certain functions.

> If you go with the latter approach, then what is the lifecycle on that
> struct?

 (1) You create a new task_security struct

 (2) You fill in the fsuid, fsgid, etc.

 (3) You request that the LSM security pointer in it be set to point to the
     context you want (at the moment this is done by attempting a transition
     from the daemon's context):

	ret = security_transition_sid(dtsec->sid, SECINITSID_KERNEL,
				      SECCLASS_PROCESS, &ksid);

     and, in the current code, it returns an error if you're not allowed to do
     that.  But instead you'd ask it to set a specific context, and it'd set
     that if you're permitted to do that, and give you an error if you're not.

 (4) You then use the task_security at will to override the task->act_as
     pointer in whatever task(s) you're operating on behalf of at the moment.

 (5) When you cease operating on the behalf of a task, you revert its act_as
     pointer and drop a reference to your task_security struct.

 (6) When the last ref to the task_security struct goes away, the LSM data
     attached to it is released, as are its groups and keyrings.

> BTW, it gets a little confusing with your use of task_security for the full
> task security state vs our existing use of task_security_struct within
> SELinux for the task's LSM security blob.

I know.  I thought about it quite a bit, but the problem is there's so much
overloading of various words (eg: security, context), and I wanted to avoid
struct cred for various other reasons.

> I suppose ours could be renamed to task_selinux.

Better still, perhaps, would be to prefix things with selinux_ to make it
namespace clean.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:37                 ` Casey Schaufler
@ 2007-12-12 22:52                   ` David Howells
  0 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 22:52 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, linux-kernel, selinux, linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> It would seem to me that security_secctx_to_secid() ought to suffice if the
> application code was written correctly.

That's not quite sufficient as there still needs to be a verification step to
make sure the caller is allowed to do this.

> Be aware that factors outside the LSM may be important, too. As Stephen
> points out elsewhere, Smack will require you have particular capabilities
> (CAP_MAC_OVERRIDE, CAP_MAC_ADMIN) while a DAC LSM may require
> CAP_DAC_OVERRIDE.

For what?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 19:20                 ` Casey Schaufler
  2007-12-12 19:29                   ` David Howells
  2007-12-12 19:35                   ` Stephen Smalley
@ 2007-12-12 22:55                   ` David Howells
  2007-12-13 14:51                     ` Stephen Smalley
  2007-12-13 16:03                     ` David Howells
  2 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-12 22:55 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> More likely, run it at build time in your .spec file to generate
> cachefiles.conf,

I don't think sticking it in cachefiles.conf is a good idea necessarily.
That has to be an administrator modifiable file.  Is there a program I could
make cachefiles run directly and capture the output of that could give me the
info I want?

> then run it again maybe upon a policy update or if the user selects a
> different policy.

How do I do that?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 22:49                 ` David Howells
@ 2007-12-13 14:49                   ` Stephen Smalley
  2007-12-13 15:36                   ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-13 14:49 UTC (permalink / raw)
  To: David Howells
  Cc: casey, Karl MacMillan, viro, hch, Trond.Myklebust, linux-kernel,
	selinux, linux-security-module

On Wed, 2007-12-12 at 22:49 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > > Have you example code for the security hook you mention?  I'm not sure I
> > > understand why security_secctx_to_secid() is not sufficient.
> > 
> > security_secctx_to_secid() would just validate and map a context string to a
> > secid.
> 
> Validate as in check it's a valid string?  Okay, I see that.

Yes.

> > It wouldn't perform any permission check, as the caller might a
> > kernel-internal user that is just mapping back and forth like current users
> > of security_secid_to_secctx, or it might be something that ultimately
> > originated from userspace but the hook has no way of knowing why or what set
> > of checks would be appropriate.  You'd need a more specific hook for the
> > authorization, one that would perform a permission check, e.g. an
> > avc_has_perm() call.  Which likely requires defining a new class and
> > permissions for your cachefiles kernel interface.
> 
> Hmmm...  This is sounding very not-simple.  I don't know how to do this.  I
> can probably guess the kernel side by looking at how SECCLASS_KEY is done, but
> it sounds like it involves changes to the userspace policy processing tools
> too.

Not the tools, just the policy definitions.  Dan can help with that.
You add the definitions to the policy, then there is a script to
regenerate the SELinux kernel headers that #define the SECCLASS_ and
permission values.

> It also sounds a bit like overkill, but if it's the right way then I guess it
> has to be done.
> 
> What does the security class represent in this case?  And can it be generalised
> to apply to non-cachefiles stuff too?

It is just a way of carving up the permission space, typically based on
object type, but it can essentially be arbitrary.  The check in this
case seems specific to cachefiles since it is controlling an operation
on the /dev/cachefiles interface that only applies to cachefiles
internal operations, so making a cachefiles class seems reasonable.

If this was instead just a generic "set my acting context to <value>"
operation, then it could be a generic /proc/self/attr/active interface
with an generic implementation and permission check, but here we aren't
setting the active context of the cachefilesd daemon but rather of the
cachefiles kernel module.

The other approach that I suggested a long time ago is to exempt the
cachefiles kernel module internal operations from SELinux permission
checks altogether by setting some task flag when performing those
operations and checking for that flag either on entry to the security_
static inlines, similar to the inode private flag, or within SELinux
itself.  Rationale being that the cachefiles kernel module can already
do what it wants and the SELinux permission checks are really about
controlling what userspace can do.  Then we don't have to invent a
context for the kernel module at all or worry about subtle breakage when
some permission is missing from its policy.

> > > Or is it that I need something that takes a secctx, converts it to a secid
> > > and authorises its use all in one go?  If it's this, why can't that be
> > > rolles into security_task_kernel_act_as()?  That sets up a task_security
> > > struct which is then switched in and out without consultation of the LSM.
> > 
> > I was under the impression that security_task_kernel_act_as() was being used
> > to switch the current task to an acting context, not to initially set up a
> > struct for later use.
> 
> Definitely the latter.  I guess I wasn't very clear in the patch description.
> It also sounds like I need to adjust the naming of certain functions.
> 
> > If you go with the latter approach, then what is the lifecycle on that
> > struct?
> 
>  (1) You create a new task_security struct
> 
>  (2) You fill in the fsuid, fsgid, etc.
> 
>  (3) You request that the LSM security pointer in it be set to point to the
>      context you want (at the moment this is done by attempting a transition
>      from the daemon's context):
> 
> 	ret = security_transition_sid(dtsec->sid, SECINITSID_KERNEL,
> 				      SECCLASS_PROCESS, &ksid);
> 
>      and, in the current code, it returns an error if you're not allowed to do
>      that.  But instead you'd ask it to set a specific context, and it'd set
>      that if you're permitted to do that, and give you an error if you're not.
> 
>  (4) You then use the task_security at will to override the task->act_as
>      pointer in whatever task(s) you're operating on behalf of at the moment.
> 
>  (5) When you cease operating on the behalf of a task, you revert its act_as
>      pointer and drop a reference to your task_security struct.
> 
>  (6) When the last ref to the task_security struct goes away, the LSM data
>      attached to it is released, as are its groups and keyrings.
> 
> > BTW, it gets a little confusing with your use of task_security for the full
> > task security state vs our existing use of task_security_struct within
> > SELinux for the task's LSM security blob.
> 
> I know.  I thought about it quite a bit, but the problem is there's so much
> overloading of various words (eg: security, context), and I wanted to avoid
> struct cred for various other reasons.
> 
> > I suppose ours could be renamed to task_selinux.
> 
> Better still, perhaps, would be to prefix things with selinux_ to make it
> namespace clean.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 22:55                   ` David Howells
@ 2007-12-13 14:51                     ` Stephen Smalley
  2007-12-13 16:03                     ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-13 14:51 UTC (permalink / raw)
  To: David Howells
  Cc: casey, Karl MacMillan, viro, hch, Trond.Myklebust, linux-kernel,
	selinux, linux-security-module

On Wed, 2007-12-12 at 22:55 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > More likely, run it at build time in your .spec file to generate
> > cachefiles.conf,
> 
> I don't think sticking it in cachefiles.conf is a good idea necessarily.
> That has to be an administrator modifiable file.  Is there a program I could
> make cachefiles run directly and capture the output of that could give me the
> info I want?

Yes, we could easily make a simple program that just invokes a
libselinux function that in turn grabs the proper context from some
context configuration file under /etc/selinux/$SELINUXTYPE/contexts/ and
outputs it.  Dan can help with that.

> > then run it again maybe upon a policy update or if the user selects a
> > different policy.
> 
> How do I do that?

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 22:49                 ` David Howells
  2007-12-13 14:49                   ` Stephen Smalley
@ 2007-12-13 15:36                   ` David Howells
  2007-12-13 16:23                     ` Stephen Smalley
  2007-12-13 17:01                     ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-13 15:36 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> It is just a way of carving up the permission space, typically based on
> object type, but it can essentially be arbitrary.  The check in this
> case seems specific to cachefiles since it is controlling an operation
> on the /dev/cachefiles interface that only applies to cachefiles
> internal operations, so making a cachefiles class seems reasonable.

Can you specify what sort of permissions you're thinking of providing for
tasks to operate on this class?  Can an object of this class 'operate' on
other objects, or can only process-class objects do that?

How does an object of this class acquire a label?  What is an object of this
class?  Is it a "cache"?  Or were you thinking of a "module"?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-12 22:55                   ` David Howells
  2007-12-13 14:51                     ` Stephen Smalley
@ 2007-12-13 16:03                     ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-13 16:03 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> 
> Yes, we could easily make a simple program that just invokes a
> libselinux function that in turn grabs the proper context from some
> context configuration file under /etc/selinux/$SELINUXTYPE/contexts/ and
> outputs it.  Dan can help with that.

That sounds nicely genericisable, perhaps even for any LSM.

	/usr/bin/lsm-get-context cachefiles

It does have to be able to come up with different contexts for different
caches, but that can be controlled by changing the name supplied to it.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-13 15:36                   ` David Howells
@ 2007-12-13 16:23                     ` Stephen Smalley
  2007-12-13 17:01                     ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-13 16:23 UTC (permalink / raw)
  To: David Howells
  Cc: casey, Karl MacMillan, viro, hch, Trond.Myklebust, linux-kernel,
	selinux, linux-security-module

On Thu, 2007-12-13 at 15:36 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > It is just a way of carving up the permission space, typically based on
> > object type, but it can essentially be arbitrary.  The check in this
> > case seems specific to cachefiles since it is controlling an operation
> > on the /dev/cachefiles interface that only applies to cachefiles
> > internal operations, so making a cachefiles class seems reasonable.
> 
> Can you specify what sort of permissions you're thinking of providing for
> tasks to operate on this class?

They would correspond with the operations provided by
the /dev/cachefiles interface, at the granularity you want to support
distinctions to be made.  Could just be a single 'setcontext' permission
if that is all you want to control distinctly, or could be a permission
per operation.

>   Can an object of this class 'operate' on
> other objects, or can only process-class objects do that?

In this case, I wouldn't expect a cachefiles object to act on anything
else.  Some objects are also used as subjects, especially in the
networking arena.

> How does an object of this class acquire a label?  What is an object of this
> class?  Is it a "cache"?  Or were you thinking of a "module"?

I was thinking the latter since the only goal was to control what
contexts could be set by a given task, but you could support per-cache
"objects" with their own labels (in which case the label would likely be
determined from the creating task).

If the latter, you don't really need a label for the object, and can
just use the supplied context/secid as the object of the permission
check, ala:
	rc = avc_has_perm(tsec->sid, secid, SECCLASS_CACHEFILES,
CACHEFILES__SETCONTEXT);

If the former, then you'd need more than one check, as you then have to
check whether the task can act on the cache in question, and then check
whether it can set the context for that cache to the specified value.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-13 15:36                   ` David Howells
  2007-12-13 16:23                     ` Stephen Smalley
@ 2007-12-13 17:01                     ` David Howells
  2007-12-13 17:27                       ` Stephen Smalley
  2007-12-13 18:04                       ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-13 17:01 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> They would correspond with the operations provided by the /dev/cachefiles
> interface, at the granularity you want to support distinctions to be made.

Can this be made simpler by the fact that /dev/cachefiles has its own unique
label (cachefiles_dev_t).

> Could just be a single 'setcontext' permission if that is all you want to
> control distinctly, or could be a permission per operation.

There is only one operation that makes sense to have a permission: "set
context and begin caching".

All the other operations on a file descriptor attached to /dev/cachfiles are
necessary for there to be a managed cache at all, and given that you've
managed to open /dev/cachefiles that's sufficient access for those, I think.

> If the latter, you don't really need a label for the object, and can
> just use the supplied context/secid as the object of the permission
> check, ala:
> 	rc = avc_has_perm(tsec->sid, secid, SECCLASS_CACHEFILES,
> CACHEFILES__SETCONTEXT);

Ummm.   I was under the impression that the target SID had to be a member of
target class.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-13 17:01                     ` David Howells
@ 2007-12-13 17:27                       ` Stephen Smalley
  2007-12-13 18:04                       ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-13 17:27 UTC (permalink / raw)
  To: David Howells
  Cc: casey, Karl MacMillan, viro, hch, Trond.Myklebust, linux-kernel,
	selinux, linux-security-module

On Thu, 2007-12-13 at 17:01 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > They would correspond with the operations provided by the /dev/cachefiles
> > interface, at the granularity you want to support distinctions to be made.
> 
> Can this be made simpler by the fact that /dev/cachefiles has its own unique
> label (cachefiles_dev_t).

That lets you control who can use the interface at all, but not what
operations they can invoke on it.

> > Could just be a single 'setcontext' permission if that is all you want to
> > control distinctly, or could be a permission per operation.
> 
> There is only one operation that makes sense to have a permission: "set
> context and begin caching".
> 
> All the other operations on a file descriptor attached to /dev/cachfiles are
> necessary for there to be a managed cache at all, and given that you've
> managed to open /dev/cachefiles that's sufficient access for those, I think.

Do any of the interfaces allow a task to act on a cache other than one
it has created?  How does the task identify the desired cache?  What if
there is a conflict between multiple tasks asking for the same cache?

> > If the latter, you don't really need a label for the object, and can
> > just use the supplied context/secid as the object of the permission
> > check, ala:
> > 	rc = avc_has_perm(tsec->sid, secid, SECCLASS_CACHEFILES,
> > CACHEFILES__SETCONTEXT);
> 
> Ummm.   I was under the impression that the target SID had to be a member of
> target class.

Not necessarily.

secid is being applied as the acting context for the cachefiles kernel
module, so the above makes sense, even though there isn't really any
"object" in view here.  Abstractly, the question we are asking above is:

Can this task set the context of the cachefiles kernel module to this
value?

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-13 17:01                     ` David Howells
  2007-12-13 17:27                       ` Stephen Smalley
@ 2007-12-13 18:04                       ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-13 18:04 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Karl MacMillan, viro, hch, Trond.Myklebust,
	linux-kernel, selinux, linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> Do any of the interfaces allow a task to act on a cache other than one
> it has created?

No.

> How does the task identify the desired cache?

Each file descriptor opened creates one separate cache instance.  Any commands
sent over that filedescriptor affect only the cache instance it is attached
to; similarly, any status data you read only refers to that one cache
instance.

Closing the file descriptor makes the cache go away as far as the kernel is
concerned.  The cachefiles daemon retains its cache dev file descriptor for
the lifetime of the daemon.

> What if there is a conflict between multiple tasks asking for the same
> cache?

As far as the cache daemon is concerned, the file descriptor is its handle to
the cache so the conflict does not arise.

> secid is being applied as the acting context for the cachefiles kernel
> module, so the above makes sense, even though there isn't really any
> "object" in view here.  Abstractly, the question we are asking above is:
>
> Can this task set the context of the cachefiles kernel module to this
> value?

So the following (taken from cachefilesd.te):

	allow cachefilesd_t cachefiles_var_t : file { getattr rename unlink };

says, for example, allow:

	avc_has_perm("cachefilesd_t",
		     "cachefiles_var_t",
		     SECCLASS_FILE,
		     FILE__RENAME,
		     ...);

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 09/28] FS-Cache: Release page->private after failed readahead [try #2]
  2007-12-05 19:39 ` [PATCH 09/28] FS-Cache: Release page->private after failed readahead " David Howells
@ 2007-12-14  3:51   ` Nick Piggin
  2007-12-17 22:42   ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-14  3:51 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Thursday 06 December 2007 06:39, David Howells wrote:
> The attached patch causes read_cache_pages() to release page-private data
> on a page for which add_to_page_cache() fails or the filler function fails.
> This permits pages with caching references associated with them to be
> cleaned up.
>
> The invalidatepage() address space op is called (indirectly) to do the
> honours.

This is pretty nasty. I would suggest either to have the function
return the number of pages that were added to pagecache, or just
open code it.

>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  mm/readahead.c |   39 +++++++++++++++++++++++++++++++++++++--
>  1 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index c9c50ca..75aa6b6 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -44,6 +44,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>
>  #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
>
> +/*
> + * see if a page needs releasing upon read_cache_pages() failure
> + * - the caller of read_cache_pages() may have set PG_private before
> calling, + *   such as the NFS fs marking pages that are cached locally on
> disk, thus we + *   need to give the fs a chance to clean up in the event
> of an error + */
> +static void read_cache_pages_invalidate_page(struct address_space
> *mapping, +					     struct page *page)
> +{
> +	if (PagePrivate(page)) {
> +		if (TestSetPageLocked(page))
> +			BUG();
> +		page->mapping = mapping;
> +		do_invalidatepage(page, 0);
> +		page->mapping = NULL;
> +		unlock_page(page);
> +	}
> +	page_cache_release(page);
> +}
> +
> +/*
> + * release a list of pages, invalidating them first if need be
> + */
> +static void read_cache_pages_invalidate_pages(struct address_space
> *mapping, +					      struct list_head *pages)
> +{
> +	struct page *victim;
> +
> +	while (!list_empty(pages)) {
> +		victim = list_to_page(pages);
> +		list_del(&victim->lru);
> +		read_cache_pages_invalidate_page(mapping, victim);
> +	}
> +}
> +
>  /**
>   * read_cache_pages - populate an address space with some pages & start
> reads against them * @mapping: the address_space
> @@ -65,14 +100,14 @@ int read_cache_pages(struct address_space *mapping,
> struct list_head *pages, list_del(&page->lru);
>  		if (add_to_page_cache_lru(page, mapping,
>  					page->index, GFP_KERNEL)) {
> -			page_cache_release(page);
> +			read_cache_pages_invalidate_page(mapping, page);
>  			continue;
>  		}
>  		page_cache_release(page);
>
>  		ret = filler(data, page);
>  		if (unlikely(ret)) {
> -			put_pages_list(pages);
> +			read_cache_pages_invalidate_pages(mapping, pages);
>  			break;
>  		}
>  		task_io_account_read(PAGE_CACHE_SIZE);
>

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-05 19:39 ` [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management " David Howells
@ 2007-12-14  4:08   ` Nick Piggin
  2007-12-17 22:36   ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-14  4:08 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Thursday 06 December 2007 06:39, David Howells wrote:
> Recruit a couple of page flags to aid in cache management.  The following
> extra flags are defined:
>
>  (1) PG_fscache (PG_owner_priv_2)
>
>      The marked page is backed by a local cache and is pinning resources in
> the cache driver.
>
>  (2) PG_fscache_write (PG_owner_priv_3)
>
>      The marked page is being written to the local cache.  The page may not
> be modified whilst this is in progress.
>
> If PG_fscache is set, then things that checked for PG_private will now also
> check for that.  This includes things like truncation and page
> invalidation. The function page_has_private() had been added to detect
> this.

I'd much prefer if you would handle this in the filesystem, and have it
set PG_private whenever fscache needs to receive a callback, and DTRT
depending on whether PG_fscache etc. is set or not.

Also, this wait_on_page_fscache_write / end_page_fscache_write stuff
seems like it would belong in your fscache headers rather than generic
mm code (ditto for your PG_fscache checks in the page allocator -- you
should use their PG_owner_priv_? names for that).


> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  fs/splice.c                |    2 +-
>  include/linux/page-flags.h |   38 ++++++++++++++++++++++++++++++++++++--
>  include/linux/pagemap.h    |   11 +++++++++++
>  mm/filemap.c               |   16 ++++++++++++++++
>  mm/migrate.c               |    2 +-
>  mm/page_alloc.c            |    3 +++
>  mm/readahead.c             |    9 +++++----
>  mm/swap.c                  |    4 ++--
>  mm/swap_state.c            |    4 ++--
>  mm/truncate.c              |   10 +++++-----
>  mm/vmscan.c                |    2 +-
>  11 files changed, 83 insertions(+), 18 deletions(-)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 6bdcb61..61edad7 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct
> pipe_inode_info *pipe, */
>  		wait_on_page_writeback(page);
>
> -		if (PagePrivate(page))
> +		if (page_has_private(page))
>  			try_to_release_page(page, GFP_KERNEL);
>
>  		/*
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 209d3a4..fcc9e23 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -77,25 +77,30 @@
>  #define PG_active		 6
>  #define PG_slab			 7	/* slab debug (Suparna wants this) */
>
> -#define PG_owner_priv_1		 8	/* Owner use. If pagecache, fs may use*/
> +#define PG_owner_priv_1		 8	/* Owner use. fs may use in pagecache */
>  #define PG_arch_1		 9
>  #define PG_reserved		10
>  #define PG_private		11	/* If pagecache, has fs-private data */
>
>  #define PG_writeback		12	/* Page is under writeback */
> +#define PG_owner_priv_2		13	/* Owner use. fs may use in pagecache */
>  #define PG_compound		14	/* Part of a compound page */
>  #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
>
>  #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
>  #define PG_reclaim		17	/* To be reclaimed asap */
> +#define PG_owner_priv_3		18	/* Owner use. fs may use in pagecache */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
>
>  /* PG_readahead is only used for file reads; PG_reclaim is only for writes
> */ #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
>
> -/* PG_owner_priv_1 users should have descriptive aliases */
> +/* PG_owner_priv_1/2/3 users should have descriptive aliases */
>  #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
>  #define PG_pinned		PG_owner_priv_1	/* Xen pinned pagetable */
> +#define PG_fscache		PG_owner_priv_2	/* Backed by local cache */
> +#define PG_fscache_write	PG_owner_priv_3	/* Writing to local cache */
> +
>
>  #if (BITS_PER_LONG > 32)
>  /*
> @@ -199,6 +204,24 @@ static inline void SetPageUptodate(struct page *page)
>  #define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback,	\
>  							&(page)->flags)
>
> +#define PageFsCache(page)	test_bit(PG_fscache, &(page)->flags)
> +#define SetPageFsCache(page)	set_bit(PG_fscache, &(page)->flags)
> +#define ClearPageFsCache(page)	clear_bit(PG_fscache, &(page)->flags)
> +#define TestSetPageFsCache(page) test_and_set_bit(PG_fscache,
> &(page)->flags) +#define TestClearPageFsCache(page)
> test_and_clear_bit(PG_fscache, \ +						      &(page)->flags)
> +
> +#define PageFsCacheWrite(page)		test_bit(PG_fscache_write, \
> +						 &(page)->flags)
> +#define SetPageFsCacheWrite(page)	set_bit(PG_fscache_write, \
> +						&(page)->flags)
> +#define ClearPageFsCacheWrite(page)	clear_bit(PG_fscache_write, \
> +						  &(page)->flags)
> +#define TestSetPageFsCacheWrite(page)	test_and_set_bit(PG_fscache_write, \
> +							 &(page)->flags)
> +#define
> TestClearPageFsCacheWrite(page)	test_and_clear_bit(PG_fscache_write, \
> +							   &(page)->flags)
> +
>  #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
>  #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
>  #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
> @@ -272,4 +295,15 @@ static inline void set_page_writeback(struct page
> *page) test_set_page_writeback(page);
>  }
>
> +/**
> + * page_has_private - Determine if page has private stuff
> + * @page: The page to be checked
> + *
> + * Determine if a page has private stuff, indicating that release routines
> + * should be invoked upon it.
> + */
> +#define page_has_private(page)			\
> +	((page)->flags & ((1 << PG_private) |	\
> +			  (1 << PG_fscache)))
> +
>  #endif	/* PAGE_FLAGS_H */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index db8a410..6a1b317 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -212,6 +212,17 @@ static inline void wait_on_page_writeback(struct page
> *page) extern void end_page_writeback(struct page *page);
>
>  /*
> + * Wait for a page to finish being written to a local cache
> + */
> +static inline void wait_on_page_fscache_write(struct page *page)
> +{
> +	if (PageFsCacheWrite(page))
> +		wait_on_page_bit(page, PG_fscache_write);
> +}
> +
> +extern void end_page_fscache_write(struct page *page);
> +
> +/*
>   * Fault a userspace page into pagetables.  Return non-zero on a fault.
>   *
>   * This assumes that two userspace pages are always sufficient.  That's
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 188cf5f..8a8e5b8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -560,6 +560,19 @@ void end_page_writeback(struct page *page)
>  EXPORT_SYMBOL(end_page_writeback);
>
>  /**
> + * end_page_fscache_write - End writing page to cache
> + * @page: the page
> + */
> +void end_page_fscache_write(struct page *page)
> +{
> +	if (!TestClearPageFsCacheWrite(page))
> +		BUG();
> +	smp_mb__after_clear_bit();
> +	wake_up_page(page, PG_fscache_write);
> +}
> +EXPORT_SYMBOL(end_page_fscache_write);
> +
> +/**
>   * __lock_page - get a lock on the page, assuming we need to sleep to get
> it * @page: the page to lock
>   *
> @@ -2528,6 +2541,9 @@ out:
>   * (presumably at page->private).  If the release was successful, return
> `1'. * Otherwise return zero.
>   *
> + * This may also be called if PG_fscache is set on a page, indicating that
> the + * page is known to the local caching routines.
> + *
>   * The @gfp_mask argument specifies whether I/O may be performed to
> release * this page (__GFP_IO), and whether the call may block
> (__GFP_WAIT). *
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6a207e8..ebaf557 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -545,7 +545,7 @@ static int fallback_migrate_page(struct address_space
> *mapping, * Buffers may be managed in a filesystem specific way.
>  	 * We must have no buffers or drop them.
>  	 */
> -	if (PagePrivate(page) &&
> +	if (page_has_private(page) &&
>  	    !try_to_release_page(page, GFP_KERNEL))
>  		return -EAGAIN;
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5a58d4..ac4c341 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -230,6 +230,7 @@ static void bad_page(struct page *page)
>  	dump_stack();
>  	page->flags &= ~(1 << PG_lru	|
>  			1 << PG_private |
> +			1 << PG_fscache	|
>  			1 << PG_locked	|
>  			1 << PG_active	|
>  			1 << PG_dirty	|
> @@ -456,6 +457,7 @@ static inline int free_pages_check(struct page *page)
>  		(page->flags & (
>  			1 << PG_lru	|
>  			1 << PG_private |
> +			1 << PG_fscache	|
>  			1 << PG_locked	|
>  			1 << PG_active	|
>  			1 << PG_slab	|
> @@ -605,6 +607,7 @@ static int prep_new_page(struct page *page, int order,
> gfp_t gfp_flags) (page->flags & (
>  			1 << PG_lru	|
>  			1 << PG_private	|
> +			1 << PG_fscache	|
>  			1 << PG_locked	|
>  			1 << PG_active	|
>  			1 << PG_dirty	|
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 75aa6b6..272ffc7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -46,14 +46,15 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>
>  /*
>   * see if a page needs releasing upon read_cache_pages() failure
> - * - the caller of read_cache_pages() may have set PG_private before
> calling, - *   such as the NFS fs marking pages that are cached locally on
> disk, thus we - *   need to give the fs a chance to clean up in the event
> of an error + * - the caller of read_cache_pages() may have set PG_private
> or PG_fscache + *   before calling, such as the NFS fs marking pages that
> are cached locally + *   on disk, thus we need to give the fs a chance to
> clean up in the event of + *   an error
>   */
>  static void read_cache_pages_invalidate_page(struct address_space
> *mapping, struct page *page)
>  {
> -	if (PagePrivate(page)) {
> +	if (page_has_private(page)) {
>  		if (TestSetPageLocked(page))
>  			BUG();
>  		page->mapping = mapping;
> diff --git a/mm/swap.c b/mm/swap.c
> index 9ac8832..22d6e03 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -455,8 +455,8 @@ void pagevec_strip(struct pagevec *pvec)
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
>
> -		if (PagePrivate(page) && !TestSetPageLocked(page)) {
> -			if (PagePrivate(page))
> +		if (page_has_private(page) && !TestSetPageLocked(page)) {
> +			if (page_has_private(page))
>  				try_to_release_page(page, 0);
>  			unlock_page(page);
>  		}
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index b526356..33f1e5c 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -76,7 +76,7 @@ static int __add_to_swap_cache(struct page *page,
> swp_entry_t entry,
>
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(PageSwapCache(page));
> -	BUG_ON(PagePrivate(page));
> +	BUG_ON(page_has_private(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
>  		write_lock_irq(&swapper_space.tree_lock);
> @@ -129,7 +129,7 @@ void __delete_from_swap_cache(struct page *page)
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(!PageSwapCache(page));
>  	BUG_ON(PageWriteback(page));
> -	BUG_ON(PagePrivate(page));
> +	BUG_ON(page_has_private(page));
>
>  	radix_tree_delete(&swapper_space.page_tree, page_private(page));
>  	set_page_private(page, 0);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index cadc156..5b7d1c5 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -49,7 +49,7 @@ void do_invalidatepage(struct page *page, unsigned long
> offset) static inline void truncate_partial_page(struct page *page,
> unsigned partial) {
>  	zero_user_page(page, partial, PAGE_CACHE_SIZE - partial, KM_USER0);
> -	if (PagePrivate(page))
> +	if (page_has_private(page))
>  		do_invalidatepage(page, partial);
>  }
>
> @@ -100,7 +100,7 @@ truncate_complete_page(struct address_space *mapping,
> struct page *page)
>
>  	cancel_dirty_page(page, PAGE_CACHE_SIZE);
>
> -	if (PagePrivate(page))
> +	if (page_has_private(page))
>  		do_invalidatepage(page, 0);
>
>  	remove_from_page_cache(page);
> @@ -125,7 +125,7 @@ invalidate_complete_page(struct address_space *mapping,
> struct page *page) if (page->mapping != mapping)
>  		return 0;
>
> -	if (PagePrivate(page) && !try_to_release_page(page, 0))
> +	if (page_has_private(page) && !try_to_release_page(page, 0))
>  		return 0;
>
>  	ret = remove_mapping(mapping, page);
> @@ -347,14 +347,14 @@ invalidate_complete_page2(struct address_space
> *mapping, struct page *page) if (page->mapping != mapping)
>  		return 0;
>
> -	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
> +	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
>  		return 0;
>
>  	write_lock_irq(&mapping->tree_lock);
>  	if (PageDirty(page))
>  		goto failed;
>
> -	BUG_ON(PagePrivate(page));
> +	BUG_ON(page_has_private(page));
>  	__remove_from_page_cache(page);
>  	write_unlock_irq(&mapping->tree_lock);
>  	ClearPageUptodate(page);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e5a9597..ff2fa2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -578,7 +578,7 @@ static unsigned long shrink_page_list(struct list_head
> *page_list, * process address space (page_count == 1) it can be freed.
>  		 * Otherwise, leave the page on the LRU so it is swappable.
>  		 */
> -		if (PagePrivate(page)) {
> +		if (page_has_private(page)) {
>  			if (!try_to_release_page(page, sc->gfp_mask))
>  				goto activate_locked;
>  			if (!mapping && page_count(page) == 1)
>
> --

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]
  2007-12-05 19:40 ` [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache " David Howells
@ 2007-12-14  4:21   ` Nick Piggin
  2007-12-17 22:54   ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-14  4:21 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Thursday 06 December 2007 06:40, David Howells wrote:
> Add a function - cancel_rejected_write() - to excise a rejected write from
> the pagecache.  This function is related to the truncation family of
> routines.  It permits the pages modified by a network filesystem client
> (such as AFS) to be excised and discarded from the pagecache if the attempt
> to write them back to the server fails.
>
> The dirty and writeback states of the afflicted pages are cancelled and the
> pages themselves are detached for recycling.  All PTEs referring to those
> pages are removed.
>
> Note that the locking is tricky as it's very easy to deadlock against
> truncate() and other routines once the pages have been unlocked as part of
> the writeback process.  To this end, the PG_error flag is set, then the
> PG_writeback flag is cleared, and only *then* can lock_page() be called.

This reintroduces the fault vs truncate race window, which must be fixed.

Also, it is adding a fair bit of complexity in an area where we should
instead be reducing it. I think your filesystem should not be doing
writeback caching of dirty data in the cases where it is so problematic
(or at least, disallow mmap and read on the dirty data until it has been
written back or failed).

But otherwise I guess if you really want to discard the dirty data after
a failed writeback attempt, what's wrong with just invalidate_inode_pages2?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-05 19:39 ` [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management " David Howells
  2007-12-14  4:08   ` Nick Piggin
@ 2007-12-17 22:36   ` David Howells
  2007-12-18  7:00     ` Nick Piggin
  2007-12-20 18:33     ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-17 22:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> I'd much prefer if you would handle this in the filesystem, and have it
> set PG_private whenever fscache needs to receive a callback, and DTRT
> depending on whether PG_fscache etc. is set or not.

That's tricky and slower[*].  One of the things I want to do is to modify
iso9660 to do be able to do caching, but PG_private is 'owned' by the generic
buffer cache code.

[*] though perhaps not significantly.

> Also, this wait_on_page_fscache_write / end_page_fscache_write stuff
> seems like it would belong in your fscache headers rather than generic
> mm code (ditto for your PG_fscache checks in the page allocator -- you
> should use their PG_owner_priv_? names for that).

I suppose that's reasonable, though I do want to mention the PG_fscache* bits
in linux/page-flags.h so that anyone looking at those bits to select one to
use can easily see a reason they might not want to.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 09/28] FS-Cache: Release page->private after failed readahead [try #2]
  2007-12-05 19:39 ` [PATCH 09/28] FS-Cache: Release page->private after failed readahead " David Howells
  2007-12-14  3:51   ` Nick Piggin
@ 2007-12-17 22:42   ` David Howells
  2007-12-18  7:03     ` Nick Piggin
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-17 22:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> This is pretty nasty.

Why?  If the fs doesn't set PG_private or PG_fscache on any pages before
calling read_cache_pages(), there's no difference.

Furthermore, the differences only crop up in the error handling paths.

> I would suggest either to have the function return the number of pages that
> were added to pagecache,

Which helps how?

> or just open code it.

Well, I could give an alternative read_cache_pages(), I suppose, for just this
situation, but that means there are two parallel functions which then both
need to be maintained.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]
  2007-12-05 19:40 ` [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache " David Howells
  2007-12-14  4:21   ` Nick Piggin
@ 2007-12-17 22:54   ` David Howells
  2007-12-18  7:07     ` Nick Piggin
  2007-12-20 18:49     ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-17 22:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> This reintroduces the fault vs truncate race window, which must be fixed.

Hmmm...  perhaps.  I remember that cropped up in NFS, but I'm doing things a
bit differently to NFS.  Remind me again how that worked please.

> Also, it is adding a fair bit of complexity in an area where we should
> instead be reducing it. I think your filesystem should not be doing
> writeback caching of dirty data in the cases where it is so problematic
> (or at least, disallow mmap and read on the dirty data until it has been
> written back or failed).

Eh?  It's a stateless network filesystem.  There's a gap between writing to a
file (perhaps though an mmap) and the pagecache pages being written back in
which someone may change the security on a file and block the writeback.
There's nothing I can do to prevent it, so I have to instead deal with the
consequences should they arise.  See the description of patch 25 for examples.

So you say I shouldn't do any writeback caching at all?

> But otherwise I guess if you really want to discard the dirty data after
> a failed writeback attempt, what's wrong with just invalidate_inode_pages2?

Erm...  Because it deadlocks?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-17 22:36   ` David Howells
@ 2007-12-18  7:00     ` Nick Piggin
  2007-12-20 18:33     ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-18  7:00 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Tuesday 18 December 2007 09:36, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > I'd much prefer if you would handle this in the filesystem, and have it
> > set PG_private whenever fscache needs to receive a callback, and DTRT
> > depending on whether PG_fscache etc. is set or not.
>
> That's tricky and slower[*].  One of the things I want to do is to modify
> iso9660 to do be able to do caching, but PG_private is 'owned' by the
> generic buffer cache code.

Maybe it is harder, but it is the right way to do it. So you
should modify the filesystems rather than core code.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 09/28] FS-Cache: Release page->private after failed readahead [try #2]
  2007-12-17 22:42   ` David Howells
@ 2007-12-18  7:03     ` Nick Piggin
  0 siblings, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-18  7:03 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Tuesday 18 December 2007 09:42, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > This is pretty nasty.
>
> Why?  If the fs doesn't set PG_private or PG_fscache on any pages before
> calling read_cache_pages(), there's no difference.

It is conceptually wrong.


> Furthermore, the differences only crop up in the error handling paths.

That doesn't make it any better.


> > I would suggest either to have the function return the number of pages
> > that were added to pagecache,
>
> Which helps how?

So the caller can do their own error handling / cleanup.


> > or just open code it.
>
> Well, I could give an alternative read_cache_pages(), I suppose, for just
> this situation, but that means there are two parallel functions which then
> both need to be maintained.

I would be OK with that.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]
  2007-12-17 22:54   ` David Howells
@ 2007-12-18  7:07     ` Nick Piggin
  2007-12-20 18:49     ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-18  7:07 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Tuesday 18 December 2007 09:54, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > This reintroduces the fault vs truncate race window, which must be fixed.
>
> Hmmm...  perhaps.

What do you mean by perhaps?


> I remember that cropped up in NFS, but I'm doing things 
> a bit differently to NFS.  Remind me again how that worked please.

How what worked? NFS is using invalidate inode pages quite frequently
so it ran into the race more often.


> > Also, it is adding a fair bit of complexity in an area where we should
> > instead be reducing it. I think your filesystem should not be doing
> > writeback caching of dirty data in the cases where it is so problematic
> > (or at least, disallow mmap and read on the dirty data until it has been
> > written back or failed).
>
> Eh?  It's a stateless network filesystem.  There's a gap between writing to
> a file (perhaps though an mmap) and the pagecache pages being written back
> in which someone may change the security on a file and block the writeback.
> There's nothing I can do to prevent it, so I have to instead deal with the
> consequences should they arise.  See the description of patch 25 for
> examples.
>
> So you say I shouldn't do any writeback caching at all?

No, you could do writeback caching but disallow read of dirty data. But
yeah, a coherent mmap isn't possible then, I guess.


> > But otherwise I guess if you really want to discard the dirty data after
> > a failed writeback attempt, what's wrong with just
> > invalidate_inode_pages2?
>
> Erm...  Because it deadlocks?

Why don't you call it after calling end_page_writeback?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 20:42         ` David Howells
                             ` (2 preceding siblings ...)
  2007-12-11 22:43           ` David Howells
@ 2007-12-19  3:28           ` Crispin Cowan
  2007-12-19 23:38           ` David Howells
  4 siblings, 0 replies; 126+ messages in thread
From: Crispin Cowan @ 2007-12-19  3:28 UTC (permalink / raw)
  To: David Howells
  Cc: Stephen Smalley, Karl MacMillan, viro, hch, Trond.Myklebust,
	casey, linux-kernel, selinux, linux-security-module,
	apparmor-dev

David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>> That sounds too SELinux specific.  How do I do it so that it works for any
>>> LSM?
>>>       
>> You can't.  There is no LSM for userspace; LSM specifically disavowed
>> any common userspace API, and that was one of our original
>> objections/concerns about it.
>>     
> So, basically, userspace programs (outside of security tools) aren't supposed
> to need access to the security infrastructure?
>   
That depends on your LSM model. Under SELinux, lots of programs need
access to the security infrastructure to set the label to something
other than the default label. Under AppArmor, there is no concept of a
label, and not much need for programs to access the security infrastructure.

So far, AppArmor has two API's into the kernel: change_hat() and
change_profile(). Both of them are used to change the security context
of the calling process. Naturally they are subject to policy
restrictions of what you get to change to. They are largely intended to
be used by applications servers (think of Java servlets) to change into
a security context associated with a servlet rather than putting all the
hosted applications in the application server's context.

I'm unclear on whether SELinux has an analogous facility. I've been told
that the concept is anathema, and also that there is a similar facility
in SELinux, but without the corresponding user-space components that
AppArmor provides (mod_apparmor for Apache, and a similar module for
Tomcat).

Going way up to the top of the threat for 08/28, you say you want to do
this:

David Howells wrote:
> Allow kernel services to override LSM settings appropriate to the actions
> performed by a task by duplicating a security record, modifying it and then
> using task_struct::act_as to point to it when performing operations on behalf
> of a task.
>
> This is used, for example, by CacheFiles which has to transparently access the
> cache on behalf of a process that thinks it is doing, say, NFS accesses with a
> potentially inappropriate (with respect to accessing the cache) set of
> security data.
I'm not sure, but I think that you could do this *much* easier in
AppArmor using change_profile to switch the NFS Daemon to the profile of
the requesting process. That still leaves some problems:

    * The profile of the requesting process has to exist on the NFS
      server, and it may not.
    * You need a uniform name space of profiles, and you definitely
      don't have that.

However, it seems to me that you have the same problem with SELinux:
what do you do if the domain/type of the requesting process does not
exist on the NFS server? How do you ensure a uniform name space of types?

>>> Is linking against libselinux is a viable option if it's not available under
>>> all LSM models?  Is it available under all LSM models?  Perhaps Casey can
>>> answer this one.
>>>       
>> Nope, they would all have their own libraries, if they have a library at
>> all.  But that isn't your problem
>>     
> It is if I have to maintain a special pieces of code for each possible LSM.
> One piece for SELinux, one piece for AppArmour, one piece for Smack, one piece
> for Casey's security system.  That sounds like a pain.
>   
There is little correspondence between AppArmor and SELinux, but one
point of correspondence is that an AA "profile" is quite similar to the
SELinux notion of a domain as applied to a process. In the Targeted
Policy, they are used the same way too. I could imagine a small
(growing) generic library that mapped onto SELinux and AppArmor APIs for
changing security context.

>> Why are you opposed to having userspace determine the context and write it
>> to a cachefiles interface,
>>     
> Because, from what I gather, that means my userspace program needs to do
> something different, depending on the security model that's currently in force
> on a system.
Yes, that's absolutely true: if you want to manipulate the security
system, you have to actually call it in terms that the security system
will understand.

>   Furthermore, I would have to have separate code, as far as I
> know, for each security model as there's no commonality in userspace.
>   
Yep. If there was a common user space, ten there really wouldn't need to
be separate LSMs. This is a fundamental problem: the security systems
export rather different semantics, so there cannot be a truly generic
user space, just some band aids that provide a klude to access the bits
that kind of have some overlap.

> How about I just stick the context in /etc/cachefilesd.conf as a textual
> configuration item and have the daemon pass that as a string to the cachefiles
> kernel module, which can then ask LSM if it's valid to set this context as an
> override, given the daemon's own security context?  That seems entirely
> reasonable to me.
>   
That semantically maps well to the AppArmor change_profile() API, so
conceptually it should work. It would be easier if you did that in user
space instead of in the kernel, I don't know if it causes a problem to
attempt to kind-of call change_profile() from within the kernel.
Notably, change_profile can fail, so what does your kernel module do
when the attempt to change security context fails?

Crispin

-- 
Crispin Cowan, Ph.D.               http://crispincowan.com/~crispin
CEO, Mercenary Linux		   http://mercenarylinux.com/
	       Itanium. Vista. GPLv3. Complexity at work


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 21:34           ` Stephen Smalley
@ 2007-12-19  3:28             ` Crispin Cowan
  2007-12-19  5:39               ` Casey Schaufler
  2007-12-19 14:54               ` Stephen Smalley
  0 siblings, 2 replies; 126+ messages in thread
From: Crispin Cowan @ 2007-12-19  3:28 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module, apparmor-dev

Stephen Smalley wrote:
>> It is if I have to maintain a special pieces of code for each possible LSM.
>> One piece for SELinux, one piece for AppArmour, one piece for Smack, one piece
>> for Casey's security system.  That sounds like a pain.
>>     
> All your code has to do is invoke a function provided by libselinux.  If
> at some later time a liblsm is introduced that provides a common
> front-end to a libselinux, libsmack, ..., then you can use that.  But it
> doesn't exist today.   But it all just becomes a simple function call
> regardless.
>   
libapparmor exists. It only had one API, and now it has 2, but just 2
versions on the same concept (change_hat and change_profile).

This is the API for change_hat http://man-wiki.net/index.php/2:change_hat

What does the corresponding API in SELinux look like?

Crispin

-- 
Crispin Cowan, Ph.D.               http://crispincowan.com/~crispin
CEO, Mercenary Linux		   http://mercenarylinux.com/
	       Itanium. Vista. GPLv3. Complexity at work


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-19  3:28             ` Crispin Cowan
@ 2007-12-19  5:39               ` Casey Schaufler
  2007-12-19 14:54               ` Stephen Smalley
  1 sibling, 0 replies; 126+ messages in thread
From: Casey Schaufler @ 2007-12-19  5:39 UTC (permalink / raw)
  To: Crispin Cowan, Stephen Smalley
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module, apparmor-dev


--- Crispin Cowan <crispin@crispincowan.com> wrote:

> Stephen Smalley wrote:
> >> It is if I have to maintain a special pieces of code for each possible
> LSM.
> >> One piece for SELinux, one piece for AppArmour, one piece for Smack, one
> piece
> >> for Casey's security system.  That sounds like a pain.

It's probably less of a pain if you consider that Casey's
security scheme is Smack.

> > All your code has to do is invoke a function provided by libselinux.  If
> > at some later time a liblsm is introduced that provides a common
> > front-end to a libselinux, libsmack, ..., then you can use that.  But it
> > doesn't exist today.   But it all just becomes a simple function call
> > regardless.
> >   
> libapparmor exists. It only had one API, and now it has 2, but just 2
> versions on the same concept (change_hat and change_profile).
> 
> This is the API for change_hat http://man-wiki.net/index.php/2:change_hat
> 
> What does the corresponding API in SELinux look like?

The POSIX mac_set_proc(mac_t label) might work for this interface.
Sets the current process MAC attribute, if appropriate. The Smack
implementation would be pretty easy:

typedef char * mac_t;

int mac_set_proc(mac_t label)
{
        int fd;
        int rc;

        rc = strlen(label);
        if (rc > SMACK_MAX_LABEL_LEN)
                return -1;
        fd = open("/proc/self/attr/current", O_RDWR);
        if (fd < 0)
                return -1;
        rc = write(fd, label, rc);
        close(fd);
        if (rc < 0)
                return -1;
        return 0;
}


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-19  3:28             ` Crispin Cowan
  2007-12-19  5:39               ` Casey Schaufler
@ 2007-12-19 14:54               ` Stephen Smalley
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2007-12-19 14:54 UTC (permalink / raw)
  To: Crispin Cowan
  Cc: David Howells, Karl MacMillan, viro, hch, Trond.Myklebust, casey,
	linux-kernel, selinux, linux-security-module, apparmor-dev

On Tue, 2007-12-18 at 19:28 -0800, Crispin Cowan wrote:
> Stephen Smalley wrote:
> >> It is if I have to maintain a special pieces of code for each possible LSM.
> >> One piece for SELinux, one piece for AppArmour, one piece for Smack, one piece
> >> for Casey's security system.  That sounds like a pain.
> >>     
> > All your code has to do is invoke a function provided by libselinux.  If
> > at some later time a liblsm is introduced that provides a common
> > front-end to a libselinux, libsmack, ..., then you can use that.  But it
> > doesn't exist today.   But it all just becomes a simple function call
> > regardless.
> >   
> libapparmor exists. It only had one API, and now it has 2, but just 2
> versions on the same concept (change_hat and change_profile).
> 
> This is the API for change_hat http://man-wiki.net/index.php/2:change_hat
> 
> What does the corresponding API in SELinux look like?

The SELinux API for changing context is described in:
http://linux.die.net/man/3/setcon

However, setting the current context (SELinux) or profile (AppArmor) for
a userspace task doesn't really provide the functionality required here
for cachefiles, nor does it solve the (different, yet related) nfsd
problem.

cachefiles is a kernel module that needs to assume a different set of
credentials for its internal accesses to the cache files it manages when
a userspace process tries to access a file that has been cached, as the
userspace process in whose context it is operating may (and should) lack
direct permission to those cache files.  The userspace API being talked
about is simply how one configures the credentials to be used by
cachefiles kernel module for its internal accesses, and is done up front
when cachefiles is configured to create a cache.  The internal switching
of the active set of credentials is done via a kernel-internal API (or
just by switching the pointer to the credential structure previously set
up) when the cachefiles kernel module wants to access a cache file.
Further, when this internal switching occurs, we have to be careful that
there are no user-visible side effects on the current task - no change
in how others may operate on that task e.g. signal permission checks or
on how the task appears to others e.g. via /proc.

Neither change_hat nor setcon helps with that problem.

For AppArmor, I suspect that you just want the cachefiles kernel module
to act as unconfined for its internal accesses, nothing more.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-11 20:42         ` David Howells
                             ` (3 preceding siblings ...)
  2007-12-19  3:28           ` Crispin Cowan
@ 2007-12-19 23:38           ` David Howells
  4 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2007-12-19 23:38 UTC (permalink / raw)
  To: Crispin Cowan
  Cc: dhowells, Stephen Smalley, Karl MacMillan, viro, hch,
	Trond.Myklebust, casey, linux-kernel, selinux,
	linux-security-module, apparmor-dev

Crispin Cowan <crispin@crispincowan.com> wrote:

> > This is used, for example, by CacheFiles which has to transparently access
> > the cache on behalf of a process that thinks it is doing, say, NFS
> > accesses with a potentially inappropriate (with respect to accessing the
> > cache) set of security data.
> I'm not sure, but I think that you could do this *much* easier in
> AppArmor using change_profile to switch the NFS Daemon to the profile of
> the requesting process. That still leaves some problems:

NFS Daemon?  NFS quite often runs in the context of whichever process issued,
say, a read syscall.  It's at this point the cachefiles needs to run, and
using change_profile is I suspect not an option there.  Remember: you can't
change the objective profile of the aforementioned process, hence the act_as
pointer.

That said, I don't know what change_profile does.

> However, it seems to me that you have the same problem with SELinux:
> what do you do if the domain/type of the requesting process does not
> exist on the NFS server? How do you ensure a uniform name space of types?

I'm not sure what you're getting at.  The security NFS uses to access the
server is separate from the security that cachefiles uses to access the cache.

> > How about I just stick the context in /etc/cachefilesd.conf as a textual
> > configuration item and have the daemon pass that as a string to the
> > cachefiles kernel module, which can then ask LSM if it's valid to set this
> > context as an override, given the daemon's own security context?  That
> > seems entirely reasonable to me.
> >   
> That semantically maps well to the AppArmor change_profile() API, so
> conceptually it should work.

Okay.

> It would be easier if you did that in user space instead of in the kernel,

Do what in userspace?  Parse the context?  Validate the context?  Or change
the context?

> I don't know if it causes a problem to attempt to kind-of call
> change_profile() from within the kernel.  Notably, change_profile can fail,
> so what does your kernel module do when the attempt to change security
> context fails?

The cache aborts and all subsequent operations on that cache bounce and go to
the server instead.

The change of context cannot be done in userspace because to get to a
userspace capable of attempting this operation would itself require a change
of context.  Besides, it'd also be inefficient, probably horribly so, to do
all caching ops in userspace.

What I need is an LSM operation to change a task_security struct to have a
specified context.  I can then use the task_security struct in all future
cache ops on a cache by pointing task->act_as at it.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-17 22:36   ` David Howells
  2007-12-18  7:00     ` Nick Piggin
@ 2007-12-20 18:33     ` David Howells
  2007-12-21  1:08       ` Nick Piggin
  2008-01-02 16:27       ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2007-12-20 18:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > > I'd much prefer if you would handle this in the filesystem, and have it
> > > set PG_private whenever fscache needs to receive a callback, and DTRT
> > > depending on whether PG_fscache etc. is set or not.
> >
> > That's tricky and slower[*].  One of the things I want to do is to modify
> > iso9660 to do be able to do caching, but PG_private is 'owned' by the
> > generic buffer cache code.
> 
> Maybe it is harder, but it is the right way to do it.

You're wrong.  It would mean that PG_private is the logical disjunction of
PG_fscache and some condition not otherwise explicitly stored.  I tried that
with NFS and it was nasty.

As you can no doubt see, it means that you can't distinguish all the states
you used to be able to.

> So you should modify the filesystems rather than core code.

I think you missed what I said:

	but PG_private is 'owned' by the generic buffer cache code.

That means more of the core code would have to change - or, at least, change
more.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]
  2007-12-17 22:54   ` David Howells
  2007-12-18  7:07     ` Nick Piggin
@ 2007-12-20 18:49     ` David Howells
  2007-12-21  1:11       ` Nick Piggin
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2007-12-20 18:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > This reintroduces the fault vs truncate race window, which must be fixed.
> >
> > Hmmm...  perhaps.
> 
> What do you mean by perhaps?

I mean 'perhaps'.  I'm not sure I remember what the race was, so I can't
evaluate whether or not the same race crops up in AFS too.  So: can you
describe the race please.

> No, you could do writeback caching but disallow read of dirty data.

Someone might already have read-access via mmap at the point someone attempts
to write to an mmapped region.  That means that I'd have to revoke the read
access of the first someone before letting the write take place.

Does NFS do this?

> > > But otherwise I guess if you really want to discard the dirty data after
> > > a failed writeback attempt, what's wrong with just
> > > invalidate_inode_pages2?
> >
> > Erm...  Because it deadlocks?
> 
> Why don't you call it after calling end_page_writeback?

Because then there can be a race over who gets to flush the dead write.
Actually, this may no longer be a problem with your write_begin() changes.
I'll need to have another look at those.

Besides, I don't agree that invalidate_inode_pages2() is necessarily the right
way to do things.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-20 18:33     ` David Howells
@ 2007-12-21  1:08       ` Nick Piggin
  2008-01-02 16:27       ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-21  1:08 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Friday 21 December 2007 05:33, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > I'd much prefer if you would handle this in the filesystem, and have
> > > > it set PG_private whenever fscache needs to receive a callback, and
> > > > DTRT depending on whether PG_fscache etc. is set or not.
> > >
> > > That's tricky and slower[*].  One of the things I want to do is to
> > > modify iso9660 to do be able to do caching, but PG_private is 'owned'
> > > by the generic buffer cache code.
> >
> > Maybe it is harder, but it is the right way to do it.
>
> You're wrong.  It would mean that PG_private is the logical disjunction of
> PG_fscache and some condition not otherwise explicitly stored.  I tried
> that with NFS and it was nasty.
>
> As you can no doubt see, it means that you can't distinguish all the states
> you used to be able to.
>
> > So you should modify the filesystems rather than core code.
>
> I think you missed what I said:
>
> 	but PG_private is 'owned' by the generic buffer cache code.
>
> That means more of the core code would have to change - or, at least,
> change more.

Then make a PG_private2 bit and use that.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]
  2007-12-20 18:49     ` David Howells
@ 2007-12-21  1:11       ` Nick Piggin
  0 siblings, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2007-12-21  1:11 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Friday 21 December 2007 05:49, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > This reintroduces the fault vs truncate race window, which must be
> > > > fixed.
> > >
> > > Hmmm...  perhaps.
> >
> > What do you mean by perhaps?
>
> I mean 'perhaps'.  I'm not sure I remember what the race was, so I can't
> evaluate whether or not the same race crops up in AFS too.  So: can you
> describe the race please.

It's in the changelogs.


> > No, you could do writeback caching but disallow read of dirty data.
>
> Someone might already have read-access via mmap at the point someone
> attempts to write to an mmapped region.  That means that I'd have to revoke
> the read access of the first someone before letting the write take place.
>
> Does NFS do this?

I don't know. But yeah I guess it's tricky. Still, if I was writing a
filesystem, I'd focus on behavioural niceness/correctness above
performance, but maybe that's not realistic.


> > > > But otherwise I guess if you really want to discard the dirty data
> > > > after a failed writeback attempt, what's wrong with just
> > > > invalidate_inode_pages2?
> > >
> > > Erm...  Because it deadlocks?
> >
> > Why don't you call it after calling end_page_writeback?
>
> Because then there can be a race over who gets to flush the dead write.

You should solve it in your filesystem.


> Actually, this may no longer be a problem with your write_begin() changes.
> I'll need to have another look at those.
>
> Besides, I don't agree that invalidate_inode_pages2() is necessarily the
> right way to do things.

Why?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2007-12-20 18:33     ` David Howells
  2007-12-21  1:08       ` Nick Piggin
@ 2008-01-02 16:27       ` David Howells
  2008-01-07 11:33         ` Nick Piggin
  2008-01-07 13:09         ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2008-01-02 16:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Then make a PG_private2 bit and use that.

To what end?  Are you suggesting I should have:

	PG_private2 = PG_private | PG_fscache

That's redundant information and doesn't help anything really.

My suggestion (PG_private and PG_fscache separate and independent) is pretty
efficient to actually render into machine instructions, especially if the two
bits are placed in the lower part of the word.  On x86, the test for both bits
can be done with a single TEST instruction, and on most RISC archs, a MOVE and
a single AND will usually suffice.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-02 16:27       ` David Howells
@ 2008-01-07 11:33         ` Nick Piggin
  2008-01-07 13:09         ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2008-01-07 11:33 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Thursday 03 January 2008 03:27, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > Then make a PG_private2 bit and use that.
>
> To what end?  Are you suggesting I should have:
>
> 	PG_private2 = PG_private | PG_fscache

No. I mean call the bit PG_private2. That way non-pagecache and
filesystems that don't use fscache can use it.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-02 16:27       ` David Howells
  2008-01-07 11:33         ` Nick Piggin
@ 2008-01-07 13:09         ` David Howells
  2008-01-08  3:01           ` Nick Piggin
  2008-01-08 23:51           ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2008-01-07 13:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> No. I mean call the bit PG_private2. That way non-pagecache and
> filesystems that don't use fscache can use it.

The bit is called PG_owner_priv_2, and then 'subclassed' to PG_fscache, much
like PG_owner_priv_1 is 'subclassed' to PG_checked as was recommended.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-07 13:09         ` David Howells
@ 2008-01-08  3:01           ` Nick Piggin
  2008-01-08 23:51           ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2008-01-08  3:01 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Tuesday 08 January 2008 00:09, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > No. I mean call the bit PG_private2. That way non-pagecache and
> > filesystems that don't use fscache can use it.
>
> The bit is called PG_owner_priv_2, and then 'subclassed' to PG_fscache,
> much like PG_owner_priv_1 is 'subclassed' to PG_checked as was recommended.

It is not owner_priv if you're putting checks and tests into core
kernel pagecache code for it. owner_priv means a filesystem has it
_all_ to itself.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-07 13:09         ` David Howells
  2008-01-08  3:01           ` Nick Piggin
@ 2008-01-08 23:51           ` David Howells
  2008-01-09  1:52             ` Nick Piggin
  2008-01-09 15:45             ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: David Howells @ 2008-01-08 23:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > No. I mean call the bit PG_private2. That way non-pagecache and
> > > filesystems that don't use fscache can use it.
> >
> > The bit is called PG_owner_priv_2, and then 'subclassed' to PG_fscache,
> > much like PG_owner_priv_1 is 'subclassed' to PG_checked as was recommended.
> 
> It is not owner_priv if you're putting checks and tests into core
> kernel pagecache code for it. owner_priv means a filesystem has it
> _all_ to itself.

Okay, I'll change it if it makes you happy.  Bear in mind, though, you're
dictating instructions that conflict with those other people have dictated.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-08 23:51           ` David Howells
@ 2008-01-09  1:52             ` Nick Piggin
  2008-01-09 15:45             ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2008-01-09  1:52 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Wednesday 09 January 2008 10:51, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > No. I mean call the bit PG_private2. That way non-pagecache and
> > > > filesystems that don't use fscache can use it.
> > >
> > > The bit is called PG_owner_priv_2, and then 'subclassed' to PG_fscache,
> > > much like PG_owner_priv_1 is 'subclassed' to PG_checked as was
> > > recommended.
> >
> > It is not owner_priv if you're putting checks and tests into core
> > kernel pagecache code for it. owner_priv means a filesystem has it
> > _all_ to itself.
>
> Okay, I'll change it if it makes you happy.

It is to make everybody happy. Especially in code that everyone works
on like mm/ and fs/, you can't just have everybody following their own
slightly different conventions.


> Bear in mind, though, you're 
> dictating instructions that conflict with those other people have dictated.

Can you point me to the posts. Thanks.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-08 23:51           ` David Howells
  2008-01-09  1:52             ` Nick Piggin
@ 2008-01-09 15:45             ` David Howells
  2008-01-09 23:52               ` Nick Piggin
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2008-01-09 15:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dhowells, viro, hch, Trond.Myklebust, sds, casey, linux-kernel,
	selinux, linux-security-module

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> It is to make everybody happy. Especially in code that everyone works
> on like mm/ and fs/, you can't just have everybody following their own
> slightly different conventions.

Conventions are what people agree they are.

Anyway, I've attached a revised page flags patch if you can take a quick look
over that.

I'll drop the AFS-caching and AFS-write-fix patches for the moment and
concentrate on trying to get FS-Cache working with NFS.

So if we can agree on the two(?) things you brought up with the remaining
patches:

 (1) Make PG_fscache overload PG_private_2 rather than PG_owner_priv_2.  Would
     that make you happy with patch 10 of 28?

 (2) Then there's patch 9 of 28 - making read_cache_pages() release private
     data that's already attached to pages it then discards due to error.  Are
     you still going to require that I duplicate read_cache_pages()?  Or can
     you accept that sharing is sufficient, especially if PG_private_2 now
     exists?

David
---
FS-Cache: Recruit a couple of page flags for cache management

From: David Howells <dhowells@redhat.com>

Recruit a couple of page flags to aid in cache management.  The following extra
flags are defined:

 (1) PG_fscache (PG_private_2)

     The marked page is backed by a local cache and is pinning resources in the
     cache driver.

 (2) PG_fscache_write (PG_owner_priv_2)

     The marked page is being written to the local cache.  The page may not be
     modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that.  This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/splice.c                |    2 +-
 include/linux/page-flags.h |   40 ++++++++++++++++++++++++++++++++++++++--
 include/linux/pagemap.h    |   11 +++++++++++
 mm/filemap.c               |   16 ++++++++++++++++
 mm/migrate.c               |    2 +-
 mm/page_alloc.c            |    3 +++
 mm/readahead.c             |    9 +++++----
 mm/swap.c                  |    4 ++--
 mm/swap_state.c            |    4 ++--
 mm/truncate.c              |   10 +++++-----
 mm/vmscan.c                |    2 +-
 11 files changed, 85 insertions(+), 18 deletions(-)


diff --git a/fs/splice.c b/fs/splice.c
index 6bdcb61..61edad7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -58,7 +58,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 		 */
 		wait_on_page_writeback(page);
 
-		if (PagePrivate(page))
+		if (page_has_private(page))
 			try_to_release_page(page, GFP_KERNEL);
 
 		/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..364f8f9 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -77,25 +77,32 @@
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
 
-#define PG_owner_priv_1		 8	/* Owner use. If pagecache, fs may use*/
+#define PG_owner_priv_1		 8	/* Owner use. fs may use in pagecache */
 #define PG_arch_1		 9
 #define PG_reserved		10
 #define PG_private		11	/* If pagecache, has fs-private data */
 
 #define PG_writeback		12	/* Page is under writeback */
+#define PG_private_2		13	/* If pagecache, has fs aux data */
 #define PG_compound		14	/* Part of a compound page */
 #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
+#define PG_owner_priv_2		18	/* Owner use. fs may use in pagecache */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
-/* PG_owner_priv_1 users should have descriptive aliases */
+/* PG_owner_priv_1/2 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned		PG_owner_priv_1	/* Xen pinned pagetable */
+#define PG_fscache_write	PG_owner_priv_2	/* Writing to local cache */
+
+/* PG_private_2 causes releasepage() and co to be invoked */
+#define PG_fscache		PG_private_2	/* Backed by local cache */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -199,6 +206,24 @@ static inline void SetPageUptodate(struct page *page)
 #define TestClearPageWriteback(page) test_and_clear_bit(PG_writeback,	\
 							&(page)->flags)
 
+#define PageFsCache(page)	test_bit(PG_fscache, &(page)->flags)
+#define SetPageFsCache(page)	set_bit(PG_fscache, &(page)->flags)
+#define ClearPageFsCache(page)	clear_bit(PG_fscache, &(page)->flags)
+#define TestSetPageFsCache(page) test_and_set_bit(PG_fscache, &(page)->flags)
+#define TestClearPageFsCache(page) test_and_clear_bit(PG_fscache, \
+						      &(page)->flags)
+
+#define PageFsCacheWrite(page)		test_bit(PG_fscache_write, \
+						 &(page)->flags)
+#define SetPageFsCacheWrite(page)	set_bit(PG_fscache_write, \
+						&(page)->flags)
+#define ClearPageFsCacheWrite(page)	clear_bit(PG_fscache_write, \
+						  &(page)->flags)
+#define TestSetPageFsCacheWrite(page)	test_and_set_bit(PG_fscache_write, \
+							 &(page)->flags)
+#define TestClearPageFsCacheWrite(page)	test_and_clear_bit(PG_fscache_write, \
+							   &(page)->flags)
+
 #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
 #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
 #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
@@ -272,4 +297,15 @@ static inline void set_page_writeback(struct page *page)
 	test_set_page_writeback(page);
 }
 
+/**
+ * page_has_private - Determine if page has private stuff
+ * @page: The page to be checked
+ *
+ * Determine if a page has private stuff, indicating that release routines
+ * should be invoked upon it.
+ */
+#define page_has_private(page)			\
+	((page)->flags & ((1 << PG_private) |	\
+			  (1 << PG_private_2)))
+
 #endif	/* PAGE_FLAGS_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..6a1b317 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -212,6 +212,17 @@ static inline void wait_on_page_writeback(struct page *page)
 extern void end_page_writeback(struct page *page);
 
 /*
+ * Wait for a page to finish being written to a local cache
+ */
+static inline void wait_on_page_fscache_write(struct page *page)
+{
+	if (PageFsCacheWrite(page))
+		wait_on_page_bit(page, PG_fscache_write);
+}
+
+extern void end_page_fscache_write(struct page *page);
+
+/*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
diff --git a/mm/filemap.c b/mm/filemap.c
index f4d0cde..a86569f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -572,6 +572,19 @@ void end_page_writeback(struct page *page)
 EXPORT_SYMBOL(end_page_writeback);
 
 /**
+ * end_page_fscache_write - End writing page to cache
+ * @page: the page
+ */
+void end_page_fscache_write(struct page *page)
+{
+	if (!TestClearPageFsCacheWrite(page))
+		BUG();
+	smp_mb__after_clear_bit();
+	wake_up_page(page, PG_fscache_write);
+}
+EXPORT_SYMBOL(end_page_fscache_write);
+
+/**
  * __lock_page - get a lock on the page, assuming we need to sleep to get it
  * @page: the page to lock
  *
@@ -2540,6 +2553,9 @@ out:
  * (presumably at page->private).  If the release was successful, return `1'.
  * Otherwise return zero.
  *
+ * This may also be called if PG_fscache is set on a page, indicating that the
+ * page is known to the local caching routines.
+ *
  * The @gfp_mask argument specifies whether I/O may be performed to release
  * this page (__GFP_IO), and whether the call may block (__GFP_WAIT).
  *
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a207e8..ebaf557 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -545,7 +545,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	 * Buffers may be managed in a filesystem specific way.
 	 * We must have no buffers or drop them.
 	 */
-	if (PagePrivate(page) &&
+	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e1028fa..d0d356b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -230,6 +230,7 @@ static void bad_page(struct page *page)
 	dump_stack();
 	page->flags &= ~(1 << PG_lru	|
 			1 << PG_private |
+			1 << PG_fscache	|
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_dirty	|
@@ -456,6 +457,7 @@ static inline int free_pages_check(struct page *page)
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private |
+			1 << PG_fscache	|
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_slab	|
@@ -605,6 +607,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private	|
+			1 << PG_fscache	|
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_dirty	|
diff --git a/mm/readahead.c b/mm/readahead.c
index 75aa6b6..272ffc7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -46,14 +46,15 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 /*
  * see if a page needs releasing upon read_cache_pages() failure
- * - the caller of read_cache_pages() may have set PG_private before calling,
- *   such as the NFS fs marking pages that are cached locally on disk, thus we
- *   need to give the fs a chance to clean up in the event of an error
+ * - the caller of read_cache_pages() may have set PG_private or PG_fscache
+ *   before calling, such as the NFS fs marking pages that are cached locally
+ *   on disk, thus we need to give the fs a chance to clean up in the event of
+ *   an error
  */
 static void read_cache_pages_invalidate_page(struct address_space *mapping,
 					     struct page *page)
 {
-	if (PagePrivate(page)) {
+	if (page_has_private(page)) {
 		if (TestSetPageLocked(page))
 			BUG();
 		page->mapping = mapping;
diff --git a/mm/swap.c b/mm/swap.c
index 9ac8832..22d6e03 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,8 +455,8 @@ void pagevec_strip(struct pagevec *pvec)
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
-		if (PagePrivate(page) && !TestSetPageLocked(page)) {
-			if (PagePrivate(page))
+		if (page_has_private(page) && !TestSetPageLocked(page)) {
+			if (page_has_private(page))
 				try_to_release_page(page, 0);
 			unlock_page(page);
 		}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b526356..33f1e5c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -76,7 +76,7 @@ static int __add_to_swap_cache(struct page *page, swp_entry_t entry,
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageSwapCache(page));
-	BUG_ON(PagePrivate(page));
+	BUG_ON(page_has_private(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		write_lock_irq(&swapper_space.tree_lock);
@@ -129,7 +129,7 @@ void __delete_from_swap_cache(struct page *page)
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
-	BUG_ON(PagePrivate(page));
+	BUG_ON(page_has_private(page));
 
 	radix_tree_delete(&swapper_space.page_tree, page_private(page));
 	set_page_private(page, 0);
diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..5b7d1c5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -49,7 +49,7 @@ void do_invalidatepage(struct page *page, unsigned long offset)
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
 	zero_user_page(page, partial, PAGE_CACHE_SIZE - partial, KM_USER0);
-	if (PagePrivate(page))
+	if (page_has_private(page))
 		do_invalidatepage(page, partial);
 }
 
@@ -100,7 +100,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
-	if (PagePrivate(page))
+	if (page_has_private(page))
 		do_invalidatepage(page, 0);
 
 	remove_from_page_cache(page);
@@ -125,7 +125,7 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return 0;
 
-	if (PagePrivate(page) && !try_to_release_page(page, 0))
+	if (page_has_private(page) && !try_to_release_page(page, 0))
 		return 0;
 
 	ret = remove_mapping(mapping, page);
@@ -347,14 +347,14 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return 0;
 
-	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
+	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
 	write_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
-	BUG_ON(PagePrivate(page));
+	BUG_ON(page_has_private(page));
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5a9597..ff2fa2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -578,7 +578,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * process address space (page_count == 1) it can be freed.
 		 * Otherwise, leave the page on the LRU so it is swappable.
 		 */
-		if (PagePrivate(page)) {
+		if (page_has_private(page)) {
 			if (!try_to_release_page(page, sc->gfp_mask))
 				goto activate_locked;
 			if (!mapping && page_count(page) == 1)

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 17:07   ` David Howells
  2007-12-10 17:23     ` Stephen Smalley
  2007-12-10 21:08     ` David Howells
@ 2008-01-09 16:51     ` David Howells
  2008-01-09 18:11       ` Stephen Smalley
                         ` (3 more replies)
  2008-01-09 17:27     ` David Howells
  3 siblings, 4 replies; 126+ messages in thread
From: David Howells @ 2008-01-09 16:51 UTC (permalink / raw)
  To: Stephen Smalley, kmacmill
  Cc: dhowells, casey, linux-kernel, selinux, linux-security-module


Okay.  I can:

 (1) Have cachefilesd (the daemon) pass a security context string to the
     cachefiles kernel module, which can then convert it to a secID.  It'll
     require a security_secctx_to_secid() function, but I'm fairly certain I
     have a patch to add such kicking around somewhere.

 (2) Make security_task_kernel_act_as() take a task_security struct and a
     secID and just assign the latter to the former.  I'm not sure it makes
     sense to do any checks here, other than checking that under SELinux the
     secID is of SECCLASS_PROCESS class.

However, I need to write a check that the cachefilesd daemon is permitted to
nominate the secID it did.  Can someone tell me how to do this?  The obvious
way to do this is to add another PROCESS__xxx security permit specifically for
cachefiles, but that seems like a waste of a bit when there are only two spare
bits.

	avc_has_perm(daemon_tsec->sid, nominated_sid,
		     SECCLASS_PROCESS, PROCESS__CACHEFILES_USE, NULL);

Now, I recall the addition of another security class being mentioned, which
presumably would give something like:

	avc_has_perm(daemon_tsec->sid, nominated_sid,
		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);

And I assume this doesn't care if one, the other or both of the two SIDs
mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2007-12-10 17:07   ` David Howells
                       ` (2 preceding siblings ...)
  2008-01-09 16:51     ` David Howells
@ 2008-01-09 17:27     ` David Howells
  3 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2008-01-09 17:27 UTC (permalink / raw)
  Cc: dhowells, Stephen Smalley, kmacmill, casey, linux-kernel,
	selinux, linux-security-module

David Howells <dhowells@redhat.com> wrote:

> Now, I recall the addition of another security class being mentioned, which
> presumably would give something like:
> 
> 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> 		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);

Hmmmm...  I can't see how to add a new security class.  I can see that
security classes are defined in various autogenerated header files, but
autogenerated from what?  The "This file is automatically generated.  Do not
edit." message at the top of these files seems to belie the fact they're
actually checked in to GIT as is.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-09 16:51     ` David Howells
@ 2008-01-09 18:11       ` Stephen Smalley
  2008-01-09 18:56       ` David Howells
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2008-01-09 18:11 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel J Walsh, casey, linux-kernel, selinux, linux-security-module


On Wed, 2008-01-09 at 16:51 +0000, David Howells wrote:
> Okay.  I can:
> 
>  (1) Have cachefilesd (the daemon) pass a security context string to the
>      cachefiles kernel module, which can then convert it to a secID.  It'll
>      require a security_secctx_to_secid() function, but I'm fairly certain I
>      have a patch to add such kicking around somewhere.

Already planned for 2.6.25, see:
http://marc.info/?l=selinux&m=119973017423487&w=2

That is in the labeled networking tree.

>  (2) Make security_task_kernel_act_as() take a task_security struct and a
>      secID and just assign the latter to the former.  I'm not sure it makes
>      sense to do any checks here, other than checking that under SELinux the
>      secID is of SECCLASS_PROCESS class.
> 
> However, I need to write a check that the cachefilesd daemon is permitted to
> nominate the secID it did.  Can someone tell me how to do this?  The obvious
> way to do this is to add another PROCESS__xxx security permit specifically for
> cachefiles, but that seems like a waste of a bit when there are only two spare
> bits.
> 
> 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> 		     SECCLASS_PROCESS, PROCESS__CACHEFILES_USE, NULL);
> 
> Now, I recall the addition of another security class being mentioned, which
> presumably would give something like:
> 
> 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> 		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);
> 
> And I assume this doesn't care if one, the other or both of the two SIDs
> mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.

Right, the latter is reasonable.
Requires adding the class and permission definition to
policy/flask/security_classes and policy/flask/access_vectors and then
regenerating the kernel headers from those files, ala:
  svn co http://oss.tresys.com/repos/refpolicy/trunk refpolicy
  cd refpolicy/policy/flask
  vi security_classes access_vectors
  <add new class to end>
  make
  make LINUX_D=/path/to/linux-2.6 tokern
 
Dan knows how to do that.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-09 16:51     ` David Howells
  2008-01-09 18:11       ` Stephen Smalley
@ 2008-01-09 18:56       ` David Howells
  2008-01-09 19:19         ` Stephen Smalley
  2008-01-10 11:09         ` David Howells
  2008-01-14 14:01       ` David Howells
  2008-01-14 14:06       ` David Howells
  3 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2008-01-09 18:56 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> Right, the latter is reasonable.
> Requires adding the class and permission definition to
> policy/flask/security_classes and policy/flask/access_vectors and then
> regenerating the kernel headers from those files, ala:
>   svn co http://oss.tresys.com/repos/refpolicy/trunk refpolicy
>   cd refpolicy/policy/flask
>   vi security_classes access_vectors
>   <add new class to end>
>   make
>   make LINUX_D=/path/to/linux-2.6 tokern

Does this require rebuilding and updating all the SELinux rpms to know about
the new class?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-09 18:56       ` David Howells
@ 2008-01-09 19:19         ` Stephen Smalley
  2008-01-10 11:09         ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2008-01-09 19:19 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel J Walsh, casey, linux-kernel, selinux, linux-security-module


On Wed, 2008-01-09 at 18:56 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > Right, the latter is reasonable.
> > Requires adding the class and permission definition to
> > policy/flask/security_classes and policy/flask/access_vectors and then
> > regenerating the kernel headers from those files, ala:
> >   svn co http://oss.tresys.com/repos/refpolicy/trunk refpolicy
> >   cd refpolicy/policy/flask
> >   vi security_classes access_vectors
> >   <add new class to end>
> >   make
> >   make LINUX_D=/path/to/linux-2.6 tokern
> 
> Does this require rebuilding and updating all the SELinux rpms to know about
> the new class?

Policy ultimately has to be updated in order to start writing allow
rules based on the new class/perm.  libselinux et al doesn't have to
change.

If you have a "SELinux:  policy loaded with handle_unknown=allow"
message in your /var/log/messages, then new classes/perms that are not
yet known to the policy will be allowed by default, so the operation
will be permitted by the kernel.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]
  2008-01-09 15:45             ` David Howells
@ 2008-01-09 23:52               ` Nick Piggin
  0 siblings, 0 replies; 126+ messages in thread
From: Nick Piggin @ 2008-01-09 23:52 UTC (permalink / raw)
  To: David Howells
  Cc: viro, hch, Trond.Myklebust, sds, casey, linux-kernel, selinux,
	linux-security-module

On Thursday 10 January 2008 02:45, David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > It is to make everybody happy. Especially in code that everyone works
> > on like mm/ and fs/, you can't just have everybody following their own
> > slightly different conventions.
>
> Conventions are what people agree they are.

Yes, David :)


> Anyway, I've attached a revised page flags patch if you can take a quick
> look over that.

Still has funny hunks in pagemap.h and filemap.c. At a quick glance
I guess it is better, but can you please point me to the messages
where people have objected to my suggested way of doing it?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-09 18:56       ` David Howells
  2008-01-09 19:19         ` Stephen Smalley
@ 2008-01-10 11:09         ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: David Howells @ 2008-01-10 11:09 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> If you have a "SELinux:  policy loaded with handle_unknown=allow"
> message in your /var/log/messages, then new classes/perms that are not
> yet known to the policy will be allowed by default, so the operation
> will be permitted by the kernel.

I don't.  How do I set it?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-09 16:51     ` David Howells
  2008-01-09 18:11       ` Stephen Smalley
  2008-01-09 18:56       ` David Howells
@ 2008-01-14 14:01       ` David Howells
  2008-01-14 14:52         ` Casey Schaufler
                           ` (2 more replies)
  2008-01-14 14:06       ` David Howells
  3 siblings, 3 replies; 126+ messages in thread
From: David Howells @ 2008-01-14 14:01 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module


Stephen Smalley <sds@tycho.nsa.gov> wrote:

> > 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> > 		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);
> > 
> > And I assume this doesn't care if one, the other or both of the two SIDs
> > mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.
> 
> Right, the latter is reasonable.

Okay...  It looks like I want four security operations/hooks for cachefiles:

 (1) Check that a daemon can nominate a secid for use by the kernel to override
     the process subjective secid.

 (2) Set the secid mentioned in (1).

 (3) Check that the kernel may create files as a particular secid (this could
     be specified indirectly by specifying an inode, which would hide the secid
     inside the LSM).

 (4) Set the fscreate secid mentioned in (3).

Now, it's possible to condense (1) and (2) into a single op, and condense (3)
and (4) into a single op.  That, however, might make the ops unusable by nfsd,
which may well want to bypass the checks or do them elsewhere.

Any thoughts?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-09 16:51     ` David Howells
                         ` (2 preceding siblings ...)
  2008-01-14 14:01       ` David Howells
@ 2008-01-14 14:06       ` David Howells
  2008-01-15 14:58         ` Stephen Smalley
  2008-01-23 20:52         ` David Howells
  3 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2008-01-14 14:06 UTC (permalink / raw)
  Cc: dhowells, Stephen Smalley, Daniel J Walsh, casey, linux-kernel,
	selinux, linux-security-module

David Howells <dhowells@redhat.com> wrote:

> Okay...  It looks like I want four security operations/hooks for cachefiles:

FYI, I added the following vectors:

	# kernel services that need to override task security
	class kernel_service
	{
		use_as_override
		create_files_as
	}

The first allows:

	avc_has_perm(daemon_tsec->sid, nominated_sid,
		     SECCLASS_KERNEL_SERVICE,
		     KERNEL_SERVICE__USE_AS_OVERRIDE,
		     NULL);

And the second something like:

	avc_has_perm(tsec->sid, inode->sid,
		     SECCLASS_KERNEL_SERVICE,
		     KERNEL_SERVICE__CREATE_FILES_AS,
		     NULL);

Rather than specifically dedicating them to the cache, I made them general.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-14 14:01       ` David Howells
@ 2008-01-14 14:52         ` Casey Schaufler
  2008-01-14 15:19           ` David Howells
  2008-01-15 14:56         ` Stephen Smalley
  2008-01-15 16:03         ` David Howells
  2 siblings, 1 reply; 126+ messages in thread
From: Casey Schaufler @ 2008-01-14 14:52 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> 
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > > 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> > > 		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);
> > > 
> > > And I assume this doesn't care if one, the other or both of the two SIDs
> > > mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.
> > 
> > Right, the latter is reasonable.
> 
> Okay...  It looks like I want four security operations/hooks for cachefiles:
> 
>  (1) Check that a daemon can nominate a secid for use by the kernel to
> override
>      the process subjective secid.
> 
>  (2) Set the secid mentioned in (1).
> 
>  (3) Check that the kernel may create files as a particular secid (this could
>      be specified indirectly by specifying an inode, which would hide the
> secid
>      inside the LSM).
> 
>  (4) Set the fscreate secid mentioned in (3).
> 
> Now, it's possible to condense (1) and (2) into a single op,

Yes, and I would recommend doing so to avoid permission races.
You're going to have to deal with the case where step (2) fails
even if you have step (1), so the "test and set" mindset seems
prudent to me.

> and condense (3) and (4) into a single op.  That, however, might make
> the ops unusable by nfsd,
> which may well want to bypass the checks or do them elsewhere.

Again, I don't think you're doing yourself any favors with a separate
test operation.

On (4) are you suggesting a third attribute value? There's the secid
of the task originally, the secid you're going to use to do the access
checks, and the secid you're going to set the file to on creation.

> Any thoughts?

Let me see if I understand your current scheme.

You want a (object) secid that is used to access the task.
You want a (subject) secid that the task uses to accesses objects.
You want a (newobject) secid that an object gets on creation.
And you want them all to be distinct and settable.
Did I get that right?

Thank you.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-14 14:52         ` Casey Schaufler
@ 2008-01-14 15:19           ` David Howells
  0 siblings, 0 replies; 126+ messages in thread
From: David Howells @ 2008-01-14 15:19 UTC (permalink / raw)
  To: casey
  Cc: dhowells, Stephen Smalley, Daniel J Walsh, linux-kernel, selinux,
	linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> Yes, and I would recommend doing so to avoid permission races.
> You're going to have to deal with the case where step (2) fails
> even if you have step (1), so the "test and set" mindset seems
> prudent to me.

Looking at SELinux, that doesn't get rid of the permission race because there's
no locking.  This may be different for other models.

I was thinking of having steps (2) and (4) not do any checking, but rather
assume that the caller has done the checks before calling the set routines,
possibly by calling the hooks mentioned in (1) and (3).

My main problem is that I don't know how NFSd wants to do things.  I suppose
there *ought* to be rules that say what NFSd is allowed to do.

> > and condense (3) and (4) into a single op.  That, however, might make
> > the ops unusable by nfsd,
> > which may well want to bypass the checks or do them elsewhere.
> 
> Again, I don't think you're doing yourself any favors with a separate
> test operation.

It doesn't matter to cachefiles, but it might matter to other users:-/

> On (4) are you suggesting a third attribute value? There's the secid
> of the task originally, the secid you're going to use to do the access
> checks, and the secid you're going to set the file to on creation.

That's correct.  Let me summarise:

 (1) The daemon has an active process security ID (say A).  When the daemon
     nominates an override process security ID (say B) to be used by the
     kernel, the cachefiles module asks the LSM to check that A is allowed to
     nominate B for this purpose.

 (2) The cachefiles module is given a path under which its cache exists.  The
     directory at the base of this path has its own security ID (say C).
     cachefiles wants to create new files in the cache with the same security
     ID as that directory (ie. C).

     However, when cachefiles is creating files in the cache, the security of
     whatever process is doing the access will be overridden with B, so
     cachefiles asks the LSM to check that B is allowed create files as C.

     Note that this is an instantaneous check in the cache startup stage.  This
     allows caching to be aborted early if the security policy does not permit
     B to create Cs.  Technically this check is superfluous as it's re-checked
     each time a vfs_mkdir() or vfs_create() are called.

> > Any thoughts?
> 
> Let me see if I understand your current scheme.
> 
> You want a (object) secid that is used to access the task.

That depends on what you mean.  cachefilesd (the daemon) will be run with a
security label because there's a security model in place.

I don't actually need to access the daemon, but the daemon does need to do
things for which it requires permission grants.

> You want a (subject) secid that the task uses to accesses objects.

Correct.  This is used as an override by any task that accesses the cache
indirectly through the cachefiles module.

The cachefilesd daemon has its own secid with which it accesses the cache
directly.  The sets of permissions that must be granted by the module's
override subjective secid and by the daemon's subjective secid aren't
identical.

> You want a (newobject) secid that an object gets on creation.

File and directory objects, yes.  The cache is stored on disk as a collection
of files and directories, each of which needs labelling.

> And you want them all to be distinct and settable.

Well, they don't technically have to be different.  The daemon and the module
can be given the same secid, for instance.  However, that secid then grants the
daemon permission to anything the module can, and vice versa.

The third secid is a file label rather than a process label, and so may or may
not have to be different anyway, depending on the model.

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-14 14:01       ` David Howells
  2008-01-14 14:52         ` Casey Schaufler
@ 2008-01-15 14:56         ` Stephen Smalley
  2008-01-15 16:03         ` David Howells
  2 siblings, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2008-01-15 14:56 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel J Walsh, casey, linux-kernel, selinux, linux-security-module


On Mon, 2008-01-14 at 14:01 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > > 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> > > 		     SECCLASS_CACHE, CACHE__USE_AS_OVERRIDE, NULL);
> > > 
> > > And I assume this doesn't care if one, the other or both of the two SIDs
> > > mentioned are of SECCLASS_PROCESS rather than of SECCLASS_CACHE.
> > 
> > Right, the latter is reasonable.
> 
> Okay...  It looks like I want four security operations/hooks for cachefiles:
> 
>  (1) Check that a daemon can nominate a secid for use by the kernel to override
>      the process subjective secid.
> 
>  (2) Set the secid mentioned in (1).
> 
>  (3) Check that the kernel may create files as a particular secid (this could
>      be specified indirectly by specifying an inode, which would hide the secid
>      inside the LSM).

I don't think this check is on the kernel per se but rather the ability
of the daemon to nominate a secid for use on files created later by the
kernel module.

>  (4) Set the fscreate secid mentioned in (3).
> 
> Now, it's possible to condense (1) and (2) into a single op, and condense (3)
> and (4) into a single op.  That, however, might make the ops unusable by nfsd,
> which may well want to bypass the checks or do them elsewhere.
> 
> Any thoughts?

I think it is fine to combine them.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-14 14:06       ` David Howells
@ 2008-01-15 14:58         ` Stephen Smalley
  2008-01-23 20:52         ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2008-01-15 14:58 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel J Walsh, casey, linux-kernel, selinux, linux-security-module


On Mon, 2008-01-14 at 14:06 +0000, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
> 
> > Okay...  It looks like I want four security operations/hooks for cachefiles:
> 
> FYI, I added the following vectors:
> 
> 	# kernel services that need to override task security
> 	class kernel_service
> 	{
> 		use_as_override
> 		create_files_as
> 	}
> 
> The first allows:
> 
> 	avc_has_perm(daemon_tsec->sid, nominated_sid,
> 		     SECCLASS_KERNEL_SERVICE,
> 		     KERNEL_SERVICE__USE_AS_OVERRIDE,
> 		     NULL);
> 
> And the second something like:
> 
> 	avc_has_perm(tsec->sid, inode->sid,
> 		     SECCLASS_KERNEL_SERVICE,
> 		     KERNEL_SERVICE__CREATE_FILES_AS,
> 		     NULL);
> 
> Rather than specifically dedicating them to the cache, I made them general.

Make sure that you or Dan submits a policy patch to register these
classes and permissions in the policy when the kernel patch is queued
for merge.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-14 14:01       ` David Howells
  2008-01-14 14:52         ` Casey Schaufler
  2008-01-15 14:56         ` Stephen Smalley
@ 2008-01-15 16:03         ` David Howells
  2008-01-15 16:08           ` Stephen Smalley
  2008-01-15 18:10           ` Casey Schaufler
  2 siblings, 2 replies; 126+ messages in thread
From: David Howells @ 2008-01-15 16:03 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> >  (3) Check that the kernel may create files as a particular secid (this
> >      could be specified indirectly by specifying an inode, which would
> >      hide the secid inside the LSM).
> 
> I don't think this check is on the kernel per se but rather the ability
> of the daemon to nominate a secid for use on files created later by the
> kernel module.

Hmmm...  At the moment the cachefiles module works out for itself what the
file label should be by looking at the root directory it was given and
assuming the label on that is what it's going to be using.  Are you suggesting
this should be specified directly instead by the daemon?

David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-15 16:03         ` David Howells
@ 2008-01-15 16:08           ` Stephen Smalley
  2008-01-15 18:10           ` Casey Schaufler
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2008-01-15 16:08 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel J Walsh, casey, linux-kernel, selinux, linux-security-module


On Tue, 2008-01-15 at 16:03 +0000, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > >  (3) Check that the kernel may create files as a particular secid (this
> > >      could be specified indirectly by specifying an inode, which would
> > >      hide the secid inside the LSM).
> > 
> > I don't think this check is on the kernel per se but rather the ability
> > of the daemon to nominate a secid for use on files created later by the
> > kernel module.
> 
> Hmmm...  At the moment the cachefiles module works out for itself what the
> file label should be by looking at the root directory it was given and
> assuming the label on that is what it's going to be using.  Are you suggesting
> this should be specified directly instead by the daemon?

No, just that however the secid is determined (whether indirectly via
specification of a directory or directly via specification of a secid),
the ability of the daemon to control what secid gets used ought to be
controlled.  Or, alternatively, the ability of the daemon to enable
caching in a given directory ought to be controlled.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-15 16:03         ` David Howells
  2008-01-15 16:08           ` Stephen Smalley
@ 2008-01-15 18:10           ` Casey Schaufler
  2008-01-15 19:15             ` Stephen Smalley
  2008-01-15 21:55             ` David Howells
  1 sibling, 2 replies; 126+ messages in thread
From: Casey Schaufler @ 2008-01-15 18:10 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > >  (3) Check that the kernel may create files as a particular secid (this
> > >      could be specified indirectly by specifying an inode, which would
> > >      hide the secid inside the LSM).
> > 
> > I don't think this check is on the kernel per se but rather the ability
> > of the daemon to nominate a secid for use on files created later by the
> > kernel module.
> 
> Hmmm...  At the moment the cachefiles module works out for itself what the
> file label should be by looking at the root directory it was given and
> assuming the label on that is what it's going to be using.  Are you
> suggesting
> this should be specified directly instead by the daemon?

Oh my. While there will be cases where the label of the file
will match the label of the containing directory, and in fact
for most label based LSMs that will usually be the case, you
certainly can't count on it. The only place that you can find
the correct label for a file with any confidence in from the
xattr (assuming the LSM uses xattrs) on the file itself. I can
imaging an LSM for which it would make sense to derive the
label from the root directory, but I know Smack isn't one of
them, and I don't think that SELinux is either, although I
would defer a definitive answer on that to Stephen.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-15 18:10           ` Casey Schaufler
@ 2008-01-15 19:15             ` Stephen Smalley
  2008-01-15 21:55             ` David Howells
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Smalley @ 2008-01-15 19:15 UTC (permalink / raw)
  To: casey
  Cc: David Howells, Daniel J Walsh, linux-kernel, selinux,
	linux-security-module


On Tue, 2008-01-15 at 10:10 -0800, Casey Schaufler wrote:
> --- David Howells <dhowells@redhat.com> wrote:
> 
> > Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > 
> > > >  (3) Check that the kernel may create files as a particular secid (this
> > > >      could be specified indirectly by specifying an inode, which would
> > > >      hide the secid inside the LSM).
> > > 
> > > I don't think this check is on the kernel per se but rather the ability
> > > of the daemon to nominate a secid for use on files created later by the
> > > kernel module.
> > 
> > Hmmm...  At the moment the cachefiles module works out for itself what the
> > file label should be by looking at the root directory it was given and
> > assuming the label on that is what it's going to be using.  Are you
> > suggesting
> > this should be specified directly instead by the daemon?
> 
> Oh my. While there will be cases where the label of the file
> will match the label of the containing directory, and in fact
> for most label based LSMs that will usually be the case, you
> certainly can't count on it. The only place that you can find
> the correct label for a file with any confidence in from the
> xattr (assuming the LSM uses xattrs) on the file itself. I can
> imaging an LSM for which it would make sense to derive the
> label from the root directory, but I know Smack isn't one of
> them, and I don't think that SELinux is either, although I
> would defer a definitive answer on that to Stephen.

The cache files are created by the cachefiles kernel module, not by the
userspace daemon, and the userspace daemon doesn't need to directly
read/write them at all (but I think it does need to be able to unlink
them?).  The userspace daemon merely identifies the directory where the
cache should live as part of configuring the cache when enabling it.

Hence, it is fine to use a fixed label for the cache files (systemhigh
in a MLS world), and to let the directory's label serve as the basis for
it.  Only the cachefiles kernel module directly reads and writes the
files.
 
-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-15 18:10           ` Casey Schaufler
  2008-01-15 19:15             ` Stephen Smalley
@ 2008-01-15 21:55             ` David Howells
  2008-01-15 22:23               ` Casey Schaufler
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2008-01-15 21:55 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, casey, Daniel J Walsh, linux-kernel, selinux,
	linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> The cache files are created by the cachefiles kernel module, not by the
> userspace daemon, and the userspace daemon doesn't need to directly
> read/write them at all

That is correct.

> (but I think it does need to be able to unlink them?).

Indeed.

> The userspace daemon merely identifies the directory where the cache should
> live as part of configuring the cache when enabling it.

That is the way it currently works, yes.
 
> Hence, it is fine to use a fixed label for the cache files (systemhigh
> in a MLS world), and to let the directory's label serve as the basis for
> it.

That is what I currently do.  SELinux rules are provided to grant the
appropriate file accesses to the override label used by the kernel module, so
that it can't go and stamp on files with the wrong label.

> Only the cachefiles kernel module directly reads and writes the files.

Correct.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-15 21:55             ` David Howells
@ 2008-01-15 22:23               ` Casey Schaufler
  0 siblings, 0 replies; 126+ messages in thread
From: Casey Schaufler @ 2008-01-15 22:23 UTC (permalink / raw)
  To: David Howells, Stephen Smalley
  Cc: dhowells, casey, Daniel J Walsh, linux-kernel, selinux,
	linux-security-module


--- David Howells <dhowells@redhat.com> wrote:

> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > The cache files are created by the cachefiles kernel module, not by the
> > userspace daemon, and the userspace daemon doesn't need to directly
> > read/write them at all
> 
> That is correct.
> 
> > (but I think it does need to be able to unlink them?).
> 
> Indeed.
> 
> > The userspace daemon merely identifies the directory where the cache should
> > live as part of configuring the cache when enabling it.
> 
> That is the way it currently works, yes.
>  
> > Hence, it is fine to use a fixed label for the cache files (systemhigh
> > in a MLS world), and to let the directory's label serve as the basis for
> > it.
> 
> That is what I currently do.  SELinux rules are provided to grant the
> appropriate file accesses to the override label used by the kernel module, so
> that it can't go and stamp on files with the wrong label.
> 
> > Only the cachefiles kernel module directly reads and writes the files.
> 
> Correct.

Well, my bad, and thank you for clearing up my misunderstanding.


Casey Schaufler
casey@schaufler-ca.com

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-14 14:06       ` David Howells
  2008-01-15 14:58         ` Stephen Smalley
@ 2008-01-23 20:52         ` David Howells
  2008-01-23 22:03           ` James Morris
  1 sibling, 1 reply; 126+ messages in thread
From: David Howells @ 2008-01-23 20:52 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: dhowells, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> Make sure that you or Dan submits a policy patch to register these
> classes and permissions in the policy when the kernel patch is queued
> for merge.

Do I just send the attached patch to <selinux@tycho.nsa.gov>?  Or do I need to
make a diff from a point in the tree nearer the root?  Is there anything else
I need to alter whilst I'm at it?

David
---
Index: policy/flask/security_classes
===================================================================
--- policy/flask/security_classes	(revision 2573)
+++ policy/flask/security_classes	(working copy)
@@ -109,4 +109,7 @@
 # network peer labels
 class peer
 
+# kernel services that need to override task security
+class kernel_service
+
 # FLASK
Index: policy/flask/access_vectors
===================================================================
--- policy/flask/access_vectors	(revision 2573)
+++ policy/flask/access_vectors	(working copy)
@@ -736,3 +736,10 @@
 {
 	recv
 }
+
+# kernel services that need to override task security
+class kernel_service
+{
+	use_as_override
+	create_files_as
+}

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions [try #2]
  2008-01-23 20:52         ` David Howells
@ 2008-01-23 22:03           ` James Morris
  0 siblings, 0 replies; 126+ messages in thread
From: James Morris @ 2008-01-23 22:03 UTC (permalink / raw)
  To: David Howells
  Cc: Stephen Smalley, Daniel J Walsh, casey, linux-kernel, selinux,
	linux-security-module

On Wed, 23 Jan 2008, David Howells wrote:

> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> > Make sure that you or Dan submits a policy patch to register these
> > classes and permissions in the policy when the kernel patch is queued
> > for merge.
> 
> Do I just send the attached patch to <selinux@tycho.nsa.gov>?  

That should be enough (and that list is on the cc).

>Or do I need to
> make a diff from a point in the tree nearer the root?  Is there anything else
> I need to alter whilst I'm at it?
> 
> David
> ---
> Index: policy/flask/security_classes
> ===================================================================
> --- policy/flask/security_classes	(revision 2573)
> +++ policy/flask/security_classes	(working copy)
> @@ -109,4 +109,7 @@
>  # network peer labels
>  class peer
>  
> +# kernel services that need to override task security
> +class kernel_service
> +
>  # FLASK
> Index: policy/flask/access_vectors
> ===================================================================
> --- policy/flask/access_vectors	(revision 2573)
> +++ policy/flask/access_vectors	(working copy)
> @@ -736,3 +736,10 @@
>  {
>  	recv
>  }
> +
> +# kernel services that need to override task security
> +class kernel_service
> +{
> +	use_as_override
> +	create_files_as
> +}
> -
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2008-01-23 22:05 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-05 19:38 [PATCH 00/28] Permit filesystem local caching [try #2] David Howells
2007-12-05 19:38 ` [PATCH 01/28] KEYS: Increase the payload size when instantiating a key " David Howells
2007-12-05 19:38 ` [PATCH 02/28] KEYS: Check starting keyring as part of search " David Howells
2007-12-05 19:38 ` [PATCH 03/28] KEYS: Allow the callout data to be passed as a blob rather than a string " David Howells
2007-12-05 19:38 ` [PATCH 04/28] KEYS: Add keyctl function to get a security label " David Howells
2007-12-05 19:38 ` [PATCH 05/28] Security: Change current->fs[ug]id to current_fs[ug]id() " David Howells
2007-12-05 19:38 ` [PATCH 06/28] SECURITY: Separate task security context from task_struct " David Howells
2007-12-05 19:38 ` [PATCH 07/28] SECURITY: De-embed task security record from task and use refcounting " David Howells
2007-12-05 19:38 ` [PATCH 08/28] SECURITY: Allow kernel services to override LSM settings for task actions " David Howells
2007-12-10 16:46   ` Stephen Smalley
2007-12-10 17:07   ` David Howells
2007-12-10 17:23     ` Stephen Smalley
2007-12-10 21:08     ` David Howells
2007-12-10 21:27       ` Stephen Smalley
2007-12-10 22:26         ` Casey Schaufler
2007-12-10 23:44           ` David Howells
2007-12-10 23:56             ` Casey Schaufler
2007-12-11 18:34           ` Stephen Smalley
2007-12-11 19:26             ` Casey Schaufler
2007-12-11 19:56               ` Stephen Smalley
2007-12-11 20:40                 ` Casey Schaufler
2007-12-10 23:36       ` David Howells
2007-12-10 23:46         ` Casey Schaufler
2007-12-11 19:52           ` Stephen Smalley
2007-12-11 19:37         ` Stephen Smalley
2007-12-12 14:41           ` Karl MacMillan
2007-12-12 14:53           ` David Howells
2007-12-12 14:59             ` Karl MacMillan
2007-12-11 20:42         ` David Howells
2007-12-11 21:18           ` Casey Schaufler
2007-12-11 21:34           ` Stephen Smalley
2007-12-19  3:28             ` Crispin Cowan
2007-12-19  5:39               ` Casey Schaufler
2007-12-19 14:54               ` Stephen Smalley
2007-12-11 22:43           ` David Howells
2007-12-11 23:04             ` Casey Schaufler
2007-12-12 15:25               ` Stephen Smalley
2007-12-12 16:51                 ` Casey Schaufler
2007-12-12 18:12                   ` Stephen Smalley
2007-12-12 18:34                   ` David Howells
2007-12-12 19:44                     ` Casey Schaufler
2007-12-12 19:49                       ` Stephen Smalley
2007-12-12 20:09                         ` Casey Schaufler
2007-12-12 22:29                           ` David Howells
2007-12-12 22:32                       ` David Howells
2007-12-12 18:25               ` David Howells
2007-12-12 19:20                 ` Casey Schaufler
2007-12-12 19:29                   ` David Howells
2007-12-12 19:35                   ` Stephen Smalley
2007-12-12 22:55                   ` David Howells
2007-12-13 14:51                     ` Stephen Smalley
2007-12-13 16:03                     ` David Howells
2007-12-12 18:29               ` David Howells
2007-12-12 19:33                 ` Stephen Smalley
2007-12-12 19:37                 ` Casey Schaufler
2007-12-12 22:52                   ` David Howells
2007-12-12 22:49                 ` David Howells
2007-12-13 14:49                   ` Stephen Smalley
2007-12-13 15:36                   ` David Howells
2007-12-13 16:23                     ` Stephen Smalley
2007-12-13 17:01                     ` David Howells
2007-12-13 17:27                       ` Stephen Smalley
2007-12-13 18:04                       ` David Howells
2007-12-19  3:28           ` Crispin Cowan
2007-12-19 23:38           ` David Howells
2008-01-09 16:51     ` David Howells
2008-01-09 18:11       ` Stephen Smalley
2008-01-09 18:56       ` David Howells
2008-01-09 19:19         ` Stephen Smalley
2008-01-10 11:09         ` David Howells
2008-01-14 14:01       ` David Howells
2008-01-14 14:52         ` Casey Schaufler
2008-01-14 15:19           ` David Howells
2008-01-15 14:56         ` Stephen Smalley
2008-01-15 16:03         ` David Howells
2008-01-15 16:08           ` Stephen Smalley
2008-01-15 18:10           ` Casey Schaufler
2008-01-15 19:15             ` Stephen Smalley
2008-01-15 21:55             ` David Howells
2008-01-15 22:23               ` Casey Schaufler
2008-01-14 14:06       ` David Howells
2008-01-15 14:58         ` Stephen Smalley
2008-01-23 20:52         ` David Howells
2008-01-23 22:03           ` James Morris
2008-01-09 17:27     ` David Howells
2007-12-05 19:39 ` [PATCH 09/28] FS-Cache: Release page->private after failed readahead " David Howells
2007-12-14  3:51   ` Nick Piggin
2007-12-17 22:42   ` David Howells
2007-12-18  7:03     ` Nick Piggin
2007-12-05 19:39 ` [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management " David Howells
2007-12-14  4:08   ` Nick Piggin
2007-12-17 22:36   ` David Howells
2007-12-18  7:00     ` Nick Piggin
2007-12-20 18:33     ` David Howells
2007-12-21  1:08       ` Nick Piggin
2008-01-02 16:27       ` David Howells
2008-01-07 11:33         ` Nick Piggin
2008-01-07 13:09         ` David Howells
2008-01-08  3:01           ` Nick Piggin
2008-01-08 23:51           ` David Howells
2008-01-09  1:52             ` Nick Piggin
2008-01-09 15:45             ` David Howells
2008-01-09 23:52               ` Nick Piggin
2007-12-05 19:39 ` [PATCH 11/28] FS-Cache: Provide an add_wait_queue_tail() function " David Howells
2007-12-05 19:39 ` [PATCH 12/28] FS-Cache: Generic filesystem caching facility " David Howells
2007-12-05 19:39 ` [PATCH 13/28] CacheFiles: Add missing copy_page export for ia64 " David Howells
2007-12-05 19:39 ` [PATCH 14/28] CacheFiles: Be consistent about the use of mapping vs file->f_mapping in Ext3 " David Howells
2007-12-05 19:39 ` [PATCH 15/28] CacheFiles: Add a hook to write a single page of data to an inode " David Howells
2007-12-05 19:39 ` [PATCH 16/28] CacheFiles: Permit the page lock state to be monitored " David Howells
2007-12-05 19:39 ` [PATCH 17/28] CacheFiles: Export things for CacheFiles " David Howells
2007-12-05 19:39 ` [PATCH 18/28] CacheFiles: A cache that backs onto a mounted filesystem " David Howells
2007-12-05 19:39 ` [PATCH 19/28] NFS: Use local caching " David Howells
2007-12-05 19:40 ` [PATCH 20/28] NFS: Configuration and mount option changes to enable local caching on NFS " David Howells
2007-12-05 19:40 ` [PATCH 21/28] NFS: Display local caching state " David Howells
2007-12-05 19:40 ` [PATCH 22/28] fcrypt endianness misannotations " David Howells
2007-12-05 19:40 ` [PATCH 23/28] AFS: Add TestSetPageError() " David Howells
2007-12-05 19:40 ` [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache " David Howells
2007-12-14  4:21   ` Nick Piggin
2007-12-17 22:54   ` David Howells
2007-12-18  7:07     ` Nick Piggin
2007-12-20 18:49     ` David Howells
2007-12-21  1:11       ` Nick Piggin
2007-12-05 19:40 ` [PATCH 25/28] AFS: Improve handling of a rejected writeback " David Howells
2007-12-05 19:40 ` [PATCH 26/28] AF_RXRPC: Save the operation ID for debugging " David Howells
2007-12-05 19:40 ` [PATCH 27/28] AFS: Implement shared-writable mmap " David Howells
2007-12-05 19:40 ` [PATCH 28/28] FS-Cache: Make kAFS use FS-Cache " David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).