All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 02/46] fs: d_validate fixes
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-12-08  1:53   ` Dave Chinner
  2010-11-27  9:44 ` [PATCH 03/46] kernel: kmem_ptr_validate considered harmful Nick Piggin
                   ` (48 subsequent siblings)
  49 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

d_validate has been broken for a long time.

kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.

So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.

Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   25 +++++++------------------
 1 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index cc2b938..9d1a59d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1483,41 +1483,30 @@ out:
 }
 
 /**
- * d_validate - verify dentry provided from insecure source
+ * d_validate - verify dentry provided from insecure source (deprecated)
  * @dentry: The dentry alleged to be valid child of @dparent
  * @dparent: The parent dentry (known to be valid)
  *
  * An insecure source has sent us a dentry, here we verify it and dget() it.
  * This is used by ncpfs in its readdir implementation.
  * Zero is returned in the dentry is invalid.
+ *
+ * This function is slow for big directories, and deprecated, do not use it.
  */
- 
 int d_validate(struct dentry *dentry, struct dentry *dparent)
 {
-	struct hlist_head *base;
-	struct hlist_node *lhp;
-
-	/* Check whether the ptr might be valid at all.. */
-	if (!kmem_ptr_validate(dentry_cache, dentry))
-		goto out;
-
-	if (dentry->d_parent != dparent)
-		goto out;
+	struct dentry *child;
 
 	spin_lock(&dcache_lock);
-	base = d_hash(dparent, dentry->d_name.hash);
-	hlist_for_each(lhp,base) { 
-		/* hlist_for_each_entry_rcu() not required for d_hash list
-		 * as it is parsed under dcache_lock
-		 */
-		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
+	list_for_each_entry(child, &dparent->d_subdirs, d_u.d_child) {
+		if (dentry == child) {
 			__dget_locked(dentry);
 			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
 	spin_unlock(&dcache_lock);
-out:
+
 	return 0;
 }
 EXPORT_SYMBOL(d_validate);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 03/46] kernel: kmem_ptr_validate considered harmful
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
  2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 04/46] fs: dcache documentation cleanup Nick Piggin
                   ` (47 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This is a nasty and error prone API. It is no longer used, remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 include/linux/slab.h |    2 --
 mm/slab.c            |   32 +-------------------------------
 mm/slob.c            |    5 -----
 mm/slub.c            |   40 ----------------------------------------
 mm/util.c            |   21 ---------------------
 5 files changed, 1 insertions(+), 99 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 59260e2..fa90866 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -106,8 +106,6 @@ int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
-int kern_ptr_validate(const void *ptr, unsigned long size);
-int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
diff --git a/mm/slab.c b/mm/slab.c
index b1e40da..6107f23 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2781,7 +2781,7 @@ static void slab_put_obj(struct kmem_cache *cachep, struct slab *slabp,
 /*
  * Map pages beginning at addr to the given cache and slab. This is required
  * for the slab allocator to be able to lookup the cache and slab of a
- * virtual address for kfree, ksize, kmem_ptr_validate, and slab debugging.
+ * virtual address for kfree, ksize, and slab debugging.
  */
 static void slab_map_pages(struct kmem_cache *cache, struct slab *slab,
 			   void *addr)
@@ -3660,36 +3660,6 @@ void *kmem_cache_alloc_notrace(struct kmem_cache *cachep, gfp_t flags)
 EXPORT_SYMBOL(kmem_cache_alloc_notrace);
 #endif
 
-/**
- * kmem_ptr_validate - check if an untrusted pointer might be a slab entry.
- * @cachep: the cache we're checking against
- * @ptr: pointer to validate
- *
- * This verifies that the untrusted pointer looks sane;
- * it is _not_ a guarantee that the pointer is actually
- * part of the slab cache in question, but it at least
- * validates that the pointer can be dereferenced and
- * looks half-way sane.
- *
- * Currently only used for dentry validation.
- */
-int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr)
-{
-	unsigned long size = cachep->buffer_size;
-	struct page *page;
-
-	if (unlikely(!kern_ptr_validate(ptr, size)))
-		goto out;
-	page = virt_to_page(ptr);
-	if (unlikely(!PageSlab(page)))
-		goto out;
-	if (unlikely(page_get_cache(page) != cachep))
-		goto out;
-	return 1;
-out:
-	return 0;
-}
-
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
diff --git a/mm/slob.c b/mm/slob.c
index 617b6d6..3588eaa 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -678,11 +678,6 @@ int kmem_cache_shrink(struct kmem_cache *d)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-int kmem_ptr_validate(struct kmem_cache *a, const void *b)
-{
-	return 0;
-}
-
 static unsigned int slob_ready __read_mostly;
 
 int slab_is_available(void)
diff --git a/mm/slub.c b/mm/slub.c
index 981fb73..5ed0941 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1917,17 +1917,6 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
-/* Figure out on which slab page the object resides */
-static struct page *get_object_page(const void *x)
-{
-	struct page *page = virt_to_head_page(x);
-
-	if (!PageSlab(page))
-		return NULL;
-
-	return page;
-}
-
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
@@ -2386,35 +2375,6 @@ error:
 }
 
 /*
- * Check if a given pointer is valid
- */
-int kmem_ptr_validate(struct kmem_cache *s, const void *object)
-{
-	struct page *page;
-
-	if (!kern_ptr_validate(object, s->size))
-		return 0;
-
-	page = get_object_page(object);
-
-	if (!page || s != page->slab)
-		/* No slab or wrong slab */
-		return 0;
-
-	if (!check_valid_pointer(s, page, object))
-		return 0;
-
-	/*
-	 * We could also check if the object is on the slabs freelist.
-	 * But this would be too expensive and it seems that the main
-	 * purpose of kmem_ptr_valid() is to check if the object belongs
-	 * to a certain slab.
-	 */
-	return 1;
-}
-EXPORT_SYMBOL(kmem_ptr_validate);
-
-/*
  * Determine the size of a slab object
  */
 unsigned int kmem_cache_size(struct kmem_cache *s)
diff --git a/mm/util.c b/mm/util.c
index 73dac81..f126975 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -186,27 +186,6 @@ void kzfree(const void *p)
 }
 EXPORT_SYMBOL(kzfree);
 
-int kern_ptr_validate(const void *ptr, unsigned long size)
-{
-	unsigned long addr = (unsigned long)ptr;
-	unsigned long min_addr = PAGE_OFFSET;
-	unsigned long align_mask = sizeof(void *) - 1;
-
-	if (unlikely(addr < min_addr))
-		goto out;
-	if (unlikely(addr > (unsigned long)high_memory - size))
-		goto out;
-	if (unlikely(addr & align_mask))
-		goto out;
-	if (unlikely(!kern_addr_valid(addr)))
-		goto out;
-	if (unlikely(!kern_addr_valid(addr + size - 1)))
-		goto out;
-	return 1;
-out:
-	return 0;
-}
-
 /*
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 04/46] fs: dcache documentation cleanup
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
  2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
  2010-11-27  9:44 ` [PATCH 03/46] kernel: kmem_ptr_validate considered harmful Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 05/46] fs: change d_delete semantics Nick Piggin
                   ` (46 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Remove redundant (and incorrect, since dcache RCU lookup) dentry locking
documentation and point to the canonical document.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 include/linux/dcache.h |   18 ++++++------------
 1 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6a4aea3..fff9755 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -141,22 +141,16 @@ struct dentry_operations {
 	char *(*d_dname)(struct dentry *, char *, int);
 };
 
-/* the dentry parameter passed to d_hash and d_compare is the parent
+/*
+ * Locking rules for dentry_operations callbacks are to be found in
+ * Documentation/filesystems/Locking. Keep it updated!
+ *
+ * the dentry parameter passed to d_hash and d_compare is the parent
  * directory of the entries to be compared. It is used in case these
  * functions need any directory specific information for determining
  * equivalency classes.  Using the dentry itself might not work, as it
  * might be a negative dentry which has no information associated with
- * it */
-
-/*
-locking rules:
-		big lock	dcache_lock	d_lock   may block
-d_revalidate:	no		no		no       yes
-d_hash		no		no		no       yes
-d_compare:	no		yes		yes      no
-d_delete:	no		yes		no       no
-d_release:	no		no		no       yes
-d_iput:		no		no		no       yes
+ * it.
  */
 
 /* d_flags entries */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 05/46] fs: change d_delete semantics
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (2 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 04/46] fs: dcache documentation cleanup Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 06/46] cifs: dont overwrite dentry name in d_revalidate Nick Piggin
                   ` (45 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 Documentation/filesystems/porting |    9 +++++++++
 Documentation/filesystems/vfs.txt |   27 +++++++++++++--------------
 arch/ia64/kernel/perfmon.c        |    2 +-
 fs/9p/vfs_dentry.c                |    4 ++--
 fs/afs/dir.c                      |    4 ++--
 fs/btrfs/inode.c                  |    2 +-
 fs/coda/dir.c                     |    4 ++--
 fs/configfs/dir.c                 |    2 +-
 fs/dcache.c                       |    2 --
 fs/gfs2/dentry.c                  |    2 +-
 fs/hostfs/hostfs_kern.c           |    2 +-
 fs/libfs.c                        |    2 +-
 fs/ncpfs/dir.c                    |    4 ++--
 fs/nfs/dir.c                      |    2 +-
 fs/proc/base.c                    |    2 +-
 fs/proc/generic.c                 |    2 +-
 fs/proc/proc_sysctl.c             |    2 +-
 fs/sysfs/dir.c                    |    2 +-
 include/linux/dcache.h            |    6 +++---
 net/sunrpc/rpc_pipe.c             |    2 +-
 20 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..6d9d6c3 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -318,3 +318,12 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+---
+[mandatory]
+
+	.d_delete() now only advises the dcache as to whether or not to cache
+unreferenced dentries, and is now only called when the dentry refcount goes to
+0. Even on 0 refcount transition, it must be able to tolerate being called 0,
+1, or more times (eg. constant, idempotent).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..83c42d0 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -841,9 +841,9 @@ defined:
 
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
-	int (*d_hash) (struct dentry *, struct qstr *);
-	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
-	int (*d_delete)(struct dentry *);
+	int (*d_hash)(struct dentry *, struct qstr *);
+	int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+	int (*d_delete)(const struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
 	char *(*d_dname)(struct dentry *, char *, int);
@@ -858,9 +858,11 @@ struct dentry_operations {
 
   d_compare: called when a dentry should be compared with another
 
-  d_delete: called when the last reference to a dentry is
-	deleted. This means no-one is using the dentry, however it is
-	still valid and in the dcache
+  d_delete: called when the last reference to a dentry is dropped and the
+	dcache is deciding whether or not to cache it. Return 1 to delete
+	immediately, or 0 to cache the dentry. Default is NULL which means to
+	always cache a reachable dentry. d_delete must be constant and
+	idempotent.
 
   d_release: called when a dentry is really deallocated
 
@@ -904,14 +906,11 @@ manipulate dentries:
 	the usage count)
 
   dput: close a handle for a dentry (decrements the usage count). If
-	the usage count drops to 0, the "d_delete" method is called
-	and the dentry is placed on the unused list if the dentry is
-	still in its parents hash list. Putting the dentry on the
-	unused list just means that if the system needs some RAM, it
-	goes through the unused list of dentries and deallocates them.
-	If the dentry has already been unhashed and the usage count
-	drops to 0, in this case the dentry is deallocated after the
-	"d_delete" method is called
+	the usage count drops to 0, and the dentry is still in its
+	parent's hash, the "d_delete" method is called to check whether
+	it should be cached. If it should not be cached, or if the dentry
+	is not hashed, it is deleted. Otherwise cached dentries are put
+	into an LRU list to be reclaimed on memory shortage.
 
   d_drop: this unhashes a dentry from its parents hash list. A
 	subsequent call to dput() will deallocate the dentry if its
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 39e534f..d39d8a5 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2185,7 +2185,7 @@ static const struct file_operations pfm_file_ops = {
 };
 
 static int
-pfmfs_delete_dentry(struct dentry *dentry)
+pfmfs_delete_dentry(const struct dentry *dentry)
 {
 	return 1;
 }
diff --git a/fs/9p/vfs_dentry.c b/fs/9p/vfs_dentry.c
index cbf4e50..466d2a4 100644
--- a/fs/9p/vfs_dentry.c
+++ b/fs/9p/vfs_dentry.c
@@ -51,7 +51,7 @@
  *
  */
 
-static int v9fs_dentry_delete(struct dentry *dentry)
+static int v9fs_dentry_delete(const struct dentry *dentry)
 {
 	P9_DPRINTK(P9_DEBUG_VFS, " dentry: %s (%p)\n", dentry->d_name.name,
 									dentry);
@@ -68,7 +68,7 @@ static int v9fs_dentry_delete(struct dentry *dentry)
  *
  */
 
-static int v9fs_cached_dentry_delete(struct dentry *dentry)
+static int v9fs_cached_dentry_delete(const struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
 	P9_DPRINTK(P9_DEBUG_VFS, " dentry: %s (%p)\n", dentry->d_name.name,
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 5439e1b..2c18cde 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -23,7 +23,7 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
 static int afs_dir_open(struct inode *inode, struct file *file);
 static int afs_readdir(struct file *file, void *dirent, filldir_t filldir);
 static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd);
-static int afs_d_delete(struct dentry *dentry);
+static int afs_d_delete(const struct dentry *dentry);
 static void afs_d_release(struct dentry *dentry);
 static int afs_lookup_filldir(void *_cookie, const char *name, int nlen,
 				  loff_t fpos, u64 ino, unsigned dtype);
@@ -730,7 +730,7 @@ out_bad:
  * - called from dput() when d_count is going to 0.
  * - return 1 to request dentry be unhashed, 0 otherwise
  */
-static int afs_d_delete(struct dentry *dentry)
+static int afs_d_delete(const struct dentry *dentry)
 {
 	_enter("%s", dentry->d_name.name);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 558cac2..e134e80 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4127,7 +4127,7 @@ struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry)
 	return inode;
 }
 
-static int btrfs_dentry_delete(struct dentry *dentry)
+static int btrfs_dentry_delete(const struct dentry *dentry)
 {
 	struct btrfs_root *root;
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index 5d8b355..4cce3b0 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -47,7 +47,7 @@ static int coda_readdir(struct file *file, void *buf, filldir_t filldir);
 
 /* dentry ops */
 static int coda_dentry_revalidate(struct dentry *de, struct nameidata *nd);
-static int coda_dentry_delete(struct dentry *);
+static int coda_dentry_delete(const struct dentry *);
 
 /* support routines */
 static int coda_venus_readdir(struct file *coda_file, void *buf,
@@ -577,7 +577,7 @@ out:
  * This is the callback from dput() when d_count is going to 0.
  * We use this to unhash dentries with bad inodes.
  */
-static int coda_dentry_delete(struct dentry * dentry)
+static int coda_dentry_delete(const struct dentry * dentry)
 {
 	int flags;
 
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index 0b502f8..1001557 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -67,7 +67,7 @@ static void configfs_d_iput(struct dentry * dentry,
  * We _must_ delete our dentries on last dput, as the chain-to-parent
  * behavior is required to clear the parents of default_groups.
  */
-static int configfs_d_delete(struct dentry *dentry)
+static int configfs_d_delete(const struct dentry *dentry)
 {
 	return 1;
 }
diff --git a/fs/dcache.c b/fs/dcache.c
index 9d1a59d..14e0564 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -446,8 +446,6 @@ static void prune_one_dentry(struct dentry * dentry)
 		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock))
 			return;
 
-		if (dentry->d_op && dentry->d_op->d_delete)
-			dentry->d_op->d_delete(dentry);
 		dentry_lru_del(dentry);
 		__d_drop(dentry);
 		dentry = d_kill(dentry);
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index 6798755..e80fea2 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -106,7 +106,7 @@ static int gfs2_dhash(struct dentry *dentry, struct qstr *str)
 	return 0;
 }
 
-static int gfs2_dentry_delete(struct dentry *dentry)
+static int gfs2_dentry_delete(const struct dentry *dentry)
 {
 	struct gfs2_inode *ginode;
 
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 2c0f148..cfe8bc7 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -32,7 +32,7 @@ static inline struct hostfs_inode_info *HOSTFS_I(struct inode *inode)
 
 #define FILE_HOSTFS_I(file) HOSTFS_I((file)->f_path.dentry->d_inode)
 
-static int hostfs_d_delete(struct dentry *dentry)
+static int hostfs_d_delete(const struct dentry *dentry)
 {
 	return 1;
 }
diff --git a/fs/libfs.c b/fs/libfs.c
index a3accdf..b9d25d8 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -37,7 +37,7 @@ int simple_statfs(struct dentry *dentry, struct kstatfs *buf)
  * Retaining negative dentries for an in-memory filesystem just wastes
  * memory and lookup time: arrange for them to be deleted immediately.
  */
-static int simple_delete_dentry(struct dentry *dentry)
+static int simple_delete_dentry(const struct dentry *dentry)
 {
 	return 1;
 }
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index f22b12e..d6e6453 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -76,7 +76,7 @@ const struct inode_operations ncp_dir_inode_operations =
 static int ncp_lookup_validate(struct dentry *, struct nameidata *);
 static int ncp_hash_dentry(struct dentry *, struct qstr *);
 static int ncp_compare_dentry (struct dentry *, struct qstr *, struct qstr *);
-static int ncp_delete_dentry(struct dentry *);
+static int ncp_delete_dentry(const struct dentry *);
 
 static const struct dentry_operations ncp_dentry_operations =
 {
@@ -162,7 +162,7 @@ ncp_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
  * Closing files can be safely postponed until iput() - it's done there anyway.
  */
 static int
-ncp_delete_dentry(struct dentry * dentry)
+ncp_delete_dentry(const struct dentry * dentry)
 {
 	struct inode *inode = dentry->d_inode;
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8ea4a41..0f7798a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1127,7 +1127,7 @@ out_error:
 /*
  * This is called from dput() when d_count is going to 0.
  */
-static int nfs_dentry_delete(struct dentry *dentry)
+static int nfs_dentry_delete(const struct dentry *dentry)
 {
 	dfprintk(VFS, "NFS: dentry_delete(%s/%s, %x)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..866a41a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1744,7 +1744,7 @@ static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
 	return 0;
 }
 
-static int pid_delete_dentry(struct dentry * dentry)
+static int pid_delete_dentry(const struct dentry * dentry)
 {
 	/* Is the task we represent dead?
 	 * If so, then don't put the dentry on the lru list,
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index dd29f03..1d607be 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -400,7 +400,7 @@ static const struct inode_operations proc_link_inode_operations = {
  * smarter: we could keep a "volatile" flag in the 
  * inode to indicate which ones to keep.
  */
-static int proc_delete_dentry(struct dentry * dentry)
+static int proc_delete_dentry(const struct dentry * dentry)
 {
 	return 1;
 }
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index b652cb0..a256d77 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -392,7 +392,7 @@ static int proc_sys_revalidate(struct dentry *dentry, struct nameidata *nd)
 	return !PROC_I(dentry->d_inode)->sysctl->unregistering;
 }
 
-static int proc_sys_delete(struct dentry *dentry)
+static int proc_sys_delete(const struct dentry *dentry)
 {
 	return !!PROC_I(dentry->d_inode)->sysctl->unregistering;
 }
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 7e54bac..27e1102 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -231,7 +231,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
 		goto repeat;
 }
 
-static int sysfs_dentry_delete(struct dentry *dentry)
+static int sysfs_dentry_delete(const struct dentry *dentry)
 {
 	struct sysfs_dirent *sd = dentry->d_fsdata;
 	return !!(sd->s_flags & SYSFS_FLAG_REMOVED);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index fff9755..cbfc956 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -133,9 +133,9 @@ enum dentry_d_lock_class
 
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
-	int (*d_hash) (struct dentry *, struct qstr *);
-	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
-	int (*d_delete)(struct dentry *);
+	int (*d_hash)(struct dentry *, struct qstr *);
+	int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+	int (*d_delete)(const struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
 	char *(*d_dname)(struct dentry *, char *, int);
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 10a17a3..a0dc1a8 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -430,7 +430,7 @@ void rpc_put_mount(void)
 }
 EXPORT_SYMBOL_GPL(rpc_put_mount);
 
-static int rpc_delete_dentry(struct dentry *dentry)
+static int rpc_delete_dentry(const struct dentry *dentry)
 {
 	return 1;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 06/46] cifs: dont overwrite dentry name in d_revalidate
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (3 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 05/46] fs: change d_delete semantics Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 07/46] jfs: " Nick Piggin
                   ` (44 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/cifs/dir.c |   43 ++++++++++++++++++++++++-------------------
 1 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 3840edd..521d841 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -656,22 +656,34 @@ lookup_out:
 static int
 cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)
 {
-	int isValid = 1;
-
 	if (direntry->d_inode) {
 		if (cifs_revalidate_dentry(direntry))
 			return 0;
-	} else {
-		cFYI(1, "neg dentry 0x%p name = %s",
-			 direntry, direntry->d_name.name);
-		if (time_after(jiffies, direntry->d_time + HZ) ||
-			!lookupCacheEnabled) {
-			d_drop(direntry);
-			isValid = 0;
-		}
+		else
+			return 1;
 	}
 
-	return isValid;
+	/*
+	 * This may be nfsd (or something), anyway, we can't see the
+	 * intent of this. So, since this can be for creation, drop it.
+	 */
+	if (!nd)
+		return 0;
+
+	/*
+	 * Drop the negative dentry, in order to make sure to use the
+	 * case sensitive name which is specified by user if this is
+	 * for creation.
+	 */
+	if (!(nd->flags & (LOOKUP_CONTINUE | LOOKUP_PARENT))) {
+		if (nd->flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
+			return 0;
+	}
+
+	if (time_after(jiffies, direntry->d_time + HZ) || !lookupCacheEnabled)
+		return 0;
+
+	return 1;
 }
 
 /* static int cifs_d_delete(struct dentry *direntry)
@@ -709,15 +721,8 @@ static int cifs_ci_compare(struct dentry *dentry, struct qstr *a,
 	struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;
 
 	if ((a->len == b->len) &&
-	    (nls_strnicmp(codepage, a->name, b->name, a->len) == 0)) {
-		/*
-		 * To preserve case, don't let an existing negative dentry's
-		 * case take precedence.  If a is not a negative dentry, this
-		 * should have no side effects
-		 */
-		memcpy((void *)a->name, b->name, a->len);
+	    (nls_strnicmp(codepage, a->name, b->name, a->len) == 0))
 		return 0;
-	}
 	return 1;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 07/46] jfs: dont overwrite dentry name in d_revalidate
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (4 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 06/46] cifs: dont overwrite dentry name in d_revalidate Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 08/46] fs: change d_compare for rcu-walk Nick Piggin
                   ` (43 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/jfs/namei.c |   43 +++++++++++++++++++++++++++++++++++--------
 1 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 231ca4a..2da1546 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -18,6 +18,7 @@
  */
 
 #include <linux/fs.h>
+#include <linux/namei.h>
 #include <linux/ctype.h>
 #include <linux/quotaops.h>
 #include <linux/exportfs.h>
@@ -1597,21 +1598,47 @@ static int jfs_ci_compare(struct dentry *dir, struct qstr *a, struct qstr *b)
 			goto out;
 	}
 	result = 0;
+out:
+	return result;
+}
 
+static int jfs_ci_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
 	/*
-	 * We want creates to preserve case.  A negative dentry, a, that
-	 * has a different case than b may cause a new entry to be created
-	 * with the wrong case.  Since we can't tell if a comes from a negative
-	 * dentry, we blindly replace it with b.  This should be harmless if
-	 * a is not a negative dentry.
+	 * This is not negative dentry. Always valid.
+	 *
+	 * Note, rename() to existing directory entry will have ->d_inode,
+	 * and will use existing name which isn't specified name by user.
+	 *
+	 * We may be able to drop this positive dentry here. But dropping
+	 * positive dentry isn't good idea. So it's unsupported like
+	 * rename("filename", "FILENAME") for now.
 	 */
-	memcpy((unsigned char *)a->name, b->name, a->len);
-out:
-	return result;
+	if (dentry->d_inode)
+		return 1;
+
+	/*
+	 * This may be nfsd (or something), anyway, we can't see the
+	 * intent of this. So, since this can be for creation, drop it.
+	 */
+	if (!nd)
+		return 0;
+
+	/*
+	 * Drop the negative dentry, in order to make sure to use the
+	 * case sensitive name which is specified by user if this is
+	 * for creation.
+	 */
+	if (!(nd->flags & (LOOKUP_CONTINUE | LOOKUP_PARENT))) {
+		if (nd->flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
+			return 0;
+	}
+	return 1;
 }
 
 const struct dentry_operations jfs_ci_dentry_operations =
 {
 	.d_hash = jfs_ci_hash,
 	.d_compare = jfs_ci_compare,
+	.d_revalidate = jfs_ci_revalidate,
 };
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 08/46] fs: change d_compare for rcu-walk
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (5 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 07/46] jfs: " Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 09/46] fs: change d_hash " Nick Piggin
                   ` (42 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 Documentation/filesystems/Locking |    4 +-
 Documentation/filesystems/porting |    7 +++
 Documentation/filesystems/vfs.txt |   25 +++++++++-
 fs/adfs/dir.c                     |    8 ++-
 fs/affs/namei.c                   |   44 +++++++++++--------
 fs/cifs/dir.c                     |   11 +++--
 fs/dcache.c                       |    4 +-
 fs/fat/namei_msdos.c              |   15 ++++---
 fs/fat/namei_vfat.c               |   39 +++++++++++-----
 fs/hfs/hfs_fs.h                   |    4 +-
 fs/hfs/string.c                   |   14 +++---
 fs/hfsplus/hfsplus_fs.h           |    4 +-
 fs/hfsplus/unicode.c              |   14 +++---
 fs/hpfs/dentry.c                  |   21 +++++---
 fs/isofs/inode.c                  |   88 ++++++++++++++++++-------------------
 fs/isofs/namei.c                  |    3 +-
 fs/jfs/namei.c                    |   10 +++--
 fs/ncpfs/dir.c                    |   29 ++++++++----
 fs/ncpfs/ncplib_kernel.h          |    8 ++--
 fs/proc/proc_sysctl.c             |   12 +++---
 include/linux/dcache.h            |   12 ++---
 include/linux/ncp_fs.h            |    4 +-
 22 files changed, 228 insertions(+), 152 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index a91f308..fc5e1b7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -11,7 +11,9 @@ be able to use diff(1).
 prototypes:
 	int (*d_revalidate)(struct dentry *, int);
 	int (*d_hash) (struct dentry *, struct qstr *);
-	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
+	int (*d_compare)(const struct dentry *,
+			const struct dentry *, const struct inode *,
+			unsigned int, const char *, const struct qstr *);
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 6d9d6c3..2999495 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -327,3 +327,10 @@ unreferenced dentries, and is now only called when the dentry refcount goes to
 0. Even on 0 refcount transition, it must be able to tolerate being called 0,
 1, or more times (eg. constant, idempotent).
 
+---
+[mandatory]
+
+	.d_compare() calling convention and locking rules are significantly
+changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+look at examples of other filesystems) for guidance.
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 83c42d0..0f66e6a 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -842,7 +842,9 @@ defined:
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
 	int (*d_hash)(struct dentry *, struct qstr *);
-	int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+	int (*d_compare)(const struct dentry *,
+			const struct dentry *, const struct inode *,
+			unsigned int, const char *, const struct qstr *);
 	int (*d_delete)(const struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
@@ -854,9 +856,26 @@ struct dentry_operations {
 	dcache. Most filesystems leave this as NULL, because all their
 	dentries in the dcache are valid
 
-  d_hash: called when the VFS adds a dentry to the hash table
+  d_hash: called when the VFS adds a dentry to the hash table. The first
+	dentry passed to d_hash is the parent directory that the name is
+ 	to be hashed into.
 
-  d_compare: called when a dentry should be compared with another
+  d_compare: called to compare a dentry name with a given name. The first
+	dentry is the parent of the dentry to be compared, the second is
+	the dentry itself. inode, len, and name string are properties of
+	the dentry to be compared. qstr is the name to compare it with.
+
+	Must be constant and idempotent, and should not take locks if
+	possible, and should not or store into the dentry or inodes.
+	Should not dereference pointers outside the dentry or inodes without
+	lots of care (eg.  d_parent, d_inode shouldn't be used).
+
+	However, our vfsmount is pinned, and RCU held, so the dentries and
+	inodes won't disappear, neither will our sb or filesystem module.
+	->i_sb and ->d_sb may be used.
+
+	It is a tricky calling convention because it needs to be called under
+	"rcu-walk", ie. without any locks or references on things.
 
   d_delete: called when the last reference to a dentry is dropped and the
 	dcache is deciding whether or not to cache it. Return 1 to delete
diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index f4287e4..4ed74ef 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -237,17 +237,19 @@ adfs_hash(struct dentry *parent, struct qstr *qstr)
  * requirements of the underlying filesystem.
  */
 static int
-adfs_compare(struct dentry *parent, struct qstr *entry, struct qstr *name)
+adfs_compare(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
 	int i;
 
-	if (entry->len != name->len)
+	if (len != name->len)
 		return 1;
 
 	for (i = 0; i < name->len; i++) {
 		char a, b;
 
-		a = entry->name[i];
+		a = str[i];
 		b = name->name[i];
 
 		if (a >= 'A' && a <= 'Z')
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index 914d1c0..a86e877 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -14,10 +14,14 @@ typedef int (*toupper_t)(int);
 
 static int	 affs_toupper(int ch);
 static int	 affs_hash_dentry(struct dentry *, struct qstr *);
-static int       affs_compare_dentry(struct dentry *, struct qstr *, struct qstr *);
+static int       affs_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
 static int	 affs_intl_toupper(int ch);
 static int	 affs_intl_hash_dentry(struct dentry *, struct qstr *);
-static int       affs_intl_compare_dentry(struct dentry *, struct qstr *, struct qstr *);
+static int       affs_intl_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
 
 const struct dentry_operations affs_dentry_operations = {
 	.d_hash		= affs_hash_dentry,
@@ -88,29 +92,29 @@ affs_intl_hash_dentry(struct dentry *dentry, struct qstr *qstr)
 	return __affs_hash_dentry(dentry, qstr, affs_intl_toupper);
 }
 
-static inline int
-__affs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b, toupper_t toupper)
+static inline int __affs_compare_dentry(unsigned int len,
+		const char *str, const struct qstr *name, toupper_t toupper)
 {
-	const u8 *aname = a->name;
-	const u8 *bname = b->name;
-	int len;
+	const u8 *aname = str;
+	const u8 *bname = name->name;
 
-	/* 'a' is the qstr of an already existing dentry, so the name
-	 * must be valid. 'b' must be validated first.
+	/*
+	 * 'str' is the name of an already existing dentry, so the name
+	 * must be valid. 'name' must be validated first.
 	 */
 
-	if (affs_check_name(b->name,b->len))
+	if (affs_check_name(name->name, name->len))
 		return 1;
 
-	/* If the names are longer than the allowed 30 chars,
+	/*
+	 * If the names are longer than the allowed 30 chars,
 	 * the excess is ignored, so their length may differ.
 	 */
-	len = a->len;
 	if (len >= 30) {
-		if (b->len < 30)
+		if (name->len < 30)
 			return 1;
 		len = 30;
-	} else if (len != b->len)
+	} else if (len != name->len)
 		return 1;
 
 	for (; len > 0; len--)
@@ -121,14 +125,18 @@ __affs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b, tou
 }
 
 static int
-affs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+affs_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	return __affs_compare_dentry(dentry, a, b, affs_toupper);
+	return __affs_compare_dentry(len, str, name, affs_toupper);
 }
 static int
-affs_intl_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+affs_intl_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	return __affs_compare_dentry(dentry, a, b, affs_intl_toupper);
+	return __affs_compare_dentry(len, str, name, affs_intl_toupper);
 }
 
 /*
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 521d841..f3351f1 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -715,13 +715,14 @@ static int cifs_ci_hash(struct dentry *dentry, struct qstr *q)
 	return 0;
 }
 
-static int cifs_ci_compare(struct dentry *dentry, struct qstr *a,
-			   struct qstr *b)
+static int cifs_ci_compare(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;
+	struct nls_table *codepage = CIFS_SB(inode->i_sb)->local_nls;
 
-	if ((a->len == b->len) &&
-	    (nls_strnicmp(codepage, a->name, b->name, a->len) == 0))
+	if ((name->len == len) &&
+	    (nls_strnicmp(codepage, name->name, str, len) == 0))
 		return 0;
 	return 1;
 }
diff --git a/fs/dcache.c b/fs/dcache.c
index 14e0564..a7362de 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1433,7 +1433,9 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 		 */
 		qstr = &dentry->d_name;
 		if (parent->d_op && parent->d_op->d_compare) {
-			if (parent->d_op->d_compare(parent, qstr, name))
+			if (parent->d_op->d_compare(parent,
+						dentry, dentry->d_inode,
+						qstr->len, qstr->name, name))
 				goto next;
 		} else {
 			if (qstr->len != len)
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 3345aab..37e915c 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -164,16 +164,19 @@ static int msdos_hash(struct dentry *dentry, struct qstr *qstr)
  * Compare two msdos names. If either of the names are invalid,
  * we fall back to doing the standard name comparison.
  */
-static int msdos_cmp(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int msdos_cmp(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str,
+		const struct qstr *name)
 {
-	struct fat_mount_options *options = &MSDOS_SB(dentry->d_sb)->options;
+	struct fat_mount_options *options = &MSDOS_SB(parent->d_sb)->options;
 	unsigned char a_msdos_name[MSDOS_NAME], b_msdos_name[MSDOS_NAME];
 	int error;
 
-	error = msdos_format_name(a->name, a->len, a_msdos_name, options);
+	error = msdos_format_name(name->name, name->len, a_msdos_name, options);
 	if (error)
 		goto old_compare;
-	error = msdos_format_name(b->name, b->len, b_msdos_name, options);
+	error = msdos_format_name(str, len, b_msdos_name, options);
 	if (error)
 		goto old_compare;
 	error = memcmp(a_msdos_name, b_msdos_name, MSDOS_NAME);
@@ -182,8 +185,8 @@ out:
 
 old_compare:
 	error = 1;
-	if (a->len == b->len)
-		error = memcmp(a->name, b->name, a->len);
+	if (name->len == len)
+		error = memcmp(name->name, str, len);
 	goto out;
 }
 
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index b936703..4540e76 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -85,15 +85,18 @@ static int vfat_revalidate_ci(struct dentry *dentry, struct nameidata *nd)
 }
 
 /* returns the length of a struct qstr, ignoring trailing dots */
-static unsigned int vfat_striptail_len(struct qstr *qstr)
+static unsigned int __vfat_striptail_len(unsigned int len, const char *name)
 {
-	unsigned int len = qstr->len;
-
-	while (len && qstr->name[len - 1] == '.')
+	while (len && name[len - 1] == '.')
 		len--;
 	return len;
 }
 
+static unsigned int vfat_striptail_len(const struct qstr *qstr)
+{
+	return __vfat_striptail_len(qstr->len, qstr->name);
+}
+
 /*
  * Compute the hash for the vfat name corresponding to the dentry.
  * Note: if the name is invalid, we leave the hash code unchanged so
@@ -133,16 +136,18 @@ static int vfat_hashi(struct dentry *dentry, struct qstr *qstr)
 /*
  * Case insensitive compare of two vfat names.
  */
-static int vfat_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int vfat_cmpi(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	struct nls_table *t = MSDOS_SB(dentry->d_inode->i_sb)->nls_io;
+	struct nls_table *t = MSDOS_SB(inode->i_sb)->nls_io;
 	unsigned int alen, blen;
 
 	/* A filename cannot end in '.' or we treat it like it has none */
-	alen = vfat_striptail_len(a);
-	blen = vfat_striptail_len(b);
+	alen = vfat_striptail_len(name);
+	blen = __vfat_striptail_len(len, str);
 	if (alen == blen) {
-		if (nls_strnicmp(t, a->name, b->name, alen) == 0)
+		if (nls_strnicmp(t, name->name, str, alen) == 0)
 			return 0;
 	}
 	return 1;
@@ -151,15 +156,17 @@ static int vfat_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b)
 /*
  * Case sensitive compare of two vfat names.
  */
-static int vfat_cmp(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int vfat_cmp(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
 	unsigned int alen, blen;
 
 	/* A filename cannot end in '.' or we treat it like it has none */
-	alen = vfat_striptail_len(a);
-	blen = vfat_striptail_len(b);
+	alen = vfat_striptail_len(name);
+	blen = __vfat_striptail_len(len, str);
 	if (alen == blen) {
-		if (strncmp(a->name, b->name, alen) == 0)
+		if (strncmp(name->name, str, alen) == 0)
 			return 0;
 	}
 	return 1;
@@ -780,6 +787,12 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, int mode,
 	struct timespec ts;
 	int err;
 
+	/*
+	 * To preserve case, don't let an existing negative dentry's case
+	 * take precedence.
+	 */
+	memcpy((void *)dentry->d_name.name, nd->last.name, dentry->d_name.len);
+
 	lock_super(sb);
 
 	ts = CURRENT_TIME_SEC;
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index c8cffb8..3a815fd 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -216,7 +216,9 @@ extern const struct dentry_operations hfs_dentry_operations;
 extern int hfs_hash_dentry(struct dentry *, struct qstr *);
 extern int hfs_strcmp(const unsigned char *, unsigned int,
 		      const unsigned char *, unsigned int);
-extern int hfs_compare_dentry(struct dentry *, struct qstr *, struct qstr *);
+extern int hfs_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
 
 /* trans.c */
 extern void hfs_asc2mac(struct super_block *, struct hfs_name *, struct qstr *);
diff --git a/fs/hfs/string.c b/fs/hfs/string.c
index 927a5af..712aa53 100644
--- a/fs/hfs/string.c
+++ b/fs/hfs/string.c
@@ -92,21 +92,21 @@ int hfs_strcmp(const unsigned char *s1, unsigned int len1,
  * Test for equality of two strings in the HFS filename character ordering.
  * return 1 on failure and 0 on success
  */
-int hfs_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *s2)
+int hfs_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
 	const unsigned char *n1, *n2;
-	int len;
 
-	len = s1->len;
 	if (len >= HFS_NAMELEN) {
-		if (s2->len < HFS_NAMELEN)
+		if (name->len < HFS_NAMELEN)
 			return 1;
 		len = HFS_NAMELEN;
-	} else if (len != s2->len)
+	} else if (len != name->len)
 		return 1;
 
-	n1 = s1->name;
-	n2 = s2->name;
+	n1 = str;
+	n2 = name->name;
 	while (len--) {
 		if (caseorder[*n1++] != caseorder[*n2++])
 			return 1;
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index cb3653e..7a98f24 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -380,7 +380,9 @@ int hfsplus_strcmp(const struct hfsplus_unistr *, const struct hfsplus_unistr *)
 int hfsplus_uni2asc(struct super_block *, const struct hfsplus_unistr *, char *, int *);
 int hfsplus_asc2uni(struct super_block *, struct hfsplus_unistr *, const char *, int);
 int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str);
-int hfsplus_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *s2);
+int hfsplus_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
 
 /* wrapper.c */
 int hfsplus_read_wrapper(struct super_block *);
diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
index b66d67d..6e3be8f 100644
--- a/fs/hfsplus/unicode.c
+++ b/fs/hfsplus/unicode.c
@@ -363,9 +363,11 @@ int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str)
  * Composed unicode characters are decomposed and case-folding is performed
  * if the appropriate bits are (un)set on the superblock.
  */
-int hfsplus_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *s2)
+int hfsplus_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	struct super_block *sb = dentry->d_sb;
+	struct super_block *sb = parent->d_sb;
 	int casefold, decompose, size;
 	int dsize1, dsize2, len1, len2;
 	const u16 *dstr1, *dstr2;
@@ -375,10 +377,10 @@ int hfsplus_compare_dentry(struct dentry *dentry, struct qstr *s1, struct qstr *
 
 	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
 	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
-	astr1 = s1->name;
-	len1 = s1->len;
-	astr2 = s2->name;
-	len2 = s2->len;
+	astr1 = str;
+	len1 = len;
+	astr2 = name->name;
+	len2 = name->len;
 	dsize1 = dsize2 = 0;
 	dstr1 = dstr2 = NULL;
 
diff --git a/fs/hpfs/dentry.c b/fs/hpfs/dentry.c
index 67d9d36..3175ccc 100644
--- a/fs/hpfs/dentry.c
+++ b/fs/hpfs/dentry.c
@@ -34,19 +34,24 @@ static int hpfs_hash_dentry(struct dentry *dentry, struct qstr *qstr)
 	return 0;
 }
 
-static int hpfs_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+static int hpfs_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	unsigned al=a->len;
-	unsigned bl=b->len;
-	hpfs_adjust_length(a->name, &al);
+	unsigned al = len;
+	unsigned bl = name->len;
+
+	hpfs_adjust_length(str, &al);
 	/*hpfs_adjust_length(b->name, &bl);*/
-	/* 'a' is the qstr of an already existing dentry, so the name
-	 * must be valid. 'b' must be validated first.
+
+	/*
+	 * 'str' is the nane of an already existing dentry, so the name
+	 * must be valid. 'name' must be validated first.
 	 */
 
-	if (hpfs_chk_name(b->name, &bl))
+	if (hpfs_chk_name(name->name, &bl))
 		return 1;
-	if (hpfs_compare_names(dentry->d_sb, a->name, al, b->name, bl, 0))
+	if (hpfs_compare_names(parent->d_sb, str, al, name->name, bl, 0))
 		return 1;
 	return 0;
 }
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index bfdeb82..9a7a6df 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -28,14 +28,22 @@
 
 static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
 static int isofs_hash(struct dentry *parent, struct qstr *qstr);
-static int isofs_dentry_cmpi(struct dentry *dentry, struct qstr *a, struct qstr *b);
-static int isofs_dentry_cmp(struct dentry *dentry, struct qstr *a, struct qstr *b);
+static int isofs_dentry_cmpi(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
+static int isofs_dentry_cmp(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
 
 #ifdef CONFIG_JOLIET
 static int isofs_hashi_ms(struct dentry *parent, struct qstr *qstr);
 static int isofs_hash_ms(struct dentry *parent, struct qstr *qstr);
-static int isofs_dentry_cmpi_ms(struct dentry *dentry, struct qstr *a, struct qstr *b);
-static int isofs_dentry_cmp_ms(struct dentry *dentry, struct qstr *a, struct qstr *b);
+static int isofs_dentry_cmpi_ms(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
+static int isofs_dentry_cmp_ms(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name);
 #endif
 
 static void isofs_put_super(struct super_block *sb)
@@ -206,49 +214,31 @@ isofs_hashi_common(struct dentry *dentry, struct qstr *qstr, int ms)
 }
 
 /*
- * Case insensitive compare of two isofs names.
+ * Compare of two isofs names.
  */
-static int isofs_dentry_cmpi_common(struct dentry *dentry, struct qstr *a,
-				struct qstr *b, int ms)
+static int isofs_dentry_cmp_common(
+		unsigned int len, const char *str,
+		const struct qstr *name, int ms, int ci)
 {
 	int alen, blen;
 
 	/* A filename cannot end in '.' or we treat it like it has none */
-	alen = a->len;
-	blen = b->len;
+	alen = name->len;
+	blen = len;
 	if (ms) {
-		while (alen && a->name[alen-1] == '.')
+		while (alen && name->name[alen-1] == '.')
 			alen--;
-		while (blen && b->name[blen-1] == '.')
+		while (blen && str[blen-1] == '.')
 			blen--;
 	}
 	if (alen == blen) {
-		if (strnicmp(a->name, b->name, alen) == 0)
-			return 0;
-	}
-	return 1;
-}
-
-/*
- * Case sensitive compare of two isofs names.
- */
-static int isofs_dentry_cmp_common(struct dentry *dentry, struct qstr *a,
-					struct qstr *b, int ms)
-{
-	int alen, blen;
-
-	/* A filename cannot end in '.' or we treat it like it has none */
-	alen = a->len;
-	blen = b->len;
-	if (ms) {
-		while (alen && a->name[alen-1] == '.')
-			alen--;
-		while (blen && b->name[blen-1] == '.')
-			blen--;
-	}
-	if (alen == blen) {
-		if (strncmp(a->name, b->name, alen) == 0)
-			return 0;
+		if (ci) {
+			if (strnicmp(name->name, str, alen) == 0)
+				return 0;
+		} else {
+			if (strncmp(name->name, str, alen) == 0)
+				return 0;
+		}
 	}
 	return 1;
 }
@@ -266,15 +256,19 @@ isofs_hashi(struct dentry *dentry, struct qstr *qstr)
 }
 
 static int
-isofs_dentry_cmp(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmp(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	return isofs_dentry_cmp_common(dentry, a, b, 0);
+	return isofs_dentry_cmp_common(len, str, name, 0, 0);
 }
 
 static int
-isofs_dentry_cmpi(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmpi(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	return isofs_dentry_cmpi_common(dentry, a, b, 0);
+	return isofs_dentry_cmp_common(len, str, name, 0, 1);
 }
 
 #ifdef CONFIG_JOLIET
@@ -291,15 +285,19 @@ isofs_hashi_ms(struct dentry *dentry, struct qstr *qstr)
 }
 
 static int
-isofs_dentry_cmp_ms(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmp_ms(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	return isofs_dentry_cmp_common(dentry, a, b, 1);
+	return isofs_dentry_cmp_common(len, str, name, 1, 0);
 }
 
 static int
-isofs_dentry_cmpi_ms(struct dentry *dentry,struct qstr *a,struct qstr *b)
+isofs_dentry_cmpi_ms(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	return isofs_dentry_cmpi_common(dentry, a, b, 1);
+	return isofs_dentry_cmp_common(len, str, name, 1, 1);
 }
 #endif
 
diff --git a/fs/isofs/namei.c b/fs/isofs/namei.c
index 0d23abf..e7768ac 100644
--- a/fs/isofs/namei.c
+++ b/fs/isofs/namei.c
@@ -37,7 +37,8 @@ isofs_cmp(struct dentry *dentry, const char *compare, int dlen)
 
 	qstr.name = compare;
 	qstr.len = dlen;
-	return dentry->d_op->d_compare(dentry, &dentry->d_name, &qstr);
+	return dentry->d_op->d_compare(NULL, NULL, NULL,
+			dentry->d_name.len, dentry->d_name.name, &qstr);
 }
 
 /*
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 2da1546..5571d0b 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1587,14 +1587,16 @@ static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
 	return 0;
 }
 
-static int jfs_ci_compare(struct dentry *dir, struct qstr *a, struct qstr *b)
+static int jfs_ci_compare(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
 	int i, result = 1;
 
-	if (a->len != b->len)
+	if (len != name->len)
 		goto out;
-	for (i=0; i < a->len; i++) {
-		if (tolower(a->name[i]) != tolower(b->name[i]))
+	for (i=0; i < len; i++) {
+		if (tolower(str[i]) != tolower(name->name[i]))
 			goto out;
 	}
 	result = 0;
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index d6e6453..c99a5dd 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -75,7 +75,9 @@ const struct inode_operations ncp_dir_inode_operations =
  */
 static int ncp_lookup_validate(struct dentry *, struct nameidata *);
 static int ncp_hash_dentry(struct dentry *, struct qstr *);
-static int ncp_compare_dentry (struct dentry *, struct qstr *, struct qstr *);
+static int ncp_compare_dentry(const struct dentry *,
+		const struct dentry *, const struct inode *,
+		unsigned int, const char *, const struct qstr *);
 static int ncp_delete_dentry(const struct dentry *);
 
 static const struct dentry_operations ncp_dentry_operations =
@@ -113,10 +115,10 @@ static inline int ncp_preserve_entry_case(struct inode *i, __u32 nscreator)
 
 #define ncp_preserve_case(i)	(ncp_namespace(i) != NW_NS_DOS)
 
-static inline int ncp_case_sensitive(struct dentry *dentry)
+static inline int ncp_case_sensitive(const struct inode *inode)
 {
 #ifdef CONFIG_NCPFS_NFS_NS
-	return ncp_namespace(dentry->d_inode) == NW_NS_NFS;
+	return ncp_namespace(inode) == NW_NS_NFS;
 #else
 	return 0;
 #endif /* CONFIG_NCPFS_NFS_NS */
@@ -129,12 +131,15 @@ static inline int ncp_case_sensitive(struct dentry *dentry)
 static int 
 ncp_hash_dentry(struct dentry *dentry, struct qstr *this)
 {
-	if (!ncp_case_sensitive(dentry)) {
+	struct inode *inode = dentry->d_inode;
+
+	if (!ncp_case_sensitive(inode)) {
+		struct super_block *sb = dentry->d_sb;
 		struct nls_table *t;
 		unsigned long hash;
 		int i;
 
-		t = NCP_IO_TABLE(dentry);
+		t = NCP_IO_TABLE(sb);
 		hash = init_name_hash();
 		for (i=0; i<this->len ; i++)
 			hash = partial_name_hash(ncp_tolower(t, this->name[i]),
@@ -145,15 +150,19 @@ ncp_hash_dentry(struct dentry *dentry, struct qstr *this)
 }
 
 static int
-ncp_compare_dentry(struct dentry *dentry, struct qstr *a, struct qstr *b)
+ncp_compare_dentry(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	if (a->len != b->len)
+	struct super_block *sb = dentry->d_sb;
+
+	if (len != name->len)
 		return 1;
 
-	if (ncp_case_sensitive(dentry))
-		return strncmp(a->name, b->name, a->len);
+	if (ncp_case_sensitive(inode))
+		return strncmp(str, name->name, len);
 
-	return ncp_strnicmp(NCP_IO_TABLE(dentry), a->name, b->name, a->len);
+	return ncp_strnicmp(NCP_IO_TABLE(sb), str, name->name, len);
 }
 
 /*
diff --git a/fs/ncpfs/ncplib_kernel.h b/fs/ncpfs/ncplib_kernel.h
index 3c57eca..244d1b7 100644
--- a/fs/ncpfs/ncplib_kernel.h
+++ b/fs/ncpfs/ncplib_kernel.h
@@ -135,7 +135,7 @@ int ncp__vol2io(struct ncp_server *, unsigned char *, unsigned int *,
 				const unsigned char *, unsigned int, int);
 
 #define NCP_ESC			':'
-#define NCP_IO_TABLE(dentry)	(NCP_SERVER((dentry)->d_inode)->nls_io)
+#define NCP_IO_TABLE(sb)	(NCP_SBP(sb)->nls_io)
 #define ncp_tolower(t, c)	nls_tolower(t, c)
 #define ncp_toupper(t, c)	nls_toupper(t, c)
 #define ncp_strnicmp(t, s1, s2, len) \
@@ -150,15 +150,15 @@ int ncp__io2vol(unsigned char *, unsigned int *,
 int ncp__vol2io(unsigned char *, unsigned int *,
 				const unsigned char *, unsigned int, int);
 
-#define NCP_IO_TABLE(dentry)	NULL
+#define NCP_IO_TABLE(sb)	NULL
 #define ncp_tolower(t, c)	tolower(c)
 #define ncp_toupper(t, c)	toupper(c)
 #define ncp_io2vol(S,m,i,n,k,U)	ncp__io2vol(m,i,n,k,U)
 #define ncp_vol2io(S,m,i,n,k,U)	ncp__vol2io(m,i,n,k,U)
 
 
-static inline int ncp_strnicmp(struct nls_table *t, const unsigned char *s1,
-		const unsigned char *s2, int len)
+static inline int ncp_strnicmp(const struct nls_table *t,
+		const unsigned char *s1, const unsigned char *s2, int len)
 {
 	while (len--) {
 		if (tolower(*s1++) != tolower(*s2++))
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index a256d77..34f1fc5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -397,15 +397,15 @@ static int proc_sys_delete(const struct dentry *dentry)
 	return !!PROC_I(dentry->d_inode)->sysctl->unregistering;
 }
 
-static int proc_sys_compare(struct dentry *dir, struct qstr *qstr,
-			    struct qstr *name)
+static int proc_sys_compare(const struct dentry *parent,
+		const struct dentry *dentry, const struct inode *inode,
+		unsigned int len, const char *str, const struct qstr *name)
 {
-	struct dentry *dentry = container_of(qstr, struct dentry, d_name);
-	if (qstr->len != name->len)
+	if (name->len != len)
 		return 1;
-	if (memcmp(qstr->name, name->name, name->len))
+	if (memcmp(name->name, str, len))
 		return 1;
-	return !sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl);
+	return !sysctl_is_seen(PROC_I(inode)->sysctl);
 }
 
 static const struct dentry_operations proc_sys_dentry_operations = {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index cbfc956..ba1b9bd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -134,7 +134,9 @@ enum dentry_d_lock_class
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
 	int (*d_hash)(struct dentry *, struct qstr *);
-	int (*d_compare)(struct dentry *, struct qstr *, struct qstr *);
+	int (*d_compare)(const struct dentry *,
+			const struct dentry *, const struct inode *,
+			unsigned int, const char *, const struct qstr *);
 	int (*d_delete)(const struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
@@ -145,12 +147,8 @@ struct dentry_operations {
  * Locking rules for dentry_operations callbacks are to be found in
  * Documentation/filesystems/Locking. Keep it updated!
  *
- * the dentry parameter passed to d_hash and d_compare is the parent
- * directory of the entries to be compared. It is used in case these
- * functions need any directory specific information for determining
- * equivalency classes.  Using the dentry itself might not work, as it
- * might be a negative dentry which has no information associated with
- * it.
+ * FUrther descriptions are found in Documentation/filesystems/vfs.txt.
+ * Keep it updated too!
  */
 
 /* d_flags entries */
diff --git a/include/linux/ncp_fs.h b/include/linux/ncp_fs.h
index ef66306..1c27f20 100644
--- a/include/linux/ncp_fs.h
+++ b/include/linux/ncp_fs.h
@@ -184,13 +184,13 @@ struct ncp_entry_info {
 	__u8			file_handle[6];
 };
 
-static inline struct ncp_server *NCP_SBP(struct super_block *sb)
+static inline struct ncp_server *NCP_SBP(const struct super_block *sb)
 {
 	return sb->s_fs_info;
 }
 
 #define NCP_SERVER(inode)	NCP_SBP((inode)->i_sb)
-static inline struct ncp_inode_info *NCP_FINFO(struct inode *inode)
+static inline struct ncp_inode_info *NCP_FINFO(const struct inode *inode)
 {
 	return container_of(inode, struct ncp_inode_info, vfs_inode);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 09/46] fs: change d_hash for rcu-walk
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (6 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 08/46] fs: change d_compare for rcu-walk Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 10/46] hostfs: simplify locking Nick Piggin
                   ` (41 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 Documentation/filesystems/Locking |    5 +++--
 Documentation/filesystems/porting |    7 +++++++
 Documentation/filesystems/vfs.txt |    8 ++++++--
 fs/adfs/dir.c                     |    3 ++-
 fs/affs/namei.c                   |   20 ++++++++++++--------
 fs/cifs/dir.c                     |    5 +++--
 fs/cifs/readdir.c                 |    2 +-
 fs/dcache.c                       |    2 +-
 fs/ecryptfs/inode.c               |    4 ++--
 fs/fat/namei_msdos.c              |    3 ++-
 fs/fat/namei_vfat.c               |    8 +++++---
 fs/gfs2/dentry.c                  |    3 ++-
 fs/hfs/hfs_fs.h                   |    3 ++-
 fs/hfs/string.c                   |    3 ++-
 fs/hfsplus/hfsplus_fs.h           |    3 ++-
 fs/hfsplus/unicode.c              |    3 ++-
 fs/hpfs/dentry.c                  |    3 ++-
 fs/isofs/inode.c                  |   28 ++++++++++++++++++----------
 fs/jfs/namei.c                    |    3 ++-
 fs/namei.c                        |    5 +++--
 fs/ncpfs/dir.c                    |   10 +++++-----
 fs/sysv/namei.c                   |    3 ++-
 include/linux/dcache.h            |    3 ++-
 23 files changed, 88 insertions(+), 49 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index fc5e1b7..5bceb19 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -10,7 +10,8 @@ be able to use diff(1).
 --------------------------- dentry_operations --------------------------
 prototypes:
 	int (*d_revalidate)(struct dentry *, int);
-	int (*d_hash) (struct dentry *, struct qstr *);
+	int (*d_hash)(const struct dentry *, const struct inode *,
+			struct qstr *);
 	int (*d_compare)(const struct dentry *,
 			const struct dentry *, const struct inode *,
 			unsigned int, const char *, const struct qstr *);
@@ -23,7 +24,7 @@ locking rules:
 	none have BKL
 		dcache_lock	rename_lock	->d_lock	may block
 d_revalidate:	no		no		no		yes
-d_hash		no		no		no		yes
+d_hash		no		no		no		no
 d_compare:	no		yes		no		no 
 d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 2999495..0c6fc97 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -334,3 +334,10 @@ unreferenced dentries, and is now only called when the dentry refcount goes to
 changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
 look at examples of other filesystems) for guidance.
 
+---
+[mandatory]
+
+	.d_hash() calling convention and locking rules are significantly
+changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+look at examples of other filesystems) for guidance.
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 0f66e6a..47c29ca 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -841,7 +841,8 @@ defined:
 
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
-	int (*d_hash)(struct dentry *, struct qstr *);
+	int (*d_hash)(const struct dentry *, const struct inode *,
+			struct qstr *);
 	int (*d_compare)(const struct dentry *,
 			const struct dentry *, const struct inode *,
 			unsigned int, const char *, const struct qstr *);
@@ -858,7 +859,10 @@ struct dentry_operations {
 
   d_hash: called when the VFS adds a dentry to the hash table. The first
 	dentry passed to d_hash is the parent directory that the name is
- 	to be hashed into.
+ 	to be hashed into. The inode is the dentry's inode.
+
+	Same locking and synchronisation rules as d_compare regarding
+	what is safe to dereference etc.
 
   d_compare: called to compare a dentry name with a given name. The first
 	dentry is the parent of the dentry to be compared, the second is
diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index 4ed74ef..a098bba 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -201,7 +201,8 @@ const struct file_operations adfs_dir_operations = {
 };
 
 static int
-adfs_hash(struct dentry *parent, struct qstr *qstr)
+adfs_hash(const struct dentry *parent, const struct inode *inode,
+		struct qstr *qstr)
 {
 	const unsigned int name_len = ADFS_SB(parent->d_sb)->s_namelen;
 	const unsigned char *name;
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index a86e877..91d5dcd 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -13,12 +13,14 @@
 typedef int (*toupper_t)(int);
 
 static int	 affs_toupper(int ch);
-static int	 affs_hash_dentry(struct dentry *, struct qstr *);
+static int	 affs_hash_dentry(const struct dentry *,
+		const struct inode *, struct qstr *);
 static int       affs_compare_dentry(const struct dentry *parent,
 		const struct dentry *dentry, const struct inode *inode,
 		unsigned int len, const char *str, const struct qstr *name);
 static int	 affs_intl_toupper(int ch);
-static int	 affs_intl_hash_dentry(struct dentry *, struct qstr *);
+static int	 affs_intl_hash_dentry(const struct dentry *,
+		const struct inode *, struct qstr *);
 static int       affs_intl_compare_dentry(const struct dentry *parent,
 		const struct dentry *dentry, const struct inode *inode,
 		unsigned int len, const char *str, const struct qstr *name);
@@ -62,13 +64,13 @@ affs_get_toupper(struct super_block *sb)
  * Note: the dentry argument is the parent dentry.
  */
 static inline int
-__affs_hash_dentry(struct dentry *dentry, struct qstr *qstr, toupper_t toupper)
+__affs_hash_dentry(struct qstr *qstr, toupper_t toupper)
 {
 	const u8 *name = qstr->name;
 	unsigned long hash;
 	int i;
 
-	i = affs_check_name(qstr->name,qstr->len);
+	i = affs_check_name(qstr->name, qstr->len);
 	if (i)
 		return i;
 
@@ -82,14 +84,16 @@ __affs_hash_dentry(struct dentry *dentry, struct qstr *qstr, toupper_t toupper)
 }
 
 static int
-affs_hash_dentry(struct dentry *dentry, struct qstr *qstr)
+affs_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
-	return __affs_hash_dentry(dentry, qstr, affs_toupper);
+	return __affs_hash_dentry(qstr, affs_toupper);
 }
 static int
-affs_intl_hash_dentry(struct dentry *dentry, struct qstr *qstr)
+affs_intl_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
-	return __affs_hash_dentry(dentry, qstr, affs_intl_toupper);
+	return __affs_hash_dentry(qstr, affs_intl_toupper);
 }
 
 static inline int __affs_compare_dentry(unsigned int len,
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index f3351f1..5227626 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -700,9 +700,10 @@ const struct dentry_operations cifs_dentry_ops = {
 /* d_delete:       cifs_d_delete,      */ /* not needed except for debugging */
 };
 
-static int cifs_ci_hash(struct dentry *dentry, struct qstr *q)
+static int cifs_ci_hash(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *q)
 {
-	struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;
+	struct nls_table *codepage = CIFS_SB(dentry->d_sb)->local_nls;
 	unsigned long hash;
 	int i;
 
diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
index ef7bb7b..ec5b2af 100644
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -79,7 +79,7 @@ cifs_readdir_lookup(struct dentry *parent, struct qstr *name,
 	cFYI(1, "For %s", name->name);
 
 	if (parent->d_op && parent->d_op->d_hash)
-		parent->d_op->d_hash(parent, name);
+		parent->d_op->d_hash(parent, parent->d_inode, name);
 	else
 		name->hash = full_name_hash(name->name, name->len);
 
diff --git a/fs/dcache.c b/fs/dcache.c
index a7362de..9fd5180 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1474,7 +1474,7 @@ struct dentry *d_hash_and_lookup(struct dentry *dir, struct qstr *name)
 	 */
 	name->hash = full_name_hash(name->name, name->len);
 	if (dir->d_op && dir->d_op->d_hash) {
-		if (dir->d_op->d_hash(dir, name) < 0)
+		if (dir->d_op->d_hash(dir, dir->d_inode, name) < 0)
 			goto out;
 	}
 	dentry = d_lookup(dir, name);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 9d1a22d..a1ed7a7 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -454,7 +454,7 @@ static struct dentry *ecryptfs_lookup(struct inode *ecryptfs_dir_inode,
 	lower_name.hash = ecryptfs_dentry->d_name.hash;
 	if (lower_dir_dentry->d_op && lower_dir_dentry->d_op->d_hash) {
 		rc = lower_dir_dentry->d_op->d_hash(lower_dir_dentry,
-						    &lower_name);
+				lower_dir_dentry->d_inode, &lower_name);
 		if (rc < 0)
 			goto out_d_drop;
 	}
@@ -489,7 +489,7 @@ static struct dentry *ecryptfs_lookup(struct inode *ecryptfs_dir_inode,
 	lower_name.hash = full_name_hash(lower_name.name, lower_name.len);
 	if (lower_dir_dentry->d_op && lower_dir_dentry->d_op->d_hash) {
 		rc = lower_dir_dentry->d_op->d_hash(lower_dir_dentry,
-						    &lower_name);
+				lower_dir_dentry->d_inode, &lower_name);
 		if (rc < 0)
 			goto out_d_drop;
 	}
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 37e915c..c5f32db 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -148,7 +148,8 @@ static int msdos_find(struct inode *dir, const unsigned char *name, int len,
  * that the existing dentry can be used. The msdos fs routines will
  * return ENOENT or EINVAL as appropriate.
  */
-static int msdos_hash(struct dentry *dentry, struct qstr *qstr)
+static int msdos_hash(const struct dentry *dentry, const struct inode *inode,
+	       struct qstr *qstr)
 {
 	struct fat_mount_options *options = &MSDOS_SB(dentry->d_sb)->options;
 	unsigned char msdos_name[MSDOS_NAME];
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 4540e76..6e4d02d 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -103,7 +103,8 @@ static unsigned int vfat_striptail_len(const struct qstr *qstr)
  * that the existing dentry can be used. The vfat fs routines will
  * return ENOENT or EINVAL as appropriate.
  */
-static int vfat_hash(struct dentry *dentry, struct qstr *qstr)
+static int vfat_hash(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	qstr->hash = full_name_hash(qstr->name, vfat_striptail_len(qstr));
 	return 0;
@@ -115,9 +116,10 @@ static int vfat_hash(struct dentry *dentry, struct qstr *qstr)
  * that the existing dentry can be used. The vfat fs routines will
  * return ENOENT or EINVAL as appropriate.
  */
-static int vfat_hashi(struct dentry *dentry, struct qstr *qstr)
+static int vfat_hashi(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
-	struct nls_table *t = MSDOS_SB(dentry->d_inode->i_sb)->nls_io;
+	struct nls_table *t = MSDOS_SB(dentry->d_sb)->nls_io;
 	const unsigned char *name;
 	unsigned int len;
 	unsigned long hash;
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index e80fea2..50497f6 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -100,7 +100,8 @@ fail:
 	return 0;
 }
 
-static int gfs2_dhash(struct dentry *dentry, struct qstr *str)
+static int gfs2_dhash(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *str)
 {
 	str->hash = gfs2_disk_hash(str->name, str->len);
 	return 0;
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 3a815fd..345717a 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -213,7 +213,8 @@ extern int hfs_part_find(struct super_block *, sector_t *, sector_t *);
 /* string.c */
 extern const struct dentry_operations hfs_dentry_operations;
 
-extern int hfs_hash_dentry(struct dentry *, struct qstr *);
+extern int hfs_hash_dentry(const struct dentry *, const struct inode *,
+		struct qstr *);
 extern int hfs_strcmp(const unsigned char *, unsigned int,
 		      const unsigned char *, unsigned int);
 extern int hfs_compare_dentry(const struct dentry *parent,
diff --git a/fs/hfs/string.c b/fs/hfs/string.c
index 712aa53..d270f56 100644
--- a/fs/hfs/string.c
+++ b/fs/hfs/string.c
@@ -51,7 +51,8 @@ static unsigned char caseorder[256] = {
 /*
  * Hash a string to an integer in a case-independent way
  */
-int hfs_hash_dentry(struct dentry *dentry, struct qstr *this)
+int hfs_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *this)
 {
 	const unsigned char *name = this->name;
 	unsigned int hash, len = this->len;
diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h
index 7a98f24..b9100a6 100644
--- a/fs/hfsplus/hfsplus_fs.h
+++ b/fs/hfsplus/hfsplus_fs.h
@@ -379,7 +379,8 @@ int hfsplus_strcasecmp(const struct hfsplus_unistr *, const struct hfsplus_unist
 int hfsplus_strcmp(const struct hfsplus_unistr *, const struct hfsplus_unistr *);
 int hfsplus_uni2asc(struct super_block *, const struct hfsplus_unistr *, char *, int *);
 int hfsplus_asc2uni(struct super_block *, struct hfsplus_unistr *, const char *, int);
-int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str);
+int hfsplus_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *str);
 int hfsplus_compare_dentry(const struct dentry *parent,
 		const struct dentry *dentry, const struct inode *inode,
 		unsigned int len, const char *str, const struct qstr *name);
diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
index 6e3be8f..9baf6be 100644
--- a/fs/hfsplus/unicode.c
+++ b/fs/hfsplus/unicode.c
@@ -320,7 +320,8 @@ int hfsplus_asc2uni(struct super_block *sb, struct hfsplus_unistr *ustr,
  * Composed unicode characters are decomposed and case-folding is performed
  * if the appropriate bits are (un)set on the superblock.
  */
-int hfsplus_hash_dentry(struct dentry *dentry, struct qstr *str)
+int hfsplus_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *str)
 {
 	struct super_block *sb = dentry->d_sb;
 	const char *astr;
diff --git a/fs/hpfs/dentry.c b/fs/hpfs/dentry.c
index 3175ccc..d7f1cbb 100644
--- a/fs/hpfs/dentry.c
+++ b/fs/hpfs/dentry.c
@@ -12,7 +12,8 @@
  * Note: the dentry argument is the parent dentry.
  */
 
-static int hpfs_hash_dentry(struct dentry *dentry, struct qstr *qstr)
+static int hpfs_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	unsigned long	 hash;
 	int		 i;
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 9a7a6df..bc77744 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -26,8 +26,10 @@
 
 #define BEQUIET
 
-static int isofs_hashi(struct dentry *parent, struct qstr *qstr);
-static int isofs_hash(struct dentry *parent, struct qstr *qstr);
+static int isofs_hashi(const struct dentry *parent, const struct inode *inode,
+		struct qstr *qstr);
+static int isofs_hash(const struct dentry *parent, const struct inode *inode,
+		struct qstr *qstr);
 static int isofs_dentry_cmpi(const struct dentry *parent,
 		const struct dentry *dentry, const struct inode *inode,
 		unsigned int len, const char *str, const struct qstr *name);
@@ -36,8 +38,10 @@ static int isofs_dentry_cmp(const struct dentry *parent,
 		unsigned int len, const char *str, const struct qstr *name);
 
 #ifdef CONFIG_JOLIET
-static int isofs_hashi_ms(struct dentry *parent, struct qstr *qstr);
-static int isofs_hash_ms(struct dentry *parent, struct qstr *qstr);
+static int isofs_hashi_ms(const struct dentry *parent, const struct inode *inode,
+		struct qstr *qstr);
+static int isofs_hash_ms(const struct dentry *parent, const struct inode *inode,
+		struct qstr *qstr);
 static int isofs_dentry_cmpi_ms(const struct dentry *parent,
 		const struct dentry *dentry, const struct inode *inode,
 		unsigned int len, const char *str, const struct qstr *name);
@@ -168,7 +172,7 @@ struct iso9660_options{
  * Compute the hash for the isofs name corresponding to the dentry.
  */
 static int
-isofs_hash_common(struct dentry *dentry, struct qstr *qstr, int ms)
+isofs_hash_common(const struct dentry *dentry, struct qstr *qstr, int ms)
 {
 	const char *name;
 	int len;
@@ -189,7 +193,7 @@ isofs_hash_common(struct dentry *dentry, struct qstr *qstr, int ms)
  * Compute the hash for the isofs name corresponding to the dentry.
  */
 static int
-isofs_hashi_common(struct dentry *dentry, struct qstr *qstr, int ms)
+isofs_hashi_common(const struct dentry *dentry, struct qstr *qstr, int ms)
 {
 	const char *name;
 	int len;
@@ -244,13 +248,15 @@ static int isofs_dentry_cmp_common(
 }
 
 static int
-isofs_hash(struct dentry *dentry, struct qstr *qstr)
+isofs_hash(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	return isofs_hash_common(dentry, qstr, 0);
 }
 
 static int
-isofs_hashi(struct dentry *dentry, struct qstr *qstr)
+isofs_hashi(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	return isofs_hashi_common(dentry, qstr, 0);
 }
@@ -273,13 +279,15 @@ isofs_dentry_cmpi(const struct dentry *parent,
 
 #ifdef CONFIG_JOLIET
 static int
-isofs_hash_ms(struct dentry *dentry, struct qstr *qstr)
+isofs_hash_ms(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	return isofs_hash_common(dentry, qstr, 1);
 }
 
 static int
-isofs_hashi_ms(struct dentry *dentry, struct qstr *qstr)
+isofs_hashi_ms(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	return isofs_hashi_common(dentry, qstr, 1);
 }
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 5571d0b..7166a1b 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1574,7 +1574,8 @@ const struct file_operations jfs_dir_operations = {
 	.llseek		= generic_file_llseek,
 };
 
-static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
+static int jfs_ci_hash(const struct dentry *dir, const struct inode *inode,
+		struct qstr *this)
 {
 	unsigned long hash;
 	int i;
diff --git a/fs/namei.c b/fs/namei.c
index 5362af9..bf6821a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -731,7 +731,8 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 	 * to use its own hash..
 	 */
 	if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
-		int err = nd->path.dentry->d_op->d_hash(nd->path.dentry, name);
+		int err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
+				nd->path.dentry->d_inode, name);
 		if (err < 0)
 			return err;
 	}
@@ -1134,7 +1135,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
 	 * to use its own hash..
 	 */
 	if (base->d_op && base->d_op->d_hash) {
-		err = base->d_op->d_hash(base, name);
+		err = base->d_op->d_hash(base, inode, name);
 		dentry = ERR_PTR(err);
 		if (err < 0)
 			goto out;
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index c99a5dd..ecb25c6 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -74,7 +74,8 @@ const struct inode_operations ncp_dir_inode_operations =
  * Dentry operations routines
  */
 static int ncp_lookup_validate(struct dentry *, struct nameidata *);
-static int ncp_hash_dentry(struct dentry *, struct qstr *);
+static int ncp_hash_dentry(const struct dentry *, const struct inode *,
+		struct qstr *);
 static int ncp_compare_dentry(const struct dentry *,
 		const struct dentry *, const struct inode *,
 		unsigned int, const char *, const struct qstr *);
@@ -129,10 +130,9 @@ static inline int ncp_case_sensitive(const struct inode *inode)
  * is case-sensitive.
  */
 static int 
-ncp_hash_dentry(struct dentry *dentry, struct qstr *this)
+ncp_hash_dentry(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *this)
 {
-	struct inode *inode = dentry->d_inode;
-
 	if (!ncp_case_sensitive(inode)) {
 		struct super_block *sb = dentry->d_sb;
 		struct nls_table *t;
@@ -601,7 +601,7 @@ ncp_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
 	qname.hash = full_name_hash(qname.name, qname.len);
 
 	if (dentry->d_op && dentry->d_op->d_hash)
-		if (dentry->d_op->d_hash(dentry, &qname) != 0)
+		if (dentry->d_op->d_hash(dentry, dentry->d_inode, &qname) != 0)
 			goto end_advance;
 
 	newdent = d_lookup(dentry, &qname);
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 11e7f7d..7507aeb 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -27,7 +27,8 @@ static int add_nondir(struct dentry *dentry, struct inode *inode)
 	return err;
 }
 
-static int sysv_hash(struct dentry *dentry, struct qstr *qstr)
+static int sysv_hash(const struct dentry *dentry, const struct inode *inode,
+		struct qstr *qstr)
 {
 	/* Truncate the name in place, avoids having to define a compare
 	   function. */
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index ba1b9bd..4ef2af7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -133,7 +133,8 @@ enum dentry_d_lock_class
 
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
-	int (*d_hash)(struct dentry *, struct qstr *);
+	int (*d_hash)(const struct dentry *, const struct inode *,
+			struct qstr *);
 	int (*d_compare)(const struct dentry *,
 			const struct dentry *, const struct inode *,
 			unsigned int, const char *, const struct qstr *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 10/46] hostfs: simplify locking
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (7 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 09/46] fs: change d_hash " Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 11/46] fs: dcache scale hash Nick Piggin
                   ` (40 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that should really be required is a coherent path name, protection
from concurrent modification is not provided outside path name generation
because dcache_lock is dropped before the path is used.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c             |   15 +++++++++++++--
 fs/hostfs/hostfs_kern.c |   24 ++++++++++--------------
 include/linux/dcache.h  |    2 +-
 3 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9fd5180..4f9ccbe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2140,7 +2140,7 @@ char *dynamic_dname(struct dentry *dentry, char *buffer, int buflen,
 /*
  * Write full pathname from the root of the filesystem into the buffer.
  */
-char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
+static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 {
 	char *end = buf + buflen;
 	char *retval;
@@ -2167,7 +2167,18 @@ char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
 }
-EXPORT_SYMBOL(__dentry_path);
+
+char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
+{
+	char *retval;
+
+	spin_lock(&dcache_lock);
+	retval = __dentry_path(dentry, buf, buflen);
+	spin_unlock(&dcache_lock);
+
+	return retval;
+}
+EXPORT_SYMBOL(dentry_path_raw);
 
 char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 {
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index cfe8bc7..39dc505 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -92,12 +92,10 @@ __uml_setup("hostfs=", hostfs_args,
 
 static char *__dentry_name(struct dentry *dentry, char *name)
 {
-	char *p = __dentry_path(dentry, name, PATH_MAX);
+	char *p = dentry_path_raw(dentry, name, PATH_MAX);
 	char *root;
 	size_t len;
 
-	spin_unlock(&dcache_lock);
-
 	root = dentry->d_sb->s_fs_info;
 	len = strlen(root);
 	if (IS_ERR(p)) {
@@ -123,25 +121,23 @@ static char *dentry_name(struct dentry *dentry)
 	if (!name)
 		return NULL;
 
-	spin_lock(&dcache_lock);
 	return __dentry_name(dentry, name); /* will unlock */
 }
 
 static char *inode_name(struct inode *ino)
 {
 	struct dentry *dentry;
-	char *name = __getname();
-	if (!name)
-		return NULL;
+	char *name;
 
-	spin_lock(&dcache_lock);
-	if (list_empty(&ino->i_dentry)) {
-		spin_unlock(&dcache_lock);
-		__putname(name);
+	dentry = d_find_alias(ino);
+	if (!dentry)
 		return NULL;
-	}
-	dentry = list_first_entry(&ino->i_dentry, struct dentry, d_alias);
-	return __dentry_name(dentry, name); /* will unlock */
+
+	name = dentry_name(dentry);
+
+	dput(dentry);
+
+	return name;
 }
 
 static char *follow_link(char *link)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 4ef2af7..6b5760b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -309,7 +309,7 @@ extern char *dynamic_dname(struct dentry *, char *, int, const char *, ...);
 extern char *__d_path(const struct path *path, struct path *root, char *, int);
 extern char *d_path(const struct path *, char *, int);
 extern char *d_path_with_unreachable(const struct path *, char *, int);
-extern char *__dentry_path(struct dentry *, char *, int);
+extern char *dentry_path_raw(struct dentry *, char *, int);
 extern char *dentry_path(struct dentry *, char *, int);
 
 /* Allocation counts.. */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 11/46] fs: dcache scale hash
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (8 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 10/46] hostfs: simplify locking Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-12-09  6:09   ` Dave Chinner
  2010-11-27  9:44 ` [PATCH 12/46] fs: dcache scale lru Nick Piggin
                   ` (39 subsequent siblings)
  49 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c            |   38 +++++++++++++++++++++++++++-----------
 include/linux/dcache.h |    3 +++
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4f9ccbe..50c65c7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -35,12 +35,27 @@
 #include <linux/hardirq.h>
 #include "internal.h"
 
+/*
+ * Usage:
+ * dcache_hash_lock protects dcache hash table
+ *
+ * Ordering:
+ * dcache_lock
+ *   dentry->d_lock
+ *     dcache_hash_lock
+ *
+ * if (dentry1 < dentry2)
+ *   dentry1->d_lock
+ *     dentry2->d_lock
+ */
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
- __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
+EXPORT_SYMBOL(dcache_hash_lock);
 EXPORT_SYMBOL(dcache_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
@@ -1195,7 +1210,9 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	tmp->d_flags |= DCACHE_DISCONNECTED;
 	tmp->d_flags &= ~DCACHE_UNHASHED;
 	list_add(&tmp->d_alias, &inode->i_dentry);
+	spin_lock(&dcache_hash_lock);
 	hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
+	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&tmp->d_lock);
 
 	spin_unlock(&dcache_lock);
@@ -1581,7 +1598,9 @@ void d_rehash(struct dentry * entry)
 {
 	spin_lock(&dcache_lock);
 	spin_lock(&entry->d_lock);
+	spin_lock(&dcache_hash_lock);
 	_d_rehash(entry);
+	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&entry->d_lock);
 	spin_unlock(&dcache_lock);
 }
@@ -1661,8 +1680,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
  */
 static void d_move_locked(struct dentry * dentry, struct dentry * target)
 {
-	struct hlist_head *list;
-
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -1679,14 +1696,11 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
-	if (d_unhashed(dentry))
-		goto already_unhashed;
-
-	hlist_del_rcu(&dentry->d_hash);
-
-already_unhashed:
-	list = d_hash(target->d_parent, target->d_name.hash);
-	__d_rehash(dentry, list);
+	spin_lock(&dcache_hash_lock);
+	if (!d_unhashed(dentry))
+		hlist_del_rcu(&dentry->d_hash);
+	__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
+	spin_unlock(&dcache_hash_lock);
 
 	/* Unhash the target: dput() will then get rid of it */
 	__d_drop(target);
@@ -1883,7 +1897,9 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 found_lock:
 	spin_lock(&actual->d_lock);
 found:
+	spin_lock(&dcache_hash_lock);
 	_d_rehash(actual);
+	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
 	spin_unlock(&dcache_lock);
 out_nolock:
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6b5760b..7ce20f5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -181,6 +181,7 @@ struct dentry_operations {
 
 #define DCACHE_CANT_MOUNT	0x0100
 
+extern spinlock_t dcache_hash_lock;
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
 
@@ -204,7 +205,9 @@ static inline void __d_drop(struct dentry *dentry)
 {
 	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
 		dentry->d_flags |= DCACHE_UNHASHED;
+		spin_lock(&dcache_hash_lock);
 		hlist_del_rcu(&dentry->d_hash);
+		spin_unlock(&dcache_hash_lock);
 	}
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 12/46] fs: dcache scale lru
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (9 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 11/46] fs: dcache scale hash Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-12-09  7:22   ` Dave Chinner
  2010-11-27  9:44 ` [PATCH 13/46] fs: dcache scale dentry refcount Nick Piggin
                   ` (38 subsequent siblings)
  49 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
modification. d_lru is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |  112 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 84 insertions(+), 28 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 50c65c7..aa410b6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,11 +37,19 @@
 
 /*
  * Usage:
- * dcache_hash_lock protects dcache hash table
+ * dcache_hash_lock protects:
+ *   - the dcache hash table
+ * dcache_lru_lock protects:
+ *   - the dcache lru lists and counters
+ * d_lock protects:
+ *   - d_flags
+ *   - d_name
+ *   - d_lru
  *
  * Ordering:
  * dcache_lock
  *   dentry->d_lock
+ *     dcache_lru_lock
  *     dcache_hash_lock
  *
  * if (dentry1 < dentry2)
@@ -52,6 +60,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
@@ -148,28 +157,38 @@ static void dentry_iput(struct dentry * dentry)
 }
 
 /*
- * dentry_lru_(add|del|move_tail) must be called with dcache_lock held.
+ * dentry_lru_(add|del|move_tail) must be called with d_lock held.
  */
 static void dentry_lru_add(struct dentry *dentry)
 {
 	if (list_empty(&dentry->d_lru)) {
+		spin_lock(&dcache_lru_lock);
 		list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
 		percpu_counter_inc(&nr_dentry_unused);
+		spin_unlock(&dcache_lru_lock);
 	}
 }
 
+static void __dentry_lru_del(struct dentry *dentry)
+{
+	list_del_init(&dentry->d_lru);
+	dentry->d_sb->s_nr_dentry_unused--;
+	percpu_counter_dec(&nr_dentry_unused);
+}
+
 static void dentry_lru_del(struct dentry *dentry)
 {
 	if (!list_empty(&dentry->d_lru)) {
-		list_del_init(&dentry->d_lru);
-		dentry->d_sb->s_nr_dentry_unused--;
-		percpu_counter_dec(&nr_dentry_unused);
+		spin_lock(&dcache_lru_lock);
+		__dentry_lru_del(dentry);
+		spin_unlock(&dcache_lru_lock);
 	}
 }
 
 static void dentry_lru_move_tail(struct dentry *dentry)
 {
+	spin_lock(&dcache_lru_lock);
 	if (list_empty(&dentry->d_lru)) {
 		list_add_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 		dentry->d_sb->s_nr_dentry_unused++;
@@ -177,6 +196,7 @@ static void dentry_lru_move_tail(struct dentry *dentry)
 	} else {
 		list_move_tail(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
 	}
+	spin_unlock(&dcache_lru_lock);
 }
 
 /**
@@ -186,6 +206,8 @@ static void dentry_lru_move_tail(struct dentry *dentry)
  * The dentry must already be unhashed and removed from the LRU.
  *
  * If this is the root of the dentry tree, return NULL.
+ *
+ * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
  */
 static struct dentry *d_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
@@ -341,10 +363,19 @@ int d_invalidate(struct dentry * dentry)
 EXPORT_SYMBOL(d_invalidate);
 
 /* This should be called _only_ with dcache_lock held */
+static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+{
+	atomic_inc(&dentry->d_count);
+	dentry_lru_del(dentry);
+	return dentry;
+}
+
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
 	atomic_inc(&dentry->d_count);
+	spin_lock(&dentry->d_lock);
 	dentry_lru_del(dentry);
+	spin_unlock(&dentry->d_lock);
 	return dentry;
 }
 
@@ -423,7 +454,7 @@ restart:
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!atomic_read(&dentry->d_count)) {
-			__dget_locked(dentry);
+			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
@@ -447,7 +478,6 @@ EXPORT_SYMBOL(d_prune_aliases);
 static void prune_one_dentry(struct dentry * dentry)
 	__releases(dentry->d_lock)
 	__releases(dcache_lock)
-	__acquires(dcache_lock)
 {
 	__d_drop(dentry);
 	dentry = d_kill(dentry);
@@ -456,15 +486,16 @@ static void prune_one_dentry(struct dentry * dentry)
 	 * Prune ancestors.  Locking is simpler than in dput(),
 	 * because dcache_lock needs to be taken anyway.
 	 */
-	spin_lock(&dcache_lock);
 	while (dentry) {
-		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock))
+		spin_lock(&dcache_lock);
+		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock)) {
+			spin_unlock(&dcache_lock);
 			return;
+		}
 
 		dentry_lru_del(dentry);
 		__d_drop(dentry);
 		dentry = d_kill(dentry);
-		spin_lock(&dcache_lock);
 	}
 }
 
@@ -474,21 +505,31 @@ static void shrink_dentry_list(struct list_head *list)
 
 	while (!list_empty(list)) {
 		dentry = list_entry(list->prev, struct dentry, d_lru);
-		dentry_lru_del(dentry);
+
+		if (!spin_trylock(&dentry->d_lock)) {
+			spin_unlock(&dcache_lru_lock);
+			cpu_relax();
+			spin_lock(&dcache_lru_lock);
+			continue;
+		}
+
+		__dentry_lru_del(dentry);
 
 		/*
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup.  Do not free
 		 * it - just keep it off the LRU list.
 		 */
-		spin_lock(&dentry->d_lock);
 		if (atomic_read(&dentry->d_count)) {
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
+		spin_unlock(&dcache_lru_lock);
+
 		prune_one_dentry(dentry);
-		/* dentry->d_lock was dropped in prune_one_dentry() */
-		cond_resched_lock(&dcache_lock);
+		/* dcache_lock and dentry->d_lock dropped */
+		spin_lock(&dcache_lock);
+		spin_lock(&dcache_lru_lock);
 	}
 }
 
@@ -509,32 +550,36 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
 	int cnt = *count;
 
 	spin_lock(&dcache_lock);
+relock:
+	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		dentry = list_entry(sb->s_dentry_lru.prev,
 				struct dentry, d_lru);
 		BUG_ON(dentry->d_sb != sb);
 
+		if (!spin_trylock(&dentry->d_lock)) {
+			spin_unlock(&dcache_lru_lock);
+			cpu_relax();
+			goto relock;
+		}
+
 		/*
 		 * If we are honouring the DCACHE_REFERENCED flag and the
 		 * dentry has this flag set, don't free it.  Clear the flag
 		 * and put it back on the LRU.
 		 */
-		if (flags & DCACHE_REFERENCED) {
-			spin_lock(&dentry->d_lock);
-			if (dentry->d_flags & DCACHE_REFERENCED) {
-				dentry->d_flags &= ~DCACHE_REFERENCED;
-				list_move(&dentry->d_lru, &referenced);
-				spin_unlock(&dentry->d_lock);
-				cond_resched_lock(&dcache_lock);
-				continue;
-			}
+		if (flags & DCACHE_REFERENCED &&
+				dentry->d_flags & DCACHE_REFERENCED) {
+			dentry->d_flags &= ~DCACHE_REFERENCED;
+			list_move(&dentry->d_lru, &referenced);
+			spin_unlock(&dentry->d_lock);
+		} else {
+			list_move_tail(&dentry->d_lru, &tmp);
 			spin_unlock(&dentry->d_lock);
+			if (!--cnt)
+				break;
 		}
-
-		list_move_tail(&dentry->d_lru, &tmp);
-		if (!--cnt)
-			break;
-		cond_resched_lock(&dcache_lock);
+		/* XXX: re-add cond_resched_lock when dcache_lock goes away */
 	}
 
 	*count = cnt;
@@ -542,6 +587,7 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
 
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
+	spin_unlock(&dcache_lru_lock);
 	spin_unlock(&dcache_lock);
 
 }
@@ -637,10 +683,12 @@ void shrink_dcache_sb(struct super_block *sb)
 	LIST_HEAD(tmp);
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 		shrink_dentry_list(&tmp);
 	}
+	spin_unlock(&dcache_lru_lock);
 	spin_unlock(&dcache_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
@@ -659,7 +707,9 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 
 	/* detach this root from the system */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	dentry_lru_del(dentry);
+	spin_unlock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dcache_lock);
 
@@ -673,7 +723,9 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			spin_lock(&dcache_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
+				spin_lock(&loop->d_lock);
 				dentry_lru_del(loop);
+				spin_unlock(&loop->d_lock);
 				__d_drop(loop);
 				cond_resched_lock(&dcache_lock);
 			}
@@ -850,6 +902,8 @@ resume:
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
 
+		spin_lock(&dentry->d_lock);
+
 		/* 
 		 * move only zero ref count dentries to the end 
 		 * of the unused list for prune_dcache
@@ -861,6 +915,8 @@ resume:
 			dentry_lru_del(dentry);
 		}
 
+		spin_unlock(&dentry->d_lock);
+
 		/*
 		 * We can return to the caller if we have found some (this
 		 * ensures forward progress). We'll be coming back to find
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 13/46] fs: dcache scale dentry refcount
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (10 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 12/46] fs: dcache scale lru Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 14/46] fs: dcache scale d_unhashed Nick Piggin
                   ` (37 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 arch/powerpc/platforms/cell/spufs/inode.c |    2 +-
 drivers/infiniband/hw/ipath/ipath_fs.c    |    2 +-
 drivers/infiniband/hw/qib/qib_fs.c        |    2 +-
 fs/autofs4/expire.c                       |    8 +-
 fs/autofs4/root.c                         |    6 +-
 fs/ceph/dir.c                             |    4 +-
 fs/ceph/inode.c                           |    4 +-
 fs/ceph/mds_client.c                      |    2 +-
 fs/coda/dir.c                             |    2 +-
 fs/configfs/dir.c                         |    3 +-
 fs/configfs/inode.c                       |    2 +-
 fs/dcache.c                               |  106 ++++++++++++++++++++++-------
 fs/ecryptfs/inode.c                       |    2 +-
 fs/locks.c                                |    2 +-
 fs/namei.c                                |    2 +-
 fs/nfs/dir.c                              |    6 +-
 fs/nfs/unlink.c                           |    2 +-
 fs/nfsd/vfs.c                             |    5 +-
 fs/nilfs2/super.c                         |    2 +-
 include/linux/dcache.h                    |   29 ++++----
 kernel/cgroup.c                           |    2 -
 21 files changed, 126 insertions(+), 69 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 3532b92..29a406a 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -162,7 +162,7 @@ static void spufs_prune_dir(struct dentry *dir)
 		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry)) && dentry->d_inode) {
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c
index 8c8afc7..18aee04 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -280,7 +280,7 @@ static int remove_file(struct dentry *parent, char *name)
 	spin_lock(&dcache_lock);
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
-		dget_locked(tmp);
+		dget_locked_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
 		spin_unlock(&dcache_lock);
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index f99bddc..fe4b242 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -456,7 +456,7 @@ static int remove_file(struct dentry *parent, char *name)
 	spin_lock(&dcache_lock);
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
-		dget_locked(tmp);
+		dget_locked_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
 		spin_unlock(&dcache_lock);
diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index a796c94..413b564 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -198,7 +198,7 @@ static int autofs4_tree_busy(struct vfsmount *mnt,
 			else
 				ino_count++;
 
-			if (atomic_read(&p->d_count) > ino_count) {
+			if (p->d_count > ino_count) {
 				top_ino->last_used = jiffies;
 				dput(p);
 				return 1;
@@ -347,7 +347,7 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 2;
-			if (atomic_read(&dentry->d_count) > ino_count)
+			if (dentry->d_count > ino_count)
 				goto next;
 
 			/* Can we umount this guy */
@@ -369,7 +369,7 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 		if (!exp_leaves) {
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 1;
-			if (atomic_read(&dentry->d_count) > ino_count)
+			if (dentry->d_count > ino_count)
 				goto next;
 
 			if (!autofs4_tree_busy(mnt, dentry, timeout, do_now)) {
@@ -383,7 +383,7 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 		} else {
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 1;
-			if (atomic_read(&dentry->d_count) > ino_count)
+			if (dentry->d_count > ino_count)
 				goto next;
 
 			expired = autofs4_check_leaves(mnt, dentry, timeout, do_now);
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index d5c1401..dc7e64b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -436,7 +436,7 @@ static struct dentry *autofs4_lookup_active(struct dentry *dentry)
 		spin_lock(&active->d_lock);
 
 		/* Already gone? */
-		if (atomic_read(&active->d_count) == 0)
+		if (active->d_count == 0)
 			goto next;
 
 		qstr = &active->d_name;
@@ -452,7 +452,7 @@ static struct dentry *autofs4_lookup_active(struct dentry *dentry)
 			goto next;
 
 		if (d_unhashed(active)) {
-			dget(active);
+			dget_dlock(active);
 			spin_unlock(&active->d_lock);
 			spin_unlock(&sbi->lookup_lock);
 			spin_unlock(&dcache_lock);
@@ -507,7 +507,7 @@ static struct dentry *autofs4_lookup_expiring(struct dentry *dentry)
 			goto next;
 
 		if (d_unhashed(expiring)) {
-			dget(expiring);
+			dget_dlock(expiring);
 			spin_unlock(&expiring->d_lock);
 			spin_unlock(&sbi->lookup_lock);
 			spin_unlock(&dcache_lock);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 7d447af..599b011 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -149,7 +149,9 @@ more:
 		di = ceph_dentry(dentry);
 	}
 
-	atomic_inc(&dentry->d_count);
+	spin_lock(&dentry->d_lock);
+	dentry->d_count++;
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 	dout(" %llu (%llu) dentry %p %.*s %p\n", di->offset, filp->f_pos,
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index bf12865..bb68c79 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -879,8 +879,8 @@ static struct dentry *splice_dentry(struct dentry *dn, struct inode *in,
 	} else if (realdn) {
 		dout("dn %p (%d) spliced with %p (%d) "
 		     "inode %p ino %llx.%llx\n",
-		     dn, atomic_read(&dn->d_count),
-		     realdn, atomic_read(&realdn->d_count),
+		     dn, dn->d_count,
+		     realdn, realdn->d_count,
 		     realdn->d_inode, ceph_vinop(realdn->d_inode));
 		dput(dn);
 		dn = realdn;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 098b185..b110246 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1454,7 +1454,7 @@ retry:
 	*base = ceph_ino(temp->d_inode);
 	*plen = len;
 	dout("build_path on %p %d built %llx '%.*s'\n",
-	     dentry, atomic_read(&dentry->d_count), *base, len, path);
+	     dentry, dentry->d_count, *base, len, path);
 	return path;
 }
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index 4cce3b0..9e37e8b 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -559,7 +559,7 @@ static int coda_dentry_revalidate(struct dentry *de, struct nameidata *nd)
 	if (cii->c_flags & C_FLUSH) 
 		coda_flag_inode_children(inode, C_FLUSH);
 
-	if (atomic_read(&de->d_count) > 1)
+	if (de->d_count > 1)
 		/* pretend it's valid, but don't change the flags */
 		goto out;
 
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index 1001557..5825780 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -399,8 +399,7 @@ static void remove_dir(struct dentry * d)
 	if (d->d_inode)
 		simple_rmdir(parent->d_inode,d);
 
-	pr_debug(" o %s removing done (%d)\n",d->d_name.name,
-		 atomic_read(&d->d_count));
+	pr_debug(" o %s removing done (%d)\n",d->d_name.name, d->d_count);
 
 	dput(parent);
 }
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 253476d..79b3776 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -253,7 +253,7 @@ void configfs_drop_dentry(struct configfs_dirent * sd, struct dentry * parent)
 		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry) && dentry->d_inode)) {
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
diff --git a/fs/dcache.c b/fs/dcache.c
index aa410b6..2e04131 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -45,6 +45,7 @@
  *   - d_flags
  *   - d_name
  *   - d_lru
+ *   - d_count
  *
  * Ordering:
  * dcache_lock
@@ -119,6 +120,7 @@ static void __d_free(struct rcu_head *head)
  */
 static void d_free(struct dentry *dentry)
 {
+	BUG_ON(dentry->d_count);
 	percpu_counter_dec(&nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
@@ -216,8 +218,11 @@ static struct dentry *d_kill(struct dentry *dentry)
 	struct dentry *parent;
 
 	list_del(&dentry->d_u.d_child);
-	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
+	/*
+	 * dentry_iput drops the locks, at which point nobody (except
+	 * transient RCU lookups) can reach this dentry.
+	 */
 	if (IS_ROOT(dentry))
 		parent = NULL;
 	else
@@ -261,13 +266,23 @@ void dput(struct dentry *dentry)
 		return;
 
 repeat:
-	if (atomic_read(&dentry->d_count) == 1)
+	if (dentry->d_count == 1)
 		might_sleep();
-	if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
-		return;
-
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count)) {
+	if (dentry->d_count == 1) {
+		if (!spin_trylock(&dcache_lock)) {
+			/*
+			 * Something of a livelock possibility we could avoid
+			 * by taking dcache_lock and trying again, but we
+			 * want to reduce dcache_lock anyway so this will
+			 * get improved.
+			 */
+			spin_unlock(&dentry->d_lock);
+			goto repeat;
+		}
+	}
+	dentry->d_count--;
+	if (dentry->d_count) {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return;
@@ -347,7 +362,7 @@ int d_invalidate(struct dentry * dentry)
 	 * working directory or similar).
 	 */
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count) > 1) {
+	if (dentry->d_count > 1) {
 		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
@@ -362,29 +377,61 @@ int d_invalidate(struct dentry * dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
-/* This should be called _only_ with dcache_lock held */
+/* This must be called with dcache_lock and d_lock held */
 static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
 {
-	atomic_inc(&dentry->d_count);
+	dentry->d_count++;
 	dentry_lru_del(dentry);
 	return dentry;
 }
 
+/* This should be called _only_ with dcache_lock held */
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
-	atomic_inc(&dentry->d_count);
 	spin_lock(&dentry->d_lock);
-	dentry_lru_del(dentry);
+	__dget_locked_dlock(dentry);
 	spin_unlock(&dentry->d_lock);
 	return dentry;
 }
 
+struct dentry * dget_locked_dlock(struct dentry *dentry)
+{
+	return __dget_locked_dlock(dentry);
+}
+
 struct dentry * dget_locked(struct dentry *dentry)
 {
 	return __dget_locked(dentry);
 }
 EXPORT_SYMBOL(dget_locked);
 
+struct dentry *dget_parent(struct dentry *dentry)
+{
+	struct dentry *ret;
+
+repeat:
+	spin_lock(&dentry->d_lock);
+	ret = dentry->d_parent;
+	if (!ret)
+		goto out;
+	if (dentry == ret) {
+		ret->d_count++;
+		goto out;
+	}
+	if (!spin_trylock(&ret->d_lock)) {
+		spin_unlock(&dentry->d_lock);
+		cpu_relax();
+		goto repeat;
+	}
+	BUG_ON(!ret->d_count);
+	ret->d_count++;
+	spin_unlock(&ret->d_lock);
+out:
+	spin_unlock(&dentry->d_lock);
+	return ret;
+}
+EXPORT_SYMBOL(dget_parent);
+
 /**
  * d_find_alias - grab a hashed alias of inode
  * @inode: inode in question
@@ -453,7 +500,7 @@ restart:
 	spin_lock(&dcache_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
-		if (!atomic_read(&dentry->d_count)) {
+		if (!dentry->d_count) {
 			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
@@ -488,7 +535,10 @@ static void prune_one_dentry(struct dentry * dentry)
 	 */
 	while (dentry) {
 		spin_lock(&dcache_lock);
-		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock)) {
+		spin_lock(&dentry->d_lock);
+		dentry->d_count--;
+		if (dentry->d_count) {
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return;
 		}
@@ -520,7 +570,7 @@ static void shrink_dentry_list(struct list_head *list)
 		 * the LRU because of laziness during lookup.  Do not free
 		 * it - just keep it off the LRU list.
 		 */
-		if (atomic_read(&dentry->d_count)) {
+		if (dentry->d_count) {
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
@@ -741,7 +791,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 		do {
 			struct inode *inode;
 
-			if (atomic_read(&dentry->d_count) != 0) {
+			if (dentry->d_count != 0) {
 				printk(KERN_ERR
 				       "BUG: Dentry %p{i=%lx,n=%s}"
 				       " still in use (%d)"
@@ -750,7 +800,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 				       dentry->d_inode ?
 				       dentry->d_inode->i_ino : 0UL,
 				       dentry->d_name.name,
-				       atomic_read(&dentry->d_count),
+				       dentry->d_count,
 				       dentry->d_sb->s_type->name,
 				       dentry->d_sb->s_id);
 				BUG();
@@ -760,7 +810,9 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 				parent = NULL;
 			else {
 				parent = dentry->d_parent;
-				atomic_dec(&parent->d_count);
+				spin_lock(&parent->d_lock);
+				parent->d_count--;
+				spin_unlock(&parent->d_lock);
 			}
 
 			list_del(&dentry->d_u.d_child);
@@ -811,7 +863,9 @@ void shrink_dcache_for_umount(struct super_block *sb)
 
 	dentry = sb->s_root;
 	sb->s_root = NULL;
-	atomic_dec(&dentry->d_count);
+	spin_lock(&dentry->d_lock);
+	dentry->d_count--;
+	spin_unlock(&dentry->d_lock);
 	shrink_dcache_for_umount_subtree(dentry);
 
 	while (!hlist_empty(&sb->s_anon)) {
@@ -908,7 +962,7 @@ resume:
 		 * move only zero ref count dentries to the end 
 		 * of the unused list for prune_dcache
 		 */
-		if (!atomic_read(&dentry->d_count)) {
+		if (!dentry->d_count) {
 			dentry_lru_move_tail(dentry);
 			found++;
 		} else {
@@ -1029,7 +1083,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	memcpy(dname, name->name, name->len);
 	dname[name->len] = 0;
 
-	atomic_set(&dentry->d_count, 1);
+	dentry->d_count = 1;
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
 	dentry->d_inode = NULL;
@@ -1517,7 +1571,7 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 				goto next;
 		}
 
-		atomic_inc(&dentry->d_count);
+		dentry->d_count++;
 		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
@@ -1614,7 +1668,7 @@ void d_delete(struct dentry * dentry)
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
-	if (atomic_read(&dentry->d_count) == 1) {
+	if (dentry->d_count == 1) {
 		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
 		dentry_iput(dentry);
 		fsnotify_nameremove(dentry, isdir);
@@ -2428,11 +2482,15 @@ resume:
 			this_parent = dentry;
 			goto repeat;
 		}
-		atomic_dec(&dentry->d_count);
+		spin_lock(&dentry->d_lock);
+		dentry->d_count--;
+		spin_unlock(&dentry->d_lock);
 	}
 	if (this_parent != root) {
 		next = this_parent->d_u.d_child.next;
-		atomic_dec(&this_parent->d_count);
+		spin_lock(&this_parent->d_lock);
+		this_parent->d_count--;
+		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
 		goto resume;
 	}
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index a1ed7a7..5e5c7ec 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -260,7 +260,7 @@ int ecryptfs_lookup_and_interpose_lower(struct dentry *ecryptfs_dentry,
 				   ecryptfs_dentry->d_parent));
 	lower_inode = lower_dentry->d_inode;
 	fsstack_copy_attr_atime(ecryptfs_dir_inode, lower_dir_dentry->d_inode);
-	BUG_ON(!atomic_read(&lower_dentry->d_count));
+	BUG_ON(!lower_dentry->d_count);
 	ecryptfs_set_dentry_private(ecryptfs_dentry,
 				    kmem_cache_alloc(ecryptfs_dentry_info_cache,
 						     GFP_KERNEL));
diff --git a/fs/locks.c b/fs/locks.c
index 8729347..08415b2 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1389,7 +1389,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 		if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
 			goto out;
 		if ((arg == F_WRLCK)
-		    && ((atomic_read(&dentry->d_count) > 1)
+		    && ((dentry->d_count > 1)
 			|| (atomic_read(&inode->i_count) > 1)))
 			goto out;
 	}
diff --git a/fs/namei.c b/fs/namei.c
index bf6821a..3081ea3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2130,7 +2130,7 @@ void dentry_unhash(struct dentry *dentry)
 	shrink_dcache_parent(dentry);
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count) == 2)
+	if (dentry->d_count == 2)
 		__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 0f7798a..34ef2dd 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1730,7 +1730,7 @@ static int nfs_unlink(struct inode *dir, struct dentry *dentry)
 
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
-	if (atomic_read(&dentry->d_count) > 1) {
+	if (dentry->d_count > 1) {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		/* Start asynchronous writeout of the inode */
@@ -1878,7 +1878,7 @@ static int nfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	dfprintk(VFS, "NFS: rename(%s/%s -> %s/%s, ct=%d)\n",
 		 old_dentry->d_parent->d_name.name, old_dentry->d_name.name,
 		 new_dentry->d_parent->d_name.name, new_dentry->d_name.name,
-		 atomic_read(&new_dentry->d_count));
+		 new_dentry->d_count);
 
 	/*
 	 * For non-directories, check whether the target is busy and if so,
@@ -1896,7 +1896,7 @@ static int nfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 			rehash = new_dentry;
 		}
 
-		if (atomic_read(&new_dentry->d_count) > 2) {
+		if (new_dentry->d_count > 2) {
 			int err;
 
 			/* copy the target dentry's name */
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index 7bdec85..8fe9eb4 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -496,7 +496,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 
 	dfprintk(VFS, "NFS: silly-rename(%s/%s, ct=%d)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
-		atomic_read(&dentry->d_count));
+		dentry->d_count);
 	nfs_inc_stats(dir, NFSIOS_SILLYRENAME);
 
 	/*
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 184938f..3a35902 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1756,8 +1756,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 		goto out_dput_new;
 
 	if (svc_msnfs(ffhp) &&
-		((atomic_read(&odentry->d_count) > 1)
-		 || (atomic_read(&ndentry->d_count) > 1))) {
+		((odentry->d_count > 1) || (ndentry->d_count > 1))) {
 			host_err = -EPERM;
 			goto out_dput_new;
 	}
@@ -1843,7 +1842,7 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 	if (type != S_IFDIR) { /* It's UNLINK */
 #ifdef MSNFS
 		if ((fhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
-			(atomic_read(&rdentry->d_count) > 1)) {
+			(rdentry->d_count > 1)) {
 			host_err = -EPERM;
 		} else
 #endif
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index f804d41..d36fc7e 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -838,7 +838,7 @@ static int nilfs_attach_snapshot(struct super_block *s, __u64 cno,
 
 static int nilfs_tree_was_touched(struct dentry *root_dentry)
 {
-	return atomic_read(&root_dentry->d_count) > 1;
+	return root_dentry->d_count > 1;
 }
 
 /**
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7ce20f5..06f6f58 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -87,7 +87,7 @@ full_name_hash(const unsigned char *name, unsigned int len)
 #endif
 
 struct dentry {
-	atomic_t d_count;
+	unsigned int d_count;		/* protected by d_lock */
 	unsigned int d_flags;		/* protected by d_lock */
 	spinlock_t d_lock;		/* per dentry lock */
 	int d_mounted;
@@ -329,17 +329,28 @@ extern char *dentry_path(struct dentry *, char *, int);
  *	needs and they take necessary precautions) you should hold dcache_lock
  *	and call dget_locked() instead of dget().
  */
- 
+static inline struct dentry *dget_dlock(struct dentry *dentry)
+{
+	if (dentry) {
+		BUG_ON(!dentry->d_count);
+		dentry->d_count++;
+	}
+	return dentry;
+}
 static inline struct dentry *dget(struct dentry *dentry)
 {
 	if (dentry) {
-		BUG_ON(!atomic_read(&dentry->d_count));
-		atomic_inc(&dentry->d_count);
+		spin_lock(&dentry->d_lock);
+		dget_dlock(dentry);
+		spin_unlock(&dentry->d_lock);
 	}
 	return dentry;
 }
 
 extern struct dentry * dget_locked(struct dentry *);
+extern struct dentry * dget_locked_dlock(struct dentry *);
+
+extern struct dentry *dget_parent(struct dentry *dentry);
 
 /**
  *	d_unhashed -	is dentry hashed
@@ -370,16 +381,6 @@ static inline void dont_mount(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 }
 
-static inline struct dentry *dget_parent(struct dentry *dentry)
-{
-	struct dentry *ret;
-
-	spin_lock(&dentry->d_lock);
-	ret = dget(dentry->d_parent);
-	spin_unlock(&dentry->d_lock);
-	return ret;
-}
-
 extern void dput(struct dentry *);
 
 static inline int d_mountpoint(struct dentry *dentry)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..7ead732 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3638,9 +3638,7 @@ again:
 	list_del(&cgrp->sibling);
 	cgroup_unlock_hierarchy(cgrp->root);
 
-	spin_lock(&cgrp->dentry->d_lock);
 	d = dget(cgrp->dentry);
-	spin_unlock(&d->d_lock);
 
 	cgroup_d_remove_dir(d);
 	dput(d);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 14/46] fs: dcache scale d_unhashed
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (11 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 13/46] fs: dcache scale dentry refcount Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 15/46] fs: dcache scale subdirs Nick Piggin
                   ` (36 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Protect d_unhashed(dentry) condition with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 arch/powerpc/platforms/cell/spufs/inode.c |    3 +
 drivers/usb/core/inode.c                  |    3 +
 fs/autofs4/autofs_i.h                     |   13 -----
 fs/autofs4/expire.c                       |   21 ++++++--
 fs/ceph/dir.c                             |    5 +-
 fs/configfs/configfs_internal.h           |    2 +
 fs/dcache.c                               |   83 ++++++++++++++++++++---------
 fs/libfs.c                                |   29 +++++++---
 fs/ocfs2/dcache.c                         |    5 ++-
 security/tomoyo/realpath.c                |    1 +
 10 files changed, 109 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 29a406a..5aef1a7 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -166,6 +166,9 @@ static void spufs_prune_dir(struct dentry *dir)
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
+			/* XXX: what is dcache_lock protecting here? Other
+			 * filesystems (IB, configfs) release dcache_lock
+			 * before unlink */
 			spin_unlock(&dcache_lock);
 			dput(dentry);
 		} else {
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index b690aa3..e3ab443 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -347,10 +347,13 @@ static int usbfs_empty (struct dentry *dentry)
 
 	list_for_each(list, &dentry->d_subdirs) {
 		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
+		spin_lock(&de->d_lock);
 		if (usbfs_positive(de)) {
+			spin_unlock(&de->d_lock);
 			spin_unlock(&dcache_lock);
 			return 0;
 		}
+		spin_unlock(&de->d_lock);
 	}
 
 	spin_unlock(&dcache_lock);
diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3d283ab..3912dcf 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -254,19 +254,6 @@ static inline int simple_positive(struct dentry *dentry)
 	return dentry->d_inode && !d_unhashed(dentry);
 }
 
-static inline int __simple_empty(struct dentry *dentry)
-{
-	struct dentry *child;
-	int ret = 0;
-
-	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
-		if (simple_positive(child))
-			goto out;
-	ret = 1;
-out:
-	return ret;
-}
-
 static inline void autofs4_add_expiring(struct dentry *dentry)
 {
 	struct autofs_sb_info *sbi = autofs4_sbi(dentry->d_sb);
diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index 413b564..ee64020 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -160,14 +160,18 @@ static int autofs4_tree_busy(struct vfsmount *mnt,
 
 	spin_lock(&dcache_lock);
 	for (p = top; p; p = next_dentry(p, top)) {
+		spin_lock(&p->d_lock);
 		/* Negative dentry - give up */
-		if (!simple_positive(p))
+		if (!simple_positive(p)) {
+			spin_unlock(&p->d_lock);
 			continue;
+		}
 
 		DPRINTK("dentry %p %.*s",
 			p, (int) p->d_name.len, p->d_name.name);
 
-		p = dget(p);
+		p = dget_dlock(p);
+		spin_unlock(&p->d_lock);
 		spin_unlock(&dcache_lock);
 
 		/*
@@ -228,14 +232,18 @@ static struct dentry *autofs4_check_leaves(struct vfsmount *mnt,
 
 	spin_lock(&dcache_lock);
 	for (p = parent; p; p = next_dentry(p, parent)) {
+		spin_lock(&p->d_lock);
 		/* Negative dentry - give up */
-		if (!simple_positive(p))
+		if (!simple_positive(p)) {
+			spin_unlock(&p->d_lock);
 			continue;
+		}
 
 		DPRINTK("dentry %p %.*s",
 			p, (int) p->d_name.len, p->d_name.name);
 
-		p = dget(p);
+		p = dget_dlock(p);
+		spin_unlock(&p->d_lock);
 		spin_unlock(&dcache_lock);
 
 		if (d_mountpoint(p)) {
@@ -324,12 +332,15 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
 
 		/* Negative dentry - give up */
+		spin_lock(&dentry->d_lock);
 		if (!simple_positive(dentry)) {
 			next = next->next;
+			spin_unlock(&dentry->d_lock);
 			continue;
 		}
 
-		dentry = dget(dentry);
+		dentry = dget_dlock(dentry);
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 
 		spin_lock(&sbi->fs_lock);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 599b011..9d52531 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -135,6 +135,7 @@ more:
 			fi->at_end = 1;
 			goto out_unlock;
 		}
+		spin_lock(&dentry->d_lock);
 		if (!d_unhashed(dentry) && dentry->d_inode &&
 		    ceph_snap(dentry->d_inode) != CEPH_SNAPDIR &&
 		    ceph_ino(dentry->d_inode) != CEPH_INO_CEPH &&
@@ -144,13 +145,13 @@ more:
 		     dentry->d_name.len, dentry->d_name.name, di->offset,
 		     filp->f_pos, d_unhashed(dentry) ? " unhashed" : "",
 		     !dentry->d_inode ? " null" : "");
+		spin_unlock(&dentry->d_lock);
 		p = p->prev;
 		dentry = list_entry(p, struct dentry, d_u.d_child);
 		di = ceph_dentry(dentry);
 	}
 
-	spin_lock(&dentry->d_lock);
-	dentry->d_count++;
+	dget_dlock(dentry);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
diff --git a/fs/configfs/configfs_internal.h b/fs/configfs/configfs_internal.h
index da6061a..e58b4c3 100644
--- a/fs/configfs/configfs_internal.h
+++ b/fs/configfs/configfs_internal.h
@@ -121,6 +121,7 @@ static inline struct config_item *configfs_get_config_item(struct dentry *dentry
 	struct config_item * item = NULL;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (!d_unhashed(dentry)) {
 		struct configfs_dirent * sd = dentry->d_fsdata;
 		if (sd->s_type & CONFIGFS_ITEM_LINK) {
@@ -129,6 +130,7 @@ static inline struct config_item *configfs_get_config_item(struct dentry *dentry
 		} else
 			item = config_item_get(sd->s_element);
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 	return item;
diff --git a/fs/dcache.c b/fs/dcache.c
index 2e04131..05b2257 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -46,6 +46,7 @@
  *   - d_name
  *   - d_lru
  *   - d_count
+ *   - d_unhashed()
  *
  * Ordering:
  * dcache_lock
@@ -53,6 +54,13 @@
  *     dcache_lru_lock
  *     dcache_hash_lock
  *
+ * If there is an ancestor relationship:
+ * dentry->d_parent->...->d_parent->d_lock
+ *   ...
+ *     dentry->d_parent->d_lock
+ *       dentry->d_lock
+ *
+ * If no ancestor relationship:
  * if (dentry1 < dentry2)
  *   dentry1->d_lock
  *     dentry2->d_lock
@@ -337,7 +345,9 @@ int d_invalidate(struct dentry * dentry)
 	 * If it's already been dropped, return OK.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (d_unhashed(dentry)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return 0;
 	}
@@ -346,9 +356,11 @@ int d_invalidate(struct dentry * dentry)
 	 * to get rid of unused child entries.
 	 */
 	if (!list_empty(&dentry->d_subdirs)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		shrink_dcache_parent(dentry);
 		spin_lock(&dcache_lock);
+		spin_lock(&dentry->d_lock);
 	}
 
 	/*
@@ -361,7 +373,6 @@ int d_invalidate(struct dentry * dentry)
 	 * we might still populate it if it was a
 	 * working directory or similar).
 	 */
-	spin_lock(&dentry->d_lock);
 	if (dentry->d_count > 1) {
 		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
 			spin_unlock(&dentry->d_lock);
@@ -448,35 +459,49 @@ EXPORT_SYMBOL(dget_parent);
  * any other hashed alias over that one unless @want_discon is set,
  * in which case only return an IS_ROOT, DCACHE_DISCONNECTED alias.
  */
-
-static struct dentry * __d_find_alias(struct inode *inode, int want_discon)
+static struct dentry *___d_find_alias(struct inode *inode, int want_discon)
 {
-	struct list_head *head, *next, *tmp;
-	struct dentry *alias, *discon_alias=NULL;
+	struct dentry *alias, *discon_alias;
 
-	head = &inode->i_dentry;
-	next = inode->i_dentry.next;
-	while (next != head) {
-		tmp = next;
-		next = tmp->next;
-		prefetch(next);
-		alias = list_entry(tmp, struct dentry, d_alias);
+again:
+	discon_alias = NULL;
+	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
+		spin_lock(&alias->d_lock);
  		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
 			if (IS_ROOT(alias) &&
 			    (alias->d_flags & DCACHE_DISCONNECTED))
 				discon_alias = alias;
-			else if (!want_discon) {
-				__dget_locked(alias);
+			else if (!want_discon)
 				return alias;
-			}
 		}
+		spin_unlock(&alias->d_lock);
 	}
-	if (discon_alias)
-		__dget_locked(discon_alias);
-	return discon_alias;
+	if (discon_alias) {
+		alias = discon_alias;
+		spin_lock(&alias->d_lock);
+ 		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
+			if (IS_ROOT(alias) &&
+			    (alias->d_flags & DCACHE_DISCONNECTED))
+				return alias;
+		}
+		spin_unlock(&alias->d_lock);
+		goto again;
+	}
+	return NULL;
 }
 
-struct dentry * d_find_alias(struct inode *inode)
+static struct dentry *__d_find_alias(struct inode *inode, int want_discon)
+{
+	struct dentry *alias;
+	alias = ___d_find_alias(inode, want_discon);
+	if (alias) {
+		__dget_locked_dlock(alias);
+		spin_unlock(&alias->d_lock);
+	}
+	return alias;
+}
+
+struct dentry *d_find_alias(struct inode *inode)
 {
 	struct dentry *de = NULL;
 
@@ -759,8 +784,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	dentry_lru_del(dentry);
-	spin_unlock(&dentry->d_lock);
 	__d_drop(dentry);
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 	for (;;) {
@@ -775,8 +800,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 					    d_u.d_child) {
 				spin_lock(&loop->d_lock);
 				dentry_lru_del(loop);
-				spin_unlock(&loop->d_lock);
 				__d_drop(loop);
+				spin_unlock(&loop->d_lock);
 				cond_resched_lock(&dcache_lock);
 			}
 			spin_unlock(&dcache_lock);
@@ -1797,7 +1822,10 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
 	/*
 	 * XXXX: do we really need to take target->d_lock?
 	 */
-	if (target < dentry) {
+	if (d_ancestor(dentry, target)) {
+		spin_lock(&dentry->d_lock);
+		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
+	} else if (d_ancestor(target, dentry) || target < dentry) {
 		spin_lock(&target->d_lock);
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 	} else {
@@ -2476,15 +2504,18 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
-		if (d_unhashed(dentry)||!dentry->d_inode)
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		if (d_unhashed(dentry) || !dentry->d_inode) {
+			spin_unlock(&dentry->d_lock);
 			continue;
+		}
 		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&dentry->d_lock);
 			this_parent = dentry;
 			goto repeat;
 		}
-		spin_lock(&dentry->d_lock);
-		dentry->d_count--;
-		spin_unlock(&dentry->d_lock);
+ 		dentry->d_count--;
+ 		spin_unlock(&dentry->d_lock);
 	}
 	if (this_parent != root) {
 		next = this_parent->d_u.d_child.next;
diff --git a/fs/libfs.c b/fs/libfs.c
index b9d25d8..433e713 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -16,6 +16,11 @@
 
 #include <asm/uaccess.h>
 
+static inline int simple_positive(struct dentry *dentry)
+{
+	return dentry->d_inode && !d_unhashed(dentry);
+}
+
 int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
 		   struct kstat *stat)
 {
@@ -100,8 +105,10 @@ loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 			while (n && p != &file->f_path.dentry->d_subdirs) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				if (!d_unhashed(next) && next->d_inode)
+				spin_lock(&next->d_lock);
+				if (simple_positive(next))
 					n--;
+				spin_unlock(&next->d_lock);
 				p = p->next;
 			}
 			list_add_tail(&cursor->d_u.d_child, p);
@@ -155,9 +162,13 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 			for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				if (d_unhashed(next) || !next->d_inode)
+				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
+				if (!simple_positive(next)) {
+					spin_unlock(&next->d_lock);
 					continue;
+				}
 
+				spin_unlock(&next->d_lock);
 				spin_unlock(&dcache_lock);
 				if (filldir(dirent, next->d_name.name, 
 					    next->d_name.len, filp->f_pos, 
@@ -259,20 +270,20 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 	return 0;
 }
 
-static inline int simple_positive(struct dentry *dentry)
-{
-	return dentry->d_inode && !d_unhashed(dentry);
-}
-
 int simple_empty(struct dentry *dentry)
 {
 	struct dentry *child;
 	int ret = 0;
 
 	spin_lock(&dcache_lock);
-	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
-		if (simple_positive(child))
+	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		if (simple_positive(child)) {
+			spin_unlock(&child->d_lock);
 			goto out;
+		}
+		spin_unlock(&child->d_lock);
+	}
 	ret = 1;
 out:
 	spin_unlock(&dcache_lock);
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index edaded4..107d0f1 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -174,13 +174,16 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
 
+		spin_lock(&dentry->d_lock);
 		if (ocfs2_match_dentry(dentry, parent_blkno, skip_unhashed)) {
 			mlog(0, "dentry found: %.*s\n",
 			     dentry->d_name.len, dentry->d_name.name);
 
-			dget_locked(dentry);
+			dget_locked_dlock(dentry);
+			spin_unlock(&dentry->d_lock);
 			break;
 		}
+		spin_unlock(&dentry->d_lock);
 
 		dentry = NULL;
 	}
diff --git a/security/tomoyo/realpath.c b/security/tomoyo/realpath.c
index 1d0bf8f..d1e05b0 100644
--- a/security/tomoyo/realpath.c
+++ b/security/tomoyo/realpath.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <net/sock.h>
 #include "common.h"
+#include "../../fs/internal.h"
 
 /**
  * tomoyo_encode: Convert binary string to ascii string.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 15/46] fs: dcache scale subdirs
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (12 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 14/46] fs: dcache scale d_unhashed Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 16/46] fs: scale inode alias list Nick Piggin
                   ` (35 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 drivers/staging/smbfs/cache.c |    4 +
 drivers/usb/core/inode.c      |    8 +-
 fs/autofs4/autofs_i.h         |   11 +++
 fs/autofs4/expire.c           |  129 +++++++++++++--------------
 fs/autofs4/root.c             |   18 ++++-
 fs/ceph/dir.c                 |    6 +-
 fs/ceph/inode.c               |    8 ++-
 fs/coda/cache.c               |    2 +
 fs/dcache.c                   |  195 +++++++++++++++++++++++++++++++----------
 fs/libfs.c                    |   24 ++++--
 fs/ncpfs/dir.c                |    3 +
 fs/ncpfs/ncplib_kernel.h      |    4 +
 fs/notify/fsnotify.c          |    4 +-
 include/linux/dcache.h        |    1 +
 kernel/cgroup.c               |   19 ++++-
 security/selinux/selinuxfs.c  |   12 ++-
 16 files changed, 314 insertions(+), 134 deletions(-)

diff --git a/drivers/staging/smbfs/cache.c b/drivers/staging/smbfs/cache.c
index dbb9865..29afb05 100644
--- a/drivers/staging/smbfs/cache.c
+++ b/drivers/staging/smbfs/cache.c
@@ -63,6 +63,7 @@ smb_invalidate_dircache_entries(struct dentry *parent)
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -70,6 +71,7 @@ smb_invalidate_dircache_entries(struct dentry *parent)
 		smb_age_dentry(server, dentry);
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -97,6 +99,7 @@ smb_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 
 	/* If a pointer is invalid, we search the dentry. */
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dent = list_entry(next, struct dentry, d_u.d_child);
@@ -111,6 +114,7 @@ smb_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 	}
 	dent = NULL;
 out_unlock:
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return dent;
 }
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index e3ab443..89a0e83 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -344,18 +344,20 @@ static int usbfs_empty (struct dentry *dentry)
 	struct list_head *list;
 
 	spin_lock(&dcache_lock);
-
+	spin_lock(&dentry->d_lock);
 	list_for_each(list, &dentry->d_subdirs) {
 		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
-		spin_lock(&de->d_lock);
+
+		spin_lock_nested(&de->d_lock, DENTRY_D_LOCK_NESTED);
 		if (usbfs_positive(de)) {
 			spin_unlock(&de->d_lock);
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return 0;
 		}
 		spin_unlock(&de->d_lock);
 	}
-
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 	return 1;
 }
diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3912dcf..9d2ae9b 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -254,6 +254,17 @@ static inline int simple_positive(struct dentry *dentry)
 	return dentry->d_inode && !d_unhashed(dentry);
 }
 
+static inline void __autofs4_add_expiring(struct dentry *dentry)
+{
+	struct autofs_sb_info *sbi = autofs4_sbi(dentry->d_sb);
+	struct autofs_info *ino = autofs4_dentry_ino(dentry);
+	if (ino) {
+		if (list_empty(&ino->expiring))
+			list_add(&ino->expiring, &sbi->expiring_list);
+	}
+	return;
+}
+
 static inline void autofs4_add_expiring(struct dentry *dentry)
 {
 	struct autofs_sb_info *sbi = autofs4_sbi(dentry->d_sb);
diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index ee64020..3164802 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -91,24 +91,64 @@ done:
 }
 
 /*
- * Calculate next entry in top down tree traversal.
- * From next_mnt in namespace.c - elegant.
+ * Calculate and dget next entry in top down tree traversal.
  */
-static struct dentry *next_dentry(struct dentry *p, struct dentry *root)
+static struct dentry *get_next_positive_dentry(struct dentry *prev,
+						struct dentry *root)
 {
-	struct list_head *next = p->d_subdirs.next;
+	struct list_head *next;
+	struct dentry *p, *ret;
+
+	if (prev == NULL)
+		return dget(prev);
 
+	spin_lock(&dcache_lock);
+relock:
+	p = prev;
+	spin_lock(&p->d_lock);
+again:
+	next = p->d_subdirs.next;
 	if (next == &p->d_subdirs) {
 		while (1) {
-			if (p == root)
+			struct dentry *parent;
+
+			if (p == root) {
+				spin_unlock(&p->d_lock);
+				spin_unlock(&dcache_lock);
+				dput(prev);
 				return NULL;
+			}
+
+			parent = p->d_parent;
+			if (!spin_trylock(&parent->d_lock)) {
+				spin_unlock(&p->d_lock);
+				cpu_relax();
+				goto relock;
+			}
+			spin_unlock(&p->d_lock);
 			next = p->d_u.d_child.next;
-			if (next != &p->d_parent->d_subdirs)
+			p = parent;
+			if (next != &parent->d_subdirs)
 				break;
-			p = p->d_parent;
 		}
 	}
-	return list_entry(next, struct dentry, d_u.d_child);
+	ret = list_entry(next, struct dentry, d_u.d_child);
+
+	spin_lock_nested(&ret->d_lock, DENTRY_D_LOCK_NESTED);
+	/* Negative dentry - try next */
+	if (!simple_positive(ret)) {
+		spin_unlock(&ret->d_lock);
+		p = ret;
+		goto again;
+	}
+	dget_dlock(ret);
+	spin_unlock(&ret->d_lock);
+	spin_unlock(&p->d_lock);
+	spin_unlock(&dcache_lock);
+
+	dput(prev);
+
+	return ret;
 }
 
 /*
@@ -158,22 +198,11 @@ static int autofs4_tree_busy(struct vfsmount *mnt,
 	if (!simple_positive(top))
 		return 1;
 
-	spin_lock(&dcache_lock);
-	for (p = top; p; p = next_dentry(p, top)) {
-		spin_lock(&p->d_lock);
-		/* Negative dentry - give up */
-		if (!simple_positive(p)) {
-			spin_unlock(&p->d_lock);
-			continue;
-		}
-
+	p = NULL;
+	while ((p = get_next_positive_dentry(p, top))) {
 		DPRINTK("dentry %p %.*s",
 			p, (int) p->d_name.len, p->d_name.name);
 
-		p = dget_dlock(p);
-		spin_unlock(&p->d_lock);
-		spin_unlock(&dcache_lock);
-
 		/*
 		 * Is someone visiting anywhere in the subtree ?
 		 * If there's no mount we need to check the usage
@@ -208,10 +237,7 @@ static int autofs4_tree_busy(struct vfsmount *mnt,
 				return 1;
 			}
 		}
-		dput(p);
-		spin_lock(&dcache_lock);
 	}
-	spin_unlock(&dcache_lock);
 
 	/* Timeout of a tree mount is ultimately determined by its top dentry */
 	if (!autofs4_can_expire(top, timeout, do_now))
@@ -230,36 +256,21 @@ static struct dentry *autofs4_check_leaves(struct vfsmount *mnt,
 	DPRINTK("parent %p %.*s",
 		parent, (int)parent->d_name.len, parent->d_name.name);
 
-	spin_lock(&dcache_lock);
-	for (p = parent; p; p = next_dentry(p, parent)) {
-		spin_lock(&p->d_lock);
-		/* Negative dentry - give up */
-		if (!simple_positive(p)) {
-			spin_unlock(&p->d_lock);
-			continue;
-		}
-
+	p = NULL;
+	while ((p = get_next_positive_dentry(p, parent))) {
 		DPRINTK("dentry %p %.*s",
 			p, (int) p->d_name.len, p->d_name.name);
 
-		p = dget_dlock(p);
-		spin_unlock(&p->d_lock);
-		spin_unlock(&dcache_lock);
-
 		if (d_mountpoint(p)) {
 			/* Can we umount this guy */
 			if (autofs4_mount_busy(mnt, p))
-				goto cont;
+				continue;
 
 			/* Can we expire this guy */
 			if (autofs4_can_expire(p, timeout, do_now))
 				return p;
 		}
-cont:
-		dput(p);
-		spin_lock(&dcache_lock);
 	}
-	spin_unlock(&dcache_lock);
 	return NULL;
 }
 
@@ -310,8 +321,8 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 {
 	unsigned long timeout;
 	struct dentry *root = sb->s_root;
+	struct dentry *dentry;
 	struct dentry *expired = NULL;
-	struct list_head *next;
 	int do_now = how & AUTOFS_EXP_IMMEDIATE;
 	int exp_leaves = how & AUTOFS_EXP_LEAVES;
 	struct autofs_info *ino;
@@ -323,26 +334,8 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 	now = jiffies;
 	timeout = sbi->exp_timeout;
 
-	spin_lock(&dcache_lock);
-	next = root->d_subdirs.next;
-
-	/* On exit from the loop expire is set to a dgot dentry
-	 * to expire or it's NULL */
-	while ( next != &root->d_subdirs ) {
-		struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
-
-		/* Negative dentry - give up */
-		spin_lock(&dentry->d_lock);
-		if (!simple_positive(dentry)) {
-			next = next->next;
-			spin_unlock(&dentry->d_lock);
-			continue;
-		}
-
-		dentry = dget_dlock(dentry);
-		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
-
+	dentry = NULL;
+	while ((dentry = get_next_positive_dentry(dentry, root))) {
 		spin_lock(&sbi->fs_lock);
 		ino = autofs4_dentry_ino(dentry);
 
@@ -405,11 +398,7 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 		}
 next:
 		spin_unlock(&sbi->fs_lock);
-		dput(dentry);
-		spin_lock(&dcache_lock);
-		next = next->next;
 	}
-	spin_unlock(&dcache_lock);
 	return NULL;
 
 found:
@@ -420,7 +409,11 @@ found:
 	init_completion(&ino->expire_complete);
 	spin_unlock(&sbi->fs_lock);
 	spin_lock(&dcache_lock);
-	list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
+	spin_lock(&expired->d_parent->d_lock);
+	spin_lock_nested(&expired->d_lock, DENTRY_D_LOCK_NESTED);
+  	list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
+	spin_unlock(&expired->d_lock);
+	spin_unlock(&expired->d_parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return expired;
 }
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index dc7e64b..a185e7e 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -143,10 +143,13 @@ static int autofs4_dir_open(struct inode *inode, struct file *file)
 	 * it.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (!d_mountpoint(dentry) && list_empty(&dentry->d_subdirs)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		return -ENOENT;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 out:
@@ -253,7 +256,9 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 	lookup_type = autofs4_need_mount(nd->flags);
 	spin_lock(&sbi->fs_lock);
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (!(lookup_type || ino->flags & AUTOFS_INF_PENDING)) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		spin_unlock(&sbi->fs_lock);
 		goto follow;
@@ -266,6 +271,7 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 	 */
 	if (ino->flags & AUTOFS_INF_PENDING ||
 	    (!d_mountpoint(dentry) && list_empty(&dentry->d_subdirs))) {
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 		spin_unlock(&sbi->fs_lock);
 
@@ -275,6 +281,7 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 
 		goto follow;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 	spin_unlock(&sbi->fs_lock);
 follow:
@@ -347,10 +354,12 @@ static int autofs4_revalidate(struct dentry *dentry, struct nameidata *nd)
 
 	/* Check for a non-mountpoint directory with no contents */
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	if (S_ISDIR(dentry->d_inode->i_mode) &&
 	    !d_mountpoint(dentry) && list_empty(&dentry->d_subdirs)) {
 		DPRINTK("dentry=%p %.*s, emptydir",
 			 dentry, dentry->d_name.len, dentry->d_name.name);
+		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_lock);
 
 		/* The daemon never causes a mount to trigger */
@@ -367,6 +376,7 @@ static int autofs4_revalidate(struct dentry *dentry, struct nameidata *nd)
 
 		return status;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
 	return 1;
@@ -776,12 +786,16 @@ static int autofs4_dir_rmdir(struct inode *dir, struct dentry *dentry)
 		return -EACCES;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&sbi->lookup_lock);
+	spin_lock(&dentry->d_lock);
 	if (!list_empty(&dentry->d_subdirs)) {
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&sbi->lookup_lock);
 		spin_unlock(&dcache_lock);
 		return -ENOTEMPTY;
 	}
-	autofs4_add_expiring(dentry);
-	spin_lock(&dentry->d_lock);
+	__autofs4_add_expiring(dentry);
+	spin_unlock(&sbi->lookup_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 9d52531..4c81ae5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -112,6 +112,7 @@ static int __dcache_readdir(struct file *filp,
 	     last);
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 
 	/* start at beginning? */
 	if (filp->f_pos == 2 || (last &&
@@ -135,7 +136,7 @@ more:
 			fi->at_end = 1;
 			goto out_unlock;
 		}
-		spin_lock(&dentry->d_lock);
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		if (!d_unhashed(dentry) && dentry->d_inode &&
 		    ceph_snap(dentry->d_inode) != CEPH_SNAPDIR &&
 		    ceph_ino(dentry->d_inode) != CEPH_INO_CEPH &&
@@ -153,6 +154,7 @@ more:
 
 	dget_dlock(dentry);
 	spin_unlock(&dentry->d_lock);
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 
 	dout(" %llu (%llu) dentry %p %.*s %p\n", di->offset, filp->f_pos,
@@ -187,10 +189,12 @@ more:
 	}
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	p = p->prev;	/* advance to next dentry */
 	goto more;
 
 out_unlock:
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 out:
 	if (last)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index bb68c79..2c69444 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -842,11 +842,13 @@ static void ceph_set_dentry_offset(struct dentry *dn)
 	spin_unlock(&inode->i_lock);
 
 	spin_lock(&dcache_lock);
-	spin_lock(&dn->d_lock);
+	spin_lock(&dir->d_lock);
+	spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
 	list_move(&dn->d_u.d_child, &dir->d_subdirs);
 	dout("set_dentry_offset %p %lld (%p %p)\n", dn, di->offset,
 	     dn->d_u.d_child.prev, dn->d_u.d_child.next);
 	spin_unlock(&dn->d_lock);
+	spin_unlock(&dir->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -1232,9 +1234,11 @@ retry_lookup:
 		} else {
 			/* reorder parent's d_subdirs */
 			spin_lock(&dcache_lock);
-			spin_lock(&dn->d_lock);
+			spin_lock(&parent->d_lock);
+			spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
 			list_move(&dn->d_u.d_child, &parent->d_subdirs);
 			spin_unlock(&dn->d_lock);
+			spin_unlock(&parent->d_lock);
 			spin_unlock(&dcache_lock);
 		}
 
diff --git a/fs/coda/cache.c b/fs/coda/cache.c
index 9060f08..859393f 100644
--- a/fs/coda/cache.c
+++ b/fs/coda/cache.c
@@ -94,6 +94,7 @@ static void coda_flag_children(struct dentry *parent, int flag)
 	struct dentry *de;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	list_for_each(child, &parent->d_subdirs)
 	{
 		de = list_entry(child, struct dentry, d_u.d_child);
@@ -102,6 +103,7 @@ static void coda_flag_children(struct dentry *parent, int flag)
 			continue;
 		coda_flag_inode(de->d_inode, flag);
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return; 
 }
diff --git a/fs/dcache.c b/fs/dcache.c
index 05b2257..c79f3d4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -47,6 +47,8 @@
  *   - d_lru
  *   - d_count
  *   - d_unhashed()
+ *   - d_parent and d_subdirs
+ *   - childrens' d_child and d_parent
  *
  * Ordering:
  * dcache_lock
@@ -217,24 +219,22 @@ static void dentry_lru_move_tail(struct dentry *dentry)
  *
  * If this is the root of the dentry tree, return NULL.
  *
- * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
+ * dcache_lock and d_lock and d_parent->d_lock must be held by caller, and
+ * are dropped by d_kill.
  */
-static struct dentry *d_kill(struct dentry *dentry)
+static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
+	__releases(parent->d_lock)
 	__releases(dcache_lock)
 {
-	struct dentry *parent;
-
 	list_del(&dentry->d_u.d_child);
+	if (parent)
+		spin_unlock(&parent->d_lock);
 	dentry_iput(dentry);
 	/*
 	 * dentry_iput drops the locks, at which point nobody (except
 	 * transient RCU lookups) can reach this dentry.
 	 */
-	if (IS_ROOT(dentry))
-		parent = NULL;
-	else
-		parent = dentry->d_parent;
 	d_free(dentry);
 	return parent;
 }
@@ -270,6 +270,7 @@ static struct dentry *d_kill(struct dentry *dentry)
 
 void dput(struct dentry *dentry)
 {
+	struct dentry *parent;
 	if (!dentry)
 		return;
 
@@ -277,6 +278,10 @@ repeat:
 	if (dentry->d_count == 1)
 		might_sleep();
 	spin_lock(&dentry->d_lock);
+	if (IS_ROOT(dentry))
+		parent = NULL;
+	else
+		parent = dentry->d_parent;
 	if (dentry->d_count == 1) {
 		if (!spin_trylock(&dcache_lock)) {
 			/*
@@ -288,10 +293,17 @@ repeat:
 			spin_unlock(&dentry->d_lock);
 			goto repeat;
 		}
+		if (parent && !spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_lock);
+			goto repeat;
+		}
 	}
 	dentry->d_count--;
 	if (dentry->d_count) {
 		spin_unlock(&dentry->d_lock);
+		if (parent)
+			spin_unlock(&parent->d_lock);
 		spin_unlock(&dcache_lock);
 		return;
 	}
@@ -313,6 +325,8 @@ repeat:
 	dentry_lru_add(dentry);
 
  	spin_unlock(&dentry->d_lock);
+	if (parent)
+		spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return;
 
@@ -321,7 +335,7 @@ unhash_it:
 kill_it:
 	/* if dentry was on the d_lru list delete it from there */
 	dentry_lru_del(dentry);
-	dentry = d_kill(dentry);
+	dentry = d_kill(dentry, parent);
 	if (dentry)
 		goto repeat;
 }
@@ -547,12 +561,13 @@ EXPORT_SYMBOL(d_prune_aliases);
  * quadratic behavior of shrink_dcache_parent(), but is also expected
  * to be beneficial in reducing dentry cache fragmentation.
  */
-static void prune_one_dentry(struct dentry * dentry)
+static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
+	__releases(parent->d_lock)
 	__releases(dcache_lock)
 {
 	__d_drop(dentry);
-	dentry = d_kill(dentry);
+	dentry = d_kill(dentry, parent);
 
 	/*
 	 * Prune ancestors.  Locking is simpler than in dput(),
@@ -560,9 +575,20 @@ static void prune_one_dentry(struct dentry * dentry)
 	 */
 	while (dentry) {
 		spin_lock(&dcache_lock);
+again:
 		spin_lock(&dentry->d_lock);
+		if (IS_ROOT(dentry))
+			parent = NULL;
+		else
+			parent = dentry->d_parent;
+		if (parent && !spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dentry->d_lock);
+			goto again;
+		}
 		dentry->d_count--;
 		if (dentry->d_count) {
+			if (parent)
+				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			return;
@@ -570,7 +596,7 @@ static void prune_one_dentry(struct dentry * dentry)
 
 		dentry_lru_del(dentry);
 		__d_drop(dentry);
-		dentry = d_kill(dentry);
+		dentry = d_kill(dentry, parent);
 	}
 }
 
@@ -579,29 +605,40 @@ static void shrink_dentry_list(struct list_head *list)
 	struct dentry *dentry;
 
 	while (!list_empty(list)) {
+		struct dentry *parent;
+
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 
 		if (!spin_trylock(&dentry->d_lock)) {
+relock:
 			spin_unlock(&dcache_lru_lock);
 			cpu_relax();
 			spin_lock(&dcache_lru_lock);
 			continue;
 		}
 
-		__dentry_lru_del(dentry);
-
 		/*
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup.  Do not free
 		 * it - just keep it off the LRU list.
 		 */
 		if (dentry->d_count) {
+			__dentry_lru_del(dentry);
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
+		if (IS_ROOT(dentry))
+			parent = NULL;
+		else
+			parent = dentry->d_parent;
+		if (parent && !spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dentry->d_lock);
+			goto relock;
+		}
+		__dentry_lru_del(dentry);
 		spin_unlock(&dcache_lru_lock);
 
-		prune_one_dentry(dentry);
+		prune_one_dentry(dentry, parent);
 		/* dcache_lock and dentry->d_lock dropped */
 		spin_lock(&dcache_lock);
 		spin_lock(&dcache_lru_lock);
@@ -796,14 +833,16 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			/* this is a branch with children - detach all of them
 			 * from the system in one go */
 			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
-				spin_lock(&loop->d_lock);
+				spin_lock_nested(&loop->d_lock,
+						DENTRY_D_LOCK_NESTED);
 				dentry_lru_del(loop);
 				__d_drop(loop);
 				spin_unlock(&loop->d_lock);
-				cond_resched_lock(&dcache_lock);
 			}
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 
 			/* move to the first child */
@@ -831,16 +870,17 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 				BUG();
 			}
 
-			if (IS_ROOT(dentry))
+			if (IS_ROOT(dentry)) {
 				parent = NULL;
-			else {
+				list_del(&dentry->d_u.d_child);
+			} else {
 				parent = dentry->d_parent;
 				spin_lock(&parent->d_lock);
 				parent->d_count--;
+				list_del(&dentry->d_u.d_child);
 				spin_unlock(&parent->d_lock);
 			}
 
-			list_del(&dentry->d_u.d_child);
 			detached++;
 
 			inode = dentry->d_inode;
@@ -921,6 +961,7 @@ int have_submounts(struct dentry *parent)
 	spin_lock(&dcache_lock);
 	if (d_mountpoint(parent))
 		goto positive;
+	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -928,22 +969,34 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
+
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		/* Have we found a mount point ? */
-		if (d_mountpoint(dentry))
+		if (d_mountpoint(dentry)) {
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&this_parent->d_lock);
 			goto positive;
+		}
 		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
 			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
+		spin_unlock(&dentry->d_lock);
 	}
 	/*
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
 		next = this_parent->d_u.d_child.next;
+		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
+	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return 0; /* No mount points found in tree */
 positive:
@@ -973,6 +1026,7 @@ static int select_parent(struct dentry * parent)
 	int found = 0;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -980,8 +1034,9 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
+		BUG_ON(this_parent == dentry);
 
-		spin_lock(&dentry->d_lock);
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 
 		/* 
 		 * move only zero ref count dentries to the end 
@@ -994,33 +1049,44 @@ resume:
 			dentry_lru_del(dentry);
 		}
 
-		spin_unlock(&dentry->d_lock);
-
 		/*
 		 * We can return to the caller if we have found some (this
 		 * ensures forward progress). We'll be coming back to find
 		 * the rest.
 		 */
-		if (found && need_resched())
+		if (found && need_resched()) {
+			spin_unlock(&dentry->d_lock);
 			goto out;
+		}
 
 		/*
 		 * Descend a level if the d_subdirs list is non-empty.
 		 */
 		if (!list_empty(&dentry->d_subdirs)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
 			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
+
+		spin_unlock(&dentry->d_lock);
 	}
 	/*
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
+		struct dentry *tmp;
 		next = this_parent->d_u.d_child.next;
-		this_parent = this_parent->d_parent;
+		tmp = this_parent->d_parent;
+		spin_unlock(&this_parent->d_lock);
+		BUG_ON(tmp == this_parent);
+		this_parent = tmp;
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
 out:
+	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return found;
 }
@@ -1121,18 +1187,19 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
 	INIT_LIST_HEAD(&dentry->d_alias);
+	INIT_LIST_HEAD(&dentry->d_u.d_child);
 
 	if (parent) {
-		dentry->d_parent = dget(parent);
+		spin_lock(&dcache_lock);
+		spin_lock(&parent->d_lock);
+		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		dentry->d_parent = dget_dlock(parent);
 		dentry->d_sb = parent->d_sb;
-	} else {
-		INIT_LIST_HEAD(&dentry->d_u.d_child);
-	}
-
-	spin_lock(&dcache_lock);
-	if (parent)
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-	spin_unlock(&dcache_lock);
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&parent->d_lock);
+		spin_unlock(&dcache_lock);
+	}
 
 	percpu_counter_inc(&nr_dentry);
 
@@ -1650,13 +1717,18 @@ int d_validate(struct dentry *dentry, struct dentry *dparent)
 	struct dentry *child;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dparent->d_lock);
 	list_for_each_entry(child, &dparent->d_subdirs, d_u.d_child) {
 		if (dentry == child) {
-			__dget_locked(dentry);
+			spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+			__dget_locked_dlock(dentry);
+			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dparent->d_lock);
 			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
+	spin_unlock(&dparent->d_lock);
 	spin_unlock(&dcache_lock);
 
 	return 0;
@@ -1822,15 +1894,26 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
 	/*
 	 * XXXX: do we really need to take target->d_lock?
 	 */
-	if (d_ancestor(dentry, target)) {
-		spin_lock(&dentry->d_lock);
-		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
-	} else if (d_ancestor(target, dentry) || target < dentry) {
-		spin_lock(&target->d_lock);
-		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-	} else {
-		spin_lock(&dentry->d_lock);
-		spin_lock_nested(&target->d_lock, DENTRY_D_LOCK_NESTED);
+	BUG_ON(d_ancestor(dentry, target));
+	BUG_ON(d_ancestor(target, dentry));
+
+	if (IS_ROOT(dentry) || dentry->d_parent == target->d_parent)
+		spin_lock(&target->d_parent->d_lock);
+	else {
+		if (d_ancestor(dentry->d_parent, target->d_parent)) {
+			spin_lock(&dentry->d_parent->d_lock);
+			spin_lock_nested(&target->d_parent->d_lock, DENTRY_D_LOCK_NESTED);
+		} else {
+			spin_lock(&target->d_parent->d_lock);
+			spin_lock_nested(&dentry->d_parent->d_lock, DENTRY_D_LOCK_NESTED);
+		}
+	}
+	if (target < dentry) {
+		spin_lock_nested(&target->d_lock, 2);
+		spin_lock_nested(&dentry->d_lock, 3);
+ 	} else {
+		spin_lock_nested(&dentry->d_lock, 2);
+		spin_lock_nested(&target->d_lock, 3);
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
@@ -1863,6 +1946,10 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
 	}
 
 	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
+	if (target->d_parent != dentry->d_parent)
+		spin_unlock(&dentry->d_parent->d_lock);
+	if (target->d_parent != target)
+		spin_unlock(&target->d_parent->d_lock);
 	spin_unlock(&target->d_lock);
 	fsnotify_d_move(dentry);
 	spin_unlock(&dentry->d_lock);
@@ -1963,6 +2050,13 @@ static void __d_materialise_dentry(struct dentry *dentry, struct dentry *anon)
 	dparent = dentry->d_parent;
 	aparent = anon->d_parent;
 
+	/* XXX: hack */
+	/* returns with anon->d_lock held! */
+	spin_lock(&aparent->d_lock);
+	spin_lock(&dparent->d_lock);
+	spin_lock(&dentry->d_lock);
+	spin_lock(&anon->d_lock);
+
 	dentry->d_parent = (aparent == anon) ? dentry : aparent;
 	list_del(&dentry->d_u.d_child);
 	if (!IS_ROOT(dentry))
@@ -1977,6 +2071,10 @@ static void __d_materialise_dentry(struct dentry *dentry, struct dentry *anon)
 	else
 		INIT_LIST_HEAD(&anon->d_u.d_child);
 
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&dparent->d_lock);
+	spin_unlock(&aparent->d_lock);
+
 	anon->d_flags &= ~DCACHE_DISCONNECTED;
 }
 
@@ -2012,7 +2110,6 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 			/* Is this an anonymous mountpoint that we could splice
 			 * into our tree? */
 			if (IS_ROOT(alias)) {
-				spin_lock(&alias->d_lock);
 				__d_materialise_dentry(dentry, alias);
 				__d_drop(alias);
 				goto found;
@@ -2497,6 +2594,7 @@ void d_genocide(struct dentry *root)
 	struct list_head *next;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -2510,8 +2608,10 @@ resume:
 			continue;
 		}
 		if (!list_empty(&dentry->d_subdirs)) {
-			spin_unlock(&dentry->d_lock);
+			spin_unlock(&this_parent->d_lock);
+			spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
 			this_parent = dentry;
+			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
  		dentry->d_count--;
@@ -2519,12 +2619,13 @@ resume:
 	}
 	if (this_parent != root) {
 		next = this_parent->d_u.d_child.next;
-		spin_lock(&this_parent->d_lock);
 		this_parent->d_count--;
 		spin_unlock(&this_parent->d_lock);
 		this_parent = this_parent->d_parent;
+		spin_lock(&this_parent->d_lock);
 		goto resume;
 	}
+	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
diff --git a/fs/libfs.c b/fs/libfs.c
index 433e713..cc47949 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -81,7 +81,8 @@ int dcache_dir_close(struct inode *inode, struct file *file)
 
 loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 {
-	mutex_lock(&file->f_path.dentry->d_inode->i_mutex);
+	struct dentry *dentry = file->f_path.dentry;
+	mutex_lock(&dentry->d_inode->i_mutex);
 	switch (origin) {
 		case 1:
 			offset += file->f_pos;
@@ -89,7 +90,7 @@ loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 			if (offset >= 0)
 				break;
 		default:
-			mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
+			mutex_unlock(&dentry->d_inode->i_mutex);
 			return -EINVAL;
 	}
 	if (offset != file->f_pos) {
@@ -100,22 +101,25 @@ loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 			loff_t n = file->f_pos - 2;
 
 			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
+			/* d_lock not required for cursor */
 			list_del(&cursor->d_u.d_child);
-			p = file->f_path.dentry->d_subdirs.next;
-			while (n && p != &file->f_path.dentry->d_subdirs) {
+			p = dentry->d_subdirs.next;
+			while (n && p != &dentry->d_subdirs) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				spin_lock(&next->d_lock);
+				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
 				if (simple_positive(next))
 					n--;
 				spin_unlock(&next->d_lock);
 				p = p->next;
 			}
 			list_add_tail(&cursor->d_u.d_child, p);
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 		}
 	}
-	mutex_unlock(&file->f_path.dentry->d_inode->i_mutex);
+	mutex_unlock(&dentry->d_inode->i_mutex);
 	return offset;
 }
 
@@ -156,6 +160,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 			/* fallthrough */
 		default:
 			spin_lock(&dcache_lock);
+			spin_lock(&dentry->d_lock);
 			if (filp->f_pos == 2)
 				list_move(q, &dentry->d_subdirs);
 
@@ -169,6 +174,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 				}
 
 				spin_unlock(&next->d_lock);
+				spin_unlock(&dentry->d_lock);
 				spin_unlock(&dcache_lock);
 				if (filldir(dirent, next->d_name.name, 
 					    next->d_name.len, filp->f_pos, 
@@ -176,11 +182,15 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 					    dt_type(next->d_inode)) < 0)
 					return 0;
 				spin_lock(&dcache_lock);
+				spin_lock(&dentry->d_lock);
+				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
 				/* next is still alive */
 				list_move(q, p);
+				spin_unlock(&next->d_lock);
 				p = q;
 				filp->f_pos++;
 			}
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 	}
 	return 0;
@@ -276,6 +286,7 @@ int simple_empty(struct dentry *dentry)
 	int ret = 0;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
 		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 		if (simple_positive(child)) {
@@ -286,6 +297,7 @@ int simple_empty(struct dentry *dentry)
 	}
 	ret = 1;
 out:
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 	return ret;
 }
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index ecb25c6..ce65f29 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -394,6 +394,7 @@ ncp_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 
 	/* If a pointer is invalid, we search the dentry. */
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dent = list_entry(next, struct dentry, d_u.d_child);
@@ -402,11 +403,13 @@ ncp_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 				dget_locked(dent);
 			else
 				dent = NULL;
+			spin_unlock(&parent->d_lock);
 			spin_unlock(&dcache_lock);
 			goto out;
 		}
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	return NULL;
 
diff --git a/fs/ncpfs/ncplib_kernel.h b/fs/ncpfs/ncplib_kernel.h
index 244d1b7..c4b718f 100644
--- a/fs/ncpfs/ncplib_kernel.h
+++ b/fs/ncpfs/ncplib_kernel.h
@@ -194,6 +194,7 @@ ncp_renew_dentries(struct dentry *parent)
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -205,6 +206,7 @@ ncp_renew_dentries(struct dentry *parent)
 
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -216,6 +218,7 @@ ncp_invalidate_dircache_entries(struct dentry *parent)
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
 		dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -223,6 +226,7 @@ ncp_invalidate_dircache_entries(struct dentry *parent)
 		ncp_age_dentry(server, dentry);
 		next = next->next;
 	}
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 20dc218..aa4f25e 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -68,17 +68,19 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 		/* run all of the children of the original inode and fix their
 		 * d_flags to indicate parental interest (their parent is the
 		 * original inode) */
+		spin_lock(&alias->d_lock);
 		list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
 			if (!child->d_inode)
 				continue;
 
-			spin_lock(&child->d_lock);
+			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
 			if (watched)
 				child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
 			else
 				child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
 			spin_unlock(&child->d_lock);
 		}
+		spin_unlock(&alias->d_lock);
 	}
 	spin_unlock(&dcache_lock);
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 06f6f58..16d6ecf 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -337,6 +337,7 @@ static inline struct dentry *dget_dlock(struct dentry *dentry)
 	}
 	return dentry;
 }
+
 static inline struct dentry *dget(struct dentry *dentry)
 {
 	if (dentry) {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7ead732..54f3442 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -875,23 +875,31 @@ static void cgroup_clear_directory(struct dentry *dentry)
 
 	BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
 	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
 	node = dentry->d_subdirs.next;
 	while (node != &dentry->d_subdirs) {
 		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
+
+		spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
 		list_del_init(node);
 		if (d->d_inode) {
 			/* This should never be called on a cgroup
 			 * directory with child cgroups */
 			BUG_ON(d->d_inode->i_mode & S_IFDIR);
-			d = dget_locked(d);
+			dget_locked_dlock(d);
+			spin_unlock(&d->d_lock);
+			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(dentry->d_inode, d);
 			dput(d);
 			spin_lock(&dcache_lock);
-		}
+			spin_lock(&dentry->d_lock);
+		} else
+			spin_unlock(&d->d_lock);
 		node = dentry->d_subdirs.next;
 	}
+	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
@@ -900,10 +908,17 @@ static void cgroup_clear_directory(struct dentry *dentry)
  */
 static void cgroup_d_remove_dir(struct dentry *dentry)
 {
+	struct dentry *parent;
+
 	cgroup_clear_directory(dentry);
 
 	spin_lock(&dcache_lock);
+	parent = dentry->d_parent;
+	spin_lock(&parent->d_lock);
+	spin_lock(&dentry->d_lock);
 	list_del_init(&dentry->d_u.d_child);
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_lock);
 	remove_dir(dentry);
 }
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 073fd5b..017ec09 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1146,22 +1146,30 @@ static void sel_remove_entries(struct dentry *de)
 	struct list_head *node;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&de->d_lock);
 	node = de->d_subdirs.next;
 	while (node != &de->d_subdirs) {
 		struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
+
+		spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
 		list_del_init(node);
 
 		if (d->d_inode) {
-			d = dget_locked(d);
+			dget_locked_dlock(d);
+			spin_unlock(&de->d_lock);
+			spin_unlock(&d->d_lock);
 			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(de->d_inode, d);
 			dput(d);
 			spin_lock(&dcache_lock);
-		}
+			spin_lock(&de->d_lock);
+		} else
+			spin_unlock(&d->d_lock);
 		node = de->d_subdirs.next;
 	}
 
+	spin_unlock(&de->d_lock);
 	spin_unlock(&dcache_lock);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 16/46] fs: scale inode alias list
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (13 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 15/46] fs: dcache scale subdirs Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations Nick Piggin
                   ` (34 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/9p/vfs_inode.c      |    2 +
 fs/affs/amigaffs.c     |    2 +
 fs/cifs/inode.c        |    3 ++
 fs/dcache.c            |   66 ++++++++++++++++++++++++++++++++++++++++++------
 fs/exportfs/expfs.c    |    4 +++
 fs/nfs/getroot.c       |    4 +++
 fs/notify/fsnotify.c   |    2 +
 fs/ocfs2/dcache.c      |    3 +-
 include/linux/dcache.h |    1 +
 9 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 34bf71b..47dfd5d 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -271,9 +271,11 @@ static struct dentry *v9fs_dentry_from_dir_inode(struct inode *inode)
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	/* Directory should have only one entry. */
 	BUG_ON(S_ISDIR(inode->i_mode) && !list_is_singular(&inode->i_dentry));
 	dentry = list_entry(inode->i_dentry.next, struct dentry, d_alias);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	return dentry;
 }
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 7d0f0a3..2321cc9 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -129,6 +129,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
 	struct list_head *head, *next;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	head = &inode->i_dentry;
 	next = head->next;
 	while (next != head) {
@@ -139,6 +140,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
 		}
 		next = next->next;
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index ef3a55b..0ee1767 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -805,12 +805,15 @@ inode_has_hashed_dentries(struct inode *inode)
 	struct dentry *dentry;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		if (!d_unhashed(dentry) || IS_ROOT(dentry)) {
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			return true;
 		}
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	return false;
 }
diff --git a/fs/dcache.c b/fs/dcache.c
index c79f3d4..4c6f553 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,8 @@
 
 /*
  * Usage:
+ * dcache_inode_lock protects:
+ *   - i_dentry, d_alias, d_inode
  * dcache_hash_lock protects:
  *   - the dcache hash table
  * dcache_lru_lock protects:
@@ -49,12 +51,14 @@
  *   - d_unhashed()
  *   - d_parent and d_subdirs
  *   - childrens' d_child and d_parent
+ *   - d_alias, d_inode
  *
  * Ordering:
  * dcache_lock
- *   dentry->d_lock
- *     dcache_lru_lock
- *     dcache_hash_lock
+ *   dcache_inode_lock
+ *     dentry->d_lock
+ *       dcache_lru_lock
+ *       dcache_hash_lock
  *
  * If there is an ancestor relationship:
  * dentry->d_parent->...->d_parent->d_lock
@@ -70,11 +74,13 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
+EXPORT_SYMBOL(dcache_inode_lock);
 EXPORT_SYMBOL(dcache_hash_lock);
 EXPORT_SYMBOL(dcache_lock);
 
@@ -148,6 +154,7 @@ static void d_free(struct dentry *dentry)
  */
 static void dentry_iput(struct dentry * dentry)
 	__releases(dentry->d_lock)
+	__releases(dcache_inode_lock)
 	__releases(dcache_lock)
 {
 	struct inode *inode = dentry->d_inode;
@@ -155,6 +162,7 @@ static void dentry_iput(struct dentry * dentry)
 		dentry->d_inode = NULL;
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		if (!inode->i_nlink)
 			fsnotify_inoderemove(inode);
@@ -164,6 +172,7 @@ static void dentry_iput(struct dentry * dentry)
 			iput(inode);
 	} else {
 		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 	}
 }
@@ -225,6 +234,7 @@ static void dentry_lru_move_tail(struct dentry *dentry)
 static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
 	__releases(parent->d_lock)
+	__releases(dcache_inode_lock)
 	__releases(dcache_lock)
 {
 	list_del(&dentry->d_u.d_child);
@@ -290,13 +300,18 @@ repeat:
 			 * want to reduce dcache_lock anyway so this will
 			 * get improved.
 			 */
+drop1:
 			spin_unlock(&dentry->d_lock);
 			goto repeat;
 		}
-		if (parent && !spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dentry->d_lock);
+		if (!spin_trylock(&dcache_inode_lock)) {
+drop2:
 			spin_unlock(&dcache_lock);
-			goto repeat;
+			goto drop1;
+		}
+		if (parent && !spin_trylock(&parent->d_lock)) {
+			spin_unlock(&dcache_inode_lock);
+			goto drop2;
 		}
 	}
 	dentry->d_count--;
@@ -327,6 +342,7 @@ repeat:
  	spin_unlock(&dentry->d_lock);
 	if (parent)
 		spin_unlock(&parent->d_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	return;
 
@@ -521,7 +537,9 @@ struct dentry *d_find_alias(struct inode *inode)
 
 	if (!list_empty(&inode->i_dentry)) {
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		de = __d_find_alias(inode, 0);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 	}
 	return de;
@@ -537,18 +555,21 @@ void d_prune_aliases(struct inode *inode)
 	struct dentry *dentry;
 restart:
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!dentry->d_count) {
 			__dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			dput(dentry);
 			goto restart;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 EXPORT_SYMBOL(d_prune_aliases);
@@ -564,6 +585,7 @@ EXPORT_SYMBOL(d_prune_aliases);
 static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
 	__releases(parent->d_lock)
+	__releases(dcache_inode_lock)
 	__releases(dcache_lock)
 {
 	__d_drop(dentry);
@@ -575,6 +597,7 @@ static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
 	 */
 	while (dentry) {
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 again:
 		spin_lock(&dentry->d_lock);
 		if (IS_ROOT(dentry))
@@ -590,6 +613,7 @@ again:
 			if (parent)
 				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			return;
 		}
@@ -639,8 +663,9 @@ relock:
 		spin_unlock(&dcache_lru_lock);
 
 		prune_one_dentry(dentry, parent);
-		/* dcache_lock and dentry->d_lock dropped */
+		/* dcache_lock, dcache_inode_lock and dentry->d_lock dropped */
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		spin_lock(&dcache_lru_lock);
 	}
 }
@@ -662,6 +687,7 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
 	int cnt = *count;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 relock:
 	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
@@ -700,8 +726,8 @@ relock:
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
-
 }
 
 /**
@@ -795,12 +821,14 @@ void shrink_dcache_sb(struct super_block *sb)
 	LIST_HEAD(tmp);
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 		shrink_dentry_list(&tmp);
 	}
 	spin_unlock(&dcache_lru_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
@@ -1221,9 +1249,11 @@ EXPORT_SYMBOL(d_alloc_name);
 /* the caller must hold dcache_lock */
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
+	spin_lock(&dentry->d_lock);
 	if (inode)
 		list_add(&dentry->d_alias, &inode->i_dentry);
 	dentry->d_inode = inode;
+	spin_unlock(&dentry->d_lock);
 	fsnotify_d_instantiate(dentry, inode);
 }
 
@@ -1246,7 +1276,9 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!list_empty(&entry->d_alias));
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	__d_instantiate(entry, inode);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 }
@@ -1307,7 +1339,9 @@ struct dentry *d_instantiate_unique(struct dentry *entry, struct inode *inode)
 	BUG_ON(!list_empty(&entry->d_alias));
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	result = __d_instantiate_unique(entry, inode);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	if (!result) {
@@ -1398,8 +1432,10 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	tmp->d_parent = tmp; /* make sure dput doesn't croak */
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	res = __d_find_alias(inode, 0);
 	if (res) {
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		dput(tmp);
 		goto out_iput;
@@ -1416,6 +1452,7 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&tmp->d_lock);
+	spin_unlock(&dcache_inode_lock);
 
 	spin_unlock(&dcache_lock);
 	return tmp;
@@ -1448,9 +1485,11 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 
 	if (inode && S_ISDIR(inode->i_mode)) {
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(new, inode);
 			d_move(new, dentry);
@@ -1458,6 +1497,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 		} else {
 			/* already taking dcache_lock, so d_add() by hand */
 			__d_instantiate(dentry, inode);
+			spin_unlock(&dcache_inode_lock);
 			spin_unlock(&dcache_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
@@ -1532,8 +1572,10 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 * already has a dentry.
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
 		__d_instantiate(found, inode);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		security_d_instantiate(found, inode);
 		return found;
@@ -1545,6 +1587,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 */
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	dget_locked(new);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
@@ -1763,6 +1806,7 @@ void d_delete(struct dentry * dentry)
 	 * Are we the only user?
 	 */
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	if (dentry->d_count == 1) {
@@ -1776,6 +1820,7 @@ void d_delete(struct dentry * dentry)
 		__d_drop(dentry);
 
 	spin_unlock(&dentry->d_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	fsnotify_nameremove(dentry, isdir);
@@ -2003,6 +2048,7 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  */
 static struct dentry *__d_unalias(struct dentry *dentry, struct dentry *alias)
 	__releases(dcache_lock)
+	__releases(dcache_inode_lock)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
 	struct dentry *ret;
@@ -2028,6 +2074,7 @@ out_unalias:
 	d_move_locked(alias, dentry);
 	ret = alias;
 out_err:
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	if (m2)
 		mutex_unlock(m2);
@@ -2093,6 +2140,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 	BUG_ON(!d_unhashed(dentry));
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 
 	if (!inode) {
 		actual = dentry;
@@ -2136,6 +2184,7 @@ found:
 	_d_rehash(actual);
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 out_nolock:
 	if (actual == dentry) {
@@ -2147,6 +2196,7 @@ out_nolock:
 	return actual;
 
 shouldnt_be_hashed:
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 	BUG();
 }
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 51b3040..84b8c46 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -48,8 +48,10 @@ find_acceptable_alias(struct dentry *result,
 		return result;
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
 		dget_locked(dentry);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 		if (toput)
 			dput(toput);
@@ -58,8 +60,10 @@ find_acceptable_alias(struct dentry *result,
 			return dentry;
 		}
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
 		toput = dentry;
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	if (toput)
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index ac7b814..850f67d 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -64,7 +64,11 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 		 * Oops, since the test for IS_ROOT() will fail.
 		 */
 		spin_lock(&dcache_lock);
+		spin_lock(&dcache_inode_lock);
+		spin_lock(&sb->s_root->d_lock);
 		list_del_init(&sb->s_root->d_alias);
+		spin_unlock(&sb->s_root->d_lock);
+		spin_unlock(&dcache_inode_lock);
 		spin_unlock(&dcache_lock);
 	}
 	return 0;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index aa4f25e..ae769fc 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -60,6 +60,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 	watched = fsnotify_inode_watches_children(inode);
 
 	spin_lock(&dcache_lock);
+	spin_lock(&dcache_inode_lock);
 	/* run all of the dentries associated with this inode.  Since this is a
 	 * directory, there damn well better only be one item on this list */
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
@@ -82,6 +83,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 		}
 		spin_unlock(&alias->d_lock);
 	}
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 }
 
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index 107d0f1..d83cca4 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -170,7 +170,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	struct dentry *dentry = NULL;
 
 	spin_lock(&dcache_lock);
-
+	spin_lock(&dcache_inode_lock);
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
 
@@ -188,6 +188,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 		dentry = NULL;
 	}
 
+	spin_unlock(&dcache_inode_lock);
 	spin_unlock(&dcache_lock);
 
 	return dentry;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 16d6ecf..b23b64e 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -181,6 +181,7 @@ struct dentry_operations {
 
 #define DCACHE_CANT_MOUNT	0x0100
 
+extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
 extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (14 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 16/46] fs: scale inode alias list Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2011-01-18 22:32   ` Yehuda Sadeh Weinraub
  2010-11-27  9:44 ` [PATCH 18/46] fs: increase d_name lock coverage Nick Piggin
                   ` (33 subsequent siblings)
  49 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against.  Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.

We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 drivers/staging/pohmelfs/path_entry.c |   15 +++-
 fs/autofs4/waitq.c                    |   18 ++++-
 fs/dcache.c                           |  136 +++++++++++++++++++++++++++------
 fs/nfs/namespace.c                    |   14 +++-
 include/linux/dcache.h                |    1 +
 5 files changed, 153 insertions(+), 31 deletions(-)

diff --git a/drivers/staging/pohmelfs/path_entry.c b/drivers/staging/pohmelfs/path_entry.c
index 8ec83d2..bbe42f4 100644
--- a/drivers/staging/pohmelfs/path_entry.c
+++ b/drivers/staging/pohmelfs/path_entry.c
@@ -83,10 +83,11 @@ out:
 int pohmelfs_path_length(struct pohmelfs_inode *pi)
 {
 	struct dentry *d, *root, *first;
-	int len = 1; /* Root slash */
+	int len;
+	unsigned seq;
 
-	first = d = d_find_alias(&pi->vfs_inode);
-	if (!d) {
+	first = d_find_alias(&pi->vfs_inode);
+	if (!first) {
 		dprintk("%s: ino: %llu, mode: %o.\n", __func__, pi->ino, pi->vfs_inode.i_mode);
 		return -ENOENT;
 	}
@@ -95,6 +96,11 @@ int pohmelfs_path_length(struct pohmelfs_inode *pi)
 	root = dget(current->fs->root.dentry);
 	spin_unlock(&current->fs->lock);
 
+rename_retry:
+	len = 1; /* Root slash */
+	d = first;
+	seq = read_seqbegin(&rename_lock);
+	rcu_read_lock();
 	spin_lock(&dcache_lock);
 
 	if (!IS_ROOT(d) && d_unhashed(d))
@@ -105,6 +111,9 @@ int pohmelfs_path_length(struct pohmelfs_inode *pi)
 		d = d->d_parent;
 	}
 	spin_unlock(&dcache_lock);
+	rcu_read_unlock();
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 
 	dput(root);
 	dput(first);
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 2341375..4be8f77 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -186,16 +186,25 @@ static int autofs4_getpath(struct autofs_sb_info *sbi,
 {
 	struct dentry *root = sbi->sb->s_root;
 	struct dentry *tmp;
-	char *buf = *name;
+	char *buf;
 	char *p;
-	int len = 0;
-
+	int len;
+	unsigned seq;
+
+rename_retry:
+	buf = *name;
+	len = 0;
+	seq = read_seqbegin(&rename_lock);
+	rcu_read_lock();
 	spin_lock(&dcache_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
 		len += tmp->d_name.len + 1;
 
 	if (!len || --len > NAME_MAX) {
 		spin_unlock(&dcache_lock);
+		rcu_read_unlock();
+		if (read_seqretry(&rename_lock, seq))
+			goto rename_retry;
 		return 0;
 	}
 
@@ -209,6 +218,9 @@ static int autofs4_getpath(struct autofs_sb_info *sbi,
 		strncpy(p, tmp->d_name.name, tmp->d_name.len);
 	}
 	spin_unlock(&dcache_lock);
+	rcu_read_unlock();
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 
 	return len;
 }
diff --git a/fs/dcache.c b/fs/dcache.c
index 4c6f553..35420f7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -80,6 +80,7 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
+EXPORT_SYMBOL(rename_lock);
 EXPORT_SYMBOL(dcache_inode_lock);
 EXPORT_SYMBOL(dcache_hash_lock);
 EXPORT_SYMBOL(dcache_lock);
@@ -237,6 +238,7 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 	__releases(dcache_inode_lock)
 	__releases(dcache_lock)
 {
+	dentry->d_parent = NULL;
 	list_del(&dentry->d_u.d_child);
 	if (parent)
 		spin_unlock(&parent->d_lock);
@@ -980,11 +982,15 @@ void shrink_dcache_for_umount(struct super_block *sb)
  * Return true if the parent or its subdirectories contain
  * a mount point
  */
- 
 int have_submounts(struct dentry *parent)
 {
-	struct dentry *this_parent = parent;
+	struct dentry *this_parent;
 	struct list_head *next;
+	unsigned seq;
+
+rename_retry:
+	this_parent = parent;
+	seq = read_seqbegin(&rename_lock);
 
 	spin_lock(&dcache_lock);
 	if (d_mountpoint(parent))
@@ -1018,17 +1024,37 @@ resume:
 	 * All done at this level ... ascend and resume the search.
 	 */
 	if (this_parent != parent) {
-		next = this_parent->d_u.d_child.next;
+		struct dentry *tmp;
+		struct dentry *child;
+
+		tmp = this_parent->d_parent;
+		rcu_read_lock();
 		spin_unlock(&this_parent->d_lock);
-		this_parent = this_parent->d_parent;
+		child = this_parent;
+		this_parent = tmp;
 		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+				read_seqretry(&rename_lock, seq)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_unlock(&dcache_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
+		next = child->d_u.d_child.next;
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return 0; /* No mount points found in tree */
 positive:
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return 1;
 }
 EXPORT_SYMBOL(have_submounts);
@@ -1049,10 +1075,15 @@ EXPORT_SYMBOL(have_submounts);
  */
 static int select_parent(struct dentry * parent)
 {
-	struct dentry *this_parent = parent;
+	struct dentry *this_parent;
 	struct list_head *next;
+	unsigned seq;
 	int found = 0;
 
+rename_retry:
+	this_parent = parent;
+	seq = read_seqbegin(&rename_lock);
+
 	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
@@ -1062,7 +1093,6 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
-		BUG_ON(this_parent == dentry);
 
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 
@@ -1105,17 +1135,32 @@ resume:
 	 */
 	if (this_parent != parent) {
 		struct dentry *tmp;
-		next = this_parent->d_u.d_child.next;
+		struct dentry *child;
+
 		tmp = this_parent->d_parent;
+		rcu_read_lock();
 		spin_unlock(&this_parent->d_lock);
-		BUG_ON(tmp == this_parent);
+		child = this_parent;
 		this_parent = tmp;
 		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+				read_seqretry(&rename_lock, seq)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_unlock(&dcache_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
+		next = child->d_u.d_child.next;
 		goto resume;
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	return found;
 }
 
@@ -1615,7 +1660,7 @@ EXPORT_SYMBOL(d_add_ci);
 struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
 {
 	struct dentry * dentry = NULL;
-	unsigned long seq;
+	unsigned seq;
 
         do {
                 seq = read_seqbegin(&rename_lock);
@@ -2225,7 +2270,7 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
  * @buffer: pointer to the end of the buffer
  * @buflen: pointer to buffer length
  *
- * Caller holds the dcache_lock.
+ * Caller holds the rename_lock.
  *
  * If path is not reachable from the supplied root, then the value of
  * root is changed (without modifying refcounts).
@@ -2310,7 +2355,9 @@ char *__d_path(const struct path *path, struct path *root,
 
 	prepend(&res, &buflen, "\0", 1);
 	spin_lock(&dcache_lock);
+	write_seqlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
+	write_sequnlock(&rename_lock);
 	spin_unlock(&dcache_lock);
 
 	if (error)
@@ -2374,10 +2421,12 @@ char *d_path(const struct path *path, char *buf, int buflen)
 
 	get_fs_root(current->fs, &root);
 	spin_lock(&dcache_lock);
+	write_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (error)
 		res = ERR_PTR(error);
+	write_sequnlock(&rename_lock);
 	spin_unlock(&dcache_lock);
 	path_put(&root);
 	return res;
@@ -2405,10 +2454,12 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 
 	get_fs_root(current->fs, &root);
 	spin_lock(&dcache_lock);
+	write_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (!error && !path_equal(&tmp, &root))
 		error = prepend_unreachable(&res, &buflen);
+	write_sequnlock(&rename_lock);
 	spin_unlock(&dcache_lock);
 	path_put(&root);
 	if (error)
@@ -2474,7 +2525,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 	char *retval;
 
 	spin_lock(&dcache_lock);
+	write_seqlock(&rename_lock);
 	retval = __dentry_path(dentry, buf, buflen);
+	write_sequnlock(&rename_lock);
 	spin_unlock(&dcache_lock);
 
 	return retval;
@@ -2487,6 +2540,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *retval;
 
 	spin_lock(&dcache_lock);
+	write_seqlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2494,6 +2548,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
+	write_sequnlock(&rename_lock);
 	spin_unlock(&dcache_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
@@ -2534,6 +2589,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 	error = -ENOENT;
 	spin_lock(&dcache_lock);
+	write_seqlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		struct path tmp = root;
@@ -2542,6 +2598,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &tmp, &cwd, &buflen);
+		write_sequnlock(&rename_lock);
 		spin_unlock(&dcache_lock);
 
 		if (error)
@@ -2561,8 +2618,10 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 			if (copy_to_user(buf, cwd, len))
 				error = -EFAULT;
 		}
-	} else
+	} else {
+		write_sequnlock(&rename_lock);
 		spin_unlock(&dcache_lock);
+	}
 
 out:
 	path_put(&pwd);
@@ -2590,25 +2649,25 @@ out:
 int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 {
 	int result;
-	unsigned long seq;
+	unsigned seq;
 
 	if (new_dentry == old_dentry)
 		return 1;
 
-	/*
-	 * Need rcu_readlock to protect against the d_parent trashing
-	 * due to d_move
-	 */
-	rcu_read_lock();
 	do {
 		/* for restarting inner loop in case of seq retry */
 		seq = read_seqbegin(&rename_lock);
+		/*
+		 * Need rcu_readlock to protect against the d_parent trashing
+		 * due to d_move
+		 */
+		rcu_read_lock();
 		if (d_ancestor(old_dentry, new_dentry))
 			result = 1;
 		else
 			result = 0;
+		rcu_read_unlock();
 	} while (read_seqretry(&rename_lock, seq));
-	rcu_read_unlock();
 
 	return result;
 }
@@ -2640,9 +2699,13 @@ EXPORT_SYMBOL(path_is_under);
 
 void d_genocide(struct dentry *root)
 {
-	struct dentry *this_parent = root;
+	struct dentry *this_parent;
 	struct list_head *next;
+	unsigned seq;
 
+rename_retry:
+	this_parent = root;
+	seq = read_seqbegin(&rename_lock);
 	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
@@ -2652,6 +2715,7 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
+
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		if (d_unhashed(dentry) || !dentry->d_inode) {
 			spin_unlock(&dentry->d_lock);
@@ -2664,19 +2728,43 @@ resume:
 			spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
 			goto repeat;
 		}
- 		dentry->d_count--;
+		if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
+			dentry->d_flags |= DCACHE_GENOCIDE;
+			dentry->d_count--;
+		}
  		spin_unlock(&dentry->d_lock);
 	}
 	if (this_parent != root) {
-		next = this_parent->d_u.d_child.next;
-		this_parent->d_count--;
+		struct dentry *tmp;
+		struct dentry *child;
+
+		tmp = this_parent->d_parent;
+		if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
+			this_parent->d_flags |= DCACHE_GENOCIDE;
+			this_parent->d_count--;
+		}
+		rcu_read_lock();
 		spin_unlock(&this_parent->d_lock);
-		this_parent = this_parent->d_parent;
-		spin_lock(&this_parent->d_lock);
+		child = this_parent;
+		this_parent = tmp;
+ 		spin_lock(&this_parent->d_lock);
+		/* might go back up the wrong parent if we have had a rename
+		 * or deletion */
+		if (this_parent != child->d_parent ||
+				read_seqretry(&rename_lock, seq)) {
+			spin_unlock(&this_parent->d_lock);
+			spin_unlock(&dcache_lock);
+			rcu_read_unlock();
+			goto rename_retry;
+		}
+		rcu_read_unlock();
+		next = child->d_u.d_child.next;
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
 	spin_unlock(&dcache_lock);
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 }
 
 /**
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index db6aa36..78c0ebb 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -49,11 +49,17 @@ char *nfs_path(const char *base,
 	       const struct dentry *dentry,
 	       char *buffer, ssize_t buflen)
 {
-	char *end = buffer+buflen;
+	char *end;
 	int namelen;
+	unsigned seq;
 
+rename_retry:
+	end = buffer+buflen;
 	*--end = '\0';
 	buflen--;
+
+	seq = read_seqbegin(&rename_lock);
+	rcu_read_lock();
 	spin_lock(&dcache_lock);
 	while (!IS_ROOT(dentry) && dentry != droot) {
 		namelen = dentry->d_name.len;
@@ -66,6 +72,9 @@ char *nfs_path(const char *base,
 		dentry = dentry->d_parent;
 	}
 	spin_unlock(&dcache_lock);
+	rcu_read_unlock();
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 	if (*end != '/') {
 		if (--buflen < 0)
 			goto Elong;
@@ -83,6 +92,9 @@ char *nfs_path(const char *base,
 	return end;
 Elong_unlock:
 	spin_unlock(&dcache_lock);
+	rcu_read_unlock();
+	if (read_seqretry(&rename_lock, seq))
+		goto rename_retry;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index b23b64e..f9b5744 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -180,6 +180,7 @@ struct dentry_operations {
 #define DCACHE_FSNOTIFY_PARENT_WATCHED	0x0080 /* Parent inode is watched by some fsnotify listener */
 
 #define DCACHE_CANT_MOUNT	0x0100
+#define DCACHE_GENOCIDE		0x0200
 
 extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 18/46] fs: increase d_name lock coverage
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (15 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 19/46] fs: dcache remove dcache_lock Nick Piggin
                   ` (32 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Cover d_name with d_lock in more cases, where there may be concurrent
modification to it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   23 ++++++++++++++++-------
 1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 35420f7..7d0733b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1361,16 +1361,20 @@ static struct dentry *__d_instantiate_unique(struct dentry *entry,
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
 		struct qstr *qstr = &alias->d_name;
 
+		spin_lock(&alias->d_lock);
 		if (qstr->hash != hash)
-			continue;
+			goto next;
 		if (alias->d_parent != entry->d_parent)
-			continue;
+			goto next;
 		if (qstr->len != len)
-			continue;
+			goto next;
 		if (memcmp(qstr->name, name, len))
-			continue;
-		dget_locked(alias);
+			goto next;
+		dget_locked_dlock(alias);
+		spin_unlock(&alias->d_lock);
 		return alias;
+next:
+		spin_unlock(&alias->d_lock);
 	}
 
 	__d_instantiate(entry, inode);
@@ -2298,7 +2302,9 @@ static int prepend_path(const struct path *path, struct path *root,
 		}
 		parent = dentry->d_parent;
 		prefetch(parent);
+		spin_lock(&dentry->d_lock);
 		error = prepend_name(buffer, buflen, &dentry->d_name);
+		spin_unlock(&dentry->d_lock);
 		if (!error)
 			error = prepend(buffer, buflen, "/", 1);
 		if (error)
@@ -2506,10 +2512,13 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 
 	while (!IS_ROOT(dentry)) {
 		struct dentry *parent = dentry->d_parent;
+		int error;
 
 		prefetch(parent);
-		if ((prepend_name(&end, &buflen, &dentry->d_name) != 0) ||
-		    (prepend(&end, &buflen, "/", 1) != 0))
+		spin_lock(&dentry->d_lock);
+		error = prepend_name(&end, &buflen, &dentry->d_name);
+		spin_unlock(&dentry->d_lock);
+		if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
 			goto Elong;
 
 		retval = end;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 19/46] fs: dcache remove dcache_lock
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (16 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 18/46] fs: increase d_name lock coverage Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 20/46] fs: dcache avoid starvation in dcache multi-step operations Nick Piggin
                   ` (31 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 Documentation/filesystems/Locking            |   16 ++--
 Documentation/filesystems/dentry-locking.txt |   38 +++---
 Documentation/filesystems/porting            |    8 +-
 arch/powerpc/platforms/cell/spufs/inode.c    |    5 +-
 drivers/infiniband/hw/ipath/ipath_fs.c       |    6 +-
 drivers/infiniband/hw/qib/qib_fs.c           |    3 -
 drivers/staging/pohmelfs/path_entry.c        |    2 -
 drivers/staging/smbfs/cache.c                |    4 -
 drivers/usb/core/inode.c                     |    3 -
 fs/9p/vfs_inode.c                            |    2 -
 fs/affs/amigaffs.c                           |    2 -
 fs/autofs4/autofs_i.h                        |    3 +
 fs/autofs4/expire.c                          |   10 +-
 fs/autofs4/root.c                            |   44 ++++----
 fs/autofs4/waitq.c                           |    7 +-
 fs/ceph/dir.c                                |    6 +-
 fs/ceph/inode.c                              |    4 -
 fs/cifs/inode.c                              |    3 -
 fs/coda/cache.c                              |    2 -
 fs/configfs/configfs_internal.h              |    2 -
 fs/configfs/inode.c                          |    6 +-
 fs/dcache.c                                  |  162 ++++----------------------
 fs/exportfs/expfs.c                          |    4 -
 fs/libfs.c                                   |    8 --
 fs/namei.c                                   |    9 +-
 fs/ncpfs/dir.c                               |   18 +--
 fs/ncpfs/ncplib_kernel.h                     |    4 -
 fs/nfs/dir.c                                 |    3 -
 fs/nfs/getroot.c                             |    2 -
 fs/nfs/namespace.c                           |    3 -
 fs/notify/fsnotify.c                         |    2 -
 fs/ocfs2/dcache.c                            |    2 -
 include/linux/dcache.h                       |    7 +-
 include/linux/fs.h                           |    6 +-
 include/linux/fsnotify.h                     |    2 -
 include/linux/fsnotify_backend.h             |   11 +-
 include/linux/namei.h                        |    1 -
 kernel/cgroup.c                              |    6 -
 mm/filemap.c                                 |    3 -
 security/selinux/selinuxfs.c                 |    4 -
 40 files changed, 119 insertions(+), 314 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5bceb19..0dee2d7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -22,14 +22,14 @@ prototypes:
 
 locking rules:
 	none have BKL
-		dcache_lock	rename_lock	->d_lock	may block
-d_revalidate:	no		no		no		yes
-d_hash		no		no		no		no
-d_compare:	no		yes		no		no 
-d_delete:	yes		no		yes		no
-d_release:	no		no		no		yes
-d_iput:		no		no		no		yes
-d_dname:	no		no		no		no
+		rename_lock	->d_lock	may block
+d_revalidate:	no		no		yes
+d_hash		no		no		no
+d_compare:	yes		no		no
+d_delete:	no		yes		no
+d_release:	no		no		yes
+d_iput:		no		no		yes
+d_dname:	no		no		no
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt
index 79334ed..30b6a40 100644
--- a/Documentation/filesystems/dentry-locking.txt
+++ b/Documentation/filesystems/dentry-locking.txt
@@ -31,6 +31,7 @@ significant change is the way d_lookup traverses the hash chain, it
 doesn't acquire the dcache_lock for this and rely on RCU to ensure
 that the dentry has not been *freed*.
 
+dcache_lock no longer exists, dentry locking is explained in fs/dcache.c
 
 Dcache locking details
 ======================
@@ -50,14 +51,12 @@ Safe lock-free look-up of dcache hash table
 
 Dcache is a complex data structure with the hash table entries also
 linked together in other lists. In 2.4 kernel, dcache_lock protected
-all the lists. We applied RCU only on hash chain walking. The rest of
-the lists are still protected by dcache_lock.  Some of the important
-changes are :
+all the lists. RCU dentry hash walking works like this:
 
 1. The deletion from hash chain is done using hlist_del_rcu() macro
    which doesn't initialize next pointer of the deleted dentry and
    this allows us to walk safely lock-free while a deletion is
-   happening.
+   happening. This is a standard hlist_rcu iteration.
 
 2. Insertion of a dentry into the hash table is done using
    hlist_add_head_rcu() which take care of ordering the writes - the
@@ -66,19 +65,18 @@ changes are :
    which has since been replaced by hlist_for_each_entry_rcu(), while
    walking the hash chain. The only requirement is that all
    initialization to the dentry must be done before
-   hlist_add_head_rcu() since we don't have dcache_lock protection
-   while traversing the hash chain. This isn't different from the
-   existing code.
-
-3. The dentry looked up without holding dcache_lock by cannot be
-   returned for walking if it is unhashed. It then may have a NULL
-   d_inode or other bogosity since RCU doesn't protect the other
-   fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
-   indicate unhashed dentries and use this in conjunction with a
-   per-dentry lock (d_lock). Once looked up without the dcache_lock,
-   we acquire the per-dentry lock (d_lock) and check if the dentry is
-   unhashed. If so, the look-up is failed. If not, the reference count
-   of the dentry is increased and the dentry is returned.
+   hlist_add_head_rcu() since we don't have lock protection
+   while traversing the hash chain.
+
+3. The dentry looked up without holding locks cannot be returned for
+   walking if it is unhashed. It then may have a NULL d_inode or other
+   bogosity since RCU doesn't protect the other fields in the dentry. We
+   therefore use a flag DCACHE_UNHASHED to indicate unhashed dentries
+   and use this in conjunction with a per-dentry lock (d_lock). Once
+   looked up without locks, we acquire the per-dentry lock (d_lock) and
+   check if the dentry is unhashed. If so, the look-up is failed. If not,
+   the reference count of the dentry is increased and the dentry is
+   returned.
 
 4. Once a dentry is looked up, it must be ensured during the path walk
    for that component it doesn't go away. In pre-2.5.10 code, this was
@@ -86,10 +84,10 @@ changes are :
    In some sense, dcache_rcu path walking looks like the pre-2.5.10
    version.
 
-5. All dentry hash chain updates must take the dcache_lock as well as
-   the per-dentry lock in that order. dput() does this to ensure that
-   a dentry that has just been looked up in another CPU doesn't get
-   deleted before dget() can be done on it.
+5. All dentry hash chain updates must take the per-dentry lock (see
+   fs/dcache.c). This excludes dput() to ensure that a dentry that has
+   been looked up concurrently does not get deleted before dget() can
+   take a ref.
 
 6. There are several ways to do reference counting of RCU protected
    objects. One such example is in ipv4 route cache where deferred
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 0c6fc97..fd353e6 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -216,7 +216,6 @@ had ->revalidate()) add calls in ->follow_link()/->readlink().
 ->d_parent changes are not protected by BKL anymore.  Read access is safe
 if at least one of the following is true:
 	* filesystem has no cross-directory rename()
-	* dcache_lock is held
 	* we know that parent had been locked (e.g. we are looking at
 ->d_parent of ->lookup() argument).
 	* we are called from ->rename().
@@ -341,3 +340,10 @@ look at examples of other filesystems) for guidance.
 changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
 look at examples of other filesystems) for guidance.
 
+---
+[mandatory]
+	dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c
+for details of what locks to replace dcache_lock with in order to protect
+particular things. Most of the time, a filesystem only needs ->d_lock, which
+protects *all* the dcache state of a given dentry.
+
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 5aef1a7..2662b50 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -159,21 +159,18 @@ static void spufs_prune_dir(struct dentry *dir)
 
 	mutex_lock(&dir->d_inode->i_mutex);
 	list_for_each_entry_safe(dentry, tmp, &dir->d_subdirs, d_u.d_child) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry)) && dentry->d_inode) {
 			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
-			/* XXX: what is dcache_lock protecting here? Other
+			/* XXX: what was dcache_lock protecting here? Other
 			 * filesystems (IB, configfs) release dcache_lock
 			 * before unlink */
-			spin_unlock(&dcache_lock);
 			dput(dentry);
 		} else {
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 		}
 	}
 	shrink_dcache_parent(dir);
diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c
index 18aee04..925e882 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -277,18 +277,14 @@ static int remove_file(struct dentry *parent, char *name)
 		goto bail;
 	}
 
-	spin_lock(&dcache_lock);
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
 		dget_locked_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
-		spin_unlock(&dcache_lock);
 		simple_unlink(parent->d_inode, tmp);
-	} else {
+	} else
 		spin_unlock(&tmp->d_lock);
-		spin_unlock(&dcache_lock);
-	}
 
 	ret = 0;
 bail:
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index fe4b242..49af4a6 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -453,17 +453,14 @@ static int remove_file(struct dentry *parent, char *name)
 		goto bail;
 	}
 
-	spin_lock(&dcache_lock);
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
 		dget_locked_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
-		spin_unlock(&dcache_lock);
 		simple_unlink(parent->d_inode, tmp);
 	} else {
 		spin_unlock(&tmp->d_lock);
-		spin_unlock(&dcache_lock);
 	}
 
 	ret = 0;
diff --git a/drivers/staging/pohmelfs/path_entry.c b/drivers/staging/pohmelfs/path_entry.c
index bbe42f4..400a9fc 100644
--- a/drivers/staging/pohmelfs/path_entry.c
+++ b/drivers/staging/pohmelfs/path_entry.c
@@ -101,7 +101,6 @@ rename_retry:
 	d = first;
 	seq = read_seqbegin(&rename_lock);
 	rcu_read_lock();
-	spin_lock(&dcache_lock);
 
 	if (!IS_ROOT(d) && d_unhashed(d))
 		len += UNHASHED_OBSCURE_STRING_SIZE; /* Obscure " (deleted)" string */
@@ -110,7 +109,6 @@ rename_retry:
 		len += d->d_name.len + 1; /* Plus slash */
 		d = d->d_parent;
 	}
-	spin_unlock(&dcache_lock);
 	rcu_read_unlock();
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
diff --git a/drivers/staging/smbfs/cache.c b/drivers/staging/smbfs/cache.c
index 29afb05..abae450 100644
--- a/drivers/staging/smbfs/cache.c
+++ b/drivers/staging/smbfs/cache.c
@@ -62,7 +62,6 @@ smb_invalidate_dircache_entries(struct dentry *parent)
 	struct list_head *next;
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -72,7 +71,6 @@ smb_invalidate_dircache_entries(struct dentry *parent)
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -98,7 +96,6 @@ smb_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 	}
 
 	/* If a pointer is invalid, we search the dentry. */
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -115,7 +112,6 @@ smb_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 	dent = NULL;
 out_unlock:
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return dent;
 }
 
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index 89a0e83..1b125c2 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -343,7 +343,6 @@ static int usbfs_empty (struct dentry *dentry)
 {
 	struct list_head *list;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	list_for_each(list, &dentry->d_subdirs) {
 		struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
@@ -352,13 +351,11 @@ static int usbfs_empty (struct dentry *dentry)
 		if (usbfs_positive(de)) {
 			spin_unlock(&de->d_lock);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			return 0;
 		}
 		spin_unlock(&de->d_lock);
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	return 1;
 }
 
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 47dfd5d..1073bca 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -270,13 +270,11 @@ static struct dentry *v9fs_dentry_from_dir_inode(struct inode *inode)
 {
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	/* Directory should have only one entry. */
 	BUG_ON(S_ISDIR(inode->i_mode) && !list_is_singular(&inode->i_dentry));
 	dentry = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	return dentry;
 }
 
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 2321cc9..600101a 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -128,7 +128,6 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
 	void *data = dentry->d_fsdata;
 	struct list_head *head, *next;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	head = &inode->i_dentry;
 	next = head->next;
@@ -141,7 +140,6 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
 		next = next->next;
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 
 
diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 9d2ae9b..0fffe1c 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -16,6 +16,7 @@
 #include <linux/auto_fs4.h>
 #include <linux/auto_dev-ioctl.h>
 #include <linux/mutex.h>
+#include <linux/spinlock.h>
 #include <linux/list.h>
 
 /* This is the range of ioctl() numbers we claim as ours */
@@ -60,6 +61,8 @@ do {							\
 		current->pid, __func__, ##args);	\
 } while (0)
 
+extern spinlock_t autofs4_lock;
+
 /* Unified info structure.  This is pointed to by both the dentry and
    inode structures.  Each file in the filesystem has an instance of this
    structure.  It holds a reference to the dentry, so dentries are never
diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index 3164802..7869b3a 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -102,7 +102,7 @@ static struct dentry *get_next_positive_dentry(struct dentry *prev,
 	if (prev == NULL)
 		return dget(prev);
 
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 relock:
 	p = prev;
 	spin_lock(&p->d_lock);
@@ -114,7 +114,7 @@ again:
 
 			if (p == root) {
 				spin_unlock(&p->d_lock);
-				spin_unlock(&dcache_lock);
+				spin_unlock(&autofs4_lock);
 				dput(prev);
 				return NULL;
 			}
@@ -144,7 +144,7 @@ again:
 	dget_dlock(ret);
 	spin_unlock(&ret->d_lock);
 	spin_unlock(&p->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 	dput(prev);
 
@@ -408,13 +408,13 @@ found:
 	ino->flags |= AUTOFS_INF_EXPIRING;
 	init_completion(&ino->expire_complete);
 	spin_unlock(&sbi->fs_lock);
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&expired->d_parent->d_lock);
 	spin_lock_nested(&expired->d_lock, DENTRY_D_LOCK_NESTED);
   	list_move(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
 	spin_unlock(&expired->d_lock);
 	spin_unlock(&expired->d_parent->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 	return expired;
 }
 
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index a185e7e..3eaa251 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -23,6 +23,8 @@
 
 #include "autofs_i.h"
 
+DEFINE_SPINLOCK(autofs4_lock);
+
 static int autofs4_dir_symlink(struct inode *,struct dentry *,const char *);
 static int autofs4_dir_unlink(struct inode *,struct dentry *);
 static int autofs4_dir_rmdir(struct inode *,struct dentry *);
@@ -142,15 +144,15 @@ static int autofs4_dir_open(struct inode *inode, struct file *file)
 	 * autofs file system so just let the libfs routines handle
 	 * it.
 	 */
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&dentry->d_lock);
 	if (!d_mountpoint(dentry) && list_empty(&dentry->d_subdirs)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
+		spin_unlock(&autofs4_lock);
 		return -ENOENT;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 out:
 	return dcache_dir_open(inode, file);
@@ -255,11 +257,11 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 	/* We trigger a mount for almost all flags */
 	lookup_type = autofs4_need_mount(nd->flags);
 	spin_lock(&sbi->fs_lock);
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&dentry->d_lock);
 	if (!(lookup_type || ino->flags & AUTOFS_INF_PENDING)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
+		spin_unlock(&autofs4_lock);
 		spin_unlock(&sbi->fs_lock);
 		goto follow;
 	}
@@ -272,7 +274,7 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 	if (ino->flags & AUTOFS_INF_PENDING ||
 	    (!d_mountpoint(dentry) && list_empty(&dentry->d_subdirs))) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
+		spin_unlock(&autofs4_lock);
 		spin_unlock(&sbi->fs_lock);
 
 		status = try_to_fill_dentry(dentry, nd->flags);
@@ -282,7 +284,7 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 		goto follow;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 	spin_unlock(&sbi->fs_lock);
 follow:
 	/*
@@ -353,14 +355,14 @@ static int autofs4_revalidate(struct dentry *dentry, struct nameidata *nd)
 		return 0;
 
 	/* Check for a non-mountpoint directory with no contents */
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&dentry->d_lock);
 	if (S_ISDIR(dentry->d_inode->i_mode) &&
 	    !d_mountpoint(dentry) && list_empty(&dentry->d_subdirs)) {
 		DPRINTK("dentry=%p %.*s, emptydir",
 			 dentry, dentry->d_name.len, dentry->d_name.name);
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
+		spin_unlock(&autofs4_lock);
 
 		/* The daemon never causes a mount to trigger */
 		if (oz_mode)
@@ -377,7 +379,7 @@ static int autofs4_revalidate(struct dentry *dentry, struct nameidata *nd)
 		return status;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 	return 1;
 }
@@ -432,7 +434,7 @@ static struct dentry *autofs4_lookup_active(struct dentry *dentry)
 	const unsigned char *str = name->name;
 	struct list_head *p, *head;
 
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&sbi->lookup_lock);
 	head = &sbi->active_list;
 	list_for_each(p, head) {
@@ -465,14 +467,14 @@ static struct dentry *autofs4_lookup_active(struct dentry *dentry)
 			dget_dlock(active);
 			spin_unlock(&active->d_lock);
 			spin_unlock(&sbi->lookup_lock);
-			spin_unlock(&dcache_lock);
+			spin_unlock(&autofs4_lock);
 			return active;
 		}
 next:
 		spin_unlock(&active->d_lock);
 	}
 	spin_unlock(&sbi->lookup_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 	return NULL;
 }
@@ -487,7 +489,7 @@ static struct dentry *autofs4_lookup_expiring(struct dentry *dentry)
 	const unsigned char *str = name->name;
 	struct list_head *p, *head;
 
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&sbi->lookup_lock);
 	head = &sbi->expiring_list;
 	list_for_each(p, head) {
@@ -520,14 +522,14 @@ static struct dentry *autofs4_lookup_expiring(struct dentry *dentry)
 			dget_dlock(expiring);
 			spin_unlock(&expiring->d_lock);
 			spin_unlock(&sbi->lookup_lock);
-			spin_unlock(&dcache_lock);
+			spin_unlock(&autofs4_lock);
 			return expiring;
 		}
 next:
 		spin_unlock(&expiring->d_lock);
 	}
 	spin_unlock(&sbi->lookup_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 	return NULL;
 }
@@ -763,12 +765,12 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry)
 
 	dir->i_mtime = CURRENT_TIME;
 
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	autofs4_add_expiring(dentry);
 	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 	return 0;
 }
@@ -785,20 +787,20 @@ static int autofs4_dir_rmdir(struct inode *dir, struct dentry *dentry)
 	if (!autofs4_oz_mode(sbi))
 		return -EACCES;
 
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	spin_lock(&sbi->lookup_lock);
 	spin_lock(&dentry->d_lock);
 	if (!list_empty(&dentry->d_subdirs)) {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&sbi->lookup_lock);
-		spin_unlock(&dcache_lock);
+		spin_unlock(&autofs4_lock);
 		return -ENOTEMPTY;
 	}
 	__autofs4_add_expiring(dentry);
 	spin_unlock(&sbi->lookup_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 
 	if (atomic_dec_and_test(&ino->count)) {
 		p_ino = autofs4_dentry_ino(dentry->d_parent);
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 4be8f77..c5f8459 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -194,14 +194,15 @@ static int autofs4_getpath(struct autofs_sb_info *sbi,
 rename_retry:
 	buf = *name;
 	len = 0;
+
 	seq = read_seqbegin(&rename_lock);
 	rcu_read_lock();
-	spin_lock(&dcache_lock);
+	spin_lock(&autofs4_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
 		len += tmp->d_name.len + 1;
 
 	if (!len || --len > NAME_MAX) {
-		spin_unlock(&dcache_lock);
+		spin_unlock(&autofs4_lock);
 		rcu_read_unlock();
 		if (read_seqretry(&rename_lock, seq))
 			goto rename_retry;
@@ -217,7 +218,7 @@ rename_retry:
 		p -= tmp->d_name.len;
 		strncpy(p, tmp->d_name.name, tmp->d_name.len);
 	}
-	spin_unlock(&dcache_lock);
+	spin_unlock(&autofs4_lock);
 	rcu_read_unlock();
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 4c81ae5..e42d2a1 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -111,7 +111,6 @@ static int __dcache_readdir(struct file *filp,
 	dout("__dcache_readdir %p at %llu (last %p)\n", dir, filp->f_pos,
 	     last);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 
 	/* start at beginning? */
@@ -155,7 +154,6 @@ more:
 	dget_dlock(dentry);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 
 	dout(" %llu (%llu) dentry %p %.*s %p\n", di->offset, filp->f_pos,
 	     dentry, dentry->d_name.len, dentry->d_name.name, dentry->d_inode);
@@ -181,21 +179,19 @@ more:
 
 	filp->f_pos++;
 
-	/* make sure a dentry wasn't dropped while we didn't have dcache_lock */
+	/* make sure a dentry wasn't dropped while we didn't have parent lock */
 	if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
 		dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
 		err = -EAGAIN;
 		goto out;
 	}
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	p = p->prev;	/* advance to next dentry */
 	goto more;
 
 out_unlock:
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 out:
 	if (last)
 		dput(last);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 2c69444..2a48caf 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -841,7 +841,6 @@ static void ceph_set_dentry_offset(struct dentry *dn)
 	di->offset = ceph_inode(inode)->i_max_offset++;
 	spin_unlock(&inode->i_lock);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dir->d_lock);
 	spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
 	list_move(&dn->d_u.d_child, &dir->d_subdirs);
@@ -849,7 +848,6 @@ static void ceph_set_dentry_offset(struct dentry *dn)
 	     dn->d_u.d_child.prev, dn->d_u.d_child.next);
 	spin_unlock(&dn->d_lock);
 	spin_unlock(&dir->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -1233,13 +1231,11 @@ retry_lookup:
 			goto retry_lookup;
 		} else {
 			/* reorder parent's d_subdirs */
-			spin_lock(&dcache_lock);
 			spin_lock(&parent->d_lock);
 			spin_lock_nested(&dn->d_lock, DENTRY_D_LOCK_NESTED);
 			list_move(&dn->d_u.d_child, &parent->d_subdirs);
 			spin_unlock(&dn->d_lock);
 			spin_unlock(&parent->d_lock);
-			spin_unlock(&dcache_lock);
 		}
 
 		di = dn->d_fsdata;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 0ee1767..ca901f0 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -804,17 +804,14 @@ inode_has_hashed_dentries(struct inode *inode)
 {
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		if (!d_unhashed(dentry) || IS_ROOT(dentry)) {
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			return true;
 		}
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	return false;
 }
 
diff --git a/fs/coda/cache.c b/fs/coda/cache.c
index 859393f..5525e1c 100644
--- a/fs/coda/cache.c
+++ b/fs/coda/cache.c
@@ -93,7 +93,6 @@ static void coda_flag_children(struct dentry *parent, int flag)
 	struct list_head *child;
 	struct dentry *de;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	list_for_each(child, &parent->d_subdirs)
 	{
@@ -104,7 +103,6 @@ static void coda_flag_children(struct dentry *parent, int flag)
 		coda_flag_inode(de->d_inode, flag);
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return; 
 }
 
diff --git a/fs/configfs/configfs_internal.h b/fs/configfs/configfs_internal.h
index e58b4c3..026cf68 100644
--- a/fs/configfs/configfs_internal.h
+++ b/fs/configfs/configfs_internal.h
@@ -120,7 +120,6 @@ static inline struct config_item *configfs_get_config_item(struct dentry *dentry
 {
 	struct config_item * item = NULL;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (!d_unhashed(dentry)) {
 		struct configfs_dirent * sd = dentry->d_fsdata;
@@ -131,7 +130,6 @@ static inline struct config_item *configfs_get_config_item(struct dentry *dentry
 			item = config_item_get(sd->s_element);
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	return item;
 }
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 79b3776..fb3a55f 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -250,18 +250,14 @@ void configfs_drop_dentry(struct configfs_dirent * sd, struct dentry * parent)
 	struct dentry * dentry = sd->s_dentry;
 
 	if (dentry) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry) && dentry->d_inode)) {
 			dget_locked_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			simple_unlink(parent->d_inode, dentry);
-		} else {
+		} else
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
-		}
 	}
 }
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 7d0733b..c4b2c5e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -54,11 +54,10 @@
  *   - d_alias, d_inode
  *
  * Ordering:
- * dcache_lock
- *   dcache_inode_lock
- *     dentry->d_lock
- *       dcache_lru_lock
- *       dcache_hash_lock
+ * dcache_inode_lock
+ *   dentry->d_lock
+ *     dcache_lru_lock
+ *     dcache_hash_lock
  *
  * If there is an ancestor relationship:
  * dentry->d_parent->...->d_parent->d_lock
@@ -77,13 +76,11 @@ EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 EXPORT_SYMBOL(dcache_inode_lock);
 EXPORT_SYMBOL(dcache_hash_lock);
-EXPORT_SYMBOL(dcache_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
 
@@ -133,7 +130,7 @@ static void __d_free(struct rcu_head *head)
 }
 
 /*
- * no dcache_lock, please.
+ * no locks, please.
  */
 static void d_free(struct dentry *dentry)
 {
@@ -156,7 +153,6 @@ static void d_free(struct dentry *dentry)
 static void dentry_iput(struct dentry * dentry)
 	__releases(dentry->d_lock)
 	__releases(dcache_inode_lock)
-	__releases(dcache_lock)
 {
 	struct inode *inode = dentry->d_inode;
 	if (inode) {
@@ -164,7 +160,6 @@ static void dentry_iput(struct dentry * dentry)
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		if (!inode->i_nlink)
 			fsnotify_inoderemove(inode);
 		if (dentry->d_op && dentry->d_op->d_iput)
@@ -174,7 +169,6 @@ static void dentry_iput(struct dentry * dentry)
 	} else {
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 	}
 }
 
@@ -229,14 +223,13 @@ static void dentry_lru_move_tail(struct dentry *dentry)
  *
  * If this is the root of the dentry tree, return NULL.
  *
- * dcache_lock and d_lock and d_parent->d_lock must be held by caller, and
- * are dropped by d_kill.
+ * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
+ * d_kill.
  */
 static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
 	__releases(parent->d_lock)
 	__releases(dcache_inode_lock)
-	__releases(dcache_lock)
 {
 	dentry->d_parent = NULL;
 	list_del(&dentry->d_u.d_child);
@@ -295,21 +288,10 @@ repeat:
 	else
 		parent = dentry->d_parent;
 	if (dentry->d_count == 1) {
-		if (!spin_trylock(&dcache_lock)) {
-			/*
-			 * Something of a livelock possibility we could avoid
-			 * by taking dcache_lock and trying again, but we
-			 * want to reduce dcache_lock anyway so this will
-			 * get improved.
-			 */
-drop1:
-			spin_unlock(&dentry->d_lock);
-			goto repeat;
-		}
 		if (!spin_trylock(&dcache_inode_lock)) {
 drop2:
-			spin_unlock(&dcache_lock);
-			goto drop1;
+			spin_unlock(&dentry->d_lock);
+			goto repeat;
 		}
 		if (parent && !spin_trylock(&parent->d_lock)) {
 			spin_unlock(&dcache_inode_lock);
@@ -321,7 +303,6 @@ drop2:
 		spin_unlock(&dentry->d_lock);
 		if (parent)
 			spin_unlock(&parent->d_lock);
-		spin_unlock(&dcache_lock);
 		return;
 	}
 
@@ -345,7 +326,6 @@ drop2:
 	if (parent)
 		spin_unlock(&parent->d_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	return;
 
 unhash_it:
@@ -376,11 +356,9 @@ int d_invalidate(struct dentry * dentry)
 	/*
 	 * If it's already been dropped, return OK.
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (d_unhashed(dentry)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		return 0;
 	}
 	/*
@@ -389,9 +367,7 @@ int d_invalidate(struct dentry * dentry)
 	 */
 	if (!list_empty(&dentry->d_subdirs)) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		shrink_dcache_parent(dentry);
-		spin_lock(&dcache_lock);
 		spin_lock(&dentry->d_lock);
 	}
 
@@ -408,19 +384,17 @@ int d_invalidate(struct dentry * dentry)
 	if (dentry->d_count > 1) {
 		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			return -EBUSY;
 		}
 	}
 
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	return 0;
 }
 EXPORT_SYMBOL(d_invalidate);
 
-/* This must be called with dcache_lock and d_lock held */
+/* This must be called with d_lock held */
 static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
 {
 	dentry->d_count++;
@@ -428,7 +402,7 @@ static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
 	return dentry;
 }
 
-/* This should be called _only_ with dcache_lock held */
+/* This must be called with d_lock held */
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
 	spin_lock(&dentry->d_lock);
@@ -538,11 +512,9 @@ struct dentry *d_find_alias(struct inode *inode)
 	struct dentry *de = NULL;
 
 	if (!list_empty(&inode->i_dentry)) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		de = __d_find_alias(inode, 0);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 	}
 	return de;
 }
@@ -556,7 +528,6 @@ void d_prune_aliases(struct inode *inode)
 {
 	struct dentry *dentry;
 restart:
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
@@ -565,14 +536,12 @@ restart:
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			dput(dentry);
 			goto restart;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
@@ -588,17 +557,14 @@ static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
 	__releases(parent->d_lock)
 	__releases(dcache_inode_lock)
-	__releases(dcache_lock)
 {
 	__d_drop(dentry);
 	dentry = d_kill(dentry, parent);
 
 	/*
-	 * Prune ancestors.  Locking is simpler than in dput(),
-	 * because dcache_lock needs to be taken anyway.
+	 * Prune ancestors.
 	 */
 	while (dentry) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 again:
 		spin_lock(&dentry->d_lock);
@@ -616,7 +582,6 @@ again:
 				spin_unlock(&parent->d_lock);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			return;
 		}
 
@@ -665,8 +630,7 @@ relock:
 		spin_unlock(&dcache_lru_lock);
 
 		prune_one_dentry(dentry, parent);
-		/* dcache_lock, dcache_inode_lock and dentry->d_lock dropped */
-		spin_lock(&dcache_lock);
+		/* dcache_inode_lock and dentry->d_lock dropped */
 		spin_lock(&dcache_inode_lock);
 		spin_lock(&dcache_lru_lock);
 	}
@@ -688,7 +652,6 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
 	LIST_HEAD(tmp);
 	int cnt = *count;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 relock:
 	spin_lock(&dcache_lru_lock);
@@ -719,7 +682,6 @@ relock:
 			if (!--cnt)
 				break;
 		}
-		/* XXX: re-add cond_resched_lock when dcache_lock goes away */
 	}
 
 	*count = cnt;
@@ -729,7 +691,6 @@ relock:
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lru_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /**
@@ -751,7 +712,6 @@ static void prune_dcache(int count)
 
 	if (unused == 0 || count == 0)
 		return;
-	spin_lock(&dcache_lock);
 	if (count >= unused)
 		prune_ratio = 1;
 	else
@@ -788,11 +748,9 @@ static void prune_dcache(int count)
 		if (down_read_trylock(&sb->s_umount)) {
 			if ((sb->s_root != NULL) &&
 			    (!list_empty(&sb->s_dentry_lru))) {
-				spin_unlock(&dcache_lock);
 				__shrink_dcache_sb(sb, &w_count,
 						DCACHE_REFERENCED);
 				pruned -= w_count;
-				spin_lock(&dcache_lock);
 			}
 			up_read(&sb->s_umount);
 		}
@@ -808,7 +766,6 @@ static void prune_dcache(int count)
 	if (p)
 		__put_super(p);
 	spin_unlock(&sb_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /**
@@ -822,7 +779,6 @@ void shrink_dcache_sb(struct super_block *sb)
 {
 	LIST_HEAD(tmp);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
@@ -831,7 +787,6 @@ void shrink_dcache_sb(struct super_block *sb)
 	}
 	spin_unlock(&dcache_lru_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -848,12 +803,10 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 	BUG_ON(!IS_ROOT(dentry));
 
 	/* detach this root from the system */
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	dentry_lru_del(dentry);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 
 	for (;;) {
 		/* descend to the first leaf in the current subtree */
@@ -862,7 +815,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 
 			/* this is a branch with children - detach all of them
 			 * from the system in one go */
-			spin_lock(&dcache_lock);
 			spin_lock(&dentry->d_lock);
 			list_for_each_entry(loop, &dentry->d_subdirs,
 					    d_u.d_child) {
@@ -873,7 +825,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 				spin_unlock(&loop->d_lock);
 			}
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 
 			/* move to the first child */
 			dentry = list_entry(dentry->d_subdirs.next,
@@ -940,8 +891,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 
 /*
  * destroy the dentries attached to a superblock on unmounting
- * - we don't need to use dentry->d_lock, and only need dcache_lock when
- *   removing the dentry from the system lists and hashes because:
+ * - we don't need to use dentry->d_lock because:
  *   - the superblock is detached from all mountings and open files, so the
  *     dentry trees will not be rearranged by the VFS
  *   - s_umount is write-locked, so the memory pressure shrinker will ignore
@@ -992,7 +942,6 @@ rename_retry:
 	this_parent = parent;
 	seq = read_seqbegin(&rename_lock);
 
-	spin_lock(&dcache_lock);
 	if (d_mountpoint(parent))
 		goto positive;
 	spin_lock(&this_parent->d_lock);
@@ -1038,7 +987,6 @@ resume:
 		if (this_parent != child->d_parent ||
 				read_seqretry(&rename_lock, seq)) {
 			spin_unlock(&this_parent->d_lock);
-			spin_unlock(&dcache_lock);
 			rcu_read_unlock();
 			goto rename_retry;
 		}
@@ -1047,12 +995,10 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	return 0; /* No mount points found in tree */
 positive:
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	return 1;
@@ -1084,7 +1030,6 @@ rename_retry:
 	this_parent = parent;
 	seq = read_seqbegin(&rename_lock);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
@@ -1148,7 +1093,6 @@ resume:
 		if (this_parent != child->d_parent ||
 				read_seqretry(&rename_lock, seq)) {
 			spin_unlock(&this_parent->d_lock);
-			spin_unlock(&dcache_lock);
 			rcu_read_unlock();
 			goto rename_retry;
 		}
@@ -1158,7 +1102,6 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 	return found;
@@ -1263,7 +1206,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	INIT_LIST_HEAD(&dentry->d_u.d_child);
 
 	if (parent) {
-		spin_lock(&dcache_lock);
 		spin_lock(&parent->d_lock);
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
 		dentry->d_parent = dget_dlock(parent);
@@ -1271,7 +1213,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&parent->d_lock);
-		spin_unlock(&dcache_lock);
 	}
 
 	percpu_counter_inc(&nr_dentry);
@@ -1291,7 +1232,6 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
 }
 EXPORT_SYMBOL(d_alloc_name);
 
-/* the caller must hold dcache_lock */
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
 	spin_lock(&dentry->d_lock);
@@ -1320,11 +1260,9 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!list_empty(&entry->d_alias));
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	__d_instantiate(entry, inode);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	security_d_instantiate(entry, inode);
 }
 EXPORT_SYMBOL(d_instantiate);
@@ -1387,11 +1325,9 @@ struct dentry *d_instantiate_unique(struct dentry *entry, struct inode *inode)
 
 	BUG_ON(!list_empty(&entry->d_alias));
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	result = __d_instantiate_unique(entry, inode);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	if (!result) {
 		security_d_instantiate(entry, inode);
@@ -1480,12 +1416,11 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	}
 	tmp->d_parent = tmp; /* make sure dput doesn't croak */
 
-	spin_lock(&dcache_lock);
+
 	spin_lock(&dcache_inode_lock);
 	res = __d_find_alias(inode, 0);
 	if (res) {
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		dput(tmp);
 		goto out_iput;
 	}
@@ -1503,7 +1438,6 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	spin_unlock(&tmp->d_lock);
 	spin_unlock(&dcache_inode_lock);
 
-	spin_unlock(&dcache_lock);
 	return tmp;
 
  out_iput:
@@ -1533,21 +1467,18 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	struct dentry *new = NULL;
 
 	if (inode && S_ISDIR(inode->i_mode)) {
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			security_d_instantiate(new, inode);
 			d_move(new, dentry);
 			iput(inode);
 		} else {
-			/* already taking dcache_lock, so d_add() by hand */
+			/* already taking dcache_inode_lock, so d_add() by hand */
 			__d_instantiate(dentry, inode);
 			spin_unlock(&dcache_inode_lock);
-			spin_unlock(&dcache_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
 		}
@@ -1620,12 +1551,10 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 * Negative dentry: instantiate it unless the inode is a directory and
 	 * already has a dentry.
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
 		__d_instantiate(found, inode);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		security_d_instantiate(found, inode);
 		return found;
 	}
@@ -1637,7 +1566,6 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	dget_locked(new);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
 	iput(inode);
@@ -1808,7 +1736,6 @@ int d_validate(struct dentry *dentry, struct dentry *dparent)
 {
 	struct dentry *child;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dparent->d_lock);
 	list_for_each_entry(child, &dparent->d_subdirs, d_u.d_child) {
 		if (dentry == child) {
@@ -1816,12 +1743,10 @@ int d_validate(struct dentry *dentry, struct dentry *dparent)
 			__dget_locked_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dparent->d_lock);
-			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
 	spin_unlock(&dparent->d_lock);
-	spin_unlock(&dcache_lock);
 
 	return 0;
 }
@@ -1854,7 +1779,6 @@ void d_delete(struct dentry * dentry)
 	/*
 	 * Are we the only user?
 	 */
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
@@ -1870,7 +1794,6 @@ void d_delete(struct dentry * dentry)
 
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	fsnotify_nameremove(dentry, isdir);
 }
@@ -1897,13 +1820,11 @@ static void _d_rehash(struct dentry * entry)
  
 void d_rehash(struct dentry * entry)
 {
-	spin_lock(&dcache_lock);
 	spin_lock(&entry->d_lock);
 	spin_lock(&dcache_hash_lock);
 	_d_rehash(entry);
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&entry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 EXPORT_SYMBOL(d_rehash);
 
@@ -1972,14 +1893,14 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
  */
  
 /*
- * d_move_locked - move a dentry
+ * d_move - move a dentry
  * @dentry: entry to move
  * @target: new dentry
  *
  * Update the dcache to reflect the move of a file name. Negative
  * dcache entries should not be moved in this way.
  */
-static void d_move_locked(struct dentry * dentry, struct dentry * target)
+void d_move(struct dentry * dentry, struct dentry * target)
 {
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
@@ -2049,22 +1970,6 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
 	spin_unlock(&dentry->d_lock);
 	write_sequnlock(&rename_lock);
 }
-
-/**
- * d_move - move a dentry
- * @dentry: entry to move
- * @target: new dentry
- *
- * Update the dcache to reflect the move of a file name. Negative
- * dcache entries should not be moved in this way.
- */
-
-void d_move(struct dentry * dentry, struct dentry * target)
-{
-	spin_lock(&dcache_lock);
-	d_move_locked(dentry, target);
-	spin_unlock(&dcache_lock);
-}
 EXPORT_SYMBOL(d_move);
 
 /**
@@ -2090,13 +1995,12 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex and the dcache_lock
+ * dentry->d_parent->d_inode->i_mutex and the dcache_inode_lock
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
  */
 static struct dentry *__d_unalias(struct dentry *dentry, struct dentry *alias)
-	__releases(dcache_lock)
 	__releases(dcache_inode_lock)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
@@ -2120,11 +2024,10 @@ static struct dentry *__d_unalias(struct dentry *dentry, struct dentry *alias)
 		goto out_err;
 	m2 = &alias->d_parent->d_inode->i_mutex;
 out_unalias:
-	d_move_locked(alias, dentry);
+	d_move(alias, dentry);
 	ret = alias;
 out_err:
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2188,7 +2091,6 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 
 	BUG_ON(!d_unhashed(dentry));
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 
 	if (!inode) {
@@ -2203,6 +2105,11 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 		/* Does an aliased dentry already exist? */
 		alias = __d_find_alias(inode, 0);
 		if (alias) {
+			/*
+			 * XXX: after dcache_lock removal, we no longer
+			 * guarantee a !d_unhashed alias here. Is that going to
+			 * be a problem?
+			 */
 			actual = alias;
 			/* Is this an anonymous mountpoint that we could splice
 			 * into our tree? */
@@ -2234,7 +2141,6 @@ found:
 	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 out_nolock:
 	if (actual == dentry) {
 		security_d_instantiate(dentry, inode);
@@ -2246,7 +2152,6 @@ out_nolock:
 
 shouldnt_be_hashed:
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 	BUG();
 }
 EXPORT_SYMBOL_GPL(d_materialise_unique);
@@ -2360,11 +2265,9 @@ char *__d_path(const struct path *path, struct path *root,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	spin_lock(&dcache_lock);
 	write_seqlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
 	write_sequnlock(&rename_lock);
-	spin_unlock(&dcache_lock);
 
 	if (error)
 		return ERR_PTR(error);
@@ -2426,14 +2329,12 @@ char *d_path(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	spin_lock(&dcache_lock);
 	write_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (error)
 		res = ERR_PTR(error);
 	write_sequnlock(&rename_lock);
-	spin_unlock(&dcache_lock);
 	path_put(&root);
 	return res;
 }
@@ -2459,14 +2360,12 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	spin_lock(&dcache_lock);
 	write_seqlock(&rename_lock);
 	tmp = root;
 	error = path_with_deleted(path, &tmp, &res, &buflen);
 	if (!error && !path_equal(&tmp, &root))
 		error = prepend_unreachable(&res, &buflen);
 	write_sequnlock(&rename_lock);
-	spin_unlock(&dcache_lock);
 	path_put(&root);
 	if (error)
 		res =  ERR_PTR(error);
@@ -2533,11 +2432,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
 	char *retval;
 
-	spin_lock(&dcache_lock);
 	write_seqlock(&rename_lock);
 	retval = __dentry_path(dentry, buf, buflen);
 	write_sequnlock(&rename_lock);
-	spin_unlock(&dcache_lock);
 
 	return retval;
 }
@@ -2548,7 +2445,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	spin_lock(&dcache_lock);
 	write_seqlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
@@ -2558,12 +2454,10 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	}
 	retval = __dentry_path(dentry, buf, buflen);
 	write_sequnlock(&rename_lock);
-	spin_unlock(&dcache_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
 Elong:
-	spin_unlock(&dcache_lock);
 	return ERR_PTR(-ENAMETOOLONG);
 }
 
@@ -2597,7 +2491,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	get_fs_root_and_pwd(current->fs, &root, &pwd);
 
 	error = -ENOENT;
-	spin_lock(&dcache_lock);
 	write_seqlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
@@ -2608,7 +2501,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &tmp, &cwd, &buflen);
 		write_sequnlock(&rename_lock);
-		spin_unlock(&dcache_lock);
 
 		if (error)
 			goto out;
@@ -2629,7 +2521,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 		}
 	} else {
 		write_sequnlock(&rename_lock);
-		spin_unlock(&dcache_lock);
 	}
 
 out:
@@ -2715,7 +2606,6 @@ void d_genocide(struct dentry *root)
 rename_retry:
 	this_parent = root;
 	seq = read_seqbegin(&rename_lock);
-	spin_lock(&dcache_lock);
 	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
@@ -2762,7 +2652,6 @@ resume:
 		if (this_parent != child->d_parent ||
 				read_seqretry(&rename_lock, seq)) {
 			spin_unlock(&this_parent->d_lock);
-			spin_unlock(&dcache_lock);
 			rcu_read_unlock();
 			goto rename_retry;
 		}
@@ -2771,7 +2660,6 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	spin_unlock(&dcache_lock);
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
 }
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 84b8c46..53a5c08 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -47,24 +47,20 @@ find_acceptable_alias(struct dentry *result,
 	if (acceptable(context, result))
 		return result;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
 		dget_locked(dentry);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 		if (toput)
 			dput(toput);
 		if (dentry != result && acceptable(context, dentry)) {
 			dput(result);
 			return dentry;
 		}
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		toput = dentry;
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	if (toput)
 		dput(toput);
diff --git a/fs/libfs.c b/fs/libfs.c
index cc47949..28b3666 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -100,7 +100,6 @@ loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 			struct dentry *cursor = file->private_data;
 			loff_t n = file->f_pos - 2;
 
-			spin_lock(&dcache_lock);
 			spin_lock(&dentry->d_lock);
 			/* d_lock not required for cursor */
 			list_del(&cursor->d_u.d_child);
@@ -116,7 +115,6 @@ loff_t dcache_dir_lseek(struct file *file, loff_t offset, int origin)
 			}
 			list_add_tail(&cursor->d_u.d_child, p);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 		}
 	}
 	mutex_unlock(&dentry->d_inode->i_mutex);
@@ -159,7 +157,6 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 			i++;
 			/* fallthrough */
 		default:
-			spin_lock(&dcache_lock);
 			spin_lock(&dentry->d_lock);
 			if (filp->f_pos == 2)
 				list_move(q, &dentry->d_subdirs);
@@ -175,13 +172,11 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 
 				spin_unlock(&next->d_lock);
 				spin_unlock(&dentry->d_lock);
-				spin_unlock(&dcache_lock);
 				if (filldir(dirent, next->d_name.name, 
 					    next->d_name.len, filp->f_pos, 
 					    next->d_inode->i_ino, 
 					    dt_type(next->d_inode)) < 0)
 					return 0;
-				spin_lock(&dcache_lock);
 				spin_lock(&dentry->d_lock);
 				spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
 				/* next is still alive */
@@ -191,7 +186,6 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 				filp->f_pos++;
 			}
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 	}
 	return 0;
 }
@@ -285,7 +279,6 @@ int simple_empty(struct dentry *dentry)
 	struct dentry *child;
 	int ret = 0;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
 		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
@@ -298,7 +291,6 @@ int simple_empty(struct dentry *dentry)
 	ret = 1;
 out:
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	return ret;
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 3081ea3..39fda05 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -612,8 +612,8 @@ int follow_up(struct path *path)
 	return 1;
 }
 
-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * serialization is taken care of in namespace.c
  */
 static int __follow_mount(struct path *path)
 {
@@ -645,9 +645,6 @@ static void follow_mount(struct path *path)
 	}
 }
 
-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
- */
 int follow_down(struct path *path)
 {
 	struct vfsmount *mounted;
@@ -2128,12 +2125,10 @@ void dentry_unhash(struct dentry *dentry)
 {
 	dget(dentry);
 	shrink_dcache_parent(dentry);
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (dentry->d_count == 2)
 		__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 int vfs_rmdir(struct inode *dir, struct dentry *dentry)
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index ce65f29..e6d5153 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -393,7 +393,6 @@ ncp_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 	}
 
 	/* If a pointer is invalid, we search the dentry. */
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -404,13 +403,11 @@ ncp_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 			else
 				dent = NULL;
 			spin_unlock(&parent->d_lock);
-			spin_unlock(&dcache_lock);
 			goto out;
 		}
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	return NULL;
 
 out:
@@ -634,21 +631,18 @@ ncp_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
 			struct inode *inode = newdent->d_inode;
 
 			/*
-			 * Inside ncpfs all uses of d_name are either for debugging,
-			 * or on functions which acquire inode mutex (mknod, creat,
-			 * lookup).  So grab i_mutex here, to be sure.  d_path
-			 * uses dcache_lock when generating path, so we should too.
-			 * And finally d_compare is protected by dentry's d_lock, so
-			 * here we go.
+			 * Inside ncpfs all uses of d_name are either for
+			 * debugging, or on functions which acquire inode mutex
+			 * (mknod, creat, lookup).  So grab i_mutex here, to be
+			 * sure.  And finally d_compare is protected by
+			 * dentry's d_lock, so here we go.
 			 */
 			if (inode)
 				mutex_lock(&inode->i_mutex);
-			spin_lock(&dcache_lock);
 			spin_lock(&newdent->d_lock);
 			memcpy((char *) newdent->d_name.name, qname.name,
-								newdent->d_name.len);
+							newdent->d_name.len);
 			spin_unlock(&newdent->d_lock);
-			spin_unlock(&dcache_lock);
 			if (inode)
 				mutex_unlock(&inode->i_mutex);
 		}
diff --git a/fs/ncpfs/ncplib_kernel.h b/fs/ncpfs/ncplib_kernel.h
index c4b718f..1220df7 100644
--- a/fs/ncpfs/ncplib_kernel.h
+++ b/fs/ncpfs/ncplib_kernel.h
@@ -193,7 +193,6 @@ ncp_renew_dentries(struct dentry *parent)
 	struct list_head *next;
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -207,7 +206,6 @@ ncp_renew_dentries(struct dentry *parent)
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 static inline void
@@ -217,7 +215,6 @@ ncp_invalidate_dircache_entries(struct dentry *parent)
 	struct list_head *next;
 	struct dentry *dentry;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&parent->d_lock);
 	next = parent->d_subdirs.next;
 	while (next != &parent->d_subdirs) {
@@ -227,7 +224,6 @@ ncp_invalidate_dircache_entries(struct dentry *parent)
 		next = next->next;
 	}
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 struct ncp_cache_head {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 34ef2dd..c03f2d1 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1728,11 +1728,9 @@ static int nfs_unlink(struct inode *dir, struct dentry *dentry)
 	dfprintk(VFS, "NFS: unlink(%s/%ld, %s)\n", dir->i_sb->s_id,
 		dir->i_ino, dentry->d_name.name);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	if (dentry->d_count > 1) {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_lock);
 		/* Start asynchronous writeout of the inode */
 		write_inode_now(dentry->d_inode, 0);
 		error = nfs_sillyrename(dir, dentry);
@@ -1743,7 +1741,6 @@ static int nfs_unlink(struct inode *dir, struct dentry *dentry)
 		need_rehash = 1;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 	error = nfs_safe_remove(dentry);
 	if (!error || error == -ENOENT) {
 		nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index 850f67d..b3e36c3 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -63,13 +63,11 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 		 * This again causes shrink_dcache_for_umount_subtree() to
 		 * Oops, since the test for IS_ROOT() will fail.
 		 */
-		spin_lock(&dcache_lock);
 		spin_lock(&dcache_inode_lock);
 		spin_lock(&sb->s_root->d_lock);
 		list_del_init(&sb->s_root->d_alias);
 		spin_unlock(&sb->s_root->d_lock);
 		spin_unlock(&dcache_inode_lock);
-		spin_unlock(&dcache_lock);
 	}
 	return 0;
 }
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index 78c0ebb..74aaf39 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,6 @@ rename_retry:
 
 	seq = read_seqbegin(&rename_lock);
 	rcu_read_lock();
-	spin_lock(&dcache_lock);
 	while (!IS_ROOT(dentry) && dentry != droot) {
 		namelen = dentry->d_name.len;
 		buflen -= namelen + 1;
@@ -71,7 +70,6 @@ rename_retry:
 		*--end = '/';
 		dentry = dentry->d_parent;
 	}
-	spin_unlock(&dcache_lock);
 	rcu_read_unlock();
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
@@ -91,7 +89,6 @@ rename_retry:
 	memcpy(end, base, namelen);
 	return end;
 Elong_unlock:
-	spin_unlock(&dcache_lock);
 	rcu_read_unlock();
 	if (read_seqretry(&rename_lock, seq))
 		goto rename_retry;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index ae769fc..9be6ec1 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -59,7 +59,6 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 	/* determine if the children should tell inode about their events */
 	watched = fsnotify_inode_watches_children(inode);
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	/* run all of the dentries associated with this inode.  Since this is a
 	 * directory, there damn well better only be one item on this list */
@@ -84,7 +83,6 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 		spin_unlock(&alias->d_lock);
 	}
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /* Notify this dentry's parent about a child's events. */
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index d83cca4..76f2170 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -169,7 +169,6 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	struct list_head *p;
 	struct dentry *dentry = NULL;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&dcache_inode_lock);
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
@@ -189,7 +188,6 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	}
 
 	spin_unlock(&dcache_inode_lock);
-	spin_unlock(&dcache_lock);
 
 	return dentry;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f9b5744..ab07ed8 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -184,7 +184,6 @@ struct dentry_operations {
 
 extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
-extern spinlock_t dcache_lock;
 extern seqlock_t rename_lock;
 
 /**
@@ -215,11 +214,9 @@ static inline void __d_drop(struct dentry *dentry)
 
 static inline void d_drop(struct dentry *dentry)
 {
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
  	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 static inline int dname_external(struct dentry *dentry)
@@ -328,8 +325,8 @@ extern char *dentry_path(struct dentry *, char *, int);
  *	destroyed when it has references. dget() should never be
  *	called for dentries with zero reference counter. For these cases
  *	(preferably none, functions in dcache.c are sufficient for normal
- *	needs and they take necessary precautions) you should hold dcache_lock
- *	and call dget_locked() instead of dget().
+ *	needs and they take necessary precautions) you should hold d_lock
+ *	and call dget_dlock() instead of dget().
  */
 static inline struct dentry *dget_dlock(struct dentry *dentry)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c9e06cc..bf95e7e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1377,7 +1377,7 @@ struct super_block {
 #else
 	struct list_head	s_files;
 #endif
-	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
+	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
@@ -2445,6 +2445,10 @@ static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
 
+	/*
+	 * Don't strictly need d_lock here? If the parent ino could change
+	 * then surely we'd have a deeper race in the caller?
+	 */
 	spin_lock(&dentry->d_lock);
 	res = dentry->d_parent->d_inode->i_ino;
 	spin_unlock(&dentry->d_lock);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 5c185fa..76d359a 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -17,7 +17,6 @@
 
 /*
  * fsnotify_d_instantiate - instantiate a dentry for inode
- * Called with dcache_lock held.
  */
 static inline void fsnotify_d_instantiate(struct dentry *dentry,
 					  struct inode *inode)
@@ -62,7 +61,6 @@ static inline int fsnotify_perm(struct file *file, int mask)
 
 /*
  * fsnotify_d_move - dentry has been moved
- * Called with dcache_lock and dentry->d_lock held.
  */
 static inline void fsnotify_d_move(struct dentry *dentry)
 {
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 0a68f92..b1944f0 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -329,9 +329,15 @@ static inline void __fsnotify_update_dcache_flags(struct dentry *dentry)
 {
 	struct dentry *parent;
 
-	assert_spin_locked(&dcache_lock);
 	assert_spin_locked(&dentry->d_lock);
 
+	/*
+	 * Serialisation of setting PARENT_WATCHED on the dentries is provided
+	 * by d_lock. If inotify_inode_watched changes after we have taken
+	 * d_lock, the following __fsnotify_update_child_dentry_flags call will
+	 * find our entry, so it will spin until we complete here, and update
+	 * us with the new state.
+	 */
 	parent = dentry->d_parent;
 	if (parent->d_inode && fsnotify_inode_watches_children(parent->d_inode))
 		dentry->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
@@ -341,15 +347,12 @@ static inline void __fsnotify_update_dcache_flags(struct dentry *dentry)
 
 /*
  * fsnotify_d_instantiate - instantiate a dentry for inode
- * Called with dcache_lock held.
  */
 static inline void __fsnotify_d_instantiate(struct dentry *dentry, struct inode *inode)
 {
 	if (!inode)
 		return;
 
-	assert_spin_locked(&dcache_lock);
-
 	spin_lock(&dentry->d_lock);
 	__fsnotify_update_dcache_flags(dentry);
 	spin_unlock(&dentry->d_lock);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 05b441d..aec730b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -41,7 +41,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
  *  - require a directory
  *  - ending slashes ok even for nonexistent files
  *  - internal "there are more path components" flag
- *  - locked when lookup done with dcache_lock held
  *  - dentry cache is untrusted; force a real lookup
  */
 #define LOOKUP_FOLLOW		 1
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 54f3442..845dc23 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -874,7 +874,6 @@ static void cgroup_clear_directory(struct dentry *dentry)
 	struct list_head *node;
 
 	BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
-	spin_lock(&dcache_lock);
 	spin_lock(&dentry->d_lock);
 	node = dentry->d_subdirs.next;
 	while (node != &dentry->d_subdirs) {
@@ -889,18 +888,15 @@ static void cgroup_clear_directory(struct dentry *dentry)
 			dget_locked_dlock(d);
 			spin_unlock(&d->d_lock);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(dentry->d_inode, d);
 			dput(d);
-			spin_lock(&dcache_lock);
 			spin_lock(&dentry->d_lock);
 		} else
 			spin_unlock(&d->d_lock);
 		node = dentry->d_subdirs.next;
 	}
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 /*
@@ -912,14 +908,12 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
 
 	cgroup_clear_directory(dentry);
 
-	spin_lock(&dcache_lock);
 	parent = dentry->d_parent;
 	spin_lock(&parent->d_lock);
 	spin_lock(&dentry->d_lock);
 	list_del_init(&dentry->d_u.d_child);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_lock);
 	remove_dir(dentry);
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index ea89840..a7da042 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -102,9 +102,6 @@
  *    ->inode_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
- *  ->task->proc_lock
- *    ->dcache_lock		(proc_pid_lookup)
- *
  *  (code doesn't rely on that order, so you could switch it around)
  *  ->tasklist_lock             (memory_failure, collect_procs_ao)
  *    ->i_mmap_lock
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 017ec09..2285d69 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1145,7 +1145,6 @@ static void sel_remove_entries(struct dentry *de)
 {
 	struct list_head *node;
 
-	spin_lock(&dcache_lock);
 	spin_lock(&de->d_lock);
 	node = de->d_subdirs.next;
 	while (node != &de->d_subdirs) {
@@ -1158,11 +1157,9 @@ static void sel_remove_entries(struct dentry *de)
 			dget_locked_dlock(d);
 			spin_unlock(&de->d_lock);
 			spin_unlock(&d->d_lock);
-			spin_unlock(&dcache_lock);
 			d_delete(d);
 			simple_unlink(de->d_inode, d);
 			dput(d);
-			spin_lock(&dcache_lock);
 			spin_lock(&de->d_lock);
 		} else
 			spin_unlock(&d->d_lock);
@@ -1170,7 +1167,6 @@ static void sel_remove_entries(struct dentry *de)
 	}
 
 	spin_unlock(&de->d_lock);
-	spin_unlock(&dcache_lock);
 }
 
 #define BOOL_DIR_NAME "booleans"
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 20/46] fs: dcache avoid starvation in dcache multi-step operations
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (17 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 19/46] fs: dcache remove dcache_lock Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 21/46] fs: dcache reduce dput locking Nick Piggin
                   ` (30 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Long lived dcache "multi-step" operations which retry on rename seq can
be starved with a lot of rename activity. If they fail after the 1st pass,
take the rename_lock for writing to avoid further starvation.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   56 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c4b2c5e..73f5552 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -937,10 +937,11 @@ int have_submounts(struct dentry *parent)
 	struct dentry *this_parent;
 	struct list_head *next;
 	unsigned seq;
+	int locked = 0;
 
-rename_retry:
-	this_parent = parent;
 	seq = read_seqbegin(&rename_lock);
+again:
+	this_parent = parent;
 
 	if (d_mountpoint(parent))
 		goto positive;
@@ -985,7 +986,7 @@ resume:
 		/* might go back up the wrong parent if we have had a rename
 		 * or deletion */
 		if (this_parent != child->d_parent ||
-				read_seqretry(&rename_lock, seq)) {
+			 (!locked && read_seqretry(&rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
 			rcu_read_unlock();
 			goto rename_retry;
@@ -995,13 +996,22 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
+	if (locked)
+		write_sequnlock(&rename_lock);
 	return 0; /* No mount points found in tree */
 positive:
-	if (read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
+	if (locked)
+		write_sequnlock(&rename_lock);
 	return 1;
+
+rename_retry:
+	locked = 1;
+	write_seqlock(&rename_lock);
+	goto again;
 }
 EXPORT_SYMBOL(have_submounts);
 
@@ -1025,11 +1035,11 @@ static int select_parent(struct dentry * parent)
 	struct list_head *next;
 	unsigned seq;
 	int found = 0;
+	int locked = 0;
 
-rename_retry:
-	this_parent = parent;
 	seq = read_seqbegin(&rename_lock);
-
+again:
+	this_parent = parent;
 	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
@@ -1091,7 +1101,7 @@ resume:
 		/* might go back up the wrong parent if we have had a rename
 		 * or deletion */
 		if (this_parent != child->d_parent ||
-				read_seqretry(&rename_lock, seq)) {
+			(!locked && read_seqretry(&rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
 			rcu_read_unlock();
 			goto rename_retry;
@@ -1102,9 +1112,18 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	if (read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
+	if (locked)
+		write_sequnlock(&rename_lock);
 	return found;
+
+rename_retry:
+	if (found)
+		return found;
+	locked = 1;
+	write_seqlock(&rename_lock);
+	goto again;
 }
 
 /**
@@ -2602,10 +2621,11 @@ void d_genocide(struct dentry *root)
 	struct dentry *this_parent;
 	struct list_head *next;
 	unsigned seq;
+	int locked = 0;
 
-rename_retry:
-	this_parent = root;
 	seq = read_seqbegin(&rename_lock);
+again:
+	this_parent = root;
 	spin_lock(&this_parent->d_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
@@ -2650,7 +2670,7 @@ resume:
 		/* might go back up the wrong parent if we have had a rename
 		 * or deletion */
 		if (this_parent != child->d_parent ||
-				read_seqretry(&rename_lock, seq)) {
+			 (!locked && read_seqretry(&rename_lock, seq))) {
 			spin_unlock(&this_parent->d_lock);
 			rcu_read_unlock();
 			goto rename_retry;
@@ -2660,8 +2680,16 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqretry(&rename_lock, seq))
 		goto rename_retry;
+	if (locked)
+		write_sequnlock(&rename_lock);
+	return;
+
+rename_retry:
+	locked = 1;
+	write_seqlock(&rename_lock);
+	goto again;
 }
 
 /**
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 21/46] fs: dcache reduce dput locking
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (18 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 20/46] fs: dcache avoid starvation in dcache multi-step operations Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 22/46] fs: dcache reduce locking in d_alloc Nick Piggin
                   ` (29 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

It is possible to run dput without taking data structure locks up-front. In
many cases where we don't kill the dentry anyway, these locks are not required.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   56 +++++++++++++++++++++++++-------------------------------
 1 files changed, 25 insertions(+), 31 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 73f5552..a117139 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -283,35 +283,16 @@ repeat:
 	if (dentry->d_count == 1)
 		might_sleep();
 	spin_lock(&dentry->d_lock);
-	if (IS_ROOT(dentry))
-		parent = NULL;
-	else
-		parent = dentry->d_parent;
-	if (dentry->d_count == 1) {
-		if (!spin_trylock(&dcache_inode_lock)) {
-drop2:
-			spin_unlock(&dentry->d_lock);
-			goto repeat;
-		}
-		if (parent && !spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dcache_inode_lock);
-			goto drop2;
-		}
-	}
-	dentry->d_count--;
-	if (dentry->d_count) {
+	BUG_ON(!dentry->d_count);
+	if (dentry->d_count > 1) {
+		dentry->d_count--;
 		spin_unlock(&dentry->d_lock);
-		if (parent)
-			spin_unlock(&parent->d_lock);
-		return;
-	}
+ 		return;
+ 	}
 
-	/*
-	 * AV: ->d_delete() is _NOT_ allowed to block now.
-	 */
 	if (dentry->d_op && dentry->d_op->d_delete) {
 		if (dentry->d_op->d_delete(dentry))
-			goto unhash_it;
+			goto kill_it;
 	}
 
 	/* Unreachable? Get rid of it */
@@ -322,17 +303,30 @@ drop2:
 	dentry->d_flags |= DCACHE_REFERENCED;
 	dentry_lru_add(dentry);
 
- 	spin_unlock(&dentry->d_lock);
-	if (parent)
-		spin_unlock(&parent->d_lock);
-	spin_unlock(&dcache_inode_lock);
+	dentry->d_count--;
+	spin_unlock(&dentry->d_lock);
 	return;
 
-unhash_it:
-	__d_drop(dentry);
 kill_it:
+	if (!spin_trylock(&dcache_inode_lock)) {
+relock:
+		spin_unlock(&dentry->d_lock);
+		cpu_relax();
+		goto repeat;
+	}
+	if (IS_ROOT(dentry))
+		parent = NULL;
+	else
+		parent = dentry->d_parent;
+	if (parent && !spin_trylock(&parent->d_lock)) {
+		spin_unlock(&dcache_inode_lock);
+		goto relock;
+	}
+	dentry->d_count--;
 	/* if dentry was on the d_lru list delete it from there */
 	dentry_lru_del(dentry);
+	/* if it was on the hash (d_delete case), then remove it */
+	__d_drop(dentry);
 	dentry = d_kill(dentry, parent);
 	if (dentry)
 		goto repeat;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 22/46] fs: dcache reduce locking in d_alloc
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (19 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 21/46] fs: dcache reduce dput locking Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 23/46] fs: dcache reduce dcache_inode_lock Nick Piggin
                   ` (28 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a117139..3e4f7c1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1220,11 +1220,13 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 
 	if (parent) {
 		spin_lock(&parent->d_lock);
-		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+		/*
+		 * don't need child lock because it is not subject
+		 * to concurrency here
+		 */
 		dentry->d_parent = dget_dlock(parent);
 		dentry->d_sb = parent->d_sb;
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
-		spin_unlock(&dentry->d_lock);
 		spin_unlock(&parent->d_lock);
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 23/46] fs: dcache reduce dcache_inode_lock
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (20 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 22/46] fs: dcache reduce locking in d_alloc Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 24/46] fs: dcache rationalise dget variants Nick Piggin
                   ` (27 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3e4f7c1..6fe387d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1794,10 +1794,15 @@ void d_delete(struct dentry * dentry)
 	/*
 	 * Are we the only user?
 	 */
-	spin_lock(&dcache_inode_lock);
+again:
 	spin_lock(&dentry->d_lock);
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	if (dentry->d_count == 1) {
+		if (!spin_trylock(&dcache_inode_lock)) {
+			spin_unlock(&dentry->d_lock);
+			cpu_relax();
+			goto again;
+		}
 		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
 		dentry_iput(dentry);
 		fsnotify_nameremove(dentry, isdir);
@@ -1808,7 +1813,6 @@ void d_delete(struct dentry * dentry)
 		__d_drop(dentry);
 
 	spin_unlock(&dentry->d_lock);
-	spin_unlock(&dcache_inode_lock);
 
 	fsnotify_nameremove(dentry, isdir);
 }
@@ -2106,14 +2110,15 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 
 	BUG_ON(!d_unhashed(dentry));
 
-	spin_lock(&dcache_inode_lock);
-
 	if (!inode) {
 		actual = dentry;
 		__d_instantiate(dentry, NULL);
-		goto found_lock;
+		d_rehash(actual);
+		goto out_nolock;
 	}
 
+	spin_lock(&dcache_inode_lock);
+
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *alias;
 
@@ -2145,10 +2150,9 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 	actual = __d_instantiate_unique(dentry, inode);
 	if (!actual)
 		actual = dentry;
-	else if (unlikely(!d_unhashed(actual)))
-		goto shouldnt_be_hashed;
+	else
+		BUG_ON(!d_unhashed(actual));
 
-found_lock:
 	spin_lock(&actual->d_lock);
 found:
 	spin_lock(&dcache_hash_lock);
@@ -2164,10 +2168,6 @@ out_nolock:
 
 	iput(inode);
 	return actual;
-
-shouldnt_be_hashed:
-	spin_unlock(&dcache_inode_lock);
-	BUG();
 }
 EXPORT_SYMBOL_GPL(d_materialise_unique);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 24/46] fs: dcache rationalise dget variants
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (21 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 23/46] fs: dcache reduce dcache_inode_lock Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 25/46] fs: dcache reduce d_parent locking Nick Piggin
                   ` (26 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 arch/powerpc/platforms/cell/spufs/inode.c |    2 +-
 drivers/infiniband/hw/ipath/ipath_fs.c    |    2 +-
 drivers/infiniband/hw/qib/qib_fs.c        |    2 +-
 drivers/staging/smbfs/cache.c             |    2 +-
 fs/configfs/inode.c                       |    2 +-
 fs/dcache.c                               |   34 ++++++++--------------------
 fs/exportfs/expfs.c                       |    2 +-
 fs/ncpfs/dir.c                            |    2 +-
 fs/ocfs2/dcache.c                         |    2 +-
 include/linux/dcache.h                    |   15 ++----------
 kernel/cgroup.c                           |    2 +-
 security/selinux/selinuxfs.c              |    2 +-
 12 files changed, 23 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 2662b50..03185de 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -161,7 +161,7 @@ static void spufs_prune_dir(struct dentry *dir)
 	list_for_each_entry_safe(dentry, tmp, &dir->d_subdirs, d_u.d_child) {
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry)) && dentry->d_inode) {
-			dget_locked_dlock(dentry);
+			dget_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(dir->d_inode, dentry);
diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c
index 925e882..31ae1b1 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -279,7 +279,7 @@ static int remove_file(struct dentry *parent, char *name)
 
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
-		dget_locked_dlock(tmp);
+		dget_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
 		simple_unlink(parent->d_inode, tmp);
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index 49af4a6..df7fa25 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -455,7 +455,7 @@ static int remove_file(struct dentry *parent, char *name)
 
 	spin_lock(&tmp->d_lock);
 	if (!(d_unhashed(tmp) && tmp->d_inode)) {
-		dget_locked_dlock(tmp);
+		dget_dlock(tmp);
 		__d_drop(tmp);
 		spin_unlock(&tmp->d_lock);
 		simple_unlink(parent->d_inode, tmp);
diff --git a/drivers/staging/smbfs/cache.c b/drivers/staging/smbfs/cache.c
index abae450..9a8267f 100644
--- a/drivers/staging/smbfs/cache.c
+++ b/drivers/staging/smbfs/cache.c
@@ -102,7 +102,7 @@ smb_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 		dent = list_entry(next, struct dentry, d_u.d_child);
 		if ((unsigned long)dent->d_fsdata == fpos) {
 			if (dent->d_inode)
-				dget_locked(dent);
+				dget(dent);
 			else
 				dent = NULL;
 			goto out_unlock;
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index fb3a55f..c83f476 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -252,7 +252,7 @@ void configfs_drop_dentry(struct configfs_dirent * sd, struct dentry * parent)
 	if (dentry) {
 		spin_lock(&dentry->d_lock);
 		if (!(d_unhashed(dentry) && dentry->d_inode)) {
-			dget_locked_dlock(dentry);
+			dget_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			simple_unlink(parent->d_inode, dentry);
diff --git a/fs/dcache.c b/fs/dcache.c
index 6fe387d..9231748 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -389,32 +389,17 @@ int d_invalidate(struct dentry * dentry)
 EXPORT_SYMBOL(d_invalidate);
 
 /* This must be called with d_lock held */
-static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
+static inline void __dget_dlock(struct dentry *dentry)
 {
 	dentry->d_count++;
-	dentry_lru_del(dentry);
-	return dentry;
 }
 
-/* This must be called with d_lock held */
-static inline struct dentry * __dget_locked(struct dentry *dentry)
+static inline void __dget(struct dentry *dentry)
 {
 	spin_lock(&dentry->d_lock);
-	__dget_locked_dlock(dentry);
+	__dget_dlock(dentry);
 	spin_unlock(&dentry->d_lock);
-	return dentry;
-}
-
-struct dentry * dget_locked_dlock(struct dentry *dentry)
-{
-	return __dget_locked_dlock(dentry);
-}
-
-struct dentry * dget_locked(struct dentry *dentry)
-{
-	return __dget_locked(dentry);
 }
-EXPORT_SYMBOL(dget_locked);
 
 struct dentry *dget_parent(struct dentry *dentry)
 {
@@ -495,7 +480,7 @@ static struct dentry *__d_find_alias(struct inode *inode, int want_discon)
 	struct dentry *alias;
 	alias = ___d_find_alias(inode, want_discon);
 	if (alias) {
-		__dget_locked_dlock(alias);
+		__dget_dlock(alias);
 		spin_unlock(&alias->d_lock);
 	}
 	return alias;
@@ -526,7 +511,7 @@ restart:
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!dentry->d_count) {
-			__dget_locked_dlock(dentry);
+			__dget_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_inode_lock);
@@ -1224,7 +1209,8 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 		 * don't need child lock because it is not subject
 		 * to concurrency here
 		 */
-		dentry->d_parent = dget_dlock(parent);
+		__dget_dlock(parent);
+		dentry->d_parent = parent;
 		dentry->d_sb = parent->d_sb;
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 		spin_unlock(&parent->d_lock);
@@ -1323,7 +1309,7 @@ static struct dentry *__d_instantiate_unique(struct dentry *entry,
 			goto next;
 		if (memcmp(qstr->name, name, len))
 			goto next;
-		dget_locked_dlock(alias);
+		__dget_dlock(alias);
 		spin_unlock(&alias->d_lock);
 		return alias;
 next:
@@ -1579,7 +1565,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 * reference to it, move it in place and use it.
 	 */
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
-	dget_locked(new);
+	__dget(new);
 	spin_unlock(&dcache_inode_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
@@ -1755,7 +1741,7 @@ int d_validate(struct dentry *dentry, struct dentry *dparent)
 	list_for_each_entry(child, &dparent->d_subdirs, d_u.d_child) {
 		if (dentry == child) {
 			spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-			__dget_locked_dlock(dentry);
+			__dget_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dparent->d_lock);
 			return 1;
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 53a5c08..f06a940 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -49,7 +49,7 @@ find_acceptable_alias(struct dentry *result,
 
 	spin_lock(&dcache_inode_lock);
 	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
-		dget_locked(dentry);
+		dget(dentry);
 		spin_unlock(&dcache_inode_lock);
 		if (toput)
 			dput(toput);
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index e6d5153..6ecc33a 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -399,7 +399,7 @@ ncp_dget_fpos(struct dentry *dentry, struct dentry *parent, unsigned long fpos)
 		dent = list_entry(next, struct dentry, d_u.d_child);
 		if ((unsigned long)dent->d_fsdata == fpos) {
 			if (dent->d_inode)
-				dget_locked(dent);
+				dget(dent);
 			else
 				dent = NULL;
 			spin_unlock(&parent->d_lock);
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index 76f2170..8b26b54 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -178,7 +178,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 			mlog(0, "dentry found: %.*s\n",
 			     dentry->d_name.len, dentry->d_name.name);
 
-			dget_locked_dlock(dentry);
+			dget_dlock(dentry);
 			spin_unlock(&dentry->d_lock);
 			break;
 		}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index ab07ed8..e31fc9a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -317,23 +317,17 @@ extern char *dentry_path(struct dentry *, char *, int);
 /* Allocation counts.. */
 
 /**
- *	dget, dget_locked	-	get a reference to a dentry
+ *	dget, dget_dlock -	get a reference to a dentry
  *	@dentry: dentry to get a reference to
  *
  *	Given a dentry or %NULL pointer increment the reference count
  *	if appropriate and return the dentry. A dentry will not be 
- *	destroyed when it has references. dget() should never be
- *	called for dentries with zero reference counter. For these cases
- *	(preferably none, functions in dcache.c are sufficient for normal
- *	needs and they take necessary precautions) you should hold d_lock
- *	and call dget_dlock() instead of dget().
+ *	destroyed when it has references.
  */
 static inline struct dentry *dget_dlock(struct dentry *dentry)
 {
-	if (dentry) {
-		BUG_ON(!dentry->d_count);
+	if (dentry)
 		dentry->d_count++;
-	}
 	return dentry;
 }
 
@@ -347,9 +341,6 @@ static inline struct dentry *dget(struct dentry *dentry)
 	return dentry;
 }
 
-extern struct dentry * dget_locked(struct dentry *);
-extern struct dentry * dget_locked_dlock(struct dentry *);
-
 extern struct dentry *dget_parent(struct dentry *dentry);
 
 /**
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 845dc23..6e34f75 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -885,7 +885,7 @@ static void cgroup_clear_directory(struct dentry *dentry)
 			/* This should never be called on a cgroup
 			 * directory with child cgroups */
 			BUG_ON(d->d_inode->i_mode & S_IFDIR);
-			dget_locked_dlock(d);
+			dget_dlock(d);
 			spin_unlock(&d->d_lock);
 			spin_unlock(&dentry->d_lock);
 			d_delete(d);
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 2285d69..43deac2 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1154,7 +1154,7 @@ static void sel_remove_entries(struct dentry *de)
 		list_del_init(node);
 
 		if (d->d_inode) {
-			dget_locked_dlock(d);
+			dget_dlock(d);
 			spin_unlock(&de->d_lock);
 			spin_unlock(&d->d_lock);
 			d_delete(d);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 25/46] fs: dcache reduce d_parent locking
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (22 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 24/46] fs: dcache rationalise dget variants Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 26/46] fs: dcache reduce prune_one_dentry locking Nick Piggin
                   ` (25 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Use RCU to simplify locking in dget_parent.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9231748..23f6fed 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -406,24 +406,27 @@ struct dentry *dget_parent(struct dentry *dentry)
 	struct dentry *ret;
 
 repeat:
-	spin_lock(&dentry->d_lock);
+	/*
+	 * Don't need rcu_dereference because we re-check it was correct under
+	 * the lock.
+	 */
+	rcu_read_lock();
 	ret = dentry->d_parent;
-	if (!ret)
-		goto out;
-	if (dentry == ret) {
-		ret->d_count++;
-		goto out;
-	}
-	if (!spin_trylock(&ret->d_lock)) {
-		spin_unlock(&dentry->d_lock);
-		cpu_relax();
+	if (!ret) {
+		rcu_read_unlock();
+ 		goto out;
+ 	}
+	spin_lock(&ret->d_lock);
+	if (unlikely(ret != dentry->d_parent)) {
+		spin_unlock(&ret->d_lock);
+		rcu_read_unlock();
 		goto repeat;
 	}
+	rcu_read_unlock();
 	BUG_ON(!ret->d_count);
 	ret->d_count++;
 	spin_unlock(&ret->d_lock);
 out:
-	spin_unlock(&dentry->d_lock);
 	return ret;
 }
 EXPORT_SYMBOL(dget_parent);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 26/46] fs: dcache reduce prune_one_dentry locking
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (23 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 25/46] fs: dcache reduce d_parent locking Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 27/46] fs: reduce dcache_inode_lock width in lru scanning Nick Piggin
                   ` (24 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   27 +++++++++++++++------------
 1 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 23f6fed..4fa27b5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -547,26 +547,29 @@ static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
 	 * Prune ancestors.
 	 */
 	while (dentry) {
-		spin_lock(&dcache_inode_lock);
-again:
+relock:
 		spin_lock(&dentry->d_lock);
+		if (dentry->d_count > 1) {
+			dentry->d_count--;
+			spin_unlock(&dentry->d_lock);
+	 		return;
+	 	}
+		if (!spin_trylock(&dcache_inode_lock)) {
+relock2:
+			spin_unlock(&dentry->d_lock);
+			cpu_relax();
+			goto relock;
+		}
+
 		if (IS_ROOT(dentry))
 			parent = NULL;
 		else
 			parent = dentry->d_parent;
 		if (parent && !spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dentry->d_lock);
-			goto again;
-		}
-		dentry->d_count--;
-		if (dentry->d_count) {
-			if (parent)
-				spin_unlock(&parent->d_lock);
-			spin_unlock(&dentry->d_lock);
 			spin_unlock(&dcache_inode_lock);
-			return;
+			goto relock2;
 		}
-
+		dentry->d_count--;
 		dentry_lru_del(dentry);
 		__d_drop(dentry);
 		dentry = d_kill(dentry, parent);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 27/46] fs: reduce dcache_inode_lock width in lru scanning
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (24 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 26/46] fs: dcache reduce prune_one_dentry locking Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 28/46] fs: use RCU in shrink_dentry_list to reduce lock nesting Nick Piggin
                   ` (23 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4fa27b5..5a7c328 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -586,7 +586,7 @@ static void shrink_dentry_list(struct list_head *list)
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 
 		if (!spin_trylock(&dentry->d_lock)) {
-relock:
+relock1:
 			spin_unlock(&dcache_lru_lock);
 			cpu_relax();
 			spin_lock(&dcache_lru_lock);
@@ -603,20 +603,24 @@ relock:
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
+		if (!spin_trylock(&dcache_inode_lock)) {
+relock2:
+			spin_unlock(&dentry->d_lock);
+			goto relock1;
+		}
 		if (IS_ROOT(dentry))
 			parent = NULL;
 		else
 			parent = dentry->d_parent;
 		if (parent && !spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dentry->d_lock);
-			goto relock;
+			spin_unlock(&dcache_inode_lock);
+			goto relock2;
 		}
 		__dentry_lru_del(dentry);
 		spin_unlock(&dcache_lru_lock);
 
 		prune_one_dentry(dentry, parent);
 		/* dcache_inode_lock and dentry->d_lock dropped */
-		spin_lock(&dcache_inode_lock);
 		spin_lock(&dcache_lru_lock);
 	}
 }
@@ -637,7 +641,6 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
 	LIST_HEAD(tmp);
 	int cnt = *count;
 
-	spin_lock(&dcache_inode_lock);
 relock:
 	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
@@ -675,7 +678,6 @@ relock:
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lru_lock);
-	spin_unlock(&dcache_inode_lock);
 }
 
 /**
@@ -764,14 +766,12 @@ void shrink_dcache_sb(struct super_block *sb)
 {
 	LIST_HEAD(tmp);
 
-	spin_lock(&dcache_inode_lock);
 	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 		shrink_dentry_list(&tmp);
 	}
 	spin_unlock(&dcache_lru_lock);
-	spin_unlock(&dcache_inode_lock);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 28/46] fs: use RCU in shrink_dentry_list to reduce lock nesting
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (25 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 27/46] fs: reduce dcache_inode_lock width in lru scanning Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:44 ` [PATCH 29/46] fs: consolidate dentry kill sequence Nick Piggin
                   ` (22 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   38 +++++++++++++++++++++-----------------
 1 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5a7c328..4dbcb6c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -580,16 +580,16 @@ static void shrink_dentry_list(struct list_head *list)
 {
 	struct dentry *dentry;
 
+	rcu_read_lock();
 	while (!list_empty(list)) {
 		struct dentry *parent;
 
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 
-		if (!spin_trylock(&dentry->d_lock)) {
-relock1:
-			spin_unlock(&dcache_lru_lock);
-			cpu_relax();
-			spin_lock(&dcache_lru_lock);
+		/* Don't need RCU dereference because we recheck under lock */
+		spin_lock(&dentry->d_lock);
+		if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
+			spin_unlock(&dentry->d_lock);
 			continue;
 		}
 
@@ -599,14 +599,16 @@ relock1:
 		 * it - just keep it off the LRU list.
 		 */
 		if (dentry->d_count) {
-			__dentry_lru_del(dentry);
+			dentry_lru_del(dentry);
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
+
 		if (!spin_trylock(&dcache_inode_lock)) {
-relock2:
+relock:
 			spin_unlock(&dentry->d_lock);
-			goto relock1;
+			cpu_relax();
+			continue;
 		}
 		if (IS_ROOT(dentry))
 			parent = NULL;
@@ -614,15 +616,15 @@ relock2:
 			parent = dentry->d_parent;
 		if (parent && !spin_trylock(&parent->d_lock)) {
 			spin_unlock(&dcache_inode_lock);
-			goto relock2;
+			goto relock;
 		}
-		__dentry_lru_del(dentry);
-		spin_unlock(&dcache_lru_lock);
+		dentry_lru_del(dentry);
 
+		rcu_read_unlock();
 		prune_one_dentry(dentry, parent);
-		/* dcache_inode_lock and dentry->d_lock dropped */
-		spin_lock(&dcache_lru_lock);
+		rcu_read_lock();
 	}
+	rcu_read_unlock();
 }
 
 /**
@@ -671,13 +673,13 @@ relock:
 				break;
 		}
 	}
-
-	*count = cnt;
-	shrink_dentry_list(&tmp);
-
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lru_lock);
+
+	shrink_dentry_list(&tmp);
+
+	*count = cnt;
 }
 
 /**
@@ -769,7 +771,9 @@ void shrink_dcache_sb(struct super_block *sb)
 	spin_lock(&dcache_lru_lock);
 	while (!list_empty(&sb->s_dentry_lru)) {
 		list_splice_init(&sb->s_dentry_lru, &tmp);
+		spin_unlock(&dcache_lru_lock);
 		shrink_dentry_list(&tmp);
+		spin_lock(&dcache_lru_lock);
 	}
 	spin_unlock(&dcache_lru_lock);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 29/46] fs: consolidate dentry kill sequence
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (26 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 28/46] fs: use RCU in shrink_dentry_list to reduce lock nesting Nick Piggin
@ 2010-11-27  9:44 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 30/46] fs: icache RCU free inodes Nick Piggin
                   ` (21 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

The tricky locking for disposing of a dentry is duplicated 3 times in the
dcache (dput, pruning a dentry from the LRU, and pruning its ancestors).
Consolidate them all into a single function dentry_kill.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |  137 +++++++++++++++++++++++++++--------------------------------
 1 files changed, 62 insertions(+), 75 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4dbcb6c..5abb8f2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -244,6 +244,40 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 	return parent;
 }
 
+/*
+ * Finish off a dentry we've decided to kill.
+ * dentry->d_lock must be held, returns with it unlocked.
+ * If ref is non-zero, then decrement the refcount too.
+ * Returns dentry requiring refcount drop, or NULL if we're done.
+ */
+static inline struct dentry *dentry_kill(struct dentry *dentry, int ref)
+	__releases(dentry->d_lock)
+{
+	struct dentry *parent;
+
+	if (!spin_trylock(&dcache_inode_lock)) {
+relock:
+		spin_unlock(&dentry->d_lock);
+		cpu_relax();
+		return dentry; /* try again with same dentry */
+	}
+	if (IS_ROOT(dentry))
+		parent = NULL;
+	else
+		parent = dentry->d_parent;
+	if (parent && !spin_trylock(&parent->d_lock)) {
+		spin_unlock(&dcache_inode_lock);
+		goto relock;
+	}
+	if (ref)
+		dentry->d_count--;
+	/* if dentry was on the d_lru list delete it from there */
+	dentry_lru_del(dentry);
+	/* if it was on the hash then remove it */
+	__d_drop(dentry);
+	return d_kill(dentry, parent);
+}
+
 /* 
  * This is dput
  *
@@ -269,13 +303,9 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
  * call the dentry unlink method as well as removing it from the queues and
  * releasing its resources. If the parent dentries were scheduled for release
  * they too may now get deleted.
- *
- * no dcache lock, please.
  */
-
 void dput(struct dentry *dentry)
 {
-	struct dentry *parent;
 	if (!dentry)
 		return;
 
@@ -308,26 +338,7 @@ repeat:
 	return;
 
 kill_it:
-	if (!spin_trylock(&dcache_inode_lock)) {
-relock:
-		spin_unlock(&dentry->d_lock);
-		cpu_relax();
-		goto repeat;
-	}
-	if (IS_ROOT(dentry))
-		parent = NULL;
-	else
-		parent = dentry->d_parent;
-	if (parent && !spin_trylock(&parent->d_lock)) {
-		spin_unlock(&dcache_inode_lock);
-		goto relock;
-	}
-	dentry->d_count--;
-	/* if dentry was on the d_lru list delete it from there */
-	dentry_lru_del(dentry);
-	/* if it was on the hash (d_delete case), then remove it */
-	__d_drop(dentry);
-	dentry = d_kill(dentry, parent);
+	dentry = dentry_kill(dentry, 1);
 	if (dentry)
 		goto repeat;
 }
@@ -528,51 +539,43 @@ restart:
 EXPORT_SYMBOL(d_prune_aliases);
 
 /*
- * Throw away a dentry - free the inode, dput the parent.  This requires that
- * the LRU list has already been removed.
+ * Try to throw away a dentry - free the inode, dput the parent.
+ * Requires dentry->d_lock is held, and dentry->d_count == 0.
+ * Releases dentry->d_lock.
  *
- * Try to prune ancestors as well.  This is necessary to prevent
- * quadratic behavior of shrink_dcache_parent(), but is also expected
- * to be beneficial in reducing dentry cache fragmentation.
+ * This may fail if locks cannot be acquired no problem, just try again.
  */
-static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
+static void try_prune_one_dentry(struct dentry *dentry)
 	__releases(dentry->d_lock)
-	__releases(parent->d_lock)
-	__releases(dcache_inode_lock)
 {
-	__d_drop(dentry);
-	dentry = d_kill(dentry, parent);
+	struct dentry *parent;
 
+	parent = dentry_kill(dentry, 0);
 	/*
-	 * Prune ancestors.
+	 * If dentry_kill returns NULL, we have nothing more to do.
+	 * if it returns the same dentry, trylocks failed. In either
+	 * case, just loop again.
+	 *
+	 * Otherwise, we need to prune ancestors too. This is necessary
+	 * to prevent quadratic behavior of shrink_dcache_parent(), but
+	 * is also expected to be beneficial in reducing dentry cache
+	 * fragmentation.
 	 */
+	if (!parent)
+		return;
+	if (parent == dentry)
+		return;
+
+	/* Prune ancestors. */
+	dentry = parent;
 	while (dentry) {
-relock:
 		spin_lock(&dentry->d_lock);
 		if (dentry->d_count > 1) {
 			dentry->d_count--;
 			spin_unlock(&dentry->d_lock);
-	 		return;
-	 	}
-		if (!spin_trylock(&dcache_inode_lock)) {
-relock2:
-			spin_unlock(&dentry->d_lock);
-			cpu_relax();
-			goto relock;
-		}
-
-		if (IS_ROOT(dentry))
-			parent = NULL;
-		else
-			parent = dentry->d_parent;
-		if (parent && !spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dcache_inode_lock);
-			goto relock2;
+			return;
 		}
-		dentry->d_count--;
-		dentry_lru_del(dentry);
-		__d_drop(dentry);
-		dentry = d_kill(dentry, parent);
+		dentry = dentry_kill(dentry, 1);
 	}
 }
 
@@ -582,8 +585,6 @@ static void shrink_dentry_list(struct list_head *list)
 
 	rcu_read_lock();
 	while (!list_empty(list)) {
-		struct dentry *parent;
-
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 
 		/* Don't need RCU dereference because we recheck under lock */
@@ -604,24 +605,10 @@ static void shrink_dentry_list(struct list_head *list)
 			continue;
 		}
 
-		if (!spin_trylock(&dcache_inode_lock)) {
-relock:
-			spin_unlock(&dentry->d_lock);
-			cpu_relax();
-			continue;
-		}
-		if (IS_ROOT(dentry))
-			parent = NULL;
-		else
-			parent = dentry->d_parent;
-		if (parent && !spin_trylock(&parent->d_lock)) {
-			spin_unlock(&dcache_inode_lock);
-			goto relock;
-		}
-		dentry_lru_del(dentry);
-
 		rcu_read_unlock();
-		prune_one_dentry(dentry, parent);
+
+		try_prune_one_dentry(dentry);
+
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 30/46] fs: icache RCU free inodes
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (27 preceding siblings ...)
  2010-11-27  9:44 ` [PATCH 29/46] fs: consolidate dentry kill sequence Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 31/46] fs: avoid inode RCU freeing for pseudo fs Nick Piggin
                   ` (20 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 Documentation/filesystems/porting         |   14 ++++++++++++++
 arch/powerpc/platforms/cell/spufs/inode.c |   10 ++++++++--
 drivers/staging/pohmelfs/inode.c          |   11 +++++++++--
 fs/9p/vfs_inode.c                         |    9 ++++++++-
 fs/adfs/super.c                           |    9 ++++++++-
 fs/affs/super.c                           |    9 ++++++++-
 fs/afs/super.c                            |   10 +++++++++-
 fs/befs/linuxvfs.c                        |   10 ++++++++--
 fs/bfs/inode.c                            |    9 ++++++++-
 fs/block_dev.c                            |    9 ++++++++-
 fs/btrfs/inode.c                          |    9 ++++++++-
 fs/ceph/inode.c                           |   11 ++++++++++-
 fs/cifs/cifsfs.c                          |    9 ++++++++-
 fs/coda/inode.c                           |    9 ++++++++-
 fs/ecryptfs/super.c                       |   12 +++++++++++-
 fs/efs/super.c                            |    9 ++++++++-
 fs/exofs/super.c                          |    9 ++++++++-
 fs/ext2/super.c                           |    9 ++++++++-
 fs/ext3/super.c                           |    9 ++++++++-
 fs/ext4/super.c                           |    9 ++++++++-
 fs/fat/inode.c                            |    9 ++++++++-
 fs/freevxfs/vxfs_inode.c                  |    9 ++++++++-
 fs/fuse/inode.c                           |    9 ++++++++-
 fs/gfs2/super.c                           |    9 ++++++++-
 fs/hfs/super.c                            |    9 ++++++++-
 fs/hfsplus/super.c                        |   10 +++++++++-
 fs/hostfs/hostfs_kern.c                   |    9 ++++++++-
 fs/hpfs/super.c                           |    9 ++++++++-
 fs/hppfs/hppfs.c                          |    9 ++++++++-
 fs/hugetlbfs/inode.c                      |    9 ++++++++-
 fs/inode.c                                |   10 +++++++++-
 fs/isofs/inode.c                          |    9 ++++++++-
 fs/jffs2/super.c                          |    9 ++++++++-
 fs/jfs/super.c                            |   10 +++++++++-
 fs/logfs/inode.c                          |    9 ++++++++-
 fs/minix/inode.c                          |    9 ++++++++-
 fs/ncpfs/inode.c                          |    9 ++++++++-
 fs/nfs/inode.c                            |    9 ++++++++-
 fs/nilfs2/super.c                         |   10 +++++++++-
 fs/ntfs/inode.c                           |    9 ++++++++-
 fs/ocfs2/dlmfs/dlmfs.c                    |    9 ++++++++-
 fs/ocfs2/super.c                          |    9 ++++++++-
 fs/openpromfs/inode.c                     |    9 ++++++++-
 fs/proc/inode.c                           |    9 ++++++++-
 fs/qnx4/inode.c                           |    9 ++++++++-
 fs/reiserfs/super.c                       |    9 ++++++++-
 fs/romfs/super.c                          |    9 ++++++++-
 fs/squashfs/super.c                       |    9 ++++++++-
 fs/sysv/inode.c                           |    9 ++++++++-
 fs/ubifs/super.c                          |   10 +++++++++-
 fs/udf/super.c                            |    9 ++++++++-
 fs/ufs/super.c                            |    9 ++++++++-
 fs/xfs/xfs_iget.c                         |   13 ++++++++++++-
 include/linux/fs.h                        |    5 ++++-
 include/linux/net.h                       |    1 -
 ipc/mqueue.c                              |    9 ++++++++-
 mm/shmem.c                                |    9 ++++++++-
 net/socket.c                              |   16 ++++++++--------
 net/sunrpc/rpc_pipe.c                     |   10 +++++++++-
 59 files changed, 483 insertions(+), 68 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index fd353e6..8f03e58 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -347,3 +347,17 @@ for details of what locks to replace dcache_lock with in order to protect
 particular things. Most of the time, a filesystem only needs ->d_lock, which
 protects *all* the dcache state of a given dentry.
 
+--
+[mandatory]
+
+	Filesystems must RCU-free their inodes, if they can have been accessed
+via rcu-walk path walk (basically, if the file can have had a path name in the
+vfs namespace).
+
+	i_dentry and i_rcu share storage in a union, and the vfs expects
+i_dentry to be reinitialized before it is freed, so an:
+
+  INIT_LIST_HEAD(&inode->i_dentry);
+
+must be done in the RCU callback.
+
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 03185de..856e9c3 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -71,12 +71,18 @@ spufs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void
-spufs_destroy_inode(struct inode *inode)
+static void spufs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(spufs_inode_cache, SPUFS_I(inode));
 }
 
+static void spufs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, spufs_i_callback);
+}
+
 static void
 spufs_init_once(void *p)
 {
diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 61685cc..736eb41 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -826,6 +826,14 @@ const struct address_space_operations pohmelfs_aops = {
 	.set_page_dirty 	= __set_page_dirty_nobuffers,
 };
 
+static void pohmelfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(pohmelfs_inode_cache, POHMELFS_I(inode));
+	atomic_long_dec(&psb->total_inodes);
+}
+
 /*
  * ->detroy_inode() callback. Deletes inode from the caches
  *  and frees private data.
@@ -842,8 +850,7 @@ static void pohmelfs_destroy_inode(struct inode *inode)
 
 	dprintk("%s: pi: %p, inode: %p, ino: %llu.\n",
 		__func__, pi, &pi->vfs_inode, pi->ino);
-	kmem_cache_free(pohmelfs_inode_cache, pi);
-	atomic_long_dec(&psb->total_inodes);
+	call_rcu(&inode->i_rcu, pohmelfs_i_callback);
 }
 
 /*
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 1073bca..f6f9081 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -237,10 +237,17 @@ struct inode *v9fs_alloc_inode(struct super_block *sb)
  *
  */
 
-void v9fs_destroy_inode(struct inode *inode)
+static void v9fs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(vcookie_cache, v9fs_inode2cookie(inode));
 }
+
+void v9fs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, v9fs_i_callback);
+}
 #endif
 
 /**
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 959dbff..47dffc5 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -240,11 +240,18 @@ static struct inode *adfs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void adfs_destroy_inode(struct inode *inode)
+static void adfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(adfs_inode_cachep, ADFS_I(inode));
 }
 
+static void adfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, adfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct adfs_inode_info *ei = (struct adfs_inode_info *) foo;
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 0cf7f43..4c18fcf 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -95,11 +95,18 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
 	return &i->vfs_inode;
 }
 
-static void affs_destroy_inode(struct inode *inode)
+static void affs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(affs_inode_cachep, AFFS_I(inode));
 }
 
+static void affs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, affs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct affs_inode_info *ei = (struct affs_inode_info *) foo;
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 27201cf..f901a9d 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -498,6 +498,14 @@ static struct inode *afs_alloc_inode(struct super_block *sb)
 	return &vnode->vfs_inode;
 }
 
+static void afs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct afs_vnode *vnode = AFS_FS_I(inode);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(afs_inode_cachep, vnode);
+}
+
 /*
  * destroy an AFS inode struct
  */
@@ -511,7 +519,7 @@ static void afs_destroy_inode(struct inode *inode)
 
 	ASSERTCMP(vnode->server, ==, NULL);
 
-	kmem_cache_free(afs_inode_cachep, vnode);
+	call_rcu(&inode->i_rcu, afs_i_callback);
 	atomic_dec(&afs_count_active_inodes);
 }
 
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index aa4e7c7..de93581 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -284,12 +284,18 @@ befs_alloc_inode(struct super_block *sb)
         return &bi->vfs_inode;
 }
 
-static void
-befs_destroy_inode(struct inode *inode)
+static void befs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
         kmem_cache_free(befs_inode_cachep, BEFS_I(inode));
 }
 
+static void befs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, befs_i_callback);
+}
+
 static void init_once(void *foo)
 {
         struct befs_inode_info *bi = (struct befs_inode_info *) foo;
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 76db6d7..a8e37f8 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -248,11 +248,18 @@ static struct inode *bfs_alloc_inode(struct super_block *sb)
 	return &bi->vfs_inode;
 }
 
-static void bfs_destroy_inode(struct inode *inode)
+static void bfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(bfs_inode_cachep, BFS_I(inode));
 }
 
+static void bfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct bfs_inode_info *bi = foo;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4230252..771f235 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -409,13 +409,20 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void bdev_destroy_inode(struct inode *inode)
+static void bdev_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
+static void bdev_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, bdev_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct bdev_inode *ei = (struct bdev_inode *) foo;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e134e80..c704fd1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6322,6 +6322,13 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	return inode;
 }
 
+static void btrfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+}
+
 void btrfs_destroy_inode(struct inode *inode)
 {
 	struct btrfs_ordered_extent *ordered;
@@ -6391,7 +6398,7 @@ void btrfs_destroy_inode(struct inode *inode)
 	inode_tree_del(inode);
 	btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
-	kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+	call_rcu(&inode->i_rcu, btrfs_i_callback);
 }
 
 int btrfs_drop_inode(struct inode *inode)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 2a48caf..47f8c8b 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -368,6 +368,15 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	return &ci->vfs_inode;
 }
 
+static void ceph_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ceph_inode_cachep, ci);
+}
+
 void ceph_destroy_inode(struct inode *inode)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
@@ -407,7 +416,7 @@ void ceph_destroy_inode(struct inode *inode)
 	if (ci->i_xattrs.prealloc_blob)
 		ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
-	kmem_cache_free(ceph_inode_cachep, ci);
+	call_rcu(&inode->i_rcu, ceph_i_callback);
 }
 
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 9c37897..8c223e5 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -334,10 +334,17 @@ cifs_alloc_inode(struct super_block *sb)
 	return &cifs_inode->vfs_inode;
 }
 
+static void cifs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(cifs_inode_cachep, CIFS_I(inode));
+}
+
 static void
 cifs_destroy_inode(struct inode *inode)
 {
-	kmem_cache_free(cifs_inode_cachep, CIFS_I(inode));
+	call_rcu(&inode->i_rcu, cifs_i_callback);
 }
 
 static void
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 5ea57c8..50dc7d1 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -56,11 +56,18 @@ static struct inode *coda_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void coda_destroy_inode(struct inode *inode)
+static void coda_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(coda_inode_cachep, ITOC(inode));
 }
 
+static void coda_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, coda_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct coda_inode_info *ei = (struct coda_inode_info *) foo;
diff --git a/fs/ecryptfs/super.c b/fs/ecryptfs/super.c
index 2720178..3042fe1 100644
--- a/fs/ecryptfs/super.c
+++ b/fs/ecryptfs/super.c
@@ -62,6 +62,16 @@ out:
 	return inode;
 }
 
+static void ecryptfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct ecryptfs_inode_info *inode_info;
+	inode_info = ecryptfs_inode_to_private(inode);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
+}
+
 /**
  * ecryptfs_destroy_inode
  * @inode: The ecryptfs inode
@@ -88,7 +98,7 @@ static void ecryptfs_destroy_inode(struct inode *inode)
 		}
 	}
 	ecryptfs_destroy_crypt_stat(&inode_info->crypt_stat);
-	kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
+	call_rcu(&inode->i_rcu, ecryptfs_i_callback);
 }
 
 /**
diff --git a/fs/efs/super.c b/fs/efs/super.c
index 5073a07..0f31acb 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -65,11 +65,18 @@ static struct inode *efs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void efs_destroy_inode(struct inode *inode)
+static void efs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(efs_inode_cachep, INODE_INFO(inode));
 }
 
+static void efs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, efs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct efs_inode_info *ei = (struct efs_inode_info *) foo;
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 79c3ae6..8c6c466 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -150,12 +150,19 @@ static struct inode *exofs_alloc_inode(struct super_block *sb)
 	return &oi->vfs_inode;
 }
 
+static void exofs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+}
+
 /*
  * Remove an inode from the cache
  */
 static void exofs_destroy_inode(struct inode *inode)
 {
-	kmem_cache_free(exofs_inode_cachep, exofs_i(inode));
+	call_rcu(&inode->i_rcu, exofs_i_callback);
 }
 
 /*
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d89e0b6..e0c6380 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -161,11 +161,18 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void ext2_destroy_inode(struct inode *inode)
+static void ext2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ext2_inode_cachep, EXT2_I(inode));
 }
 
+static void ext2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ext2_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct ext2_inode_info *ei = (struct ext2_inode_info *) foo;
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index acf8695..77ce161 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -479,6 +479,13 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
+static void ext3_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+}
+
 static void ext3_destroy_inode(struct inode *inode)
 {
 	if (!list_empty(&(EXT3_I(inode)->i_orphan))) {
@@ -489,7 +496,7 @@ static void ext3_destroy_inode(struct inode *inode)
 				false);
 		dump_stack();
 	}
-	kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
+	call_rcu(&inode->i_rcu, ext3_i_callback);
 }
 
 static void init_once(void *foo)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e32195d..52c10e1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -841,6 +841,13 @@ static int ext4_drop_inode(struct inode *inode)
 	return drop;
 }
 
+static void ext4_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
+}
+
 static void ext4_destroy_inode(struct inode *inode)
 {
 	ext4_ioend_wait(inode);
@@ -853,7 +860,7 @@ static void ext4_destroy_inode(struct inode *inode)
 				true);
 		dump_stack();
 	}
-	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
+	call_rcu(&inode->i_rcu, ext4_i_callback);
 }
 
 static void init_once(void *foo)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index ad6998a..8cccfeb 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -514,11 +514,18 @@ static struct inode *fat_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void fat_destroy_inode(struct inode *inode)
+static void fat_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(fat_inode_cachep, MSDOS_I(inode));
 }
 
+static void fat_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, fat_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct msdos_inode_info *ei = (struct msdos_inode_info *)foo;
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index 8c04eac..2ba6719 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -337,6 +337,13 @@ vxfs_iget(struct super_block *sbp, ino_t ino)
 	return ip;
 }
 
+static void vxfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(vxfs_inode_cachep, inode->i_private);
+}
+
 /**
  * vxfs_evict_inode - remove inode from main memory
  * @ip:		inode to discard.
@@ -350,5 +357,5 @@ vxfs_evict_inode(struct inode *ip)
 {
 	truncate_inode_pages(&ip->i_data, 0);
 	end_writeback(ip);
-	kmem_cache_free(vxfs_inode_cachep, ip->i_private);
+	call_rcu(&ip->i_rcu, vxfs_i_callback);
 }
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index cfce3ad..44e0a6c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -99,6 +99,13 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	return inode;
 }
 
+static void fuse_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(fuse_inode_cachep, inode);
+}
+
 static void fuse_destroy_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -106,7 +113,7 @@ static void fuse_destroy_inode(struct inode *inode)
 	BUG_ON(!list_empty(&fi->queued_writes));
 	if (fi->forget_req)
 		fuse_request_free(fi->forget_req);
-	kmem_cache_free(fuse_inode_cachep, inode);
+	call_rcu(&inode->i_rcu, fuse_i_callback);
 }
 
 void fuse_send_forget(struct fuse_conn *fc, struct fuse_req *req,
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 2b2c499..16c2eca 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1405,11 +1405,18 @@ static struct inode *gfs2_alloc_inode(struct super_block *sb)
 	return &ip->i_inode;
 }
 
-static void gfs2_destroy_inode(struct inode *inode)
+static void gfs2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(gfs2_inode_cachep, inode);
 }
 
+static void gfs2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, gfs2_i_callback);
+}
+
 const struct super_operations gfs2_super_ops = {
 	.alloc_inode		= gfs2_alloc_inode,
 	.destroy_inode		= gfs2_destroy_inode,
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 4824c27..ef4ee57 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -167,11 +167,18 @@ static struct inode *hfs_alloc_inode(struct super_block *sb)
 	return i ? &i->vfs_inode : NULL;
 }
 
-static void hfs_destroy_inode(struct inode *inode)
+static void hfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(hfs_inode_cachep, HFS_I(inode));
 }
 
+static void hfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hfs_i_callback);
+}
+
 static const struct super_operations hfs_super_operations = {
 	.alloc_inode	= hfs_alloc_inode,
 	.destroy_inode	= hfs_destroy_inode,
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 52cc746..182e83a 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -488,11 +488,19 @@ static struct inode *hfsplus_alloc_inode(struct super_block *sb)
 	return i ? &i->vfs_inode : NULL;
 }
 
-static void hfsplus_destroy_inode(struct inode *inode)
+static void hfsplus_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(hfsplus_inode_cachep, HFSPLUS_I(inode));
 }
 
+static void hfsplus_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hfsplus_i_callback);
+}
+
 #define HFSPLUS_INODE_SIZE	sizeof(struct hfsplus_inode_info)
 
 static struct dentry *hfsplus_mount(struct file_system_type *fs_type,
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 39dc505..861113f 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -247,11 +247,18 @@ static void hostfs_evict_inode(struct inode *inode)
 	}
 }
 
-static void hostfs_destroy_inode(struct inode *inode)
+static void hostfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kfree(HOSTFS_I(inode));
 }
 
+static void hostfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hostfs_i_callback);
+}
+
 static int hostfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
 {
 	const char *root_path = vfs->mnt_sb->s_fs_info;
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index 6c5f015..49935ba 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -177,11 +177,18 @@ static struct inode *hpfs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void hpfs_destroy_inode(struct inode *inode)
+static void hpfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(hpfs_inode_cachep, hpfs_i(inode));
 }
 
+static void hpfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hpfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct hpfs_inode_info *ei = (struct hpfs_inode_info *) foo;
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index f702b5f..87ed48e 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -632,11 +632,18 @@ void hppfs_evict_inode(struct inode *ino)
 	mntput(ino->i_sb->s_fs_info);
 }
 
-static void hppfs_destroy_inode(struct inode *inode)
+static void hppfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kfree(HPPFS_I(inode));
 }
 
+static void hppfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, hppfs_i_callback);
+}
+
 static const struct super_operations hppfs_sbops = {
 	.alloc_inode	= hppfs_alloc_inode,
 	.destroy_inode	= hppfs_destroy_inode,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a5fe681..9885082 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -663,11 +663,18 @@ static struct inode *hugetlbfs_alloc_inode(struct super_block *sb)
 	return &p->vfs_inode;
 }
 
+static void hugetlbfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+}
+
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
 	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
-	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+	call_rcu(&inode->i_rcu, hugetlbfs_i_callback);
 }
 
 static const struct address_space_operations hugetlbfs_aops = {
diff --git a/fs/inode.c b/fs/inode.c
index ae2727a..26a8ac1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -270,6 +270,13 @@ void __destroy_inode(struct inode *inode)
 }
 EXPORT_SYMBOL(__destroy_inode);
 
+static void i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(inode_cachep, inode);
+}
+
 static void destroy_inode(struct inode *inode)
 {
 	BUG_ON(!list_empty(&inode->i_lru));
@@ -277,7 +284,7 @@ static void destroy_inode(struct inode *inode)
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
 	else
-		kmem_cache_free(inode_cachep, (inode));
+		call_rcu(&inode->i_rcu, i_callback);
 }
 
 /*
@@ -430,6 +437,7 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
+	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
 EXPORT_SYMBOL(end_writeback);
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index bc77744..9813d54 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -77,11 +77,18 @@ static struct inode *isofs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void isofs_destroy_inode(struct inode *inode)
+static void isofs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(isofs_inode_cachep, ISOFS_I(inode));
 }
 
+static void isofs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, isofs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct iso_inode_info *ei = foo;
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index c86041b..853b8e3 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -40,11 +40,18 @@ static struct inode *jffs2_alloc_inode(struct super_block *sb)
 	return &f->vfs_inode;
 }
 
-static void jffs2_destroy_inode(struct inode *inode)
+static void jffs2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(jffs2_inode_cachep, JFFS2_INODE_INFO(inode));
 }
 
+static void jffs2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, jffs2_i_callback);
+}
+
 static void jffs2_i_init_once(void *foo)
 {
 	struct jffs2_inode_info *f = foo;
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 0669fc1..b715b0f 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -115,6 +115,14 @@ static struct inode *jfs_alloc_inode(struct super_block *sb)
 	return &jfs_inode->vfs_inode;
 }
 
+static void jfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct jfs_inode_info *ji = JFS_IP(inode);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(jfs_inode_cachep, ji);
+}
+
 static void jfs_destroy_inode(struct inode *inode)
 {
 	struct jfs_inode_info *ji = JFS_IP(inode);
@@ -128,7 +136,7 @@ static void jfs_destroy_inode(struct inode *inode)
 		ji->active_ag = -1;
 	}
 	spin_unlock_irq(&ji->ag_lock);
-	kmem_cache_free(jfs_inode_cachep, ji);
+	call_rcu(&inode->i_rcu, jfs_i_callback);
 }
 
 static int jfs_statfs(struct dentry *dentry, struct kstatfs *buf)
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index d8c71ec..03b8c24 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -141,13 +141,20 @@ struct inode *logfs_safe_iget(struct super_block *sb, ino_t ino, int *is_cached)
 	return __logfs_iget(sb, ino);
 }
 
+static void logfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(logfs_inode_cache, logfs_inode(inode));
+}
+
 static void __logfs_destroy_inode(struct inode *inode)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
 	BUG_ON(li->li_block);
 	list_del(&li->li_freeing_list);
-	kmem_cache_free(logfs_inode_cache, li);
+	call_rcu(&inode->i_rcu, logfs_i_callback);
 }
 
 static void logfs_destroy_inode(struct inode *inode)
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index fb20208..ae0b83f 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -68,11 +68,18 @@ static struct inode *minix_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void minix_destroy_inode(struct inode *inode)
+static void minix_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(minix_inode_cachep, minix_i(inode));
 }
 
+static void minix_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, minix_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct minix_inode_info *ei = (struct minix_inode_info *) foo;
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index 8fb93b6..60047db 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -58,11 +58,18 @@ static struct inode *ncp_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void ncp_destroy_inode(struct inode *inode)
+static void ncp_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ncp_inode_cachep, NCP_FINFO(inode));
 }
 
+static void ncp_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ncp_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct ncp_inode_info *ei = (struct ncp_inode_info *) foo;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 314f571..a10ed32 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1437,11 +1437,18 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
 	return &nfsi->vfs_inode;
 }
 
-void nfs_destroy_inode(struct inode *inode)
+static void nfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
 }
 
+void nfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, nfs_i_callback);
+}
+
 static inline void nfs4_init_once(struct nfs_inode *nfsi)
 {
 #ifdef CONFIG_NFS_V4
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index d36fc7e..e2dcc9c 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -162,10 +162,13 @@ struct inode *nilfs_alloc_inode(struct super_block *sb)
 	return &ii->vfs_inode;
 }
 
-void nilfs_destroy_inode(struct inode *inode)
+static void nilfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct nilfs_mdt_info *mdi = NILFS_MDT(inode);
 
+	INIT_LIST_HEAD(&inode->i_dentry);
+
 	if (mdi) {
 		kfree(mdi->mi_bgl); /* kfree(NULL) is safe */
 		kfree(mdi);
@@ -173,6 +176,11 @@ void nilfs_destroy_inode(struct inode *inode)
 	kmem_cache_free(nilfs_inode_cachep, NILFS_I(inode));
 }
 
+void nilfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, nilfs_i_callback);
+}
+
 static int nilfs_sync_super(struct nilfs_sb_info *sbi, int flag)
 {
 	struct the_nilfs *nilfs = sbi->s_nilfs;
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 93622b1..a627ed8 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -332,6 +332,13 @@ struct inode *ntfs_alloc_big_inode(struct super_block *sb)
 	return NULL;
 }
 
+static void ntfs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ntfs_big_inode_cache, NTFS_I(inode));
+}
+
 void ntfs_destroy_big_inode(struct inode *inode)
 {
 	ntfs_inode *ni = NTFS_I(inode);
@@ -340,7 +347,7 @@ void ntfs_destroy_big_inode(struct inode *inode)
 	BUG_ON(ni->page);
 	if (!atomic_dec_and_test(&ni->count))
 		BUG();
-	kmem_cache_free(ntfs_big_inode_cache, NTFS_I(inode));
+	call_rcu(&inode->i_rcu, ntfs_i_callback);
 }
 
 static inline ntfs_inode *ntfs_alloc_extent_inode(void)
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index b2df490..8c5c0ed 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -351,11 +351,18 @@ static struct inode *dlmfs_alloc_inode(struct super_block *sb)
 	return &ip->ip_vfs_inode;
 }
 
-static void dlmfs_destroy_inode(struct inode *inode)
+static void dlmfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(dlmfs_inode_cache, DLMFS_I(inode));
 }
 
+static void dlmfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, dlmfs_i_callback);
+}
+
 static void dlmfs_evict_inode(struct inode *inode)
 {
 	int status;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index cfeab7c..17ff46f 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -569,11 +569,18 @@ static struct inode *ocfs2_alloc_inode(struct super_block *sb)
 	return &oi->vfs_inode;
 }
 
-static void ocfs2_destroy_inode(struct inode *inode)
+static void ocfs2_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ocfs2_inode_cachep, OCFS2_I(inode));
 }
 
+static void ocfs2_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ocfs2_i_callback);
+}
+
 static unsigned long long ocfs2_max_file_offset(unsigned int bbits,
 						unsigned int cbits)
 {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index 911e61f..a2a5bff 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -343,11 +343,18 @@ static struct inode *openprom_alloc_inode(struct super_block *sb)
 	return &oi->vfs_inode;
 }
 
-static void openprom_destroy_inode(struct inode *inode)
+static void openprom_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(op_inode_cachep, OP_I(inode));
 }
 
+static void openprom_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, openprom_i_callback);
+}
+
 static struct inode *openprom_iget(struct super_block *sb, ino_t ino)
 {
 	struct inode *inode;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 3ddb606..6bcb926 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -65,11 +65,18 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
 	return inode;
 }
 
-static void proc_destroy_inode(struct inode *inode)
+static void proc_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(proc_inode_cachep, PROC_I(inode));
 }
 
+static void proc_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, proc_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct proc_inode *ei = (struct proc_inode *) foo;
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index fcada42..e63b417 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -425,11 +425,18 @@ static struct inode *qnx4_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void qnx4_destroy_inode(struct inode *inode)
+static void qnx4_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(qnx4_inode_cachep, qnx4_i(inode));
 }
 
+static void qnx4_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, qnx4_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct qnx4_inode_info *ei = (struct qnx4_inode_info *) foo;
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index b243117..2575682 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -529,11 +529,18 @@ static struct inode *reiserfs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void reiserfs_destroy_inode(struct inode *inode)
+static void reiserfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(reiserfs_inode_cachep, REISERFS_I(inode));
 }
 
+static void reiserfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, reiserfs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct reiserfs_inode_info *ei = (struct reiserfs_inode_info *)foo;
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 6647f90..2305e31 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -400,11 +400,18 @@ static struct inode *romfs_alloc_inode(struct super_block *sb)
 /*
  * return a spent inode to the slab cache
  */
-static void romfs_destroy_inode(struct inode *inode)
+static void romfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(romfs_inode_cachep, ROMFS_I(inode));
 }
 
+static void romfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, romfs_i_callback);
+}
+
 /*
  * get filesystem statistics
  */
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 24de30b..20700b9 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -440,11 +440,18 @@ static struct inode *squashfs_alloc_inode(struct super_block *sb)
 }
 
 
-static void squashfs_destroy_inode(struct inode *inode)
+static void squashfs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(squashfs_inode_cachep, squashfs_i(inode));
 }
 
+static void squashfs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, squashfs_i_callback);
+}
+
 
 static struct file_system_type squashfs_fs_type = {
 	.owner = THIS_MODULE,
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index de44d06..0630eb9 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -333,11 +333,18 @@ static struct inode *sysv_alloc_inode(struct super_block *sb)
 	return &si->vfs_inode;
 }
 
-static void sysv_destroy_inode(struct inode *inode)
+static void sysv_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(sysv_inode_cachep, SYSV_I(inode));
 }
 
+static void sysv_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, sysv_i_callback);
+}
+
 static void init_once(void *p)
 {
 	struct sysv_inode_info *si = (struct sysv_inode_info *)p;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 91fac54..6e11c29 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -272,12 +272,20 @@ static struct inode *ubifs_alloc_inode(struct super_block *sb)
 	return &ui->vfs_inode;
 };
 
+static void ubifs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct ubifs_inode *ui = ubifs_inode(inode);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(ubifs_inode_slab, ui);
+}
+
 static void ubifs_destroy_inode(struct inode *inode)
 {
 	struct ubifs_inode *ui = ubifs_inode(inode);
 
 	kfree(ui->data);
-	kmem_cache_free(ubifs_inode_slab, inode);
+	call_rcu(&inode->i_rcu, ubifs_i_callback);
 }
 
 /*
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 4a5c7c6..b539d53 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -139,11 +139,18 @@ static struct inode *udf_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void udf_destroy_inode(struct inode *inode)
+static void udf_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(udf_inode_cachep, UDF_I(inode));
 }
 
+static void udf_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, udf_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct udf_inode_info *ei = (struct udf_inode_info *)foo;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 2c47dae..2c61ac5 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1412,11 +1412,18 @@ static struct inode *ufs_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void ufs_destroy_inode(struct inode *inode)
+static void ufs_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(ufs_inode_cachep, UFS_I(inode));
 }
 
+static void ufs_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, ufs_i_callback);
+}
+
 static void init_once(void *foo)
 {
 	struct ufs_inode_info *ei = (struct ufs_inode_info *) foo;
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 0cdd269..d7de5a3 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -91,6 +91,17 @@ xfs_inode_alloc(
 	return ip;
 }
 
+STATIC void
+xfs_inode_free_callback(
+	struct rcu_head		*head)
+{
+	struct inode		*inode = container_of(head, struct inode, i_rcu);
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_zone_free(xfs_inode_zone, ip);
+}
+
 void
 xfs_inode_free(
 	struct xfs_inode	*ip)
@@ -134,7 +145,7 @@ xfs_inode_free(
 	ASSERT(!spin_is_locked(&ip->i_flags_lock));
 	ASSERT(completion_done(&ip->i_flush));
 
-	kmem_zone_free(xfs_inode_zone, ip);
+	call_rcu(&ip->i_vnode.i_rcu, xfs_inode_free_callback);
 }
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bf95e7e..280d11c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -736,7 +736,10 @@ struct inode {
 	struct list_head	i_wb_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
-	struct list_head	i_dentry;
+	union {
+		struct list_head	i_dentry;
+		struct rcu_head		i_rcu;
+	};
 	unsigned long		i_ino;
 	atomic_t		i_count;
 	unsigned int		i_nlink;
diff --git a/include/linux/net.h b/include/linux/net.h
index 16faa13..06bde49 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -120,7 +120,6 @@ enum sock_shutdown_cmd {
 struct socket_wq {
 	wait_queue_head_t	wait;
 	struct fasync_struct	*fasync_list;
-	struct rcu_head		rcu;
 } ____cacheline_aligned_in_smp;
 
 /**
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 035f439..14fb6d6 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -237,11 +237,18 @@ static struct inode *mqueue_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
-static void mqueue_destroy_inode(struct inode *inode)
+static void mqueue_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(mqueue_inode_cachep, MQUEUE_I(inode));
 }
 
+static void mqueue_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, mqueue_i_callback);
+}
+
 static void mqueue_evict_inode(struct inode *inode)
 {
 	struct mqueue_inode_info *info;
diff --git a/mm/shmem.c b/mm/shmem.c
index 47fdeeb..5ee67c9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2415,13 +2415,20 @@ static struct inode *shmem_alloc_inode(struct super_block *sb)
 	return &p->vfs_inode;
 }
 
+static void shmem_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+}
+
 static void shmem_destroy_inode(struct inode *inode)
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
 	}
-	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
+	call_rcu(&inode->i_rcu, shmem_i_callback);
 }
 
 static void init_once(void *foo)
diff --git a/net/socket.c b/net/socket.c
index 3ca2fd9..d2504c6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -262,20 +262,20 @@ static struct inode *sock_alloc_inode(struct super_block *sb)
 }
 
 
-static void wq_free_rcu(struct rcu_head *head)
+static void sock_free_rcu(struct rcu_head *head)
 {
-	struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct socket_alloc *ei = container_of(inode, struct socket_alloc,
+								vfs_inode);
 
-	kfree(wq);
+	kfree(ei->socket.wq);
+	INIT_LIST_HEAD(&inode->i_dentry);
+	kmem_cache_free(sock_inode_cachep, ei);
 }
 
 static void sock_destroy_inode(struct inode *inode)
 {
-	struct socket_alloc *ei;
-
-	ei = container_of(inode, struct socket_alloc, vfs_inode);
-	call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
-	kmem_cache_free(sock_inode_cachep, ei);
+	call_rcu(&inode->i_rcu, sock_free_rcu);
 }
 
 static void init_once(void *foo)
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index a0dc1a8..2899fe2 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -162,11 +162,19 @@ rpc_alloc_inode(struct super_block *sb)
 }
 
 static void
-rpc_destroy_inode(struct inode *inode)
+rpc_i_callback(struct rcu_head *head)
 {
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	INIT_LIST_HEAD(&inode->i_dentry);
 	kmem_cache_free(rpc_inode_cachep, RPC_I(inode));
 }
 
+static void
+rpc_destroy_inode(struct inode *inode)
+{
+	call_rcu(&inode->i_rcu, rpc_i_callback);
+}
+
 static int
 rpc_pipe_open(struct inode *inode, struct file *filp)
 {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 31/46] fs: avoid inode RCU freeing for pseudo fs
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (28 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 30/46] fs: icache RCU free inodes Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 32/46] kernel: optimise seqlock Nick Piggin
                   ` (19 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/inode.c          |    6 ++++++
 fs/pipe.c           |    6 +++++-
 include/linux/fs.h  |    1 +
 include/linux/net.h |    1 +
 net/socket.c        |   17 +++++++++--------
 5 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 26a8ac1..853e0c6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -255,6 +255,12 @@ static struct inode *alloc_inode(struct super_block *sb)
 	return inode;
 }
 
+void free_inode_nonrcu(struct inode *inode)
+{
+	kmem_cache_free(inode_cachep, inode);
+}
+EXPORT_SYMBOL(free_inode_nonrcu);
+
 void __destroy_inode(struct inode *inode)
 {
 	BUG_ON(inode_has_buffers(inode));
diff --git a/fs/pipe.c b/fs/pipe.c
index a8012a9..4ae1d76 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1241,6 +1241,10 @@ out:
 	return ret;
 }
 
+static const struct super_operations pipefs_ops = {
+	.destroy_inode = free_inode_nonrcu,
+};
+
 /*
  * pipefs should _never_ be mounted by userland - too much of security hassle,
  * no real gain from having the whole whorehouse mounted. So we don't need
@@ -1250,7 +1254,7 @@ out:
 static struct dentry *pipefs_mount(struct file_system_type *fs_type,
 			 int flags, const char *dev_name, void *data)
 {
-	return mount_pseudo(fs_type, "pipe:", NULL, PIPEFS_MAGIC);
+	return mount_pseudo(fs_type, "pipe:", &pipefs_ops, PIPEFS_MAGIC);
 }
 
 static struct file_system_type pipe_fs_type = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 280d11c..03937b7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2232,6 +2232,7 @@ extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode(struct super_block *);
+extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_suid(struct file *);
 
diff --git a/include/linux/net.h b/include/linux/net.h
index 06bde49..16faa13 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -120,6 +120,7 @@ enum sock_shutdown_cmd {
 struct socket_wq {
 	wait_queue_head_t	wait;
 	struct fasync_struct	*fasync_list;
+	struct rcu_head		rcu;
 } ____cacheline_aligned_in_smp;
 
 /**
diff --git a/net/socket.c b/net/socket.c
index d2504c6..425da53 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -262,20 +262,21 @@ static struct inode *sock_alloc_inode(struct super_block *sb)
 }
 
 
-static void sock_free_rcu(struct rcu_head *head)
+
+static void wq_free_rcu(struct rcu_head *head)
 {
-	struct inode *inode = container_of(head, struct inode, i_rcu);
-	struct socket_alloc *ei = container_of(inode, struct socket_alloc,
-								vfs_inode);
+	struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
 
-	kfree(ei->socket.wq);
-	INIT_LIST_HEAD(&inode->i_dentry);
-	kmem_cache_free(sock_inode_cachep, ei);
+	kfree(wq);
 }
 
 static void sock_destroy_inode(struct inode *inode)
 {
-	call_rcu(&inode->i_rcu, sock_free_rcu);
+	struct socket_alloc *ei;
+
+	ei = container_of(inode, struct socket_alloc, vfs_inode);
+	call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
+	kmem_cache_free(sock_inode_cachep, ei);
 }
 
 static void init_once(void *foo)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 32/46] kernel: optimise seqlock
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (29 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 31/46] fs: avoid inode RCU freeing for pseudo fs Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 33/46] fs: rcu-walk for path lookup Nick Piggin
                   ` (18 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Add branch annotations for seqlock read fastpath, and introduce
__read_seqcount_begin and __read_seqcount_end functions, that can avoid the
smp_rmb() if used carefully. These will be used by store-free path walking
algorithm performance is critical and seqlocks are in use.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 include/linux/seqlock.h |   67 ++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 632205c..513c550 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -107,7 +107,7 @@ static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
 {
 	smp_rmb();
 
-	return (sl->sequence != start);
+	return unlikely(sl->sequence != start);
 }
 
 
@@ -125,14 +125,25 @@ typedef struct seqcount {
 #define SEQCNT_ZERO { 0 }
 #define seqcount_init(x)	do { *(x) = (seqcount_t) SEQCNT_ZERO; } while (0)
 
-/* Start of read using pointer to a sequence counter only.  */
-static inline unsigned read_seqcount_begin(const seqcount_t *s)
+/**
+ * __read_seqcount_begin - begin a seq-read critical section (without barrier)
+ * @s: pointer to seqcount_t
+ * Returns: count to be passed to read_seqcount_retry
+ *
+ * __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb()
+ * barrier. Callers should ensure that smp_rmb() or equivalent ordering is
+ * provided before actually loading any of the variables that are to be
+ * protected in this critical section.
+ *
+ * Use carefully, only in critical code, and comment how the barrier is
+ * provided.
+ */
+static inline unsigned __read_seqcount_begin(const seqcount_t *s)
 {
 	unsigned ret;
 
 repeat:
 	ret = s->sequence;
-	smp_rmb();
 	if (unlikely(ret & 1)) {
 		cpu_relax();
 		goto repeat;
@@ -140,14 +151,56 @@ repeat:
 	return ret;
 }
 
-/*
- * Test if reader processed invalid data because sequence number has changed.
+/**
+ * read_seqcount_begin - begin a seq-read critical section
+ * @s: pointer to seqcount_t
+ * Returns: count to be passed to read_seqcount_retry
+ *
+ * read_seqcount_begin opens a read critical section of the given seqcount.
+ * Validity of the critical section is tested by checking read_seqcount_retry
+ * function.
+ */
+static inline unsigned read_seqcount_begin(const seqcount_t *s)
+{
+	unsigned ret = __read_seqcount_begin(s);
+	smp_rmb();
+	return ret;
+}
+
+/**
+ * __read_seqcount_retry - end a seq-read critical section (without barrier)
+ * @s: pointer to seqcount_t
+ * @start: count, from read_seqcount_begin
+ * Returns: 1 if retry is required, else 0
+ *
+ * __read_seqcount_retry is like read_seqcount_retry, but has no smp_rmb()
+ * barrier. Callers should ensure that smp_rmb() or equivalent ordering is
+ * provided before actually loading any of the variables that are to be
+ * protected in this critical section.
+ *
+ * Use carefully, only in critical code, and comment how the barrier is
+ * provided.
+ */
+static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
+{
+	return unlikely(s->sequence != start);
+}
+
+/**
+ * read_seqcount_retry - end a seq-read critical section
+ * @s: pointer to seqcount_t
+ * @start: count, from read_seqcount_begin
+ * Returns: 1 if retry is required, else 0
+ *
+ * read_seqcount_retry closes a read critical section of the given seqcount.
+ * If the critical section was invalid, it must be ignored (and typically
+ * retried).
  */
 static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
 {
 	smp_rmb();
 
-	return s->sequence != start;
+	return __read_seqcount_retry(s, start);
 }
 
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 33/46] fs: rcu-walk for path lookup
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (30 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 32/46] kernel: optimise seqlock Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 34/46] fs: fs_struct use seqlock Nick Piggin
                   ` (17 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the
current algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can't drop rcu-walk gracefully and instead return -ECHILD (for
want of a better errno). This signals the path walking code to do the lookup
again with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. creat or negative lookup)
* parent with ->d_op->d_hash
* parent with d_inode->i_op->permission or ACLs
* Following links
* Dentry with ->d_op->d_revalidate

Apart from the first, it may be possible to make most of these cases RCU
walked. Though that would require a bit more poking into filesystems, so it
would be better to wait until the base infrastructure is converted.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 Documentation/filesystems/dentry-locking.txt |  172 -------
 Documentation/filesystems/path-lookup.txt    |  247 +++++++++
 fs/dcache.c                                  |  167 ++++++-
 fs/filesystems.c                             |    3 +
 fs/namei.c                                   |  685 ++++++++++++++++++++-----
 include/linux/dcache.h                       |   13 +-
 include/linux/namei.h                        |   15 +-
 7 files changed, 962 insertions(+), 340 deletions(-)
 delete mode 100644 Documentation/filesystems/dentry-locking.txt
 create mode 100644 Documentation/filesystems/path-lookup.txt

diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt
deleted file mode 100644
index 30b6a40..0000000
--- a/Documentation/filesystems/dentry-locking.txt
+++ /dev/null
@@ -1,172 +0,0 @@
-RCU-based dcache locking model
-==============================
-
-On many workloads, the most common operation on dcache is to look up a
-dentry, given a parent dentry and the name of the child. Typically,
-for every open(), stat() etc., the dentry corresponding to the
-pathname will be looked up by walking the tree starting with the first
-component of the pathname and using that dentry along with the next
-component to look up the next level and so on. Since it is a frequent
-operation for workloads like multiuser environments and web servers,
-it is important to optimize this path.
-
-Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
-every component during path look-up. Since 2.5.10 onwards, fast-walk
-algorithm changed this by holding the dcache_lock at the beginning and
-walking as many cached path component dentries as possible. This
-significantly decreases the number of acquisition of
-dcache_lock. However it also increases the lock hold time
-significantly and affects performance in large SMP machines. Since
-2.5.62 kernel, dcache has been using a new locking model that uses RCU
-to make dcache look-up lock-free.
-
-The current dcache locking model is not very different from the
-existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
-protected the hash chain, d_child, d_alias, d_lru lists as well as
-d_inode and several other things like mount look-up. RCU-based changes
-affect only the way the hash chain is protected. For everything else
-the dcache_lock must be taken for both traversing as well as
-updating. The hash chain updates too take the dcache_lock.  The
-significant change is the way d_lookup traverses the hash chain, it
-doesn't acquire the dcache_lock for this and rely on RCU to ensure
-that the dentry has not been *freed*.
-
-dcache_lock no longer exists, dentry locking is explained in fs/dcache.c
-
-Dcache locking details
-======================
-
-For many multi-user workloads, open() and stat() on files are very
-frequently occurring operations. Both involve walking of path names to
-find the dentry corresponding to the concerned file. In 2.4 kernel,
-dcache_lock was held during look-up of each path component. Contention
-and cache-line bouncing of this global lock caused significant
-scalability problems. With the introduction of RCU in Linux kernel,
-this was worked around by making the look-up of path components during
-path walking lock-free.
-
-
-Safe lock-free look-up of dcache hash table
-===========================================
-
-Dcache is a complex data structure with the hash table entries also
-linked together in other lists. In 2.4 kernel, dcache_lock protected
-all the lists. RCU dentry hash walking works like this:
-
-1. The deletion from hash chain is done using hlist_del_rcu() macro
-   which doesn't initialize next pointer of the deleted dentry and
-   this allows us to walk safely lock-free while a deletion is
-   happening. This is a standard hlist_rcu iteration.
-
-2. Insertion of a dentry into the hash table is done using
-   hlist_add_head_rcu() which take care of ordering the writes - the
-   writes to the dentry must be visible before the dentry is
-   inserted. This works in conjunction with hlist_for_each_rcu(),
-   which has since been replaced by hlist_for_each_entry_rcu(), while
-   walking the hash chain. The only requirement is that all
-   initialization to the dentry must be done before
-   hlist_add_head_rcu() since we don't have lock protection
-   while traversing the hash chain.
-
-3. The dentry looked up without holding locks cannot be returned for
-   walking if it is unhashed. It then may have a NULL d_inode or other
-   bogosity since RCU doesn't protect the other fields in the dentry. We
-   therefore use a flag DCACHE_UNHASHED to indicate unhashed dentries
-   and use this in conjunction with a per-dentry lock (d_lock). Once
-   looked up without locks, we acquire the per-dentry lock (d_lock) and
-   check if the dentry is unhashed. If so, the look-up is failed. If not,
-   the reference count of the dentry is increased and the dentry is
-   returned.
-
-4. Once a dentry is looked up, it must be ensured during the path walk
-   for that component it doesn't go away. In pre-2.5.10 code, this was
-   done holding a reference to the dentry. dcache_rcu does the same.
-   In some sense, dcache_rcu path walking looks like the pre-2.5.10
-   version.
-
-5. All dentry hash chain updates must take the per-dentry lock (see
-   fs/dcache.c). This excludes dput() to ensure that a dentry that has
-   been looked up concurrently does not get deleted before dget() can
-   take a ref.
-
-6. There are several ways to do reference counting of RCU protected
-   objects. One such example is in ipv4 route cache where deferred
-   freeing (using call_rcu()) is done as soon as the reference count
-   goes to zero. This cannot be done in the case of dentries because
-   tearing down of dentries require blocking (dentry_iput()) which
-   isn't supported from RCU callbacks. Instead, tearing down of
-   dentries happen synchronously in dput(), but actual freeing happens
-   later when RCU grace period is over. This allows safe lock-free
-   walking of the hash chains, but a matched dentry may have been
-   partially torn down. The checking of DCACHE_UNHASHED flag with
-   d_lock held detects such dentries and prevents them from being
-   returned from look-up.
-
-
-Maintaining POSIX rename semantics
-==================================
-
-Since look-up of dentries is lock-free, it can race against a
-concurrent rename operation. For example, during rename of file A to
-B, look-up of either A or B must succeed.  So, if look-up of B happens
-after A has been removed from the hash chain but not added to the new
-hash chain, it may fail.  Also, a comparison while the name is being
-written concurrently by a rename may result in false positive matches
-violating rename semantics.  Issues related to race with rename are
-handled as described below :
-
-1. Look-up can be done in two ways - d_lookup() which is safe from
-   simultaneous renames and __d_lookup() which is not.  If
-   __d_lookup() fails, it must be followed up by a d_lookup() to
-   correctly determine whether a dentry is in the hash table or
-   not. d_lookup() protects look-ups using a sequence lock
-   (rename_lock).
-
-2. The name associated with a dentry (d_name) may be changed if a
-   rename is allowed to happen simultaneously. To avoid memcmp() in
-   __d_lookup() go out of bounds due to a rename and false positive
-   comparison, the name comparison is done while holding the
-   per-dentry lock. This prevents concurrent renames during this
-   operation.
-
-3. Hash table walking during look-up may move to a different bucket as
-   the current dentry is moved to a different bucket due to rename.
-   But we use hlists in dcache hash table and they are
-   null-terminated.  So, even if a dentry moves to a different bucket,
-   hash chain walk will terminate. [with a list_head list, it may not
-   since termination is when the list_head in the original bucket is
-   reached].  Since we redo the d_parent check and compare name while
-   holding d_lock, lock-free look-up will not race against d_move().
-
-4. There can be a theoretical race when a dentry keeps coming back to
-   original bucket due to double moves. Due to this look-up may
-   consider that it has never moved and can end up in a infinite loop.
-   But this is not any worse that theoretical livelocks we already
-   have in the kernel.
-
-
-Important guidelines for filesystem developers related to dcache_rcu
-====================================================================
-
-1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
-   don't change. Only dcache internal implementation changes. However
-   filesystems *must not* delete from the dentry hash chains directly
-   using the list macros like allowed earlier. They must use dcache
-   APIs like d_drop() or __d_drop() depending on the situation.
-
-2. d_flags is now protected by a per-dentry lock (d_lock). All access
-   to d_flags must be protected by it.
-
-3. For a hashed dentry, checking of d_count needs to be protected by
-   d_lock.
-
-
-Papers and other documentation on dcache locking
-================================================
-
-1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
-
-2. http://lse.sourceforge.net/locking/dcache/dcache.html
-
-
-
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
new file mode 100644
index 0000000..5435c13
--- /dev/null
+++ b/Documentation/filesystems/path-lookup.txt
@@ -0,0 +1,247 @@
+Path walking and name lookup locking
+====================================
+
+On many workloads, the most common operation on dcache is to look up a dentry,
+given a parent dentry and the name of the child. Typically, for every open(),
+stat() etc., the dentry corresponding to the pathname will be looked up by
+walking the tree starting with the first component of the pathname and using
+that dentry along with the next component to look up the next level and so on.
+Since it is a frequent operation for workloads like multiuser environments and
+web servers, it is important to optimize this path.
+
+Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in every
+component during path look-up. Since 2.5.10 onwards, fast-walk algorithm
+changed this by holding the dcache_lock at the beginning and walking as many
+cached path component dentries as possible. This significantly decreases the
+number of acquisition of dcache_lock. However it also increases the lock hold
+time significantly and affects performance in large SMP machines. Since 2.5.62
+kernel, dcache has been using a new locking model that uses RCU to make dcache
+look-up lock-free. Since 2.6.XXX, RCU is used to make dcache look-up and a
+significant part of the path walk completely "store-free" (so no atomics or
+cacheline bouncing on common dentries).
+
+Path walking overview
+=====================
+
+A name string specifies a start (root directory, cwd, fd-relative) and a
+sequence of elements (directory entry names), which together refer to a path in
+the namespace. The elements are strings seperated by '/'.
+
+Name lookups will want to find a particular path that a name string refers to
+(usually the path of the final element, or parent of final element). This is
+done by taking the path given by the name's starting point (which we know in
+advance -- eg.  current->fs->cwd) as the first parent of the lookup. Then
+iteratively for each name element, look up the child of the current parent with
+the given name and if it is not the final entry, make it the parent for the
+next lookup.
+
+The parent must of course be a directory, and we must have appropriate
+permissions on the parent inode to be able to walk into it.
+
+Making the child a parent for the next lookup requires more checks and
+procedures. Symlinks essentially substitute the symlink name for the target
+name in the name string, and require some recursive path walking.  Mount points
+must be followed into, switching from the mount point path to the root of the
+particular mounted vfsmount.
+
+Safe store-free look-up of dcache hash table
+============================================
+
+Path walking then must, broadly, do several particular things:
+- perform directory entry name lookups on (parent, name element) tuples;
+- find the start point of the walk;
+- perform permissions and validity checks on inodes;
+- traverse mount points;
+- traverse symlinks;
+- lookup and create missing parts of the path on demand.
+
+Dcache name lookup
+------------------
+In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple
+use that to select a bucket in the dcache-hash table, and then compare entries
+on the hash list with our tuple.
+
+The hash lists are RCU protected, so list walking is not serialised with
+concurrent updates (insertion, deletion from the hash). This is a standard RCU
+list application with the exception of renames which will be covered below.
+
+Parent and name members of a dentry, as well as its membership in the dcache
+hash, are protected by the per-dentry d_lock spinlock. Parent, name, and inode
+members are also protected by d_seq seqlock, although this offers read-only
+protection and no durability of results so care must be taken when using d_seq
+for synchronisation.
+
+So when walking the dcache hash list, we can lock each dentry in turn, which
+then stabilises the entry, and then we can compare the parent and full name
+string without races. If there is a match, the refcount is incremented and the
+dentry can be unlocked and returned.
+
+All other operations on the dentry such as removal from the hash table must be
+performed under d_lock, so they are excluded until we have completed the
+comparison and have a valid refcount.
+
+Renames
+
+Back to the rename case. In usual RCU protected lists, the only operations that
+will happen to an object is insertion then removal from the list.  The object
+will not be reused until an RCU grace period is complete. This ensures the RCU
+list traversal primitives can run over the object without problems (see RCU
+documentation for how this works).
+
+However when a dentry is renamed, its hash value can change, requiring it to be
+moved to a new hash list. Latency would be far to high to wait for a grace
+period after removing the dentry and before inserting it in the new hash
+bucket, so the dentry is inserted on the new list right away. When the dentry's
+list pointers are updated to point to objects in the new list, this can result
+in a concurrent RCU lookup of the old list veering off into the new (incorrect)
+list and missing the remaining dentries on the list.
+
+It is no problem to walk the wrong list, because the dentry comparisons will
+not match. However it is fatal to miss a matching dentry. So a seqlock is used
+to detect when a rename has occurred, and so the lookup can be retried.
+
+         1      2      3
+        +---+  +---+  +---+
+hlist-->| N-+->| N-+->| N-+->
+head <--+-P |<-+-P |<-+-P |
+        +---+  +---+  +---+
+
+Rename of dentry 2 may require it deleted from the above list, and inserted
+into a new list. Deleting 2 gives the following list.
+
+         1             3
+        +---+         +---+     (don't worry, the longer pointers do not
+hlist-->| N-+-------->| N-+->    impose a measurable performance overhead
+head <--+-P |<--------+-P |      on modern CPUs)
+        +---+         +---+
+          ^      2      ^
+          |    +---+    |
+          |    | N-+----+
+          +----+-P |
+               +---+
+
+This is a standard RCU-list deletion, which leaves the deleted object's
+pointers intact, so a concurrent list walker that is currently looking at
+object 2 will correctly continue to object 3 when it is time to traverse the
+next object.
+
+However, when inserting object 2 onto a new list, we end up with this:
+
+         1             3
+        +---+         +---+
+hlist-->| N-+-------->| N-+->
+head <--+-P |<--------+-P |
+        +---+         +---+
+                 2
+               +---+
+               | N-+---->
+          <----+-P |
+               +---+
+
+Because we didn't wait for a grace period, there may be a concurrent lookup
+still at 2. Now when it follows 2's 'next' pointer, it will walk off into
+another list without ever having checked object 3.
+
+A related, but distinctly different, issue is that of rename atomicity versus
+lookup operations. If a file is renamed from 'A' to 'B', a lookup must only
+find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup
+of 'B' must succeed (note the reverse is not true).
+
+Between deleting the dentry from the old hash list, and inserting it on the new
+hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same
+rename seqlock is also used to cover this race in much the same way, by
+retrying a negative lookup result if a rename was in progress.
+
+Seqcount based lookups
+
+Instead of using d_lock to serialise concurrent access to the dentry while
+performing the lookup, it is possible to use d_seq. d_seq protects all the
+dentry members of interest, however they may be changed concurrently. Care must
+be taken to load the members up-front, and not perform any destructive
+operations (pretty much: no non-atomic stores to shared data), and to recheck
+the seqcount when we are "done" with the operation. Retry or abort if the
+seqcount does not match.
+
+What this means is that a caller, provided they are holding RCU lock to
+protect the dentry object from disappearing, can perform a seqcount based
+lookup which does not increment the refcount on the dentry or write to
+it in any way. This returned dentry can be used for subsequent operations
+provided that d_seq is rechecked.
+
+This is useful to perform dentry lookups of intermediate path elements without
+any cacheline bouncing or lock contention. The returned dentries can be used
+to perform subsequent dcache lookups, or we can take a refcount on them by
+taking their d_lock, rechecking d_seq, and then incrementing their refcount.
+
+RCU-walk path walking design
+============================
+
+Path walking code has two distinct modes, ref-walk and rcu-walk. ref-walk
+is the traditional[*] way of performing dcache lookups using d_lock to
+serialise concurrent modifications to the dentry and take a reference count
+on it. ref-walk is simple and obvious, and may sleep, take locks, etc while
+path walking is operating on each dentry. rcu-walk uses seqcount based
+dentry lookups and can perform lookup of intermediate elements without
+performing any stores to shared data in the dentry or inode. rcu-walk can
+not be applied to all cases, eg. if the filesystem must sleep or perform
+non trivial operations, rcu-walk must be switched to ref-walk.
+
+[*] RCU is still used for the dentry hash lookup, but not the full path walk.
+
+The overall design is like this:
+* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
+* Take the RCU lock for the entire path walk, starting with the acquiring
+  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
+  not required for dentry persistence.
+* synchronize_rcu is called when unregistering a filesystem, so we can
+  access d_ops and i_ops during rcu-walk.
+* Similarly take the vfsmount lock for the entire path walk. So now mnt
+  refcounts are not required for persistence. Also we are free to perform mount
+  lookups, and to assume dentry mount points and mount roots are stable up and
+  down the path.
+* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
+  so we can load this tuple atomically, and also check whether any of its
+  members have changed.
+* Dentry lookups (based on parent, candidate string tuple) recheck the parent
+  sequence after the child is found in case anything changed in the parent
+  during the path walk.
+* inode is also RCU protected so we can load d_inode and use the inode for
+  limited things.
+* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
+* i_op can be loaded.
+
+When we reach the destination dentry, we lock it, recheck lookup sequence,
+and increment its refcount and mountpoint refcount. RCU and vfsmount locks
+are dropped. This is termed "dropping rcu-walk". If the dentry seqcount does
+not match, we can not drop rcu-walk gracefully at the current point in the
+lokup, so instead return -ECHILD (for want of a better errno). This signals the
+path walking code to re-do the entire lookup with a ref-walk.
+
+Aside from the final dentry, there are other situations that may be encounted
+where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
+a reference on the last good dentry) and continue with a ref-walk. Again, if
+we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
+using ref-walk. But it is very important that we can continue with ref-walk
+for most cases, particularly to avoid the overhead of double lookups, and to
+gain the scalability advantages on common path elements (like cwd and root).
+
+The cases where rcu-walk cannot continue are:
+* NULL dentry (ie. creat or negative lookup)
+* parent with ->d_op->d_hash
+* parent with d_inode->i_op->permission or ACLs
+* Following links
+* Dentry with ->d_op->d_revalidate
+
+Apart from the first, it may be possible to make most of these cases RCU
+walked. Though that would require a bit more poking into filesystems, so it
+would be better to wait until the base infrastructure is converted.
+
+Papers and other documentation on dcache locking
+================================================
+
+1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+
+2. http://lse.sourceforge.net/locking/dcache/dcache.html
+
+
+
diff --git a/fs/dcache.c b/fs/dcache.c
index 5abb8f2..5b59807 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -156,7 +156,9 @@ static void dentry_iput(struct dentry * dentry)
 {
 	struct inode *inode = dentry->d_inode;
 	if (inode) {
+		write_seqcount_begin(&dentry->d_seq);
 		dentry->d_inode = NULL;
+		write_seqcount_end(&dentry->d_seq);
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
 		spin_unlock(&dcache_inode_lock);
@@ -842,7 +844,9 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 
 			inode = dentry->d_inode;
 			if (inode) {
+				write_seqcount_begin(&dentry->d_seq);
 				dentry->d_inode = NULL;
+				write_seqcount_end(&dentry->d_seq);
 				list_del_init(&dentry->d_alias);
 				if (dentry->d_op && dentry->d_op->d_iput)
 					dentry->d_op->d_iput(dentry, inode);
@@ -1188,6 +1192,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_count = 1;
 	dentry->d_flags = DCACHE_UNHASHED;
 	spin_lock_init(&dentry->d_lock);
+	seqcount_init(&dentry->d_seq);
 	dentry->d_inode = NULL;
 	dentry->d_parent = NULL;
 	dentry->d_sb = NULL;
@@ -1577,6 +1582,114 @@ err_out:
 EXPORT_SYMBOL(d_add_ci);
 
 /**
+ * __d_lookup_rcu - search for a dentry (racy, store-free)
+ * @parent: parent dentry
+ * @name: qstr of name we wish to find
+ * @seq: returns d_seq value at the point where the dentry was found
+ * @inode: returns dentry->d_inode when the inode was found valid.
+ * Returns: dentry, or NULL
+ *
+ * __d_lookup_rcu is the dcache lookup function for rcu-walk name
+ * resolution (store-free path walking) design described in
+ * Documentation/filesystems/path-lookup.txt.
+ *
+ * This is not to be used outside core vfs.
+ */
+struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
+				unsigned *seq, struct inode **inode)
+{
+	unsigned int len = name->len;
+	unsigned int hash = name->hash;
+	const unsigned char *str = name->name;
+	struct hlist_head *head = d_hash(parent,hash);
+	struct hlist_node *node;
+	struct dentry *dentry;
+
+ 	/*
+	 * Note: There is significant duplication with __d_lookup_rcu which is
+	 * required to prevent single threaded performance regressions
+	 * especially on architectures where smp_rmb (in seqcounts) are costly.
+	 * Keep the two functions in sync.
+	 */
+
+	/*
+	 * The hash list is protected using RCU.
+	 *
+	 * Carefully use d_seq when comparing a candidate dentry, to avoid
+	 * races with d_move().
+	 *
+	 * It is possible that concurrent renames can mess up our list
+	 * walk here and result in missing our dentry, resulting in the
+	 * false-negative result. d_lookup() protects against concurrent
+	 * renames using rename_lock seqlock.
+	 *
+	 * See Documentation/vfs/dcache-locking.txt for more details.
+	 */
+	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
+		struct inode *i;
+		const char *tname;
+		int tlen;
+
+		if (dentry->d_name.hash != hash)
+			continue;
+
+seqretry:
+		/* XXX: ensure all d_parent, d_name, d_inode manipulations
+		 * happen under d_seq */
+		*seq = read_seqcount_begin(&dentry->d_seq);
+		if (dentry->d_parent != parent)
+			continue;
+		if (d_unhashed(dentry))
+			continue;
+		tlen = dentry->d_name.len;
+		tname = dentry->d_name.name;
+		i = dentry->d_inode;
+		/*
+		 * This seqcount check is required to ensure name and
+		 * len are loaded atomically, so as not to walk off the
+		 * edge of memory when walking. If we could load this
+		 * atomically some other way, we could drop this check.
+		 */
+		if (read_seqcount_retry(&dentry->d_seq, *seq))
+			goto seqretry;
+		if (parent->d_op && parent->d_op->d_compare) {
+			if (parent->d_op->d_compare(parent,
+						dentry, i,
+						tlen, tname, name))
+				continue;
+		} else {
+			if (tlen != len)
+				continue;
+			if (memcmp(tname, str, tlen))
+				continue;
+		}
+		/*
+		 * No extra seqcount check is required after the name
+		 * compare. The caller must perform a seqcount check in
+		 * order to do anything useful with the returned dentry
+		 * anyway.
+		 */
+		*inode = i;
+		return dentry;
+ 	}
+ 	return NULL;
+}
+
+int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
+{
+	int ret = 0;
+
+	if (!read_seqcount_retry(&dentry->d_seq, seq)) {
+		if (IS_ROOT(dentry) || !d_unhashed(dentry)) {
+			ret = 1;
+			dentry->d_count++;
+		}
+	}
+
+	return ret;
+}
+
+/**
  * d_lookup - search for a dentry
  * @parent: parent dentry
  * @name: qstr of name we wish to find
@@ -1587,9 +1700,9 @@ EXPORT_SYMBOL(d_add_ci);
  * dentry is returned. The caller must use dput to free the entry when it has
  * finished using it. %NULL is returned if the dentry does not exist.
  */
-struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
+struct dentry *d_lookup(struct dentry *parent, struct qstr *name)
 {
-	struct dentry * dentry = NULL;
+	struct dentry *dentry;
 	unsigned seq;
 
         do {
@@ -1602,7 +1715,7 @@ struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
 }
 EXPORT_SYMBOL(d_lookup);
 
-/*
+/**
  * __d_lookup - search for a dentry (racy)
  * @parent: parent dentry
  * @name: qstr of name we wish to find
@@ -1617,16 +1730,23 @@ EXPORT_SYMBOL(d_lookup);
  *
  * __d_lookup callers must be commented.
  */
-struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
+struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 {
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
 	const unsigned char *str = name->name;
 	struct hlist_head *head = d_hash(parent,hash);
-	struct dentry *found = NULL;
 	struct hlist_node *node;
+	struct dentry *found = NULL;
 	struct dentry *dentry;
 
+ 	/*
+	 * Note: There is significant duplication with __d_lookup_rcu which is
+	 * required to prevent single threaded performance regressions
+	 * especially on architectures where smp_rmb (in seqcounts) are costly.
+	 * Keep the two functions in sync.
+	 */
+
 	/*
 	 * The hash list is protected using RCU.
 	 *
@@ -1643,24 +1763,15 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 	rcu_read_lock();
 	
 	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
-		struct qstr *qstr;
+		const char *tname;
+		int tlen;
 
 		if (dentry->d_name.hash != hash)
 			continue;
-		if (dentry->d_parent != parent)
-			continue;
 
 		spin_lock(&dentry->d_lock);
-
-		/*
-		 * Recheck the dentry after taking the lock - d_move may have
-		 * changed things. Don't bother checking the hash because
-		 * we're about to compare the whole name anyway.
-		 */
 		if (dentry->d_parent != parent)
 			goto next;
-
-		/* non-existing due to RCU? */
 		if (d_unhashed(dentry))
 			goto next;
 
@@ -1668,16 +1779,17 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 		 * It is safe to compare names since d_move() cannot
 		 * change the qstr (protected by d_lock).
 		 */
-		qstr = &dentry->d_name;
+		tlen = dentry->d_name.len;
+		tname = dentry->d_name.name;
 		if (parent->d_op && parent->d_op->d_compare) {
 			if (parent->d_op->d_compare(parent,
 						dentry, dentry->d_inode,
-						qstr->len, qstr->name, name))
+						tlen, tname, name))
 				goto next;
 		} else {
-			if (qstr->len != len)
+			if (tlen != len)
 				goto next;
-			if (memcmp(qstr->name, str, len))
+			if (memcmp(tname, str, tlen))
 				goto next;
 		}
 
@@ -1947,6 +2059,8 @@ void d_move(struct dentry * dentry, struct dentry * target)
 	list_del(&target->d_u.d_child);
 
 	/* Switch the names.. */
+	write_seqcount_begin(&dentry->d_seq);
+	write_seqcount_begin(&target->d_seq);
 	switch_names(dentry, target);
 	swap(dentry->d_name.hash, target->d_name.hash);
 
@@ -1961,6 +2075,8 @@ void d_move(struct dentry * dentry, struct dentry * target)
 		/* And add them back to the (new) parent lists */
 		list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
 	}
+	write_seqcount_end(&target->d_seq);
+	write_seqcount_end(&dentry->d_seq);
 
 	list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
 	if (target->d_parent != dentry->d_parent)
@@ -2045,11 +2161,15 @@ static void __d_materialise_dentry(struct dentry *dentry, struct dentry *anon)
 {
 	struct dentry *dparent, *aparent;
 
+	dparent = dentry->d_parent;
+	aparent = anon->d_parent;
+
+	write_seqcount_begin(&dentry->d_seq);
+	write_seqcount_begin(&anon->d_seq);
+
 	switch_names(dentry, anon);
 	swap(dentry->d_name.hash, anon->d_name.hash);
 
-	dparent = dentry->d_parent;
-	aparent = anon->d_parent;
 
 	/* XXX: hack */
 	/* returns with anon->d_lock held! */
@@ -2072,6 +2192,9 @@ static void __d_materialise_dentry(struct dentry *dentry, struct dentry *anon)
 	else
 		INIT_LIST_HEAD(&anon->d_u.d_child);
 
+	write_seqcount_end(&dentry->d_seq);
+	write_seqcount_end(&anon->d_seq);
+
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dparent->d_lock);
 	spin_unlock(&aparent->d_lock);
diff --git a/fs/filesystems.c b/fs/filesystems.c
index 68ba492..751d6b2 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -115,6 +115,9 @@ int unregister_filesystem(struct file_system_type * fs)
 		tmp = &(*tmp)->next;
 	}
 	write_unlock(&file_systems_lock);
+
+	synchronize_rcu();
+
 	return -EINVAL;
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 39fda05..5185752 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -169,8 +169,8 @@ EXPORT_SYMBOL(putname);
 /*
  * This does basic POSIX ACL permission checking
  */
-static int acl_permission_check(struct inode *inode, int mask,
-		int (*check_acl)(struct inode *inode, int mask))
+static inline int __acl_permission_check(struct inode *inode, int mask,
+		int (*check_acl)(struct inode *inode, int mask), int rcu)
 {
 	umode_t			mode = inode->i_mode;
 
@@ -180,9 +180,13 @@ static int acl_permission_check(struct inode *inode, int mask,
 		mode >>= 6;
 	else {
 		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
-			int error = check_acl(inode, mask);
-			if (error != -EAGAIN)
-				return error;
+			if (rcu) {
+				return -ECHILD;
+			} else {
+				int error = check_acl(inode, mask);
+				if (error != -EAGAIN)
+					return error;
+			}
 		}
 
 		if (in_group_p(inode->i_gid))
@@ -197,6 +201,12 @@ static int acl_permission_check(struct inode *inode, int mask,
 	return -EACCES;
 }
 
+static inline int acl_permission_check(struct inode *inode, int mask,
+		int (*check_acl)(struct inode *inode, int mask))
+{
+	return __acl_permission_check(inode, mask, check_acl, 0);
+}
+
 /**
  * generic_permission  -  check for access rights on a Posix-like filesystem
  * @inode:	inode to check access rights for
@@ -374,6 +384,129 @@ void path_put(struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
+static int nameidata_dentry_drop_rcu(struct nameidata *nd, struct dentry *dentry)
+{
+	struct fs_struct *fs = current->fs;
+	struct dentry *parent = nd->path.dentry;
+
+	BUG_ON(!(nd->flags & LOOKUP_RCU));
+	if (nd->root.mnt) {
+		spin_lock(&fs->lock);
+		if (nd->root.mnt != fs->root.mnt ||
+				nd->root.dentry != fs->root.dentry)
+			goto err_root;
+	}
+	spin_lock(&parent->d_lock);
+	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+	if (!__d_rcu_to_refcount(dentry, nd->seq))
+		goto err;
+	BUG_ON(dentry->d_parent != parent);
+	BUG_ON(!parent->d_count);
+	parent->d_count++;
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&parent->d_lock);
+	if (nd->root.mnt) {
+		path_get(&nd->root);
+		spin_unlock(&fs->lock);
+	}
+	mntget(nd->path.mnt);
+
+	rcu_read_unlock();
+	br_read_unlock(vfsmount_lock);
+	nd->flags &= ~LOOKUP_RCU;
+	return 0;
+err:
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&parent->d_lock);
+err_root:
+	if (nd->root.mnt)
+		spin_unlock(&fs->lock);
+	return -ECHILD;
+}
+
+static int nameidata_drop_rcu(struct nameidata *nd)
+{
+	struct fs_struct *fs = current->fs;
+	struct dentry *dentry = nd->path.dentry;
+
+	BUG_ON(!(nd->flags & LOOKUP_RCU));
+	if (nd->root.mnt) {
+		spin_lock(&fs->lock);
+		if (nd->root.mnt != fs->root.mnt ||
+				nd->root.dentry != fs->root.dentry)
+			goto err_root;
+	}
+	spin_lock(&dentry->d_lock);
+	if (!__d_rcu_to_refcount(dentry, nd->seq))
+		goto err;
+	BUG_ON(nd->inode != dentry->d_inode);
+	spin_unlock(&dentry->d_lock);
+	if (nd->root.mnt) {
+		path_get(&nd->root);
+		spin_unlock(&fs->lock);
+	}
+	mntget(nd->path.mnt);
+
+	rcu_read_unlock();
+	br_read_unlock(vfsmount_lock);
+	nd->flags &= ~LOOKUP_RCU;
+	return 0;
+err:
+	spin_unlock(&dentry->d_lock);
+err_root:
+	if (nd->root.mnt)
+		spin_unlock(&fs->lock);
+	return -ECHILD;
+}
+
+static int nameidata_drop_rcu_last(struct nameidata *nd)
+{
+	struct dentry *dentry = nd->path.dentry;
+
+	BUG_ON(!(nd->flags & LOOKUP_RCU));
+	nd->flags &= ~LOOKUP_RCU;
+	nd->root.mnt = NULL;
+	spin_lock(&dentry->d_lock);
+	if (!__d_rcu_to_refcount(dentry, nd->seq))
+		goto err_unlock;
+	BUG_ON(nd->inode != dentry->d_inode);
+	spin_unlock(&dentry->d_lock);
+
+	mntget(nd->path.mnt);
+
+	rcu_read_unlock();
+	br_read_unlock(vfsmount_lock);
+
+	return 0;
+
+err_unlock:
+	spin_unlock(&dentry->d_lock);
+	rcu_read_unlock();
+	br_read_unlock(vfsmount_lock);
+	return -ECHILD;
+}
+
+static inline int try_nameidata_dentry_drop_rcu(struct nameidata *nd, struct dentry *dentry)
+{
+	if (nd->flags & LOOKUP_RCU)
+		return nameidata_dentry_drop_rcu(nd, dentry);
+	return 0;
+}
+
+static inline int try_nameidata_drop_rcu(struct nameidata *nd)
+{
+	if (nd->flags & LOOKUP_RCU)
+		return nameidata_drop_rcu(nd);
+	return 0;
+}
+
+static inline int try_nameidata_drop_rcu_last(struct nameidata *nd)
+{
+	if (likely(nd->flags & LOOKUP_RCU))
+		return nameidata_drop_rcu_last(nd);
+	return 0;
+}
+
 /**
  * release_open_intent - free up open intent resources
  * @nd: pointer to nameidata
@@ -459,26 +592,40 @@ force_reval_path(struct path *path, struct nameidata *nd)
  * short-cut DAC fails, then call ->permission() to do more
  * complete permission check.
  */
-static int exec_permission(struct inode *inode)
+static inline int __exec_permission(struct inode *inode, int rcu)
 {
 	int ret;
 
 	if (inode->i_op->permission) {
+		if (rcu)
+			return -ECHILD;
 		ret = inode->i_op->permission(inode, MAY_EXEC);
 		if (!ret)
 			goto ok;
 		return ret;
 	}
-	ret = acl_permission_check(inode, MAY_EXEC, inode->i_op->check_acl);
+	ret = __acl_permission_check(inode, MAY_EXEC, inode->i_op->check_acl, rcu);
 	if (!ret)
 		goto ok;
+	if (rcu && ret == -ECHILD)
+		return ret;
 
 	if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
 		goto ok;
 
 	return ret;
 ok:
-	return security_inode_permission(inode, MAY_EXEC);
+	return security_inode_permission(inode, MAY_EXEC); /* XXX: ok for RCU? */
+}
+
+static int exec_permission(struct inode *inode)
+{
+	return __exec_permission(inode, 0);
+}
+
+static int exec_permission_rcu(struct inode *inode)
+{
+	return __exec_permission(inode, 1);
 }
 
 static __always_inline void set_root(struct nameidata *nd)
@@ -489,8 +636,20 @@ static __always_inline void set_root(struct nameidata *nd)
 
 static int link_path_walk(const char *, struct nameidata *);
 
+static __always_inline void set_root_rcu(struct nameidata *nd)
+{
+	if (!nd->root.mnt) {
+		struct fs_struct *fs = current->fs;
+		spin_lock(&fs->lock);
+		nd->root = fs->root;
+		spin_unlock(&fs->lock);
+	}
+}
+
 static __always_inline int __vfs_follow_link(struct nameidata *nd, const char *link)
 {
+	int ret;
+
 	if (IS_ERR(link))
 		goto fail;
 
@@ -500,8 +659,10 @@ static __always_inline int __vfs_follow_link(struct nameidata *nd, const char *l
 		nd->path = nd->root;
 		path_get(&nd->root);
 	}
+	nd->inode = nd->path.dentry->d_inode;
 
-	return link_path_walk(link, nd);
+	ret = link_path_walk(link, nd);
+	return ret;
 fail:
 	path_put(&nd->path);
 	return PTR_ERR(link);
@@ -516,11 +677,12 @@ static void path_put_conditional(struct path *path, struct nameidata *nd)
 
 static inline void path_to_nameidata(struct path *path, struct nameidata *nd)
 {
-	dput(nd->path.dentry);
-	if (nd->path.mnt != path->mnt) {
-		mntput(nd->path.mnt);
-		nd->path.mnt = path->mnt;
+	if (!(nd->flags & LOOKUP_RCU)) {
+		dput(nd->path.dentry);
+		if (nd->path.mnt != path->mnt)
+			mntput(nd->path.mnt);
 	}
+	nd->path.mnt = path->mnt;
 	nd->path.dentry = path->dentry;
 }
 
@@ -535,9 +697,11 @@ __do_follow_link(struct path *path, struct nameidata *nd, void **p)
 
 	if (path->mnt != nd->path.mnt) {
 		path_to_nameidata(path, nd);
+		nd->inode = nd->path.dentry->d_inode;
 		dget(dentry);
 	}
 	mntget(path->mnt);
+
 	nd->last_type = LAST_BIND;
 	*p = dentry->d_inode->i_op->follow_link(dentry, nd);
 	error = PTR_ERR(*p);
@@ -591,6 +755,20 @@ loop:
 	return err;
 }
 
+static int follow_up_rcu(struct path *path)
+{
+	struct vfsmount *parent;
+	struct dentry *mountpoint;
+
+	parent = path->mnt->mnt_parent;
+	if (parent == path->mnt)
+		return 0;
+	mountpoint = path->mnt->mnt_mountpoint;
+	path->dentry = mountpoint;
+	path->mnt = parent;
+	return 1;
+}
+
 int follow_up(struct path *path)
 {
 	struct vfsmount *parent;
@@ -615,6 +793,21 @@ int follow_up(struct path *path)
 /*
  * serialization is taken care of in namespace.c
  */
+static void __follow_mount_rcu(struct nameidata *nd, struct path *path,
+				struct inode **inode)
+{
+	while (d_mountpoint(path->dentry)) {
+		struct vfsmount *mounted;
+		mounted = __lookup_mnt(path->mnt, path->dentry, 1);
+		if (!mounted)
+			return;
+		path->mnt = mounted;
+		path->dentry = mounted->mnt_root;
+		nd->seq = read_seqcount_begin(&path->dentry->d_seq);
+		*inode = path->dentry->d_inode;
+	}
+}
+
 static int __follow_mount(struct path *path)
 {
 	int res = 0;
@@ -660,7 +853,42 @@ int follow_down(struct path *path)
 	return 0;
 }
 
-static __always_inline void follow_dotdot(struct nameidata *nd)
+static int follow_dotdot_rcu(struct nameidata *nd)
+{
+	struct inode *inode = nd->inode;
+
+	set_root_rcu(nd);
+
+	while(1) {
+		if (nd->path.dentry == nd->root.dentry &&
+		    nd->path.mnt == nd->root.mnt) {
+			break;
+		}
+		if (nd->path.dentry != nd->path.mnt->mnt_root) {
+			struct dentry *old = nd->path.dentry;
+			struct dentry *parent = old->d_parent;
+			unsigned seq;
+
+			seq = read_seqcount_begin(&parent->d_seq);
+			if (read_seqcount_retry(&old->d_seq, nd->seq))
+				return -ECHILD;
+			inode = parent->d_inode;
+			nd->path.dentry = parent;
+			nd->seq = seq;
+			break;
+		}
+		if (!follow_up_rcu(&nd->path))
+			break;
+		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		inode = nd->path.dentry->d_inode;
+	}
+	__follow_mount_rcu(nd, &nd->path, &inode);
+	nd->inode = inode;
+
+	return 0;
+}
+
+static void follow_dotdot(struct nameidata *nd)
 {
 	set_root(nd);
 
@@ -681,6 +909,7 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
 			break;
 	}
 	follow_mount(&nd->path);
+	nd->inode = nd->path.dentry->d_inode;
 }
 
 /*
@@ -718,18 +947,17 @@ static struct dentry *d_alloc_and_lookup(struct dentry *parent,
  *  It _is_ time-critical.
  */
 static int do_lookup(struct nameidata *nd, struct qstr *name,
-		     struct path *path)
+			struct path *path, struct inode **inode)
 {
 	struct vfsmount *mnt = nd->path.mnt;
-	struct dentry *dentry, *parent;
+	struct dentry *dentry, *parent = nd->path.dentry;
 	struct inode *dir;
 	/*
 	 * See if the low-level filesystem might want
 	 * to use its own hash..
 	 */
-	if (nd->path.dentry->d_op && nd->path.dentry->d_op->d_hash) {
-		int err = nd->path.dentry->d_op->d_hash(nd->path.dentry,
-				nd->path.dentry->d_inode, name);
+	if (parent->d_op && parent->d_op->d_hash) {
+		int err = parent->d_op->d_hash(parent, nd->inode, name);
 		if (err < 0)
 			return err;
 	}
@@ -739,21 +967,47 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 	 * of a false negative due to a concurrent rename, we're going to
 	 * do the non-racy lookup, below.
 	 */
-	dentry = __d_lookup(nd->path.dentry, name);
-	if (!dentry)
-		goto need_lookup;
+	if (nd->flags & LOOKUP_RCU) {
+		unsigned seq;
+
+		dentry = __d_lookup_rcu(parent, name, &seq, inode);
+		if (!dentry) {
+			if (nameidata_drop_rcu(nd))
+				return -ECHILD;
+			goto need_lookup;
+		}
+		/* Memory barrier in read_seqcount_begin of child is enough */
+		if (__read_seqcount_retry(&parent->d_seq, nd->seq))
+			return -ECHILD;
+
+		nd->seq = seq;
+		if (dentry->d_op && dentry->d_op->d_revalidate) {
+			/* XXX: RCU chokes here */
+			if (nameidata_dentry_drop_rcu(nd, dentry))
+				return -ECHILD;
+			goto need_revalidate;
+		}
+		path->mnt = mnt;
+		path->dentry = dentry;
+		__follow_mount_rcu(nd, path, inode);
+	} else {
+		dentry = __d_lookup(parent, name);
+		if (!dentry)
+			goto need_lookup;
 found:
-	if (dentry->d_op && dentry->d_op->d_revalidate)
-		goto need_revalidate;
+		if (dentry->d_op && dentry->d_op->d_revalidate)
+			goto need_revalidate;
 done:
-	path->mnt = mnt;
-	path->dentry = dentry;
-	__follow_mount(path);
+		path->mnt = mnt;
+		path->dentry = dentry;
+		__follow_mount(path);
+		*inode = path->dentry->d_inode;
+	}
 	return 0;
 
 need_lookup:
-	parent = nd->path.dentry;
 	dir = parent->d_inode;
+	BUG_ON(nd->inode != dir);
 
 	mutex_lock(&dir->i_mutex);
 	/*
@@ -815,7 +1069,6 @@ static inline int follow_on_final(struct inode *inode, unsigned lookup_flags)
 static int link_path_walk(const char *name, struct nameidata *nd)
 {
 	struct path next;
-	struct inode *inode;
 	int err;
 	unsigned int lookup_flags = nd->flags;
 	
@@ -824,18 +1077,28 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	if (!*name)
 		goto return_reval;
 
-	inode = nd->path.dentry->d_inode;
 	if (nd->depth)
 		lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
 
 	/* At this point we know we have a real path component. */
 	for(;;) {
+		struct inode *inode;
 		unsigned long hash;
 		struct qstr this;
 		unsigned int c;
 
 		nd->flags |= LOOKUP_CONTINUE;
-		err = exec_permission(inode);
+		if (nd->flags & LOOKUP_RCU) {
+			err = exec_permission_rcu(nd->inode);
+			if (err == -ECHILD) {
+				if (nameidata_drop_rcu(nd))
+					return -ECHILD;
+				goto exec_again;
+			}
+		} else {
+exec_again:
+			err = exec_permission(nd->inode);
+		}
  		if (err)
 			break;
 
@@ -866,37 +1129,44 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 		if (this.name[0] == '.') switch (this.len) {
 			default:
 				break;
-			case 2:	
+			case 2:
 				if (this.name[1] != '.')
 					break;
-				follow_dotdot(nd);
-				inode = nd->path.dentry->d_inode;
+				if (nd->flags & LOOKUP_RCU) {
+					if (follow_dotdot_rcu(nd))
+						return -ECHILD;
+				} else
+					follow_dotdot(nd);
 				/* fallthrough */
 			case 1:
 				continue;
 		}
 		/* This does the actual lookups.. */
-		err = do_lookup(nd, &this, &next);
+		err = do_lookup(nd, &this, &next, &inode);
 		if (err)
 			break;
-
 		err = -ENOENT;
-		inode = next.dentry->d_inode;
 		if (!inode)
 			goto out_dput;
 
 		if (inode->i_op->follow_link) {
+			/* XXX: RCU chokes here */
+			if (try_nameidata_dentry_drop_rcu(nd, next.dentry))
+				return -ECHILD;
+			BUG_ON(inode != next.dentry->d_inode);
 			err = do_follow_link(&next, nd);
 			if (err)
 				goto return_err;
+			nd->inode = nd->path.dentry->d_inode;
 			err = -ENOENT;
-			inode = nd->path.dentry->d_inode;
-			if (!inode)
+			if (!nd->inode)
 				break;
-		} else
+		} else {
 			path_to_nameidata(&next, nd);
+			nd->inode = inode;
+		}
 		err = -ENOTDIR; 
-		if (!inode->i_op->lookup)
+		if (!nd->inode->i_op->lookup)
 			break;
 		continue;
 		/* here ends the main loop */
@@ -911,32 +1181,39 @@ last_component:
 		if (this.name[0] == '.') switch (this.len) {
 			default:
 				break;
-			case 2:	
+			case 2:
 				if (this.name[1] != '.')
 					break;
-				follow_dotdot(nd);
-				inode = nd->path.dentry->d_inode;
+				if (nd->flags & LOOKUP_RCU) {
+					if (follow_dotdot_rcu(nd))
+						return -ECHILD;
+				} else
+					follow_dotdot(nd);
 				/* fallthrough */
 			case 1:
 				goto return_reval;
 		}
-		err = do_lookup(nd, &this, &next);
+		err = do_lookup(nd, &this, &next, &inode);
 		if (err)
 			break;
-		inode = next.dentry->d_inode;
 		if (follow_on_final(inode, lookup_flags)) {
+			if (try_nameidata_dentry_drop_rcu(nd, next.dentry))
+				return -ECHILD;
+			BUG_ON(inode != next.dentry->d_inode);
 			err = do_follow_link(&next, nd);
 			if (err)
 				goto return_err;
-			inode = nd->path.dentry->d_inode;
-		} else
+			nd->inode = nd->path.dentry->d_inode;
+		} else {
 			path_to_nameidata(&next, nd);
+			nd->inode = inode;
+		}
 		err = -ENOENT;
-		if (!inode)
+		if (!nd->inode)
 			break;
 		if (lookup_flags & LOOKUP_DIRECTORY) {
 			err = -ENOTDIR; 
-			if (!inode->i_op->lookup)
+			if (!nd->inode->i_op->lookup)
 				break;
 		}
 		goto return_base;
@@ -958,6 +1235,8 @@ return_reval:
 		 */
 		if (nd->path.dentry && nd->path.dentry->d_sb &&
 		    (nd->path.dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) {
+			if (try_nameidata_drop_rcu(nd))
+				return -ECHILD;
 			err = -ESTALE;
 			/* Note: we do not d_invalidate() */
 			if (!nd->path.dentry->d_op->d_revalidate(
@@ -965,16 +1244,34 @@ return_reval:
 				break;
 		}
 return_base:
+		if (try_nameidata_drop_rcu_last(nd))
+			return -ECHILD;
 		return 0;
 out_dput:
-		path_put_conditional(&next, nd);
+		if (!(nd->flags & LOOKUP_RCU))
+			path_put_conditional(&next, nd);
 		break;
 	}
-	path_put(&nd->path);
+	if (!(nd->flags & LOOKUP_RCU))
+		path_put(&nd->path);
 return_err:
 	return err;
 }
 
+static inline int path_walk_rcu(const char *name, struct nameidata *nd)
+{
+	current->total_link_count = 0;
+
+	return link_path_walk(name, nd);
+}
+
+static inline int path_walk_simple(const char *name, struct nameidata *nd)
+{
+	current->total_link_count = 0;
+
+	return link_path_walk(name, nd);
+}
+
 static int path_walk(const char *name, struct nameidata *nd)
 {
 	struct path save = nd->path;
@@ -1000,6 +1297,88 @@ static int path_walk(const char *name, struct nameidata *nd)
 	return result;
 }
 
+static void path_finish_rcu(struct nameidata *nd)
+{
+	if (nd->flags & LOOKUP_RCU) {
+		/* RCU dangling. Cancel it. */
+		nd->flags &= ~LOOKUP_RCU;
+		nd->root.mnt = NULL;
+		rcu_read_unlock();
+		br_read_unlock(vfsmount_lock);
+	}
+	if (nd->file)
+		fput(nd->file);
+}
+
+static int path_init_rcu(int dfd, const char *name, unsigned int flags, struct nameidata *nd)
+{
+	int retval = 0;
+	int fput_needed;
+	struct file *file;
+
+	nd->last_type = LAST_ROOT; /* if there are only slashes... */
+	nd->flags = flags | LOOKUP_RCU;
+	nd->depth = 0;
+	nd->root.mnt = NULL;
+	nd->file = NULL;
+
+	if (*name=='/') {
+		struct fs_struct *fs = current->fs;
+
+		br_read_lock(vfsmount_lock);
+		rcu_read_lock();
+
+		spin_lock(&fs->lock);
+		nd->root = fs->root;
+		nd->path = nd->root;
+		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		spin_unlock(&fs->lock);
+
+	} else if (dfd == AT_FDCWD) {
+		struct fs_struct *fs = current->fs;
+
+		br_read_lock(vfsmount_lock);
+		rcu_read_lock();
+
+		spin_lock(&fs->lock);
+		nd->path = fs->pwd;
+		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		spin_unlock(&fs->lock);
+	} else {
+		struct dentry *dentry;
+
+		file = fget_light(dfd, &fput_needed);
+		retval = -EBADF;
+		if (!file)
+			goto out_fail;
+
+		dentry = file->f_path.dentry;
+
+		retval = -ENOTDIR;
+		if (!S_ISDIR(dentry->d_inode->i_mode))
+			goto fput_fail;
+
+		retval = file_permission(file, MAY_EXEC);
+		if (retval)
+			goto fput_fail;
+
+		nd->path = file->f_path;
+		if (fput_needed)
+			nd->file = file;
+
+		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		br_read_lock(vfsmount_lock);
+		rcu_read_lock();
+	}
+	nd->inode = nd->path.dentry->d_inode;
+	return 0;
+
+fput_fail:
+	fput_light(file, fput_needed);
+out_fail:
+	return retval;
+}
+
 static int path_init(int dfd, const char *name, unsigned int flags, struct nameidata *nd)
 {
 	int retval = 0;
@@ -1040,6 +1419,7 @@ static int path_init(int dfd, const char *name, unsigned int flags, struct namei
 
 		fput_light(file, fput_needed);
 	}
+	nd->inode = nd->path.dentry->d_inode;
 	return 0;
 
 fput_fail:
@@ -1052,16 +1432,39 @@ out_fail:
 static int do_path_lookup(int dfd, const char *name,
 				unsigned int flags, struct nameidata *nd)
 {
-	int retval = path_init(dfd, name, flags, nd);
-	if (!retval)
-		retval = path_walk(name, nd);
-	if (unlikely(!retval && !audit_dummy_context() && nd->path.dentry &&
-				nd->path.dentry->d_inode))
-		audit_inode(name, nd->path.dentry);
+	int retval;
+
+	retval = path_init_rcu(dfd, name, flags, nd);
+	if (unlikely(retval))
+		return retval;
+	retval = path_walk_rcu(name, nd);
+	path_finish_rcu(nd);
 	if (nd->root.mnt) {
 		path_put(&nd->root);
 		nd->root.mnt = NULL;
 	}
+
+	if (unlikely(retval == -ECHILD || retval == -ESTALE)) {
+		/* slower, locked walk */
+		if (retval == -ESTALE)
+			flags |= LOOKUP_REVAL;
+		retval = path_init(dfd, name, flags, nd);
+		if (unlikely(retval))
+			return retval;
+		retval = path_walk(name, nd);
+		if (nd->root.mnt) {
+			path_put(&nd->root);
+			nd->root.mnt = NULL;
+		}
+	}
+
+	if (likely(!retval)) {
+		if (unlikely(!audit_dummy_context())) {
+			if (nd->path.dentry && nd->inode)
+				audit_inode(name, nd->path.dentry);
+		}
+	}
+
 	return retval;
 }
 
@@ -1104,10 +1507,11 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
 	path_get(&nd->path);
 	nd->root = nd->path;
 	path_get(&nd->root);
+	nd->inode = nd->path.dentry->d_inode;
 
 	retval = path_walk(name, nd);
 	if (unlikely(!retval && !audit_dummy_context() && nd->path.dentry &&
-				nd->path.dentry->d_inode))
+				nd->inode))
 		audit_inode(name, nd->path.dentry);
 
 	path_put(&nd->root);
@@ -1488,6 +1892,7 @@ out_unlock:
 	mutex_unlock(&dir->d_inode->i_mutex);
 	dput(nd->path.dentry);
 	nd->path.dentry = path->dentry;
+
 	if (error)
 		return error;
 	/* Don't check for write permission, don't truncate */
@@ -1582,6 +1987,9 @@ exit:
 	return ERR_PTR(error);
 }
 
+/*
+ * Handle O_CREAT case for do_filp_open
+ */
 static struct file *do_last(struct nameidata *nd, struct path *path,
 			    int open_flag, int acc_mode,
 			    int mode, const char *pathname)
@@ -1603,42 +2011,17 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 		}
 		/* fallthrough */
 	case LAST_ROOT:
-		if (open_flag & O_CREAT)
-			goto exit;
-		/* fallthrough */
+		goto exit;
 	case LAST_BIND:
 		audit_inode(pathname, dir);
 		goto ok;
 	}
 
 	/* trailing slashes? */
-	if (nd->last.name[nd->last.len]) {
-		if (open_flag & O_CREAT)
-			goto exit;
-		nd->flags |= LOOKUP_DIRECTORY | LOOKUP_FOLLOW;
-	}
-
-	/* just plain open? */
-	if (!(open_flag & O_CREAT)) {
-		error = do_lookup(nd, &nd->last, path);
-		if (error)
-			goto exit;
-		error = -ENOENT;
-		if (!path->dentry->d_inode)
-			goto exit_dput;
-		if (path->dentry->d_inode->i_op->follow_link)
-			return NULL;
-		error = -ENOTDIR;
-		if (nd->flags & LOOKUP_DIRECTORY) {
-			if (!path->dentry->d_inode->i_op->lookup)
-				goto exit_dput;
-		}
-		path_to_nameidata(path, nd);
-		audit_inode(pathname, nd->path.dentry);
-		goto ok;
-	}
+	/* XXX: need to make !O_CREAT case do this properly */
+	if (nd->last.name[nd->last.len])
+		goto exit;
 
-	/* OK, it's O_CREAT */
 	mutex_lock(&dir->d_inode->i_mutex);
 
 	path->dentry = lookup_hash(nd);
@@ -1709,8 +2092,9 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 		return NULL;
 
 	path_to_nameidata(path, nd);
+	nd->inode = path->dentry->d_inode;
 	error = -EISDIR;
-	if (S_ISDIR(path->dentry->d_inode->i_mode))
+	if (S_ISDIR(nd->inode->i_mode))
 		goto exit;
 ok:
 	filp = finish_open(nd, open_flag, acc_mode);
@@ -1741,7 +2125,7 @@ struct file *do_filp_open(int dfd, const char *pathname,
 	struct path path;
 	int count = 0;
 	int flag = open_to_namei_flags(open_flag);
-	int force_reval = 0;
+	int flags;
 
 	if (!(open_flag & O_CREAT))
 		mode = 0;
@@ -1767,54 +2151,84 @@ struct file *do_filp_open(int dfd, const char *pathname,
 	if (open_flag & O_APPEND)
 		acc_mode |= MAY_APPEND;
 
-	/* find the parent */
-reval:
-	error = path_init(dfd, pathname, LOOKUP_PARENT, &nd);
+	flags = LOOKUP_OPEN;
+	if (open_flag & O_CREAT) {
+		flags |= LOOKUP_CREATE;
+		if (open_flag & O_EXCL)
+			flags |= LOOKUP_EXCL;
+	}
+	if (open_flag & O_DIRECTORY)
+		flags |= LOOKUP_DIRECTORY;
+	if (!(open_flag & O_NOFOLLOW))
+		flags |= LOOKUP_FOLLOW;
+
+	filp = get_empty_filp();
+	if (!filp)
+		return ERR_PTR(-ENFILE);
+
+	filp->f_flags = open_flag;
+	nd.intent.open.file = filp;
+	nd.intent.open.flags = flag;
+	nd.intent.open.create_mode = mode;
+
+	if (open_flag & O_CREAT)
+		goto creat;
+
+	/* !O_CREAT, simple open */
+	error = do_path_lookup(dfd, pathname, flags, &nd);
+	if (unlikely(error))
+		goto out_filp;
+	error = -ELOOP;
+	if (!(nd.flags & LOOKUP_FOLLOW)) {
+		if (nd.inode->i_op->follow_link)
+			goto out_path;
+	}
+	error = -ENOTDIR;
+	if (nd.flags & LOOKUP_DIRECTORY) {
+		if (!nd.inode->i_op->lookup)
+			goto out_path;
+	}
+	audit_inode(pathname, nd.path.dentry);
+	filp = finish_open(&nd, open_flag, acc_mode);
+	return filp;
+
+creat:
+	/* OK, have to create the file. Find the parent. */
+	error = path_init_rcu(dfd, pathname,
+			LOOKUP_PARENT | (flags & LOOKUP_REVAL), &nd);
 	if (error)
-		return ERR_PTR(error);
-	if (force_reval)
-		nd.flags |= LOOKUP_REVAL;
+		goto out_filp;
+	error = path_walk_rcu(pathname, &nd);
+	path_finish_rcu(&nd);
+	if (unlikely(error == -ECHILD || error == -ESTALE)) {
+		/* slower, locked walk */
+		if (error == -ESTALE) {
+reval:
+			flags |= LOOKUP_REVAL;
+		}
+		error = path_init(dfd, pathname,
+				LOOKUP_PARENT | (flags & LOOKUP_REVAL), &nd);
+		if (error)
+			goto out_filp;
 
-	current->total_link_count = 0;
-	error = link_path_walk(pathname, &nd);
-	if (error) {
-		filp = ERR_PTR(error);
-		goto out;
+		error = path_walk_simple(pathname, &nd);
 	}
-	if (unlikely(!audit_dummy_context()) && (open_flag & O_CREAT))
+	if (unlikely(error))
+		goto out_filp;
+	if (unlikely(!audit_dummy_context()))
 		audit_inode(pathname, nd.path.dentry);
 
 	/*
 	 * We have the parent and last component.
 	 */
-
-	error = -ENFILE;
-	filp = get_empty_filp();
-	if (filp == NULL)
-		goto exit_parent;
-	nd.intent.open.file = filp;
-	filp->f_flags = open_flag;
-	nd.intent.open.flags = flag;
-	nd.intent.open.create_mode = mode;
-	nd.flags &= ~LOOKUP_PARENT;
-	nd.flags |= LOOKUP_OPEN;
-	if (open_flag & O_CREAT) {
-		nd.flags |= LOOKUP_CREATE;
-		if (open_flag & O_EXCL)
-			nd.flags |= LOOKUP_EXCL;
-	}
-	if (open_flag & O_DIRECTORY)
-		nd.flags |= LOOKUP_DIRECTORY;
-	if (!(open_flag & O_NOFOLLOW))
-		nd.flags |= LOOKUP_FOLLOW;
+	nd.flags = flags;
 	filp = do_last(&nd, &path, open_flag, acc_mode, mode, pathname);
 	while (unlikely(!filp)) { /* trailing symlink */
 		struct path holder;
-		struct inode *inode = path.dentry->d_inode;
 		void *cookie;
 		error = -ELOOP;
 		/* S_ISDIR part is a temporary automount kludge */
-		if (!(nd.flags & LOOKUP_FOLLOW) && !S_ISDIR(inode->i_mode))
+		if (!(nd.flags & LOOKUP_FOLLOW) && !S_ISDIR(nd.inode->i_mode))
 			goto exit_dput;
 		if (count++ == 32)
 			goto exit_dput;
@@ -1835,36 +2249,33 @@ reval:
 			goto exit_dput;
 		error = __do_follow_link(&path, &nd, &cookie);
 		if (unlikely(error)) {
+			if (!IS_ERR(cookie) && nd.inode->i_op->put_link)
+				nd.inode->i_op->put_link(path.dentry, &nd, cookie);
 			/* nd.path had been dropped */
-			if (!IS_ERR(cookie) && inode->i_op->put_link)
-				inode->i_op->put_link(path.dentry, &nd, cookie);
-			path_put(&path);
-			release_open_intent(&nd);
-			filp = ERR_PTR(error);
-			goto out;
+			nd.path = path;
+			goto out_path;
 		}
 		holder = path;
 		nd.flags &= ~LOOKUP_PARENT;
 		filp = do_last(&nd, &path, open_flag, acc_mode, mode, pathname);
-		if (inode->i_op->put_link)
-			inode->i_op->put_link(holder.dentry, &nd, cookie);
+		if (nd.inode->i_op->put_link)
+			nd.inode->i_op->put_link(holder.dentry, &nd, cookie);
 		path_put(&holder);
 	}
 out:
 	if (nd.root.mnt)
 		path_put(&nd.root);
-	if (filp == ERR_PTR(-ESTALE) && !force_reval) {
-		force_reval = 1;
+	if (filp == ERR_PTR(-ESTALE) && !(flags & LOOKUP_REVAL))
 		goto reval;
-	}
 	return filp;
 
 exit_dput:
 	path_put_conditional(&path, &nd);
+out_path:
+	path_put(&nd.path);
+out_filp:
 	if (!IS_ERR(nd.intent.open.file))
 		release_open_intent(&nd);
-exit_parent:
-	path_put(&nd.path);
 	filp = ERR_PTR(error);
 	goto out;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index e31fc9a..5dfb6f7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -5,6 +5,7 @@
 #include <linux/list.h>
 #include <linux/rculist.h>
 #include <linux/spinlock.h>
+#include <linux/seqlock.h>
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
 
@@ -80,7 +81,7 @@ full_name_hash(const unsigned char *name, unsigned int len)
  * give reasonable cacheline footprint with larger lines without the
  * large memory footprint increase).
  */
-#ifdef CONFIG_64BIT
+#ifdef CONFIG_64BIT /* XXX update */
 #define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */
 #else
 #define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */
@@ -90,6 +91,7 @@ struct dentry {
 	unsigned int d_count;		/* protected by d_lock */
 	unsigned int d_flags;		/* protected by d_lock */
 	spinlock_t d_lock;		/* per dentry lock */
+	seqcount_t d_seq;		/* per dentry seqlock */
 	int d_mounted;
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
@@ -296,9 +298,12 @@ extern void d_move(struct dentry *, struct dentry *);
 extern struct dentry *d_ancestor(struct dentry *, struct dentry *);
 
 /* appendix may either be NULL or be used for transname suffixes */
-extern struct dentry * d_lookup(struct dentry *, struct qstr *);
-extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
-extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);
+extern struct dentry *d_lookup(struct dentry *, struct qstr *);
+extern struct dentry *__d_lookup(struct dentry *, struct qstr *);
+extern struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
+				unsigned *seq, struct inode **inode);
+extern int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq);
+extern struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *);
 
 /* validate "insecure" dentry pointer */
 extern int d_validate(struct dentry *, struct dentry *);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index aec730b..18d06ad 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -19,7 +19,10 @@ struct nameidata {
 	struct path	path;
 	struct qstr	last;
 	struct path	root;
+	struct file	*file;
+	struct inode	*inode; /* path.dentry.d_inode */
 	unsigned int	flags;
+	unsigned	seq;
 	int		last_type;
 	unsigned	depth;
 	char *saved_names[MAX_NESTED_LINKS + 1];
@@ -43,11 +46,13 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
  *  - internal "there are more path components" flag
  *  - dentry cache is untrusted; force a real lookup
  */
-#define LOOKUP_FOLLOW		 1
-#define LOOKUP_DIRECTORY	 2
-#define LOOKUP_CONTINUE		 4
-#define LOOKUP_PARENT		16
-#define LOOKUP_REVAL		64
+#define LOOKUP_FOLLOW		0x0001
+#define LOOKUP_DIRECTORY	0x0002
+#define LOOKUP_CONTINUE		0x0004
+
+#define LOOKUP_PARENT		0x0010
+#define LOOKUP_REVAL		0x0020
+#define LOOKUP_RCU		0x0040
 /*
  * Intent data
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 34/46] fs: fs_struct use seqlock
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (31 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 33/46] fs: rcu-walk for path lookup Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 35/46] fs: dcache remove d_mounted Nick Piggin
                   ` (16 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/fs_struct.c            |   10 ++++++++++
 fs/namei.c                |   36 ++++++++++++++++++++++--------------
 include/linux/fs_struct.h |    3 +++
 3 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index ed45a9c..60b8531 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -14,9 +14,11 @@ void set_fs_root(struct fs_struct *fs, struct path *path)
 	struct path old_root;
 
 	spin_lock(&fs->lock);
+	write_seqcount_begin(&fs->seq);
 	old_root = fs->root;
 	fs->root = *path;
 	path_get(path);
+	write_seqcount_end(&fs->seq);
 	spin_unlock(&fs->lock);
 	if (old_root.dentry)
 		path_put(&old_root);
@@ -31,9 +33,11 @@ void set_fs_pwd(struct fs_struct *fs, struct path *path)
 	struct path old_pwd;
 
 	spin_lock(&fs->lock);
+	write_seqcount_begin(&fs->seq);
 	old_pwd = fs->pwd;
 	fs->pwd = *path;
 	path_get(path);
+	write_seqcount_end(&fs->seq);
 	spin_unlock(&fs->lock);
 
 	if (old_pwd.dentry)
@@ -52,6 +56,7 @@ void chroot_fs_refs(struct path *old_root, struct path *new_root)
 		fs = p->fs;
 		if (fs) {
 			spin_lock(&fs->lock);
+			write_seqcount_begin(&fs->seq);
 			if (fs->root.dentry == old_root->dentry
 			    && fs->root.mnt == old_root->mnt) {
 				path_get(new_root);
@@ -64,6 +69,7 @@ void chroot_fs_refs(struct path *old_root, struct path *new_root)
 				fs->pwd = *new_root;
 				count++;
 			}
+			write_seqcount_end(&fs->seq);
 			spin_unlock(&fs->lock);
 		}
 		task_unlock(p);
@@ -88,8 +94,10 @@ void exit_fs(struct task_struct *tsk)
 		int kill;
 		task_lock(tsk);
 		spin_lock(&fs->lock);
+		write_seqcount_begin(&fs->seq);
 		tsk->fs = NULL;
 		kill = !--fs->users;
+		write_seqcount_end(&fs->seq);
 		spin_unlock(&fs->lock);
 		task_unlock(tsk);
 		if (kill)
@@ -105,6 +113,7 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
 		fs->users = 1;
 		fs->in_exec = 0;
 		spin_lock_init(&fs->lock);
+		seqcount_init(&fs->seq);
 		fs->umask = old->umask;
 		get_fs_root_and_pwd(old, &fs->root, &fs->pwd);
 	}
@@ -144,6 +153,7 @@ EXPORT_SYMBOL(current_umask);
 struct fs_struct init_fs = {
 	.users		= 1,
 	.lock		= __SPIN_LOCK_UNLOCKED(init_fs.lock),
+	.seq		= SEQCNT_ZERO,
 	.umask		= 0022,
 };
 
diff --git a/fs/namei.c b/fs/namei.c
index 5185752..c48d208 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -640,9 +640,12 @@ static __always_inline void set_root_rcu(struct nameidata *nd)
 {
 	if (!nd->root.mnt) {
 		struct fs_struct *fs = current->fs;
-		spin_lock(&fs->lock);
-		nd->root = fs->root;
-		spin_unlock(&fs->lock);
+		unsigned seq;
+
+		do {
+			seq = read_seqcount_begin(&fs->seq);
+			nd->root = fs->root;
+		} while (read_seqcount_retry(&fs->seq, seq));
 	}
 }
 
@@ -1324,26 +1327,31 @@ static int path_init_rcu(int dfd, const char *name, unsigned int flags, struct n
 
 	if (*name=='/') {
 		struct fs_struct *fs = current->fs;
+		unsigned seq;
 
 		br_read_lock(vfsmount_lock);
 		rcu_read_lock();
 
-		spin_lock(&fs->lock);
-		nd->root = fs->root;
-		nd->path = nd->root;
-		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		spin_unlock(&fs->lock);
+		do {
+			seq = read_seqcount_begin(&fs->seq);
+			nd->root = fs->root;
+			nd->path = nd->root;
+			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+		} while (read_seqcount_retry(&fs->seq, seq));
 
 	} else if (dfd == AT_FDCWD) {
 		struct fs_struct *fs = current->fs;
+		unsigned seq;
 
 		br_read_lock(vfsmount_lock);
 		rcu_read_lock();
 
-		spin_lock(&fs->lock);
-		nd->path = fs->pwd;
-		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-		spin_unlock(&fs->lock);
+		do {
+			seq = read_seqcount_begin(&fs->seq);
+			nd->path = fs->pwd;
+			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
+		} while (read_seqcount_retry(&fs->seq, seq));
+
 	} else {
 		struct dentry *dentry;
 
@@ -1366,7 +1374,7 @@ static int path_init_rcu(int dfd, const char *name, unsigned int flags, struct n
 		if (fput_needed)
 			nd->file = file;
 
-		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 		br_read_lock(vfsmount_lock);
 		rcu_read_lock();
 	}
@@ -2199,7 +2207,7 @@ creat:
 	if (error)
 		goto out_filp;
 	error = path_walk_rcu(pathname, &nd);
-	path_finish_rcu(&nd);
+	path_finish_rcu(&nd); /* XXX: ok to throw out root here?? */
 	if (unlikely(error == -ECHILD || error == -ESTALE)) {
 		/* slower, locked walk */
 		if (error == -ESTALE) {
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index a42b5bf..003dc0f 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -2,10 +2,13 @@
 #define _LINUX_FS_STRUCT_H
 
 #include <linux/path.h>
+#include <linux/spinlock.h>
+#include <linux/seqlock.h>
 
 struct fs_struct {
 	int users;
 	spinlock_t lock;
+	seqcount_t seq;
 	int umask;
 	int in_exec;
 	struct path root, pwd;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 35/46] fs: dcache remove d_mounted
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (32 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 34/46] fs: fs_struct use seqlock Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 36/46] fs: dcache reduce branches in lookup path Nick Piggin
                   ` (15 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
The flag can be cleared by checking the hash table to see if there are any
mounts left, which is not time critical because it is performed at detach time.

The mounted state of a dentry is only used to speculatively take a look in the
mount hash table if it is set -- before following the mount, vfsmount lock is
taken and mount re-checked without races.

This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
might use later (and some configs have larger than 32-bit spinlocks which might
make use of the hole).

Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
In autofs4, when expring direct (or offset) mounts we need to ensure that we
block user path walks into the autofs mount, which is covered by another mount.
To do this we clear the mounted status so that follows stop before walking into
the mount and are essentially blocked until the expire is completed. The
automount daemon still finds the correct dentry for the umount due to the
follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
an inode operation for direct and offset mounts only and is called following
the lookup that stopped at the covered mount.

At the end of the expire the covering mount probably has gone away so the
mounted status need not be restored. But we need to check this and only restore
the mounted status if the expire failed.

XXX: autofs may not work right if we have other mounts go over the top of it?

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/autofs4/expire.c    |   13 +++++++++++--
 fs/dcache.c            |    1 -
 fs/namespace.c         |   29 ++++++++++++++++++++++++++---
 include/linux/dcache.h |   42 +++++++++++++++++++++---------------------
 4 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index 7869b3a..d9f9a15 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -295,7 +295,9 @@ struct dentry *autofs4_expire_direct(struct super_block *sb,
 		struct autofs_info *ino = autofs4_dentry_ino(root);
 		if (d_mountpoint(root)) {
 			ino->flags |= AUTOFS_INF_MOUNTPOINT;
-			root->d_mounted--;
+			spin_lock(&root->d_lock);
+			root->d_flags &= ~DCACHE_MOUNTED;
+			spin_unlock(&root->d_lock);
 		}
 		ino->flags |= AUTOFS_INF_EXPIRING;
 		init_completion(&ino->expire_complete);
@@ -503,7 +505,14 @@ int autofs4_do_expire_multi(struct super_block *sb, struct vfsmount *mnt,
 
 		spin_lock(&sbi->fs_lock);
 		if (ino->flags & AUTOFS_INF_MOUNTPOINT) {
-			sb->s_root->d_mounted++;
+			spin_lock(&sb->s_root->d_lock);
+			/*
+			 * If we haven't been expired away, then reset
+			 * mounted status.
+			 */
+			if (mnt->mnt_parent != mnt)
+				sb->s_root->d_flags |= DCACHE_MOUNTED;
+			spin_unlock(&sb->s_root->d_lock);
 			ino->flags &= ~AUTOFS_INF_MOUNTPOINT;
 		}
 		ino->flags &= ~AUTOFS_INF_EXPIRING;
diff --git a/fs/dcache.c b/fs/dcache.c
index 5b59807..c25be71 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1198,7 +1198,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
-	dentry->d_mounted = 0;
 	INIT_HLIST_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
diff --git a/fs/namespace.c b/fs/namespace.c
index 3dbfc07..39a7d50 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -492,6 +492,27 @@ static void __touch_mnt_namespace(struct mnt_namespace *ns)
 }
 
 /*
+ * Clear dentry's mounted state if it has no remaining mounts.
+ * vfsmount_lock must be held for write.
+ */
+static void dentry_reset_mounted(struct vfsmount *mnt, struct dentry *dentry)
+{
+	unsigned u;
+
+	for (u = 0; u < HASH_SIZE; u++) {
+		struct vfsmount *p;
+
+		list_for_each_entry(p, &mount_hashtable[u], mnt_hash) {
+			if (p->mnt_mountpoint == dentry)
+				return;
+		}
+	}
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags &= ~DCACHE_MOUNTED;
+	spin_unlock(&dentry->d_lock);
+}
+
+/*
  * vfsmount lock must be held for write
  */
 static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
@@ -502,7 +523,7 @@ static void detach_mnt(struct vfsmount *mnt, struct path *old_path)
 	mnt->mnt_mountpoint = mnt->mnt_root;
 	list_del_init(&mnt->mnt_child);
 	list_del_init(&mnt->mnt_hash);
-	old_path->dentry->d_mounted--;
+	dentry_reset_mounted(old_path->mnt, old_path->dentry);
 }
 
 /*
@@ -513,7 +534,9 @@ void mnt_set_mountpoint(struct vfsmount *mnt, struct dentry *dentry,
 {
 	child_mnt->mnt_parent = mntget(mnt);
 	child_mnt->mnt_mountpoint = dget(dentry);
-	dentry->d_mounted++;
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags |= DCACHE_MOUNTED;
+	spin_unlock(&dentry->d_lock);
 }
 
 /*
@@ -1073,7 +1096,7 @@ void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
 		list_del_init(&p->mnt_child);
 		if (p->mnt_parent != p) {
 			p->mnt_parent->mnt_ghosts++;
-			p->mnt_mountpoint->d_mounted--;
+			dentry_reset_mounted(p->mnt_parent, p->mnt_mountpoint);
 		}
 		change_mnt_propagation(p, MS_PRIVATE);
 	}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 5dfb6f7..2fd0b45 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -92,7 +92,6 @@ struct dentry {
 	unsigned int d_flags;		/* protected by d_lock */
 	spinlock_t d_lock;		/* per dentry lock */
 	seqcount_t d_seq;		/* per dentry seqlock */
-	int d_mounted;
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
 	/*
@@ -156,33 +155,34 @@ struct dentry_operations {
 
 /* d_flags entries */
 #define DCACHE_AUTOFS_PENDING 0x0001    /* autofs: "under construction" */
-#define DCACHE_NFSFS_RENAMED  0x0002    /* this dentry has been "silly
-					 * renamed" and has to be
-					 * deleted on the last dput()
-					 */
-#define	DCACHE_DISCONNECTED 0x0004
-     /* This dentry is possibly not currently connected to the dcache tree,
-      * in which case its parent will either be itself, or will have this
-      * flag as well.  nfsd will not use a dentry with this bit set, but will
-      * first endeavour to clear the bit either by discovering that it is
-      * connected, or by performing lookup operations.   Any filesystem which
-      * supports nfsd_operations MUST have a lookup function which, if it finds
-      * a directory inode with a DCACHE_DISCONNECTED dentry, will d_move
-      * that dentry into place and return that dentry rather than the passed one,
-      * typically using d_splice_alias.
-      */
+#define DCACHE_NFSFS_RENAMED  0x0002
+     /* this dentry has been "silly renamed" and has to be deleted on the last
+      * dput() */
+
+#define	DCACHE_DISCONNECTED	0x0004
+     /* This dentry is possibly not currently connected to the dcache tree, in
+      * which case its parent will either be itself, or will have this flag as
+      * well.  nfsd will not use a dentry with this bit set, but will first
+      * endeavour to clear the bit either by discovering that it is connected,
+      * or by performing lookup operations.   Any filesystem which supports
+      * nfsd_operations MUST have a lookup function which, if it finds a
+      * directory inode with a DCACHE_DISCONNECTED dentry, will d_move that
+      * dentry into place and return that dentry rather than the passed one,
+      * typically using d_splice_alias. */
 
 #define DCACHE_REFERENCED	0x0008  /* Recently used, don't discard. */
 #define DCACHE_UNHASHED		0x0010	
-
-#define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched by inotify */
+#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020
+     /* Parent inode is watched by inotify */
 
 #define DCACHE_COOKIE		0x0040	/* For use by dcookie subsystem */
-
-#define DCACHE_FSNOTIFY_PARENT_WATCHED	0x0080 /* Parent inode is watched by some fsnotify listener */
+#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080
+     /* Parent inode is watched by some fsnotify listener */
 
 #define DCACHE_CANT_MOUNT	0x0100
 #define DCACHE_GENOCIDE		0x0200
+#define DCACHE_MOUNTED		0x0400	/* is a mountpoint */
+
 
 extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
@@ -381,7 +381,7 @@ extern void dput(struct dentry *);
 
 static inline int d_mountpoint(struct dentry *dentry)
 {
-	return dentry->d_mounted;
+	return dentry->d_flags & DCACHE_MOUNTED;
 }
 
 extern struct vfsmount *lookup_mnt(struct path *);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 36/46] fs: dcache reduce branches in lookup path
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (33 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 35/46] fs: dcache remove d_mounted Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 37/46] fs: cache optimise dentry and inode for rcu-walk Nick Piggin
                   ` (14 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -l "d_op =" | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 arch/ia64/kernel/perfmon.c    |    2 +-
 drivers/staging/autofs/root.c |    2 +-
 fs/9p/vfs_inode.c             |   26 +++++++++++++-------------
 fs/adfs/dir.c                 |    2 +-
 fs/adfs/super.c               |    2 +-
 fs/affs/namei.c               |    2 +-
 fs/affs/super.c               |    2 +-
 fs/afs/dir.c                  |    2 +-
 fs/anon_inodes.c              |    2 +-
 fs/autofs4/inode.c            |    2 +-
 fs/autofs4/root.c             |   10 +++++-----
 fs/btrfs/export.c             |    4 ++--
 fs/btrfs/inode.c              |    2 +-
 fs/ceph/dir.c                 |    6 +++---
 fs/cifs/dir.c                 |   16 ++++++++--------
 fs/cifs/inode.c               |    8 ++++----
 fs/cifs/link.c                |    4 ++--
 fs/cifs/readdir.c             |    4 ++--
 fs/coda/dir.c                 |    2 +-
 fs/configfs/dir.c             |    8 ++++----
 fs/dcache.c                   |   25 +++++++++++++++++++++----
 fs/ecryptfs/inode.c           |    2 +-
 fs/ecryptfs/main.c            |    4 ++--
 fs/fat/inode.c                |    4 ++--
 fs/fat/namei_msdos.c          |    6 +++---
 fs/fat/namei_vfat.c           |    8 ++++----
 fs/fuse/dir.c                 |    2 +-
 fs/fuse/inode.c               |    4 ++--
 fs/gfs2/export.c              |    4 ++--
 fs/gfs2/ops_fstype.c          |    2 +-
 fs/gfs2/ops_inode.c           |    2 +-
 fs/hfs/dir.c                  |    2 +-
 fs/hfs/super.c                |    2 +-
 fs/hfsplus/dir.c              |    2 +-
 fs/hfsplus/super.c            |    2 +-
 fs/hostfs/hostfs_kern.c       |    2 +-
 fs/hpfs/dentry.c              |    2 +-
 fs/isofs/inode.c              |    2 +-
 fs/isofs/namei.c              |    2 +-
 fs/jfs/namei.c                |    4 ++--
 fs/jfs/super.c                |    2 +-
 fs/libfs.c                    |    2 +-
 fs/minix/namei.c              |    2 +-
 fs/namei.c                    |   31 ++++++++++++++++++++-----------
 fs/ncpfs/dir.c                |    4 ++--
 fs/ncpfs/inode.c              |    2 +-
 fs/nfs/dir.c                  |    6 +++---
 fs/nfs/getroot.c              |    4 ++--
 fs/ocfs2/export.c             |    4 ++--
 fs/ocfs2/namei.c              |   10 +++++-----
 fs/pipe.c                     |    2 +-
 fs/proc/base.c                |   12 ++++++------
 fs/proc/generic.c             |    2 +-
 fs/proc/proc_sysctl.c         |    4 ++--
 fs/reiserfs/xattr.c           |    2 +-
 fs/sysfs/dir.c                |    2 +-
 fs/sysv/namei.c               |    2 +-
 fs/sysv/super.c               |    2 +-
 include/linux/dcache.h        |    6 ++++++
 kernel/cgroup.c               |    2 +-
 net/socket.c                  |    2 +-
 net/sunrpc/rpc_pipe.c         |    2 +-
 62 files changed, 165 insertions(+), 133 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index d39d8a5..5a24f40 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2233,7 +2233,7 @@ pfm_alloc_file(pfm_context_t *ctx)
 	}
 	path.mnt = mntget(pfmfs_mnt);
 
-	path.dentry->d_op = &pfmfs_dentry_operations;
+	d_set_d_op(path.dentry, &pfmfs_dentry_operations);
 	d_add(path.dentry, inode);
 
 	file = alloc_file(&path, FMODE_READ, &pfm_file_ops);
diff --git a/drivers/staging/autofs/root.c b/drivers/staging/autofs/root.c
index 0fdec4b..b09adb5 100644
--- a/drivers/staging/autofs/root.c
+++ b/drivers/staging/autofs/root.c
@@ -237,7 +237,7 @@ static struct dentry *autofs_root_lookup(struct inode *dir, struct dentry *dentr
 	 *
 	 * We need to do this before we release the directory semaphore.
 	 */
-	dentry->d_op = &autofs_dentry_operations;
+	d_set_d_op(dentry, &autofs_dentry_operations);
 	dentry->d_flags |= DCACHE_AUTOFS_PENDING;
 	d_add(dentry, NULL);
 
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index f6f9081..df8bbb3 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -635,9 +635,9 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir,
 	}
 
 	if (v9ses->cache)
-		dentry->d_op = &v9fs_cached_dentry_operations;
+		d_set_d_op(dentry, &v9fs_cached_dentry_operations);
 	else
-		dentry->d_op = &v9fs_dentry_operations;
+		d_set_d_op(dentry, &v9fs_dentry_operations);
 
 	d_instantiate(dentry, inode);
 	err = v9fs_fid_add(dentry, fid);
@@ -749,7 +749,7 @@ v9fs_vfs_create_dotl(struct inode *dir, struct dentry *dentry, int omode,
 				err);
 			goto error;
 		}
-		dentry->d_op = &v9fs_cached_dentry_operations;
+		d_set_d_op(dentry, &v9fs_cached_dentry_operations);
 		d_instantiate(dentry, inode);
 		err = v9fs_fid_add(dentry, fid);
 		if (err < 0)
@@ -767,7 +767,7 @@ v9fs_vfs_create_dotl(struct inode *dir, struct dentry *dentry, int omode,
 			err = PTR_ERR(inode);
 			goto error;
 		}
-		dentry->d_op = &v9fs_dentry_operations;
+		d_set_d_op(dentry, &v9fs_dentry_operations);
 		d_instantiate(dentry, inode);
 	}
 	/* Now set the ACL based on the default value */
@@ -956,7 +956,7 @@ static int v9fs_vfs_mkdir_dotl(struct inode *dir,
 				err);
 			goto error;
 		}
-		dentry->d_op = &v9fs_cached_dentry_operations;
+		d_set_d_op(dentry, &v9fs_cached_dentry_operations);
 		d_instantiate(dentry, inode);
 		err = v9fs_fid_add(dentry, fid);
 		if (err < 0)
@@ -973,7 +973,7 @@ static int v9fs_vfs_mkdir_dotl(struct inode *dir,
 			err = PTR_ERR(inode);
 			goto error;
 		}
-		dentry->d_op = &v9fs_dentry_operations;
+		d_set_d_op(dentry, &v9fs_dentry_operations);
 		d_instantiate(dentry, inode);
 	}
 	/* Now set the ACL based on the default value */
@@ -1041,9 +1041,9 @@ static struct dentry *v9fs_vfs_lookup(struct inode *dir, struct dentry *dentry,
 
 inst_out:
 	if (v9ses->cache)
-		dentry->d_op = &v9fs_cached_dentry_operations;
+		d_set_d_op(dentry, &v9fs_cached_dentry_operations);
 	else
-		dentry->d_op = &v9fs_dentry_operations;
+		d_set_d_op(dentry, &v9fs_dentry_operations);
 
 	d_add(dentry, inode);
 	return NULL;
@@ -1709,7 +1709,7 @@ v9fs_vfs_symlink_dotl(struct inode *dir, struct dentry *dentry,
 					err);
 			goto error;
 		}
-		dentry->d_op = &v9fs_cached_dentry_operations;
+		d_set_d_op(dentry, &v9fs_cached_dentry_operations);
 		d_instantiate(dentry, inode);
 		err = v9fs_fid_add(dentry, fid);
 		if (err < 0)
@@ -1722,7 +1722,7 @@ v9fs_vfs_symlink_dotl(struct inode *dir, struct dentry *dentry,
 			err = PTR_ERR(inode);
 			goto error;
 		}
-		dentry->d_op = &v9fs_dentry_operations;
+		d_set_d_op(dentry, &v9fs_dentry_operations);
 		d_instantiate(dentry, inode);
 	}
 
@@ -1856,7 +1856,7 @@ v9fs_vfs_link_dotl(struct dentry *old_dentry, struct inode *dir,
 		ihold(old_dentry->d_inode);
 	}
 
-	dentry->d_op = old_dentry->d_op;
+	d_set_d_op(dentry, old_dentry->d_op);
 	d_instantiate(dentry, old_dentry->d_inode);
 
 	return err;
@@ -1980,7 +1980,7 @@ v9fs_vfs_mknod_dotl(struct inode *dir, struct dentry *dentry, int omode,
 				err);
 			goto error;
 		}
-		dentry->d_op = &v9fs_cached_dentry_operations;
+		d_set_d_op(dentry, &v9fs_cached_dentry_operations);
 		d_instantiate(dentry, inode);
 		err = v9fs_fid_add(dentry, fid);
 		if (err < 0)
@@ -1996,7 +1996,7 @@ v9fs_vfs_mknod_dotl(struct inode *dir, struct dentry *dentry, int omode,
 			err = PTR_ERR(inode);
 			goto error;
 		}
-		dentry->d_op = &v9fs_dentry_operations;
+		d_set_d_op(dentry, &v9fs_dentry_operations);
 		d_instantiate(dentry, inode);
 	}
 	/* Now set the ACL based on the default value */
diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index a098bba..ee00001 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -276,7 +276,7 @@ adfs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
 	struct object_info obj;
 	int error;
 
-	dentry->d_op = &adfs_dentry_operations;	
+	d_set_d_op(dentry, &adfs_dentry_operations);
 	lock_kernel();
 	error = adfs_dir_lookup_byname(dir, &dentry->d_name, &obj);
 	if (error == 0) {
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 47dffc5..a4041b5 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -484,7 +484,7 @@ static int adfs_fill_super(struct super_block *sb, void *data, int silent)
 		adfs_error(sb, "get root inode failed\n");
 		goto error;
 	} else
-		sb->s_root->d_op = &adfs_dentry_operations;
+		d_set_d_op(sb->s_root, &adfs_dentry_operations);
 	unlock_kernel();
 	return 0;
 
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index 91d5dcd..b69ba2d 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -238,7 +238,7 @@ affs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
 		if (IS_ERR(inode))
 			return ERR_CAST(inode);
 	}
-	dentry->d_op = AFFS_SB(sb)->s_flags & SF_INTL ? &affs_intl_dentry_operations : &affs_dentry_operations;
+	d_set_d_op(dentry, AFFS_SB(sb)->s_flags & SF_INTL ? &affs_intl_dentry_operations : &affs_dentry_operations);
 	d_add(dentry, inode);
 	return NULL;
 }
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 4c18fcf..d39081b 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -482,7 +482,7 @@ got_root:
 		printk(KERN_ERR "AFFS: Get root inode failed\n");
 		goto out_error;
 	}
-	sb->s_root->d_op = &affs_dentry_operations;
+	d_set_d_op(sb->s_root, &affs_dentry_operations);
 
 	pr_debug("AFFS: s_flags=%lX\n",sb->s_flags);
 	return 0;
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 2c18cde..b8bb7e7 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -581,7 +581,7 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
 	}
 
 success:
-	dentry->d_op = &afs_fs_dentry_operations;
+	d_set_d_op(dentry, &afs_fs_dentry_operations);
 
 	d_add(dentry, inode);
 	_leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%llu }",
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 57ce55b..aca8806 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -113,7 +113,7 @@ struct file *anon_inode_getfile(const char *name,
 	 */
 	ihold(anon_inode_inode);
 
-	path.dentry->d_op = &anon_inodefs_dentry_operations;
+	d_set_d_op(path.dentry, &anon_inodefs_dentry_operations);
 	d_instantiate(path.dentry, anon_inode_inode);
 
 	error = -ENFILE;
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index ac87e49..a7bdb9d 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -309,7 +309,7 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
 		goto fail_iput;
 	pipe = NULL;
 
-	root->d_op = &autofs4_sb_dentry_operations;
+	d_set_d_op(root, &autofs4_sb_dentry_operations);
 	root->d_fsdata = ino;
 
 	/* Can this call block? */
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index 3eaa251..5b0421f 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -571,7 +571,7 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 		 * we check for the hashed dentry and return the newly
 		 * hashed dentry.
 		 */
-		dentry->d_op = &autofs4_root_dentry_operations;
+		d_set_d_op(dentry, &autofs4_root_dentry_operations);
 
 		/*
 		 * And we need to ensure that the same dentry is used for
@@ -710,9 +710,9 @@ static int autofs4_dir_symlink(struct inode *dir,
 	d_add(dentry, inode);
 
 	if (dir == dir->i_sb->s_root->d_inode)
-		dentry->d_op = &autofs4_root_dentry_operations;
+		d_set_d_op(dentry, &autofs4_root_dentry_operations);
 	else
-		dentry->d_op = &autofs4_dentry_operations;
+		d_set_d_op(dentry, &autofs4_dentry_operations);
 
 	dentry->d_fsdata = ino;
 	ino->dentry = dget(dentry);
@@ -845,9 +845,9 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 	d_add(dentry, inode);
 
 	if (dir == dir->i_sb->s_root->d_inode)
-		dentry->d_op = &autofs4_root_dentry_operations;
+		d_set_d_op(dentry, &autofs4_root_dentry_operations);
 	else
-		dentry->d_op = &autofs4_dentry_operations;
+		d_set_d_op(dentry, &autofs4_dentry_operations);
 
 	dentry->d_fsdata = ino;
 	ino->dentry = dget(dentry);
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 951ef09..19ff716 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -110,7 +110,7 @@ static struct dentry *btrfs_get_dentry(struct super_block *sb, u64 objectid,
 
 	dentry = d_obtain_alias(inode);
 	if (!IS_ERR(dentry))
-		dentry->d_op = &btrfs_dentry_operations;
+		d_set_d_op(dentry, &btrfs_dentry_operations);
 	return dentry;
 fail:
 	srcu_read_unlock(&fs_info->subvol_srcu, index);
@@ -225,7 +225,7 @@ static struct dentry *btrfs_get_parent(struct dentry *child)
 	key.offset = 0;
 	dentry = d_obtain_alias(btrfs_iget(root->fs_info->sb, &key, root, NULL));
 	if (!IS_ERR(dentry))
-		dentry->d_op = &btrfs_dentry_operations;
+		d_set_d_op(dentry, &btrfs_dentry_operations);
 	return dentry;
 fail:
 	btrfs_free_path(path);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c704fd1..aab3087 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4084,7 +4084,7 @@ struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry)
 	int index;
 	int ret;
 
-	dentry->d_op = &btrfs_dentry_operations;
+	d_set_d_op(dentry, &btrfs_dentry_operations);
 
 	if (dentry->d_name.len > BTRFS_NAME_LEN)
 		return ERR_PTR(-ENAMETOOLONG);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index e42d2a1..5c89989 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -41,11 +41,11 @@ int ceph_init_dentry(struct dentry *dentry)
 		return 0;
 
 	if (ceph_snap(dentry->d_parent->d_inode) == CEPH_NOSNAP)
-		dentry->d_op = &ceph_dentry_ops;
+		d_set_d_op(dentry, &ceph_dentry_ops);
 	else if (ceph_snap(dentry->d_parent->d_inode) == CEPH_SNAPDIR)
-		dentry->d_op = &ceph_snapdir_dentry_ops;
+		d_set_d_op(dentry, &ceph_snapdir_dentry_ops);
 	else
-		dentry->d_op = &ceph_snap_dentry_ops;
+		d_set_d_op(dentry, &ceph_snap_dentry_ops);
 
 	di = kmem_cache_alloc(ceph_dentry_cachep, GFP_NOFS | __GFP_ZERO);
 	if (!di)
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 5227626..966f58c 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -135,9 +135,9 @@ static void setup_cifs_dentry(struct cifsTconInfo *tcon,
 			      struct inode *newinode)
 {
 	if (tcon->nocase)
-		direntry->d_op = &cifs_ci_dentry_ops;
+		d_set_d_op(direntry, &cifs_ci_dentry_ops);
 	else
-		direntry->d_op = &cifs_dentry_ops;
+		d_set_d_op(direntry, &cifs_dentry_ops);
 	d_instantiate(direntry, newinode);
 }
 
@@ -421,9 +421,9 @@ int cifs_mknod(struct inode *inode, struct dentry *direntry, int mode,
 		rc = cifs_get_inode_info_unix(&newinode, full_path,
 						inode->i_sb, xid);
 		if (pTcon->nocase)
-			direntry->d_op = &cifs_ci_dentry_ops;
+			d_set_d_op(direntry, &cifs_ci_dentry_ops);
 		else
-			direntry->d_op = &cifs_dentry_ops;
+			d_set_d_op(direntry, &cifs_dentry_ops);
 
 		if (rc == 0)
 			d_instantiate(direntry, newinode);
@@ -604,9 +604,9 @@ cifs_lookup(struct inode *parent_dir_inode, struct dentry *direntry,
 
 	if ((rc == 0) && (newInode != NULL)) {
 		if (pTcon->nocase)
-			direntry->d_op = &cifs_ci_dentry_ops;
+			d_set_d_op(direntry, &cifs_ci_dentry_ops);
 		else
-			direntry->d_op = &cifs_dentry_ops;
+			d_set_d_op(direntry, &cifs_dentry_ops);
 		d_add(direntry, newInode);
 		if (posix_open) {
 			filp = lookup_instantiate_filp(nd, direntry,
@@ -634,9 +634,9 @@ cifs_lookup(struct inode *parent_dir_inode, struct dentry *direntry,
 		rc = 0;
 		direntry->d_time = jiffies;
 		if (pTcon->nocase)
-			direntry->d_op = &cifs_ci_dentry_ops;
+			d_set_d_op(direntry, &cifs_ci_dentry_ops);
 		else
-			direntry->d_op = &cifs_dentry_ops;
+			d_set_d_op(direntry, &cifs_dentry_ops);
 		d_add(direntry, NULL);
 	/*	if it was once a directory (but how can we tell?) we could do
 		shrink_dcache_parent(direntry); */
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index ca901f0..76677de 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1314,9 +1314,9 @@ int cifs_mkdir(struct inode *inode, struct dentry *direntry, int mode)
 	to set uid/gid */
 			inc_nlink(inode);
 			if (pTcon->nocase)
-				direntry->d_op = &cifs_ci_dentry_ops;
+				d_set_d_op(direntry, &cifs_ci_dentry_ops);
 			else
-				direntry->d_op = &cifs_dentry_ops;
+				d_set_d_op(direntry, &cifs_dentry_ops);
 
 			cifs_unix_basic_to_fattr(&fattr, pInfo, cifs_sb);
 			cifs_fill_uniqueid(inode->i_sb, &fattr);
@@ -1358,9 +1358,9 @@ mkdir_get_info:
 						 inode->i_sb, xid, NULL);
 
 		if (pTcon->nocase)
-			direntry->d_op = &cifs_ci_dentry_ops;
+			d_set_d_op(direntry, &cifs_ci_dentry_ops);
 		else
-			direntry->d_op = &cifs_dentry_ops;
+			d_set_d_op(direntry, &cifs_dentry_ops);
 		d_instantiate(direntry, newinode);
 		 /* setting nlink not necessary except in cases where we
 		  * failed to get it from the server or was set bogus */
diff --git a/fs/cifs/link.c b/fs/cifs/link.c
index 85cdbf8..fe2f6a9 100644
--- a/fs/cifs/link.c
+++ b/fs/cifs/link.c
@@ -525,9 +525,9 @@ cifs_symlink(struct inode *inode, struct dentry *direntry, const char *symname)
 			      rc);
 		} else {
 			if (pTcon->nocase)
-				direntry->d_op = &cifs_ci_dentry_ops;
+				d_set_d_op(direntry, &cifs_ci_dentry_ops);
 			else
-				direntry->d_op = &cifs_dentry_ops;
+				d_set_d_op(direntry, &cifs_dentry_ops);
 			d_instantiate(direntry, newinode);
 		}
 	}
diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
index ec5b2af..9d5b00f 100644
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -103,9 +103,9 @@ cifs_readdir_lookup(struct dentry *parent, struct qstr *name,
 	}
 
 	if (cifs_sb_master_tcon(CIFS_SB(sb))->nocase)
-		dentry->d_op = &cifs_ci_dentry_ops;
+		d_set_d_op(dentry, &cifs_ci_dentry_ops);
 	else
-		dentry->d_op = &cifs_dentry_ops;
+		d_set_d_op(dentry, &cifs_dentry_ops);
 
 	alias = d_materialise_unique(dentry, inode);
 	if (alias != NULL) {
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index 9e37e8b..aa40c81 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -125,7 +125,7 @@ static struct dentry *coda_lookup(struct inode *dir, struct dentry *entry, struc
 		return ERR_PTR(error);
 
 exit:
-	entry->d_op = &coda_dentry_operations;
+	d_set_d_op(entry, &coda_dentry_operations);
 
 	if (inode && (type & CODA_NOCACHE))
 		coda_flag_inode(inode, C_VATTR | C_PURGE);
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index 5825780..8cd00c3 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -234,7 +234,7 @@ int configfs_make_dirent(struct configfs_dirent * parent_sd,
 	sd->s_dentry = dentry;
 	if (dentry) {
 		dentry->d_fsdata = configfs_get(sd);
-		dentry->d_op = &configfs_dentry_ops;
+		d_set_d_op(dentry, &configfs_dentry_ops);
 	}
 
 	return 0;
@@ -278,7 +278,7 @@ static int create_dir(struct config_item * k, struct dentry * p,
 		error = configfs_create(d, mode, init_dir);
 		if (!error) {
 			inc_nlink(p->d_inode);
-			(d)->d_op = &configfs_dentry_ops;
+			d_set_d_op((d), &configfs_dentry_ops);
 		} else {
 			struct configfs_dirent *sd = d->d_fsdata;
 			if (sd) {
@@ -372,7 +372,7 @@ int configfs_create_link(struct configfs_symlink *sl,
 	if (!err) {
 		err = configfs_create(dentry, mode, init_symlink);
 		if (!err)
-			dentry->d_op = &configfs_dentry_ops;
+			d_set_d_op(dentry, &configfs_dentry_ops);
 		else {
 			struct configfs_dirent *sd = dentry->d_fsdata;
 			if (sd) {
@@ -447,7 +447,7 @@ static int configfs_attach_attr(struct configfs_dirent * sd, struct dentry * den
 		return error;
 	}
 
-	dentry->d_op = &configfs_dentry_ops;
+	d_set_d_op(dentry, &configfs_dentry_ops);
 	d_rehash(dentry);
 
 	return 0;
diff --git a/fs/dcache.c b/fs/dcache.c
index c25be71..879e5b3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -322,7 +322,7 @@ repeat:
  		return;
  	}
 
-	if (dentry->d_op && dentry->d_op->d_delete) {
+	if (dentry->d_flags & DCACHE_OP_DELETE) {
 		if (dentry->d_op->d_delete(dentry))
 			goto kill_it;
 	}
@@ -1234,6 +1234,23 @@ struct dentry *d_alloc_name(struct dentry *parent, const char *name)
 }
 EXPORT_SYMBOL(d_alloc_name);
 
+void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
+{
+	dentry->d_op = op;
+	if (!op)
+		return;
+	if (op->d_hash)
+		dentry->d_flags |= DCACHE_OP_HASH;
+	if (op->d_compare)
+		dentry->d_flags |= DCACHE_OP_COMPARE;
+	if (op->d_revalidate)
+		dentry->d_flags |= DCACHE_OP_REVALIDATE;
+	if (op->d_delete)
+		dentry->d_flags |= DCACHE_OP_DELETE;
+
+}
+EXPORT_SYMBOL(d_set_d_op);
+
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
 	spin_lock(&dentry->d_lock);
@@ -1651,7 +1668,7 @@ seqretry:
 		 */
 		if (read_seqcount_retry(&dentry->d_seq, *seq))
 			goto seqretry;
-		if (parent->d_op && parent->d_op->d_compare) {
+		if (parent->d_flags & DCACHE_OP_COMPARE) {
 			if (parent->d_op->d_compare(parent,
 						dentry, i,
 						tlen, tname, name))
@@ -1780,7 +1797,7 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 		 */
 		tlen = dentry->d_name.len;
 		tname = dentry->d_name.name;
-		if (parent->d_op && parent->d_op->d_compare) {
+		if (parent->d_flags & DCACHE_OP_COMPARE) {
 			if (parent->d_op->d_compare(parent,
 						dentry, dentry->d_inode,
 						tlen, tname, name))
@@ -1821,7 +1838,7 @@ struct dentry *d_hash_and_lookup(struct dentry *dir, struct qstr *name)
 	 * routine may choose to leave the hash value unchanged.
 	 */
 	name->hash = full_name_hash(name->name, name->len);
-	if (dir->d_op && dir->d_op->d_hash) {
+	if (dir->d_flags & DCACHE_OP_HASH) {
 		if (dir->d_op->d_hash(dir, dir->d_inode, name) < 0)
 			goto out;
 	}
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 5e5c7ec..f91b35d 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -441,7 +441,7 @@ static struct dentry *ecryptfs_lookup(struct inode *ecryptfs_dir_inode,
 	struct qstr lower_name;
 	int rc = 0;
 
-	ecryptfs_dentry->d_op = &ecryptfs_dops;
+	d_set_d_op(ecryptfs_dentry, &ecryptfs_dops);
 	if ((ecryptfs_dentry->d_name.len == 1
 	     && !strcmp(ecryptfs_dentry->d_name.name, "."))
 	    || (ecryptfs_dentry->d_name.len == 2
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index a9dbd62..3510386 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -189,7 +189,7 @@ int ecryptfs_interpose(struct dentry *lower_dentry, struct dentry *dentry,
 	if (special_file(lower_inode->i_mode))
 		init_special_inode(inode, lower_inode->i_mode,
 				   lower_inode->i_rdev);
-	dentry->d_op = &ecryptfs_dops;
+	d_set_d_op(dentry, &ecryptfs_dops);
 	fsstack_copy_attr_all(inode, lower_inode);
 	/* This size will be overwritten for real files w/ headers and
 	 * other metadata */
@@ -594,7 +594,7 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
 		deactivate_locked_super(s);
 		goto out;
 	}
-	s->s_root->d_op = &ecryptfs_dops;
+	d_set_d_op(s->s_root, &ecryptfs_dops);
 	s->s_root->d_sb = s;
 	s->s_root->d_parent = s->s_root;
 
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 8cccfeb..206351a 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -750,7 +750,7 @@ static struct dentry *fat_fh_to_dentry(struct super_block *sb,
 	 */
 	result = d_obtain_alias(inode);
 	if (!IS_ERR(result))
-		result->d_op = sb->s_root->d_op;
+		d_set_d_op(result, sb->s_root->d_op);
 	return result;
 }
 
@@ -800,7 +800,7 @@ static struct dentry *fat_get_parent(struct dentry *child)
 
 	parent = d_obtain_alias(inode);
 	if (!IS_ERR(parent))
-		parent->d_op = sb->s_root->d_op;
+		d_set_d_op(parent, sb->s_root->d_op);
 out:
 	unlock_super(sb);
 
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index c5f32db..83ac1ba 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -228,10 +228,10 @@ static struct dentry *msdos_lookup(struct inode *dir, struct dentry *dentry,
 	}
 out:
 	unlock_super(sb);
-	dentry->d_op = &msdos_dentry_operations;
+	d_set_d_op(dentry, &msdos_dentry_operations);
 	dentry = d_splice_alias(inode, dentry);
 	if (dentry)
-		dentry->d_op = &msdos_dentry_operations;
+		d_set_d_op(dentry, &msdos_dentry_operations);
 	return dentry;
 
 error:
@@ -674,7 +674,7 @@ static int msdos_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	sb->s_flags |= MS_NOATIME;
-	sb->s_root->d_op = &msdos_dentry_operations;
+	d_set_d_op(sb->s_root, &msdos_dentry_operations);
 	unlock_super(sb);
 	return 0;
 }
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 6e4d02d..b721715 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -766,11 +766,11 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
 
 out:
 	unlock_super(sb);
-	dentry->d_op = sb->s_root->d_op;
+	d_set_d_op(dentry, sb->s_root->d_op);
 	dentry->d_time = dentry->d_parent->d_inode->i_version;
 	dentry = d_splice_alias(inode, dentry);
 	if (dentry) {
-		dentry->d_op = sb->s_root->d_op;
+		d_set_d_op(dentry, sb->s_root->d_op);
 		dentry->d_time = dentry->d_parent->d_inode->i_version;
 	}
 	return dentry;
@@ -1078,9 +1078,9 @@ static int vfat_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	if (MSDOS_SB(sb)->options.name_check != 's')
-		sb->s_root->d_op = &vfat_ci_dentry_ops;
+		d_set_d_op(sb->s_root, &vfat_ci_dentry_ops);
 	else
-		sb->s_root->d_op = &vfat_dentry_ops;
+		d_set_d_op(sb->s_root, &vfat_dentry_ops);
 
 	unlock_super(sb);
 	return 0;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c9627c9..c9a8a42 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -347,7 +347,7 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
 	}
 
 	entry = newent ? newent : entry;
-	entry->d_op = &fuse_dentry_operations;
+	d_set_d_op(entry, &fuse_dentry_operations);
 	if (outarg_valid)
 		fuse_change_entry_timeout(entry, &outarg);
 	else
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 44e0a6c..a8b31da 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -626,7 +626,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb,
 
 	entry = d_obtain_alias(inode);
 	if (!IS_ERR(entry) && get_node_id(inode) != FUSE_ROOT_ID) {
-		entry->d_op = &fuse_dentry_operations;
+		d_set_d_op(entry, &fuse_dentry_operations);
 		fuse_invalidate_entry_cache(entry);
 	}
 
@@ -728,7 +728,7 @@ static struct dentry *fuse_get_parent(struct dentry *child)
 
 	parent = d_obtain_alias(inode);
 	if (!IS_ERR(parent) && get_node_id(inode) != FUSE_ROOT_ID) {
-		parent->d_op = &fuse_dentry_operations;
+		d_set_d_op(parent, &fuse_dentry_operations);
 		fuse_invalidate_entry_cache(parent);
 	}
 
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index 5ab3839..97012ec 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -130,7 +130,7 @@ static struct dentry *gfs2_get_parent(struct dentry *child)
 
 	dentry = d_obtain_alias(gfs2_lookupi(child->d_inode, &gfs2_qdotdot, 1));
 	if (!IS_ERR(dentry))
-		dentry->d_op = &gfs2_dops;
+		d_set_d_op(dentry, &gfs2_dops);
 	return dentry;
 }
 
@@ -158,7 +158,7 @@ static struct dentry *gfs2_get_dentry(struct super_block *sb,
 out_inode:
 	dentry = d_obtain_alias(inode);
 	if (!IS_ERR(dentry))
-		dentry->d_op = &gfs2_dops;
+		d_set_d_op(dentry, &gfs2_dops);
 	return dentry;
 }
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 3eb1393..2aeabd4 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -440,7 +440,7 @@ static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
 		iput(inode);
 		return -ENOMEM;
 	}
-	dentry->d_op = &gfs2_dops;
+	d_set_d_op(dentry, &gfs2_dops);
 	*dptr = dentry;
 	return 0;
 }
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 12cbea7..f28f897 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -106,7 +106,7 @@ static struct dentry *gfs2_lookup(struct inode *dir, struct dentry *dentry,
 {
 	struct inode *inode = NULL;
 
-	dentry->d_op = &gfs2_dops;
+	d_set_d_op(dentry, &gfs2_dops);
 
 	inode = gfs2_lookupi(dir, &dentry->d_name, 0);
 	if (inode && IS_ERR(inode))
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 2b3b861..ea4aefe 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -25,7 +25,7 @@ static struct dentry *hfs_lookup(struct inode *dir, struct dentry *dentry,
 	struct inode *inode = NULL;
 	int res;
 
-	dentry->d_op = &hfs_dentry_operations;
+	d_set_d_op(dentry, &hfs_dentry_operations);
 
 	hfs_find_init(HFS_SB(dir->i_sb)->cat_tree, &fd);
 	hfs_cat_build_key(dir->i_sb, fd.search_key, dir->i_ino, &dentry->d_name);
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index ef4ee57..0bef62a 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -434,7 +434,7 @@ static int hfs_fill_super(struct super_block *sb, void *data, int silent)
 	if (!sb->s_root)
 		goto bail_iput;
 
-	sb->s_root->d_op = &hfs_dentry_operations;
+	d_set_d_op(sb->s_root, &hfs_dentry_operations);
 
 	/* everything's okay */
 	return 0;
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 9d59c05..ccab871 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -37,7 +37,7 @@ static struct dentry *hfsplus_lookup(struct inode *dir, struct dentry *dentry,
 
 	sb = dir->i_sb;
 
-	dentry->d_op = &hfsplus_dentry_operations;
+	d_set_d_op(dentry, &hfsplus_dentry_operations);
 	dentry->d_fsdata = NULL;
 	hfs_find_init(HFSPLUS_SB(sb)->cat_tree, &fd);
 	hfsplus_cat_build_key(sb, fd.search_key, dir->i_ino, &dentry->d_name);
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 182e83a..ddf712e 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -419,7 +419,7 @@ static int hfsplus_fill_super(struct super_block *sb, void *data, int silent)
 		err = -ENOMEM;
 		goto cleanup;
 	}
-	sb->s_root->d_op = &hfsplus_dentry_operations;
+	d_set_d_op(sb->s_root, &hfsplus_dentry_operations);
 
 	str.len = sizeof(HFSP_HIDDENDIR_NAME) - 1;
 	str.name = HFSP_HIDDENDIR_NAME;
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 861113f..0bc81cf 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -612,7 +612,7 @@ struct dentry *hostfs_lookup(struct inode *ino, struct dentry *dentry,
 		goto out_put;
 
 	d_add(dentry, inode);
-	dentry->d_op = &hostfs_dentry_ops;
+	d_set_d_op(dentry, &hostfs_dentry_ops);
 	return NULL;
 
  out_put:
diff --git a/fs/hpfs/dentry.c b/fs/hpfs/dentry.c
index d7f1cbb..86f573f 100644
--- a/fs/hpfs/dentry.c
+++ b/fs/hpfs/dentry.c
@@ -64,5 +64,5 @@ static const struct dentry_operations hpfs_dentry_operations = {
 
 void hpfs_set_dentry_operations(struct dentry *dentry)
 {
-	dentry->d_op = &hpfs_dentry_operations;
+	d_set_d_op(dentry, &hpfs_dentry_operations);
 }
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 9813d54..dc9505e 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -945,7 +945,7 @@ root_found:
 		table += 2;
 	if (opt.check == 'r')
 		table++;
-	s->s_root->d_op = &isofs_dentry_ops[table];
+	d_set_d_op(s->s_root, &isofs_dentry_ops[table]);
 
 	kfree(opt.iocharset);
 
diff --git a/fs/isofs/namei.c b/fs/isofs/namei.c
index e7768ac..2736046 100644
--- a/fs/isofs/namei.c
+++ b/fs/isofs/namei.c
@@ -172,7 +172,7 @@ struct dentry *isofs_lookup(struct inode *dir, struct dentry *dentry, struct nam
 	struct inode *inode;
 	struct page *page;
 
-	dentry->d_op = dir->i_sb->s_root->d_op;
+	d_set_d_op(dentry, dir->i_sb->s_root->d_op);
 
 	page = alloc_page(GFP_USER);
 	if (!page)
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 7166a1b..50d6b8d 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1466,7 +1466,7 @@ static struct dentry *jfs_lookup(struct inode *dip, struct dentry *dentry, struc
 	jfs_info("jfs_lookup: name = %s", name);
 
 	if (JFS_SBI(dip->i_sb)->mntflag & JFS_OS2)
-		dentry->d_op = &jfs_ci_dentry_operations;
+		d_set_d_op(dentry, &jfs_ci_dentry_operations);
 
 	if ((name[0] == '.') && (len == 1))
 		inum = dip->i_ino;
@@ -1495,7 +1495,7 @@ static struct dentry *jfs_lookup(struct inode *dip, struct dentry *dentry, struc
 	dentry = d_splice_alias(ip, dentry);
 
 	if (dentry && (JFS_SBI(dip->i_sb)->mntflag & JFS_OS2))
-		dentry->d_op = &jfs_ci_dentry_operations;
+		d_set_d_op(dentry, &jfs_ci_dentry_operations);
 
 	return dentry;
 }
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index b715b0f..3150d76 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -525,7 +525,7 @@ static int jfs_fill_super(struct super_block *sb, void *data, int silent)
 		goto out_no_root;
 
 	if (sbi->mntflag & JFS_OS2)
-		sb->s_root->d_op = &jfs_ci_dentry_operations;
+		d_set_d_op(sb->s_root, &jfs_ci_dentry_operations);
 
 	/* logical blocks are represented by 40 bits in pxd_t, etc. */
 	sb->s_maxbytes = ((u64) sb->s_blocksize) << 40;
diff --git a/fs/libfs.c b/fs/libfs.c
index 28b3666..889311e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -59,7 +59,7 @@ struct dentry *simple_lookup(struct inode *dir, struct dentry *dentry, struct na
 
 	if (dentry->d_name.len > NAME_MAX)
 		return ERR_PTR(-ENAMETOOLONG);
-	dentry->d_op = &simple_dentry_operations;
+	d_set_d_op(dentry, &simple_dentry_operations);
 	d_add(dentry, NULL);
 	return NULL;
 }
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index c0d35a3..1b9e077 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -23,7 +23,7 @@ static struct dentry *minix_lookup(struct inode * dir, struct dentry *dentry, st
 	struct inode * inode = NULL;
 	ino_t ino;
 
-	dentry->d_op = dir->i_sb->s_root->d_op;
+	d_set_d_op(dentry, dir->i_sb->s_root->d_op);
 
 	if (dentry->d_name.len > minix_sb(dir->i_sb)->s_namelen)
 		return ERR_PTR(-ENAMETOOLONG);
diff --git a/fs/namei.c b/fs/namei.c
index c48d208..717ab13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -543,6 +543,17 @@ do_revalidate(struct dentry *dentry, struct nameidata *nd)
 	return dentry;
 }
 
+static inline int need_reval_dot(struct dentry *dentry)
+{
+	if (likely(!(dentry->d_flags & DCACHE_OP_REVALIDATE)))
+		return 0;
+
+	if (likely(!(dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)))
+		return 0;
+
+	return 1;
+}
+
 /*
  * force_reval_path - force revalidation of a dentry
  *
@@ -566,10 +577,9 @@ force_reval_path(struct path *path, struct nameidata *nd)
 
 	/*
 	 * only check on filesystems where it's possible for the dentry to
-	 * become stale. It's assumed that if this flag is set then the
-	 * d_revalidate op will also be defined.
+	 * become stale.
 	 */
-	if (!(dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT))
+	if (!need_reval_dot(dentry))
 		return 0;
 
 	status = dentry->d_op->d_revalidate(dentry, nd);
@@ -959,7 +969,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 	 * See if the low-level filesystem might want
 	 * to use its own hash..
 	 */
-	if (parent->d_op && parent->d_op->d_hash) {
+	if (unlikely(parent->d_flags & DCACHE_OP_HASH)) {
 		int err = parent->d_op->d_hash(parent, nd->inode, name);
 		if (err < 0)
 			return err;
@@ -984,7 +994,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 			return -ECHILD;
 
 		nd->seq = seq;
-		if (dentry->d_op && dentry->d_op->d_revalidate) {
+		if (dentry->d_flags & DCACHE_OP_REVALIDATE) {
 			/* XXX: RCU chokes here */
 			if (nameidata_dentry_drop_rcu(nd, dentry))
 				return -ECHILD;
@@ -998,7 +1008,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 		if (!dentry)
 			goto need_lookup;
 found:
-		if (dentry->d_op && dentry->d_op->d_revalidate)
+		if (dentry->d_flags & DCACHE_OP_REVALIDATE)
 			goto need_revalidate;
 done:
 		path->mnt = mnt;
@@ -1236,8 +1246,7 @@ return_reval:
 		 * We bypassed the ordinary revalidation routines.
 		 * We may need to check the cached dentry for staleness.
 		 */
-		if (nd->path.dentry && nd->path.dentry->d_sb &&
-		    (nd->path.dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) {
+		if (need_reval_dot(nd->path.dentry)) {
 			if (try_nameidata_drop_rcu(nd))
 				return -ECHILD;
 			err = -ESTALE;
@@ -1543,7 +1552,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
 	 * See if the low-level filesystem might want
 	 * to use its own hash..
 	 */
-	if (base->d_op && base->d_op->d_hash) {
+	if (base->d_flags & DCACHE_OP_HASH) {
 		err = base->d_op->d_hash(base, inode, name);
 		dentry = ERR_PTR(err);
 		if (err < 0)
@@ -1557,7 +1566,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
 	 */
 	dentry = d_lookup(base, name);
 
-	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
+	if (dentry && (dentry->d_flags & DCACHE_OP_REVALIDATE))
 		dentry = do_revalidate(dentry, nd);
 
 	if (!dentry)
@@ -2011,7 +2020,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 		follow_dotdot(nd);
 		dir = nd->path.dentry;
 	case LAST_DOT:
-		if (nd->path.mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT) {
+		if (need_reval_dot(dir)) {
 			if (!dir->d_op->d_revalidate(dir, nd)) {
 				error = -ESTALE;
 				goto exit;
diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index 6ecc33a..0494dfb 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -655,7 +655,7 @@ ncp_fill_cache(struct file *filp, void *dirent, filldir_t filldir,
 		entry->ino = iunique(dir->i_sb, 2);
 		inode = ncp_iget(dir->i_sb, entry);
 		if (inode) {
-			newdent->d_op = &ncp_dentry_operations;
+			d_set_d_op(newdent, &ncp_dentry_operations);
 			d_instantiate(newdent, inode);
 			if (!hashed)
 				d_rehash(newdent);
@@ -911,7 +911,7 @@ static struct dentry *ncp_lookup(struct inode *dir, struct dentry *dentry, struc
 	if (inode) {
 		ncp_new_dentry(dentry);
 add_entry:
-		dentry->d_op = &ncp_dentry_operations;
+		d_set_d_op(dentry, &ncp_dentry_operations);
 		d_add(dentry, inode);
 		error = 0;
 	}
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index 60047db..0c75a5f 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -717,7 +717,7 @@ static int ncp_fill_super(struct super_block *sb, void *raw_data, int silent)
 	sb->s_root = d_alloc_root(root_inode);
         if (!sb->s_root)
 		goto out_no_root;
-	sb->s_root->d_op = &ncp_root_dentry_operations;
+	d_set_d_op(sb->s_root, &ncp_root_dentry_operations);
 	return 0;
 
 out_no_root:
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c03f2d1..2388032 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -443,7 +443,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
 	if (dentry == NULL)
 		return;
 
-	dentry->d_op = NFS_PROTO(dir)->dentry_ops;
+	d_set_d_op(dentry, NFS_PROTO(dir)->dentry_ops);
 	inode = nfs_fhget(dentry->d_sb, entry->fh, entry->fattr);
 	if (IS_ERR(inode))
 		goto out;
@@ -1198,7 +1198,7 @@ static struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, stru
 	if (dentry->d_name.len > NFS_SERVER(dir)->namelen)
 		goto out;
 
-	dentry->d_op = NFS_PROTO(dir)->dentry_ops;
+	d_set_d_op(dentry, NFS_PROTO(dir)->dentry_ops);
 
 	/*
 	 * If we're doing an exclusive create, optimize away the lookup
@@ -1343,7 +1343,7 @@ static struct dentry *nfs_atomic_lookup(struct inode *dir, struct dentry *dentry
 		res = ERR_PTR(-ENAMETOOLONG);
 		goto out;
 	}
-	dentry->d_op = NFS_PROTO(dir)->dentry_ops;
+	d_set_d_op(dentry, NFS_PROTO(dir)->dentry_ops);
 
 	/* Let vfs_create() deal with O_EXCL. Instantiate, but don't hash
 	 * the dentry. */
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index b3e36c3..c3a5a11 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -121,7 +121,7 @@ struct dentry *nfs_get_root(struct super_block *sb, struct nfs_fh *mntfh)
 	security_d_instantiate(ret, inode);
 
 	if (ret->d_op == NULL)
-		ret->d_op = server->nfs_client->rpc_ops->dentry_ops;
+		d_set_d_op(ret, server->nfs_client->rpc_ops->dentry_ops);
 out:
 	nfs_free_fattr(fsinfo.fattr);
 	return ret;
@@ -228,7 +228,7 @@ struct dentry *nfs4_get_root(struct super_block *sb, struct nfs_fh *mntfh)
 	security_d_instantiate(ret, inode);
 
 	if (ret->d_op == NULL)
-		ret->d_op = server->nfs_client->rpc_ops->dentry_ops;
+		d_set_d_op(ret, server->nfs_client->rpc_ops->dentry_ops);
 
 out:
 	nfs_free_fattr(fattr);
diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
index 19ad145..6adafa5 100644
--- a/fs/ocfs2/export.c
+++ b/fs/ocfs2/export.c
@@ -138,7 +138,7 @@ check_gen:
 
 	result = d_obtain_alias(inode);
 	if (!IS_ERR(result))
-		result->d_op = &ocfs2_dentry_ops;
+		d_set_d_op(result, &ocfs2_dentry_ops);
 	else
 		mlog_errno(PTR_ERR(result));
 
@@ -176,7 +176,7 @@ static struct dentry *ocfs2_get_parent(struct dentry *child)
 
 	parent = d_obtain_alias(ocfs2_iget(OCFS2_SB(dir->i_sb), blkno, 0, 0));
 	if (!IS_ERR(parent))
-		parent->d_op = &ocfs2_dentry_ops;
+		d_set_d_op(parent, &ocfs2_dentry_ops);
 
 bail_unlock:
 	ocfs2_inode_unlock(dir, 0);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index ff5744e..d14cad6 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -147,7 +147,7 @@ static struct dentry *ocfs2_lookup(struct inode *dir, struct dentry *dentry,
 	spin_unlock(&oi->ip_lock);
 
 bail_add:
-	dentry->d_op = &ocfs2_dentry_ops;
+	d_set_d_op(dentry, &ocfs2_dentry_ops);
 	ret = d_splice_alias(inode, dentry);
 
 	if (inode) {
@@ -415,7 +415,7 @@ static int ocfs2_mknod(struct inode *dir,
 		mlog_errno(status);
 		goto leave;
 	}
-	dentry->d_op = &ocfs2_dentry_ops;
+	d_set_d_op(dentry, &ocfs2_dentry_ops);
 
 	status = ocfs2_add_entry(handle, dentry, inode,
 				 OCFS2_I(inode)->ip_blkno, parent_fe_bh,
@@ -743,7 +743,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 	}
 
 	ihold(inode);
-	dentry->d_op = &ocfs2_dentry_ops;
+	d_set_d_op(dentry, &ocfs2_dentry_ops);
 	d_instantiate(dentry, inode);
 
 out_commit:
@@ -1794,7 +1794,7 @@ static int ocfs2_symlink(struct inode *dir,
 		mlog_errno(status);
 		goto bail;
 	}
-	dentry->d_op = &ocfs2_dentry_ops;
+	d_set_d_op(dentry, &ocfs2_dentry_ops);
 
 	status = ocfs2_add_entry(handle, dentry, inode,
 				 le64_to_cpu(fe->i_blkno), parent_fe_bh,
@@ -2459,7 +2459,7 @@ int ocfs2_mv_orphaned_inode_to_new(struct inode *dir,
 		goto out_commit;
 	}
 
-	dentry->d_op = &ocfs2_dentry_ops;
+	d_set_d_op(dentry, &ocfs2_dentry_ops);
 	d_instantiate(dentry, inode);
 	status = 0;
 out_commit:
diff --git a/fs/pipe.c b/fs/pipe.c
index 4ae1d76..e964d09 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1004,7 +1004,7 @@ struct file *create_write_pipe(int flags)
 		goto err_inode;
 	path.mnt = mntget(pipe_mnt);
 
-	path.dentry->d_op = &pipefs_dentry_operations;
+	d_set_d_op(path.dentry, &pipefs_dentry_operations);
 	d_instantiate(path.dentry, inode);
 
 	err = -ENFILE;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 866a41a..116d4e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1969,7 +1969,7 @@ static struct dentry *proc_fd_instantiate(struct inode *dir,
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
 	ei->op.proc_get_link = proc_fd_link;
-	dentry->d_op = &tid_fd_dentry_operations;
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
 	if (tid_fd_revalidate(dentry, NULL))
@@ -2137,7 +2137,7 @@ static struct dentry *proc_fdinfo_instantiate(struct inode *dir,
 	ei->fd = fd;
 	inode->i_mode = S_IFREG | S_IRUSR;
 	inode->i_fop = &proc_fdinfo_file_operations;
-	dentry->d_op = &tid_fd_dentry_operations;
+	d_set_d_op(dentry, &tid_fd_dentry_operations);
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
 	if (tid_fd_revalidate(dentry, NULL))
@@ -2196,7 +2196,7 @@ static struct dentry *proc_pident_instantiate(struct inode *dir,
 	if (p->fop)
 		inode->i_fop = p->fop;
 	ei->op = p->op;
-	dentry->d_op = &pid_dentry_operations;
+	d_set_d_op(dentry, &pid_dentry_operations);
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
 	if (pid_revalidate(dentry, NULL))
@@ -2615,7 +2615,7 @@ static struct dentry *proc_base_instantiate(struct inode *dir,
 	if (p->fop)
 		inode->i_fop = p->fop;
 	ei->op = p->op;
-	dentry->d_op = &proc_base_dentry_operations;
+	d_set_d_op(dentry, &proc_base_dentry_operations);
 	d_add(dentry, inode);
 	error = NULL;
 out:
@@ -2926,7 +2926,7 @@ static struct dentry *proc_pid_instantiate(struct inode *dir,
 	inode->i_nlink = 2 + pid_entry_count_dirs(tgid_base_stuff,
 		ARRAY_SIZE(tgid_base_stuff));
 
-	dentry->d_op = &pid_dentry_operations;
+	d_set_d_op(dentry, &pid_dentry_operations);
 
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
@@ -3169,7 +3169,7 @@ static struct dentry *proc_task_instantiate(struct inode *dir,
 	inode->i_nlink = 2 + pid_entry_count_dirs(tid_base_stuff,
 		ARRAY_SIZE(tid_base_stuff));
 
-	dentry->d_op = &pid_dentry_operations;
+	d_set_d_op(dentry, &pid_dentry_operations);
 
 	d_add(dentry, inode);
 	/* Close the race of the process dying before we return the dentry */
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 1d607be..f766be2 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -439,7 +439,7 @@ struct dentry *proc_lookup_de(struct proc_dir_entry *de, struct inode *dir,
 out_unlock:
 
 	if (inode) {
-		dentry->d_op = &proc_dentry_operations;
+		d_set_d_op(dentry, &proc_dentry_operations);
 		d_add(dentry, inode);
 		return NULL;
 	}
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 34f1fc5..a07d7f1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -120,7 +120,7 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
 		goto out;
 
 	err = NULL;
-	dentry->d_op = &proc_sys_dentry_operations;
+	d_set_d_op(dentry, &proc_sys_dentry_operations);
 	d_add(dentry, inode);
 
 out:
@@ -201,7 +201,7 @@ static int proc_sys_fill_cache(struct file *filp, void *dirent,
 				dput(child);
 				return -ENOMEM;
 			} else {
-				child->d_op = &proc_sys_dentry_operations;
+				d_set_d_op(child, &proc_sys_dentry_operations);
 				d_add(child, inode);
 			}
 		} else {
diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index 5d04a78..e0f0d7e 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -990,7 +990,7 @@ int reiserfs_lookup_privroot(struct super_block *s)
 				strlen(PRIVROOT_NAME));
 	if (!IS_ERR(dentry)) {
 		REISERFS_SB(s)->priv_root = dentry;
-		dentry->d_op = &xattr_lookup_poison_ops;
+		d_set_d_op(dentry, &xattr_lookup_poison_ops);
 		if (dentry->d_inode)
 			dentry->d_inode->i_flags |= S_PRIVATE;
 	} else
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 27e1102..3e076ca 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -701,7 +701,7 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry,
 	/* instantiate and hash dentry */
 	ret = d_find_alias(inode);
 	if (!ret) {
-		dentry->d_op = &sysfs_dentry_ops;
+		d_set_d_op(dentry, &sysfs_dentry_ops);
 		dentry->d_fsdata = sysfs_get(sd);
 		d_add(dentry, inode);
 	} else {
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 7507aeb..b5e68da 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -48,7 +48,7 @@ static struct dentry *sysv_lookup(struct inode * dir, struct dentry * dentry, st
 	struct inode * inode = NULL;
 	ino_t ino;
 
-	dentry->d_op = dir->i_sb->s_root->d_op;
+	d_set_d_op(dentry, dir->i_sb->s_root->d_op);
 	if (dentry->d_name.len > SYSV_NAMELEN)
 		return ERR_PTR(-ENAMETOOLONG);
 	ino = sysv_inode_by_name(dentry);
diff --git a/fs/sysv/super.c b/fs/sysv/super.c
index 3d9c62b..76712ae 100644
--- a/fs/sysv/super.c
+++ b/fs/sysv/super.c
@@ -346,7 +346,7 @@ static int complete_read_super(struct super_block *sb, int silent, int size)
 	if (sbi->s_forced_ro)
 		sb->s_flags |= MS_RDONLY;
 	if (sbi->s_truncate)
-		sb->s_root->d_op = &sysv_dentry_operations;
+		d_set_d_op(sb->s_root, &sysv_dentry_operations);
 	return 1;
 }
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2fd0b45..79a89af 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -183,6 +183,11 @@ struct dentry_operations {
 #define DCACHE_GENOCIDE		0x0200
 #define DCACHE_MOUNTED		0x0400	/* is a mountpoint */
 
+#define DCACHE_OP_HASH		0x1000
+#define DCACHE_OP_COMPARE	0x2000
+#define DCACHE_OP_REVALIDATE	0x4000
+#define DCACHE_OP_DELETE	0x8000
+
 
 extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
@@ -233,6 +238,7 @@ extern void d_instantiate(struct dentry *, struct inode *);
 extern struct dentry * d_instantiate_unique(struct dentry *, struct inode *);
 extern struct dentry * d_materialise_unique(struct dentry *, struct inode *);
 extern void d_delete(struct dentry *);
+extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op);
 
 /* allocate/de-allocate */
 extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6e34f75..afab372 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2237,7 +2237,7 @@ static int cgroup_create_file(struct dentry *dentry, mode_t mode,
 		inode->i_size = 0;
 		inode->i_fop = &cgroup_file_operations;
 	}
-	dentry->d_op = &cgroup_dops;
+	d_set_d_op(dentry, &cgroup_dops);
 	d_instantiate(dentry, inode);
 	dget(dentry);	/* Extra count - pin the dentry in core */
 	return 0;
diff --git a/net/socket.c b/net/socket.c
index 425da53..2d8b4c8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -368,7 +368,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 	}
 	path.mnt = mntget(sock_mnt);
 
-	path.dentry->d_op = &sockfs_dentry_operations;
+	d_set_d_op(path.dentry, &sockfs_dentry_operations);
 	d_instantiate(path.dentry, SOCK_INODE(sock));
 	SOCK_INODE(sock)->i_fop = &socket_file_ops;
 
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 2899fe2..09f01f4 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -591,7 +591,7 @@ static struct dentry *__rpc_lookup_create(struct dentry *parent,
 		}
 	}
 	if (!dentry->d_inode)
-		dentry->d_op = &rpc_dentry_operations;
+		d_set_d_op(dentry, &rpc_dentry_operations);
 out_err:
 	return dentry;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 37/46] fs: cache optimise dentry and inode for rcu-walk
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (34 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 36/46] fs: dcache reduce branches in lookup path Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 38/46] fs: prefetch inode data in dcache lookup Nick Piggin
                   ` (13 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Put dentry and inode fields into top of data structure.  This allows RCU path
traversal to perform an RCU dentry lookup in a path walk by touching only the
first 56 bytes of the dentry.

We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup.

inode is also rearranged so that RCU lookup will only touch a single cacheline
in the inode, plus one in the i_ops structure.

This is important for directory component lookups in RCU path walking. In the
kernel source, directory names average is around 6 chars, so this works.

When we reach the last element of the lookup, we need to lock it and take its
refcount which requires another cacheline access.

XXX: verify memory accesses.
XXX: could cacheline align xxx_operations structure
XXX: could juggle things around differently -- ie. lock and count
rather than name in the first 64.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c            |    2 --
 include/linux/dcache.h |   32 +++++++++++++++-----------------
 include/linux/fs.h     |   40 ++++++++++++++++++++++------------------
 3 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 879e5b3..58faf37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -84,8 +84,6 @@ EXPORT_SYMBOL(dcache_hash_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
 
-#define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname))
-
 /*
  * This is the single most critical data structure when it comes
  * to the dcache: the hashtable for lookups. Somebody should try
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 79a89af..3c5cafc 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -81,26 +81,30 @@ full_name_hash(const unsigned char *name, unsigned int len)
  * give reasonable cacheline footprint with larger lines without the
  * large memory footprint increase).
  */
-#ifdef CONFIG_64BIT /* XXX update */
-#define DNAME_INLINE_LEN_MIN 32 /* 192 bytes */
+#ifdef CONFIG_64BIT
+#define DNAME_INLINE_LEN 32 /* 192 bytes */
 #else
-#define DNAME_INLINE_LEN_MIN 40 /* 128 bytes */
+#define DNAME_INLINE_LEN 40 /* 128 bytes */
 #endif
 
 struct dentry {
-	unsigned int d_count;		/* protected by d_lock */
+	/* RCU lookup touched fields */
 	unsigned int d_flags;		/* protected by d_lock */
-	spinlock_t d_lock;		/* per dentry lock */
 	seqcount_t d_seq;		/* per dentry seqlock */
-	struct inode *d_inode;		/* Where the name belongs to - NULL is
-					 * negative */
-	/*
-	 * The next three fields are touched by __d_lookup.  Place them here
-	 * so they all fit in a cache line.
-	 */
 	struct hlist_node d_hash;	/* lookup hash list */
 	struct dentry *d_parent;	/* parent directory */
 	struct qstr d_name;
+	struct inode *d_inode;		/* Where the name belongs to - NULL is
+					 * negative */
+	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */
+
+	/* Ref lookup also touches following */
+	unsigned int d_count;		/* protected by d_lock */
+	spinlock_t d_lock;		/* per dentry lock */
+	const struct dentry_operations *d_op;
+	struct super_block *d_sb;	/* The root of the dentry tree */
+	unsigned long d_time;		/* used by d_revalidate */
+	void *d_fsdata;			/* fs-specific data */
 
 	struct list_head d_lru;		/* LRU list */
 	/*
@@ -112,12 +116,6 @@ struct dentry {
 	} d_u;
 	struct list_head d_subdirs;	/* our children */
 	struct list_head d_alias;	/* inode alias list */
-	unsigned long d_time;		/* used by d_revalidate */
-	const struct dentry_operations *d_op;
-	struct super_block *d_sb;	/* The root of the dentry tree */
-	void *d_fsdata;			/* fs-specific data */
-
-	unsigned char d_iname[DNAME_INLINE_LEN_MIN];	/* small names */
 };
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 03937b7..07e8a50 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -732,6 +732,20 @@ struct posix_acl;
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
+	/* RCU path lookup touches following: */
+	umode_t			i_mode;
+	uid_t			i_uid;
+	gid_t			i_gid;
+	const struct inode_operations	*i_op;
+	struct super_block	*i_sb;
+
+	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
+	unsigned int		i_flags;
+	struct mutex		i_mutex;
+
+	unsigned long		i_state;
+	unsigned long		dirtied_when;	/* jiffies of first dirtying */
+
 	struct hlist_node	i_hash;
 	struct list_head	i_wb_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
@@ -743,8 +757,6 @@ struct inode {
 	unsigned long		i_ino;
 	atomic_t		i_count;
 	unsigned int		i_nlink;
-	uid_t			i_uid;
-	gid_t			i_gid;
 	dev_t			i_rdev;
 	unsigned int		i_blkbits;
 	u64			i_version;
@@ -757,13 +769,8 @@ struct inode {
 	struct timespec		i_ctime;
 	blkcnt_t		i_blocks;
 	unsigned short          i_bytes;
-	umode_t			i_mode;
-	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
-	struct mutex		i_mutex;
 	struct rw_semaphore	i_alloc_sem;
-	const struct inode_operations	*i_op;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
-	struct super_block	*i_sb;
 	struct file_lock	*i_flock;
 	struct address_space	*i_mapping;
 	struct address_space	i_data;
@@ -784,11 +791,6 @@ struct inode {
 	struct hlist_head	i_fsnotify_marks;
 #endif
 
-	unsigned long		i_state;
-	unsigned long		dirtied_when;	/* jiffies of first dirtying */
-
-	unsigned int		i_flags;
-
 #ifdef CONFIG_IMA
 	/* protected by i_lock */
 	unsigned int		i_readcount; /* struct files open RO */
@@ -1548,8 +1550,15 @@ struct file_operations {
 };
 
 struct inode_operations {
-	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
+	void * (*follow_link) (struct dentry *, struct nameidata *);
+	int (*permission) (struct inode *, int);
+	int (*check_acl)(struct inode *, int);
+
+	int (*readlink) (struct dentry *, char __user *,int);
+	void (*put_link) (struct dentry *, struct nameidata *, void *);
+
+	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);
@@ -1558,12 +1567,7 @@ struct inode_operations {
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
-	int (*readlink) (struct dentry *, char __user *,int);
-	void * (*follow_link) (struct dentry *, struct nameidata *);
-	void (*put_link) (struct dentry *, struct nameidata *, void *);
 	void (*truncate) (struct inode *);
-	int (*permission) (struct inode *, int);
-	int (*check_acl)(struct inode *, int);
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
 	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 38/46] fs: prefetch inode data in dcache lookup
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (35 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 37/46] fs: cache optimise dentry and inode for rcu-walk Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 39/46] fs: d_revalidate_rcu for rcu-walk Nick Piggin
                   ` (12 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This gains another 5% or so on the cached git diff workload by
prefetching the important first cacheline of the inode in while
we do the actual name compare and other operations on the dentry.

There was no measurable slowdown in the single file stat case, or
the creat case (where negative dentries would be common). (actually
there was about a 5 nanosecond speedup in these cases, but I can't
say it is significant.

Workload is 100 git diffs in sequence:
real		user		sys

vanilla single thread
0m9.753s	0m1.860s 	0m7.230s
0m9.752s	0m1.960s 	0m7.270s
0m9.754s	0m1.870s 	0m7.290s
0m9.749s	0m1.910s 	0m7.330s
0m9.750s	0m2.110s 	0m7.060s

scale single thread
0m7.678s	0m1.990s 	0m5.090s
0m7.682s	0m2.090s 	0m5.000s
0m7.681s	0m1.970s 	0m5.100s
0m7.679s	0m1.810s 	0m5.280s
0m7.679s	0m1.970s 	0m5.100s

Single threaded case has about 25% higher throughput. The actual
kernel's throughput is increased by about 45%. This is incredibly
significant for a single threaded performance increase in core
kernel code in 2010.

vanilla multi thread (preloadindex=true)
0m6.517s	0m1.430s 	0m20.200s
0m6.514s	0m1.360s	0m20.230s
0m6.521s	0m1.410s 	0m20.090s
0m6.519s	0m1.410s 	0m20.060s
0m6.521s	0m1.610s 	0m20.140s

scale multi thread (preloadindex=true)
0m3.301s	0m0.840s 	0m3.300s
0m3.304s	0m0.940s 	0m3.320s
0m3.291s	0m0.930s 	0m3.170s
0m3.292s	0m0.900s 	0m3.230s
0m3.277s	0m0.770s 	0m3.230s

Parallel case throughput is very nearly doubled, despite git being
unable to produce enough work to keep all CPUs busy (118% CPU used
over the duration of the test). System time shows that scalability
of path walk has already turned to shit in the vanilla kernel.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 58faf37..fa6e7a5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1658,6 +1658,9 @@ seqretry:
 		tlen = dentry->d_name.len;
 		tname = dentry->d_name.name;
 		i = dentry->d_inode;
+		prefetch(tname);
+		if (i)
+			prefetch(i);
 		/*
 		 * This seqcount check is required to ensure name and
 		 * len are loaded atomically, so as not to walk off the
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 39/46] fs: d_revalidate_rcu for rcu-walk
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (36 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 38/46] fs: prefetch inode data in dcache lookup Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 40/46] fs: provide rcu-walk aware permission i_ops Nick Piggin
                   ` (11 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This seems to be the best way to get filesystems aware of RCU walking
in their d_revalidate routines. Needs more documentation if it works
out OK.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c            |    2 +
 fs/ecryptfs/dentry.c   |   15 ++++++++----
 fs/fat/namei_vfat.c    |   14 ++++++------
 fs/fuse/dir.c          |    7 ++++-
 fs/namei.c             |   56 ++++++++++++++++++++++++++++++++++-------------
 fs/sysfs/dir.c         |   18 ++++++++++++--
 include/linux/dcache.h |    7 +++++-
 7 files changed, 85 insertions(+), 34 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fa6e7a5..67a08d4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1243,6 +1243,8 @@ void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
 		dentry->d_flags |= DCACHE_OP_COMPARE;
 	if (op->d_revalidate)
 		dentry->d_flags |= DCACHE_OP_REVALIDATE;
+	if (op->d_revalidate_rcu)
+		dentry->d_flags |= DCACHE_OP_REVALIDATE_RCU;
 	if (op->d_delete)
 		dentry->d_flags |= DCACHE_OP_DELETE;
 
diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c
index 906e803..f9a8be8 100644
--- a/fs/ecryptfs/dentry.c
+++ b/fs/ecryptfs/dentry.c
@@ -30,7 +30,7 @@
 #include "ecryptfs_kernel.h"
 
 /**
- * ecryptfs_d_revalidate - revalidate an ecryptfs dentry
+ * ecryptfs_d_revalidate_rcu - revalidate an ecryptfs dentry
  * @dentry: The ecryptfs dentry
  * @nd: The associated nameidata
  *
@@ -42,7 +42,7 @@
  * Returns 1 if valid, 0 otherwise.
  *
  */
-static int ecryptfs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+static int ecryptfs_d_revalidate_rcu(struct dentry *dentry, struct nameidata *nd)
 {
 	struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
 	struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
@@ -50,13 +50,18 @@ static int ecryptfs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
 	struct vfsmount *vfsmount_save;
 	int rc = 1;
 
-	if (!lower_dentry->d_op || !lower_dentry->d_op->d_revalidate)
+	if (!(lower_dentry->d_flags & DCACHE_OP_REVALIDATE_EITHER))
 		goto out;
+	if (nd->flags & LOOKUP_RCU)
+		return -ECHILD;
 	dentry_save = nd->path.dentry;
 	vfsmount_save = nd->path.mnt;
 	nd->path.dentry = lower_dentry;
 	nd->path.mnt = lower_mnt;
-	rc = lower_dentry->d_op->d_revalidate(lower_dentry, nd);
+	if (lower_dentry->d_flags & DCACHE_OP_REVALIDATE)
+		rc = lower_dentry->d_op->d_revalidate(lower_dentry, nd);
+	else
+		rc = lower_dentry->d_op->d_revalidate_rcu(lower_dentry, nd);
 	nd->path.dentry = dentry_save;
 	nd->path.mnt = vfsmount_save;
 	if (dentry->d_inode) {
@@ -91,6 +96,6 @@ static void ecryptfs_d_release(struct dentry *dentry)
 }
 
 const struct dentry_operations ecryptfs_dops = {
-	.d_revalidate = ecryptfs_d_revalidate,
+	.d_revalidate_rcu = ecryptfs_d_revalidate_rcu,
 	.d_release = ecryptfs_d_release,
 };
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index b721715..a169f24 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -31,7 +31,7 @@
  * If it happened, the negative dentry isn't actually negative
  * anymore.  So, drop it.
  */
-static int vfat_revalidate_shortname(struct dentry *dentry)
+static int vfat_revalidate_rcu_shortname(struct dentry *dentry)
 {
 	int ret = 1;
 	spin_lock(&dentry->d_lock);
@@ -41,15 +41,15 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 	return ret;
 }
 
-static int vfat_revalidate(struct dentry *dentry, struct nameidata *nd)
+static int vfat_revalidate_rcu(struct dentry *dentry, struct nameidata *nd)
 {
 	/* This is not negative dentry. Always valid. */
 	if (dentry->d_inode)
 		return 1;
-	return vfat_revalidate_shortname(dentry);
+	return vfat_revalidate_rcu_shortname(dentry);
 }
 
-static int vfat_revalidate_ci(struct dentry *dentry, struct nameidata *nd)
+static int vfat_revalidate_rcu_ci(struct dentry *dentry, struct nameidata *nd)
 {
 	/*
 	 * This is not negative dentry. Always valid.
@@ -81,7 +81,7 @@ static int vfat_revalidate_ci(struct dentry *dentry, struct nameidata *nd)
 			return 0;
 	}
 
-	return vfat_revalidate_shortname(dentry);
+	return vfat_revalidate_rcu_shortname(dentry);
 }
 
 /* returns the length of a struct qstr, ignoring trailing dots */
@@ -175,13 +175,13 @@ static int vfat_cmp(const struct dentry *parent,
 }
 
 static const struct dentry_operations vfat_ci_dentry_ops = {
-	.d_revalidate	= vfat_revalidate_ci,
+	.d_revalidate	= vfat_revalidate_rcu_ci,
 	.d_hash		= vfat_hashi,
 	.d_compare	= vfat_cmpi,
 };
 
 static const struct dentry_operations vfat_dentry_ops = {
-	.d_revalidate	= vfat_revalidate,
+	.d_revalidate	= vfat_revalidate_rcu,
 	.d_hash		= vfat_hash,
 	.d_compare	= vfat_cmp,
 };
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c9a8a42..e409185 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc)
  * the lookup once more.  If the lookup results in the same inode,
  * then refresh the attributes, timeouts and mark the dentry valid.
  */
-static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
+static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd)
 {
 	struct inode *inode = entry->d_inode;
 
@@ -169,6 +169,9 @@ static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
 		struct dentry *parent;
 		u64 attr_version;
 
+		if (nd->flags & LOOKUP_RCU)
+			return -ECHILD;
+
 		/* For negative dentries, always do a fresh lookup */
 		if (!inode)
 			return 0;
@@ -225,7 +228,7 @@ static int invalid_nodeid(u64 nodeid)
 }
 
 const struct dentry_operations fuse_dentry_operations = {
-	.d_revalidate	= fuse_dentry_revalidate,
+	.d_revalidate_rcu = fuse_dentry_revalidate_rcu,
 };
 
 int fuse_valid_type(int m)
diff --git a/fs/namei.c b/fs/namei.c
index 717ab13..d8f7ece 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -519,10 +519,30 @@ void release_open_intent(struct nameidata *nd)
 		fput(nd->intent.open.file);
 }
 
+static int d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	if (dentry->d_op->d_revalidate_rcu) {
+		int status;
+		status = dentry->d_op->d_revalidate_rcu(dentry, nd);
+		if (!status && (nd->flags & LOOKUP_RCU))
+			status = -ECHILD;
+		if (status == -ECHILD) {
+			BUG_ON(!(nd->flags & LOOKUP_RCU));
+			if (nameidata_dentry_drop_rcu(nd, dentry))
+				return status;
+			status = dentry->d_op->d_revalidate_rcu(dentry, nd);
+		}
+		return status;
+	} else
+		return dentry->d_op->d_revalidate(dentry, nd);
+}
+
 static inline struct dentry *
 do_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
-	int status = dentry->d_op->d_revalidate(dentry, nd);
+	int status;
+
+	status = d_revalidate(dentry, nd);
 	if (unlikely(status <= 0)) {
 		/*
 		 * The dentry failed validation.
@@ -545,7 +565,7 @@ do_revalidate(struct dentry *dentry, struct nameidata *nd)
 
 static inline int need_reval_dot(struct dentry *dentry)
 {
-	if (likely(!(dentry->d_flags & DCACHE_OP_REVALIDATE)))
+	if (likely(!(dentry->d_flags & DCACHE_OP_REVALIDATE_EITHER)))
 		return 0;
 
 	if (likely(!(dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)))
@@ -582,7 +602,7 @@ force_reval_path(struct path *path, struct nameidata *nd)
 	if (!need_reval_dot(dentry))
 		return 0;
 
-	status = dentry->d_op->d_revalidate(dentry, nd);
+	status = d_revalidate(dentry, nd);
 	if (status > 0)
 		return 0;
 
@@ -994,10 +1014,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 			return -ECHILD;
 
 		nd->seq = seq;
-		if (dentry->d_flags & DCACHE_OP_REVALIDATE) {
-			/* XXX: RCU chokes here */
-			if (nameidata_dentry_drop_rcu(nd, dentry))
-				return -ECHILD;
+		if (dentry->d_flags & DCACHE_OP_REVALIDATE_EITHER) {
+			if (dentry->d_flags & DCACHE_OP_REVALIDATE) {
+				if (nameidata_dentry_drop_rcu(nd, dentry))
+					return -ECHILD;
+			}
 			goto need_revalidate;
 		}
 		path->mnt = mnt;
@@ -1050,8 +1071,11 @@ need_lookup:
 
 need_revalidate:
 	dentry = do_revalidate(dentry, nd);
-	if (!dentry)
+	if (!dentry) {
+		if (try_nameidata_drop_rcu(nd))
+			return -ECHILD;
 		goto need_lookup;
+	}
 	if (IS_ERR(dentry))
 		goto fail;
 	goto done;
@@ -1247,12 +1271,11 @@ return_reval:
 		 * We may need to check the cached dentry for staleness.
 		 */
 		if (need_reval_dot(nd->path.dentry)) {
-			if (try_nameidata_drop_rcu(nd))
-				return -ECHILD;
-			err = -ESTALE;
 			/* Note: we do not d_invalidate() */
-			if (!nd->path.dentry->d_op->d_revalidate(
-					nd->path.dentry, nd))
+			err = d_revalidate(nd->path.dentry, nd);
+			if (!err)
+				err = -ESTALE;
+			if (err < 0)
 				break;
 		}
 return_base:
@@ -1566,7 +1589,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
 	 */
 	dentry = d_lookup(base, name);
 
-	if (dentry && (dentry->d_flags & DCACHE_OP_REVALIDATE))
+	if (dentry && (dentry->d_flags & DCACHE_OP_REVALIDATE_EITHER))
 		dentry = do_revalidate(dentry, nd);
 
 	if (!dentry)
@@ -2021,10 +2044,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 		dir = nd->path.dentry;
 	case LAST_DOT:
 		if (need_reval_dot(dir)) {
-			if (!dir->d_op->d_revalidate(dir, nd)) {
+			error = d_revalidate(nd->path.dentry, nd);
+			if (!error)
 				error = -ESTALE;
+			if (error < 0)
 				goto exit;
-			}
 		}
 		/* fallthrough */
 	case LAST_ROOT:
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 3e076ca..c3fff57 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -237,12 +237,21 @@ static int sysfs_dentry_delete(const struct dentry *dentry)
 	return !!(sd->s_flags & SYSFS_FLAG_REMOVED);
 }
 
-static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd)
+static int sysfs_dentry_revalidate_rcu(struct dentry *dentry, struct nameidata *nd)
 {
 	struct sysfs_dirent *sd = dentry->d_fsdata;
 	int is_dir;
 
-	mutex_lock(&sysfs_mutex);
+	if (nd->flags & LOOKUP_RCU) {
+		if (!mutex_trylock(&sysfs_mutex))
+			return -ECHILD;
+		/* ensure dentry still exists, now under the sysfs_mutex */
+		if (read_seqcount_retry(&dentry->d_seq, nd->seq)) {
+			mutex_unlock(&sysfs_mutex);
+			return -ECHILD;
+		}
+	} else
+		mutex_lock(&sysfs_mutex);
 
 	/* The sysfs dirent has been deleted */
 	if (sd->s_flags & SYSFS_FLAG_REMOVED)
@@ -272,6 +281,9 @@ out_bad:
 	 */
 	is_dir = (sysfs_type(sd) == SYSFS_DIR);
 	mutex_unlock(&sysfs_mutex);
+	if (nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
 	if (is_dir) {
 		/* If we have submounts we must allow the vfs caches
 		 * to lie about the state of the filesystem to prevent
@@ -294,7 +306,7 @@ static void sysfs_dentry_iput(struct dentry *dentry, struct inode *inode)
 }
 
 static const struct dentry_operations sysfs_dentry_ops = {
-	.d_revalidate	= sysfs_dentry_revalidate,
+	.d_revalidate_rcu = sysfs_dentry_revalidate_rcu,
 	.d_delete	= sysfs_dentry_delete,
 	.d_iput		= sysfs_dentry_iput,
 };
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3c5cafc..72f5f32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -132,6 +132,7 @@ enum dentry_d_lock_class
 
 struct dentry_operations {
 	int (*d_revalidate)(struct dentry *, struct nameidata *);
+	int (*d_revalidate_rcu)(struct dentry *, struct nameidata *);
 	int (*d_hash)(const struct dentry *, const struct inode *,
 			struct qstr *);
 	int (*d_compare)(const struct dentry *,
@@ -184,8 +185,12 @@ struct dentry_operations {
 #define DCACHE_OP_HASH		0x1000
 #define DCACHE_OP_COMPARE	0x2000
 #define DCACHE_OP_REVALIDATE	0x4000
-#define DCACHE_OP_DELETE	0x8000
+#define DCACHE_OP_REVALIDATE_RCU 0x8000
 
+#define DCACHE_OP_DELETE	0x10000
+
+#define DCACHE_OP_REVALIDATE_EITHER	\
+	(DCACHE_OP_REVALIDATE|DCACHE_OP_REVALIDATE_RCU)
 
 extern spinlock_t dcache_inode_lock;
 extern spinlock_t dcache_hash_lock;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 40/46] fs: provide rcu-walk aware permission i_ops
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (37 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 39/46] fs: d_revalidate_rcu for rcu-walk Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 41/46] fs: provide simple rcu-walk ACL implementation Nick Piggin
                   ` (10 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/namei.c         |  131 ++++++++++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |    6 ++
 2 files changed, 107 insertions(+), 30 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d8f7ece..7fa6119 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -169,8 +169,36 @@ EXPORT_SYMBOL(putname);
 /*
  * This does basic POSIX ACL permission checking
  */
-static inline int __acl_permission_check(struct inode *inode, int mask,
-		int (*check_acl)(struct inode *inode, int mask), int rcu)
+static int acl_permission_check_rcu(struct inode *inode, int mask, unsigned int flags,
+		int (*check_acl_rcu)(struct inode *inode, int mask, unsigned int flags))
+{
+	umode_t			mode = inode->i_mode;
+
+	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
+
+	if (current_fsuid() == inode->i_uid)
+		mode >>= 6;
+	else {
+		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl_rcu) {
+			int error = check_acl_rcu(inode, mask, flags);
+			if (error != -EAGAIN)
+				return error;
+		}
+
+		if (in_group_p(inode->i_gid))
+			mode >>= 3;
+	}
+
+	/*
+	 * If the DACs are ok we don't need any capability check.
+	 */
+	if ((mask & ~mode) == 0)
+		return 0;
+	return -EACCES;
+}
+
+static int acl_permission_check(struct inode *inode, int mask, unsigned int flags,
+		int (*check_acl)(struct inode *inode, int mask))
 {
 	umode_t			mode = inode->i_mode;
 
@@ -180,7 +208,7 @@ static inline int __acl_permission_check(struct inode *inode, int mask,
 		mode >>= 6;
 	else {
 		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
-			if (rcu) {
+			if (flags) {
 				return -ECHILD;
 			} else {
 				int error = check_acl(inode, mask);
@@ -201,10 +229,52 @@ static inline int __acl_permission_check(struct inode *inode, int mask,
 	return -EACCES;
 }
 
-static inline int acl_permission_check(struct inode *inode, int mask,
-		int (*check_acl)(struct inode *inode, int mask))
+/**
+ * generic_permission_rcu  -  check for access rights on a Posix-like filesystem
+ * @inode:	inode to check access rights for
+ * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ * @check_acl_rcu: optional callback to check for Posix ACLs
+ * @flags	IPERM_FLAG_ flags.
+ *
+ * Used to check for read/write/execute permissions on a file.
+ * We use "fsuid" for this, letting us set arbitrary permissions
+ * for filesystem access without changing the "normal" uids which
+ * are used for other things.
+ *
+ * generic_permission_rcu must be rcu-walk aware. It should return
+ * -ECHILD in case an rcu-walk request cannot be satisfied (eg.
+ * requires blocking or too much thought!). It would then be called
+ * again in ref-walk mode.
+ */
+int generic_permission_rcu(struct inode *inode, int mask, unsigned int flags,
+	int (*check_acl_rcu)(struct inode *inode, int mask, unsigned int flags))
 {
-	return __acl_permission_check(inode, mask, check_acl, 0);
+	int ret;
+
+	/*
+	 * Do the basic POSIX ACL permission checks.
+	 */
+	ret = acl_permission_check_rcu(inode, mask, flags, check_acl_rcu);
+	if (ret != -EACCES)
+		return ret;
+
+	/*
+	 * Read/write DACs are always overridable.
+	 * Executable DACs are overridable if at least one exec bit is set.
+	 */
+	if (!(mask & MAY_EXEC) || execute_ok(inode))
+		if (capable(CAP_DAC_OVERRIDE))
+			return 0;
+
+	/*
+	 * Searching includes executable on directories, else just read.
+	 */
+	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
+	if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
+		if (capable(CAP_DAC_READ_SEARCH))
+			return 0;
+
+	return -EACCES;
 }
 
 /**
@@ -226,7 +296,7 @@ int generic_permission(struct inode *inode, int mask,
 	/*
 	 * Do the basic POSIX ACL permission checks.
 	 */
-	ret = acl_permission_check(inode, mask, check_acl);
+	ret = acl_permission_check(inode, mask, 0, check_acl);
 	if (ret != -EACCES)
 		return ret;
 
@@ -282,8 +352,14 @@ int inode_permission(struct inode *inode, int mask)
 
 	if (inode->i_op->permission)
 		retval = inode->i_op->permission(inode, mask);
+	else if (inode->i_op->permission_rcu)
+		retval = inode->i_op->permission_rcu(inode, mask, 0);
+	else if (inode->i_op->check_acl_rcu)
+		retval = generic_permission_rcu(inode, mask, 0,
+				inode->i_op->check_acl_rcu);
 	else
-		retval = generic_permission(inode, mask, inode->i_op->check_acl);
+		retval = generic_permission(inode, mask,
+				inode->i_op->check_acl);
 
 	if (retval)
 		return retval;
@@ -622,22 +698,26 @@ force_reval_path(struct path *path, struct nameidata *nd)
  * short-cut DAC fails, then call ->permission() to do more
  * complete permission check.
  */
-static inline int __exec_permission(struct inode *inode, int rcu)
+static inline int exec_permission(struct inode *inode, unsigned int flags)
 {
 	int ret;
 
-	if (inode->i_op->permission) {
-		if (rcu)
+	if (inode->i_op->permission_rcu) {
+		ret = inode->i_op->permission_rcu(inode, MAY_EXEC, flags);
+	} else if (inode->i_op->permission) {
+		if (flags)
 			return -ECHILD;
 		ret = inode->i_op->permission(inode, MAY_EXEC);
-		if (!ret)
-			goto ok;
-		return ret;
+	} else if (inode->i_op->check_acl_rcu) {
+		ret = acl_permission_check_rcu(inode, MAY_EXEC, flags,
+				inode->i_op->check_acl_rcu);
+	} else {
+		ret = acl_permission_check(inode, MAY_EXEC, flags,
+				inode->i_op->check_acl);
 	}
-	ret = __acl_permission_check(inode, MAY_EXEC, inode->i_op->check_acl, rcu);
-	if (!ret)
+	if (likely(!ret))
 		goto ok;
-	if (rcu && ret == -ECHILD)
+	if (ret == -ECHILD)
 		return ret;
 
 	if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
@@ -648,16 +728,6 @@ ok:
 	return security_inode_permission(inode, MAY_EXEC); /* XXX: ok for RCU? */
 }
 
-static int exec_permission(struct inode *inode)
-{
-	return __exec_permission(inode, 0);
-}
-
-static int exec_permission_rcu(struct inode *inode)
-{
-	return __exec_permission(inode, 1);
-}
-
 static __always_inline void set_root(struct nameidata *nd)
 {
 	if (!nd->root.mnt)
@@ -1126,7 +1196,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 
 		nd->flags |= LOOKUP_CONTINUE;
 		if (nd->flags & LOOKUP_RCU) {
-			err = exec_permission_rcu(nd->inode);
+			err = exec_permission(nd->inode, IPERM_FLAG_RCU);
 			if (err == -ECHILD) {
 				if (nameidata_drop_rcu(nd))
 					return -ECHILD;
@@ -1134,7 +1204,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 			}
 		} else {
 exec_again:
-			err = exec_permission(nd->inode);
+			err = exec_permission(nd->inode, 0);
 		}
  		if (err)
 			break;
@@ -1567,7 +1637,7 @@ static struct dentry *__lookup_hash(struct qstr *name,
 	struct dentry *dentry;
 	int err;
 
-	err = exec_permission(inode);
+	err = exec_permission(inode, 0);
 	if (err)
 		return ERR_PTR(err);
 
@@ -3352,6 +3422,7 @@ EXPORT_SYMBOL(vfs_follow_link);
 EXPORT_SYMBOL(vfs_link);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);
+EXPORT_SYMBOL(generic_permission_rcu);
 EXPORT_SYMBOL(generic_permission);
 EXPORT_SYMBOL(vfs_readlink);
 EXPORT_SYMBOL(vfs_rename);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 07e8a50..490eedd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1549,11 +1549,15 @@ struct file_operations {
 	int (*setlease)(struct file *, long, struct file_lock **);
 };
 
+#define IPERM_FLAG_RCU	0x0001
+
 struct inode_operations {
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
 	void * (*follow_link) (struct dentry *, struct nameidata *);
 	int (*permission) (struct inode *, int);
+	int (*permission_rcu) (struct inode *, int, unsigned int);
 	int (*check_acl)(struct inode *, int);
+	int (*check_acl_rcu)(struct inode *, int, unsigned int);
 
 	int (*readlink) (struct dentry *, char __user *,int);
 	void (*put_link) (struct dentry *, struct nameidata *, void *);
@@ -2164,6 +2168,8 @@ extern sector_t bmap(struct inode *, sector_t);
 #endif
 extern int notify_change(struct dentry *, struct iattr *);
 extern int inode_permission(struct inode *, int);
+extern int generic_permission_rcu(struct inode *, int, unsigned int,
+		int (*check_acl_rcu)(struct inode *, int, unsigned int));
 extern int generic_permission(struct inode *, int,
 		int (*check_acl)(struct inode *, int));
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 41/46] fs: provide simple rcu-walk ACL implementation
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (38 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 40/46] fs: provide rcu-walk aware permission i_ops Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 42/46] kernel: add bl_list Nick Piggin
                   ` (9 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.

This could easily be extended to put acls under RCU and check them
under seqlock, if need be. But this implementation is enough to show
the rcu-walk aware permissions code for path lookups is working, and
will handle cases where there are no ACLs or ACLs in just the final
element.

Convert tmpfs, ext2, btrfs. Each of these uses acl/permission code in
a different way, so convert them all to provide templates and proof of
concept.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/btrfs/acl.c              |   19 ++++++++++++-------
 fs/btrfs/ctree.h            |    4 ++--
 fs/btrfs/inode.c            |   14 +++++++-------
 fs/ext2/acl.c               |   11 +++++++++--
 fs/ext2/acl.h               |    8 ++++----
 fs/ext2/file.c              |    2 +-
 fs/ext2/namei.c             |    4 ++--
 fs/ext3/acl.c               |   11 +++++++++--
 fs/ext3/acl.h               |    8 ++++----
 fs/ext3/file.c              |    2 +-
 fs/ext3/namei.c             |    4 ++--
 fs/ext4/acl.c               |   11 +++++++++--
 fs/ext4/acl.h               |    4 ++--
 fs/ext4/file.c              |    2 +-
 fs/ext4/namei.c             |    4 ++--
 fs/generic_acl.c            |   20 ++++++++++++++++++++
 fs/xfs/linux-2.6/xfs_acl.c  |    8 +++++++-
 fs/xfs/linux-2.6/xfs_iops.c |    8 ++++----
 fs/xfs/xfs_acl.h            |    4 ++--
 include/linux/generic_acl.h |    1 +
 include/linux/posix_acl.h   |   19 +++++++++++++++++++
 mm/shmem.c                  |    6 +++---
 22 files changed, 123 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index 2222d16..c03c753 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -185,18 +185,23 @@ static int btrfs_xattr_acl_set(struct dentry *dentry, const char *name,
 	return ret;
 }
 
-int btrfs_check_acl(struct inode *inode, int mask)
+int btrfs_check_acl_rcu(struct inode *inode, int mask, unsigned int flags)
 {
 	struct posix_acl *acl;
 	int error = -EAGAIN;
 
-	acl = btrfs_get_acl(inode, ACL_TYPE_ACCESS);
+	if (flags & IPERM_FLAG_RCU) {
+		if (!negative_cached_acl(inode, ACL_TYPE_ACCESS))
+			return -ECHILD;
+	} else {
+		acl = btrfs_get_acl(inode, ACL_TYPE_ACCESS);
 
-	if (IS_ERR(acl))
-		return PTR_ERR(acl);
-	if (acl) {
-		error = posix_acl_permission(inode, acl, mask);
-		posix_acl_release(acl);
+		if (IS_ERR(acl))
+			return PTR_ERR(acl);
+		if (acl) {
+			error = posix_acl_permission(inode, acl, mask);
+			posix_acl_release(acl);
+		}
 	}
 
 	return error;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8db9234..403b069 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2544,9 +2544,9 @@ int btrfs_sync_fs(struct super_block *sb, int wait);
 
 /* acl.c */
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
-int btrfs_check_acl(struct inode *inode, int mask);
+int btrfs_check_acl_rcu(struct inode *inode, int mask, unsigned int flags);
 #else
-#define btrfs_check_acl NULL
+#define btrfs_check_acl_rcu NULL
 #endif
 int btrfs_init_acl(struct btrfs_trans_handle *trans,
 		   struct inode *inode, struct inode *dir);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index aab3087..c9844e1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7033,11 +7033,11 @@ static int btrfs_set_page_dirty(struct page *page)
 	return __set_page_dirty_nobuffers(page);
 }
 
-static int btrfs_permission(struct inode *inode, int mask)
+static int btrfs_permission_rcu(struct inode *inode, int mask, unsigned int flags)
 {
 	if ((BTRFS_I(inode)->flags & BTRFS_INODE_READONLY) && (mask & MAY_WRITE))
 		return -EACCES;
-	return generic_permission(inode, mask, btrfs_check_acl);
+	return generic_permission_rcu(inode, mask, flags, btrfs_check_acl_rcu);
 }
 
 static const struct inode_operations btrfs_dir_inode_operations = {
@@ -7056,11 +7056,11 @@ static const struct inode_operations btrfs_dir_inode_operations = {
 	.getxattr	= btrfs_getxattr,
 	.listxattr	= btrfs_listxattr,
 	.removexattr	= btrfs_removexattr,
-	.permission	= btrfs_permission,
+	.permission_rcu	= btrfs_permission_rcu,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
 	.lookup		= btrfs_lookup,
-	.permission	= btrfs_permission,
+	.permission_rcu	= btrfs_permission_rcu,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -7129,14 +7129,14 @@ static const struct inode_operations btrfs_file_inode_operations = {
 	.getxattr	= btrfs_getxattr,
 	.listxattr      = btrfs_listxattr,
 	.removexattr	= btrfs_removexattr,
-	.permission	= btrfs_permission,
+	.permission_rcu	= btrfs_permission_rcu,
 	.fallocate	= btrfs_fallocate,
 	.fiemap		= btrfs_fiemap,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.setattr	= btrfs_setattr,
-	.permission	= btrfs_permission,
+	.permission_rcu	= btrfs_permission_rcu,
 	.setxattr	= btrfs_setxattr,
 	.getxattr	= btrfs_getxattr,
 	.listxattr	= btrfs_listxattr,
@@ -7146,7 +7146,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = {
 	.readlink	= generic_readlink,
 	.follow_link	= page_follow_link_light,
 	.put_link	= page_put_link,
-	.permission	= btrfs_permission,
+	.permission_rcu	= btrfs_permission_rcu,
 	.setxattr	= btrfs_setxattr,
 	.getxattr	= btrfs_getxattr,
 	.listxattr	= btrfs_listxattr,
diff --git a/fs/ext2/acl.c b/fs/ext2/acl.c
index 2bcc043..43196d7 100644
--- a/fs/ext2/acl.c
+++ b/fs/ext2/acl.c
@@ -232,10 +232,17 @@ ext2_set_acl(struct inode *inode, int type, struct posix_acl *acl)
 }
 
 int
-ext2_check_acl(struct inode *inode, int mask)
+ext2_check_acl_rcu(struct inode *inode, int mask, unsigned int flags)
 {
-	struct posix_acl *acl = ext2_get_acl(inode, ACL_TYPE_ACCESS);
+	struct posix_acl *acl;
+
+	if (flags & IPERM_FLAG_RCU) {
+		if (!negative_cached_acl(inode, ACL_TYPE_ACCESS))
+			return -ECHILD;
+		return -EAGAIN;
+	}
 
+	acl = ext2_get_acl(inode, ACL_TYPE_ACCESS);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
 	if (acl) {
diff --git a/fs/ext2/acl.h b/fs/ext2/acl.h
index 3ff6cbb..fb49393 100644
--- a/fs/ext2/acl.h
+++ b/fs/ext2/acl.h
@@ -54,15 +54,15 @@ static inline int ext2_acl_count(size_t size)
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
 
 /* acl.c */
-extern int ext2_check_acl (struct inode *, int);
+extern int ext2_check_acl_rcu(struct inode *, int, unsigned int);
 extern int ext2_acl_chmod (struct inode *);
 extern int ext2_init_acl (struct inode *, struct inode *);
 
 #else
 #include <linux/sched.h>
-#define ext2_check_acl	NULL
-#define ext2_get_acl	NULL
-#define ext2_set_acl	NULL
+#define ext2_check_acl_rcu	NULL
+#define ext2_get_acl		NULL
+#define ext2_set_acl		NULL
 
 static inline int
 ext2_acl_chmod (struct inode *inode)
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 49eec94..a34fc5a 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,6 +102,6 @@ const struct inode_operations ext2_file_inode_operations = {
 	.removexattr	= generic_removexattr,
 #endif
 	.setattr	= ext2_setattr,
-	.check_acl	= ext2_check_acl,
+	.check_acl_rcu	= ext2_check_acl_rcu,
 	.fiemap		= ext2_fiemap,
 };
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index f8aecd2..5366e02 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -417,7 +417,7 @@ const struct inode_operations ext2_dir_inode_operations = {
 	.removexattr	= generic_removexattr,
 #endif
 	.setattr	= ext2_setattr,
-	.check_acl	= ext2_check_acl,
+	.check_acl_rcu	= ext2_check_acl_rcu,
 };
 
 const struct inode_operations ext2_special_inode_operations = {
@@ -428,5 +428,5 @@ const struct inode_operations ext2_special_inode_operations = {
 	.removexattr	= generic_removexattr,
 #endif
 	.setattr	= ext2_setattr,
-	.check_acl	= ext2_check_acl,
+	.check_acl_rcu	= ext2_check_acl_rcu,
 };
diff --git a/fs/ext3/acl.c b/fs/ext3/acl.c
index 8a11fe2..cff7c13 100644
--- a/fs/ext3/acl.c
+++ b/fs/ext3/acl.c
@@ -240,10 +240,17 @@ ext3_set_acl(handle_t *handle, struct inode *inode, int type,
 }
 
 int
-ext3_check_acl(struct inode *inode, int mask)
+ext3_check_acl_rcu(struct inode *inode, int mask, unsigned int flags)
 {
-	struct posix_acl *acl = ext3_get_acl(inode, ACL_TYPE_ACCESS);
+	struct posix_acl *acl;
+
+	if (flags & IPERM_FLAG_RCU) {
+		if (!negative_cached_acl(inode, ACL_TYPE_ACCESS))
+			return -ECHILD;
+		return -EAGAIN;
+	}
 
+	acl = ext3_get_acl(inode, ACL_TYPE_ACCESS);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
 	if (acl) {
diff --git a/fs/ext3/acl.h b/fs/ext3/acl.h
index 5973346..f2536e6 100644
--- a/fs/ext3/acl.h
+++ b/fs/ext3/acl.h
@@ -54,13 +54,13 @@ static inline int ext3_acl_count(size_t size)
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
 
 /* acl.c */
-extern int ext3_check_acl (struct inode *, int);
-extern int ext3_acl_chmod (struct inode *);
-extern int ext3_init_acl (handle_t *, struct inode *, struct inode *);
+extern int ext3_check_acl_rcu(struct inode *, int, unsigned int);
+extern int ext3_acl_chmod(struct inode *);
+extern int ext3_init_acl(handle_t *, struct inode *, struct inode *);
 
 #else  /* CONFIG_EXT3_FS_POSIX_ACL */
 #include <linux/sched.h>
-#define ext3_check_acl NULL
+#define ext3_check_acl_rcu NULL
 
 static inline int
 ext3_acl_chmod(struct inode *inode)
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index f55df0e..74e5ef7 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -79,7 +79,7 @@ const struct inode_operations ext3_file_inode_operations = {
 	.listxattr	= ext3_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
-	.check_acl	= ext3_check_acl,
+	.check_acl_rcu	= ext3_check_acl_rcu,
 	.fiemap		= ext3_fiemap,
 };
 
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index bce9dce..9b10438 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2464,7 +2464,7 @@ const struct inode_operations ext3_dir_inode_operations = {
 	.listxattr	= ext3_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
-	.check_acl	= ext3_check_acl,
+	.check_acl_rcu	= ext3_check_acl_rcu,
 };
 
 const struct inode_operations ext3_special_inode_operations = {
@@ -2475,5 +2475,5 @@ const struct inode_operations ext3_special_inode_operations = {
 	.listxattr	= ext3_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
-	.check_acl	= ext3_check_acl,
+	.check_acl_rcu	= ext3_check_acl_rcu,
 };
diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 5e2ed45..ebc7127 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -238,10 +238,17 @@ ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 }
 
 int
-ext4_check_acl(struct inode *inode, int mask)
+ext4_check_acl_rcu(struct inode *inode, int mask, unsigned int flags)
 {
-	struct posix_acl *acl = ext4_get_acl(inode, ACL_TYPE_ACCESS);
+	struct posix_acl *acl;
+
+	if (flags & IPERM_FLAG_RCU) {
+		if (!negative_cached_acl(inode, ACL_TYPE_ACCESS))
+			return -ECHILD;
+		return -EAGAIN;
+	}
 
+	acl = ext4_get_acl(inode, ACL_TYPE_ACCESS);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
 	if (acl) {
diff --git a/fs/ext4/acl.h b/fs/ext4/acl.h
index 9d843d5..71e8da7 100644
--- a/fs/ext4/acl.h
+++ b/fs/ext4/acl.h
@@ -54,13 +54,13 @@ static inline int ext4_acl_count(size_t size)
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
 
 /* acl.c */
-extern int ext4_check_acl(struct inode *, int);
+extern int ext4_check_acl_rcu(struct inode *, int, unsigned int);
 extern int ext4_acl_chmod(struct inode *);
 extern int ext4_init_acl(handle_t *, struct inode *, struct inode *);
 
 #else  /* CONFIG_EXT4_FS_POSIX_ACL */
 #include <linux/sched.h>
-#define ext4_check_acl NULL
+#define ext4_check_acl_rcu NULL
 
 static inline int
 ext4_acl_chmod(struct inode *inode)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5a5c55d..f6be0af 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,7 +200,7 @@ const struct inode_operations ext4_file_inode_operations = {
 	.listxattr	= ext4_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
-	.check_acl	= ext4_check_acl,
+	.check_acl_rcu	= ext4_check_acl_rcu,
 	.fallocate	= ext4_fallocate,
 	.fiemap		= ext4_fiemap,
 };
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 92203b8..6145f55 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2511,7 +2511,7 @@ const struct inode_operations ext4_dir_inode_operations = {
 	.listxattr	= ext4_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
-	.check_acl	= ext4_check_acl,
+	.check_acl_rcu	= ext4_check_acl_rcu,
 	.fiemap         = ext4_fiemap,
 };
 
@@ -2523,5 +2523,5 @@ const struct inode_operations ext4_special_inode_operations = {
 	.listxattr	= ext4_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
-	.check_acl	= ext4_check_acl,
+	.check_acl_rcu	= ext4_check_acl_rcu,
 };
diff --git a/fs/generic_acl.c b/fs/generic_acl.c
index 6bc9e3a..3b250e5 100644
--- a/fs/generic_acl.c
+++ b/fs/generic_acl.c
@@ -190,6 +190,26 @@ generic_acl_chmod(struct inode *inode)
 }
 
 int
+generic_check_acl_rcu(struct inode *inode, int mask, unsigned int flags)
+{
+	if (flags & IPERM_FLAG_RCU) {
+		if (!negative_cached_acl(inode, ACL_TYPE_ACCESS))
+			return -ECHILD;
+	} else {
+		struct posix_acl *acl;
+
+		acl = get_cached_acl(inode, ACL_TYPE_ACCESS);
+
+		if (acl) {
+			int error = posix_acl_permission(inode, acl, mask);
+			posix_acl_release(acl);
+			return error;
+		}
+	}
+	return -EAGAIN;
+}
+
+int
 generic_check_acl(struct inode *inode, int mask)
 {
 	struct posix_acl *acl = get_cached_acl(inode, ACL_TYPE_ACCESS);
diff --git a/fs/xfs/linux-2.6/xfs_acl.c b/fs/xfs/linux-2.6/xfs_acl.c
index b277186..70053f6 100644
--- a/fs/xfs/linux-2.6/xfs_acl.c
+++ b/fs/xfs/linux-2.6/xfs_acl.c
@@ -219,7 +219,7 @@ xfs_set_acl(struct inode *inode, int type, struct posix_acl *acl)
 }
 
 int
-xfs_check_acl(struct inode *inode, int mask)
+xfs_check_acl_rcu(struct inode *inode, int mask, unsigned int flags)
 {
 	struct xfs_inode *ip = XFS_I(inode);
 	struct posix_acl *acl;
@@ -234,6 +234,12 @@ xfs_check_acl(struct inode *inode, int mask)
 	if (!XFS_IFORK_Q(ip))
 		return -EAGAIN;
 
+	if (flags & IPERM_FLAG_RCU) {
+		if (!negative_cached_acl(inode, ACL_TYPE_ACCESS))
+			return -ECHILD;
+		return -EAGAIN;
+	}
+
 	acl = xfs_get_acl(inode, ACL_TYPE_ACCESS);
 	if (IS_ERR(acl))
 		return PTR_ERR(acl);
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 94d5fd6..c8751c7 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -643,7 +643,7 @@ xfs_vn_fiemap(
 }
 
 static const struct inode_operations xfs_inode_operations = {
-	.check_acl		= xfs_check_acl,
+	.check_acl_rcu		= xfs_check_acl_rcu,
 	.getattr		= xfs_vn_getattr,
 	.setattr		= xfs_vn_setattr,
 	.setxattr		= generic_setxattr,
@@ -670,7 +670,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
 	.rmdir			= xfs_vn_unlink,
 	.mknod			= xfs_vn_mknod,
 	.rename			= xfs_vn_rename,
-	.check_acl		= xfs_check_acl,
+	.check_acl_rcu		= xfs_check_acl_rcu,
 	.getattr		= xfs_vn_getattr,
 	.setattr		= xfs_vn_setattr,
 	.setxattr		= generic_setxattr,
@@ -695,7 +695,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
 	.rmdir			= xfs_vn_unlink,
 	.mknod			= xfs_vn_mknod,
 	.rename			= xfs_vn_rename,
-	.check_acl		= xfs_check_acl,
+	.check_acl_rcu		= xfs_check_acl_rcu,
 	.getattr		= xfs_vn_getattr,
 	.setattr		= xfs_vn_setattr,
 	.setxattr		= generic_setxattr,
@@ -708,7 +708,7 @@ static const struct inode_operations xfs_symlink_inode_operations = {
 	.readlink		= generic_readlink,
 	.follow_link		= xfs_vn_follow_link,
 	.put_link		= xfs_vn_put_link,
-	.check_acl		= xfs_check_acl,
+	.check_acl_rcu		= xfs_check_acl_rcu,
 	.getattr		= xfs_vn_getattr,
 	.setattr		= xfs_vn_setattr,
 	.setxattr		= generic_setxattr,
diff --git a/fs/xfs/xfs_acl.h b/fs/xfs/xfs_acl.h
index 0135e2a..6560d56 100644
--- a/fs/xfs/xfs_acl.h
+++ b/fs/xfs/xfs_acl.h
@@ -42,7 +42,7 @@ struct xfs_acl {
 #define SGI_ACL_DEFAULT_SIZE	(sizeof(SGI_ACL_DEFAULT)-1)
 
 #ifdef CONFIG_XFS_POSIX_ACL
-extern int xfs_check_acl(struct inode *inode, int mask);
+extern int xfs_check_acl_rcu(struct inode *inode, int mask, unsigned int flags);
 extern struct posix_acl *xfs_get_acl(struct inode *inode, int type);
 extern int xfs_inherit_acl(struct inode *inode, struct posix_acl *default_acl);
 extern int xfs_acl_chmod(struct inode *inode);
@@ -52,7 +52,7 @@ extern int posix_acl_default_exists(struct inode *inode);
 extern const struct xattr_handler xfs_xattr_acl_access_handler;
 extern const struct xattr_handler xfs_xattr_acl_default_handler;
 #else
-# define xfs_check_acl					NULL
+# define xfs_check_acl_rcu					NULL
 # define xfs_get_acl(inode, type)			NULL
 # define xfs_inherit_acl(inode, default_acl)		0
 # define xfs_acl_chmod(inode)				0
diff --git a/include/linux/generic_acl.h b/include/linux/generic_acl.h
index 574bea4..b57e89c 100644
--- a/include/linux/generic_acl.h
+++ b/include/linux/generic_acl.h
@@ -11,5 +11,6 @@ extern const struct xattr_handler generic_acl_default_handler;
 int generic_acl_init(struct inode *, struct inode *);
 int generic_acl_chmod(struct inode *);
 int generic_check_acl(struct inode *inode, int mask);
+int generic_check_acl_rcu(struct inode *inode, int mask, unsigned int flags);
 
 #endif /* LINUX_GENERIC_ACL_H */
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 6760816..d68283a 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -108,6 +108,25 @@ static inline struct posix_acl *get_cached_acl(struct inode *inode, int type)
 	return acl;
 }
 
+static inline int negative_cached_acl(struct inode *inode, int type)
+{
+	struct posix_acl **p, *acl;
+	switch (type) {
+	case ACL_TYPE_ACCESS:
+		p = &inode->i_acl;
+		break;
+	case ACL_TYPE_DEFAULT:
+		p = &inode->i_default_acl;
+		break;
+	default:
+		BUG();
+	}
+	acl = ACCESS_ONCE(*p);
+	if (acl)
+		return 0;
+	return 1;
+}
+
 static inline void set_cached_acl(struct inode *inode,
 				  int type,
 				  struct posix_acl *acl)
diff --git a/mm/shmem.c b/mm/shmem.c
index 5ee67c9..8f40d25 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2485,7 +2485,7 @@ static const struct inode_operations shmem_inode_operations = {
 	.getxattr	= generic_getxattr,
 	.listxattr	= generic_listxattr,
 	.removexattr	= generic_removexattr,
-	.check_acl	= generic_check_acl,
+	.check_acl_rcu	= generic_check_acl_rcu,
 #endif
 
 };
@@ -2508,7 +2508,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
 	.getxattr	= generic_getxattr,
 	.listxattr	= generic_listxattr,
 	.removexattr	= generic_removexattr,
-	.check_acl	= generic_check_acl,
+	.check_acl_rcu	= generic_check_acl_rcu,
 #endif
 };
 
@@ -2519,7 +2519,7 @@ static const struct inode_operations shmem_special_inode_operations = {
 	.getxattr	= generic_getxattr,
 	.listxattr	= generic_listxattr,
 	.removexattr	= generic_removexattr,
-	.check_acl	= generic_check_acl,
+	.check_acl_rcu	= generic_check_acl_rcu,
 #endif
 };
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 42/46] kernel: add bl_list
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (39 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 41/46] fs: provide simple rcu-walk ACL implementation Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 43/46] bit_spinlock: add required includes Nick Piggin
                   ` (8 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Introduce a type of hlist that can support the use of the lowest bit in the
hlist_head. This will be subsequently used to implement per-bucket bit spinlock
for inode and dentry hashes, and may be useful in other cases such as network
hashes.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 include/linux/list_bl.h    |  141 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/rculist_bl.h |  128 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 269 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/list_bl.h
 create mode 100644 include/linux/rculist_bl.h

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
new file mode 100644
index 0000000..c2034b9
--- /dev/null
+++ b/include/linux/list_bl.h
@@ -0,0 +1,141 @@
+#ifndef _LINUX_LIST_BL_H
+#define _LINUX_LIST_BL_H
+
+#include <linux/list.h>
+#include <linux/bit_spinlock.h>
+
+/*
+ * Special version of lists, where head of the list has a bit spinlock
+ * in the lowest bit. This is useful for scalable hash tables without
+ * increasing memory footprint overhead.
+ *
+ * For modification operations, the 0 bit of hlist_bl_head->first
+ * pointer must be set.
+ */
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define LIST_BL_LOCKMASK	1UL
+#else
+#define LIST_BL_LOCKMASK	0UL
+#endif
+
+#ifdef CONFIG_DEBUG_LIST
+#define LIST_BL_BUG_ON(x) BUG_ON(x)
+#else
+#define LIST_BL_BUG_ON(x)
+#endif
+
+
+struct hlist_bl_head {
+	struct hlist_bl_node *first;
+};
+
+struct hlist_bl_node {
+	struct hlist_bl_node *next, **pprev;
+};
+#define INIT_HLIST_BL_HEAD(ptr) \
+	((ptr)->first = NULL)
+
+static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
+{
+	h->next = NULL;
+	h->pprev = NULL;
+}
+
+#define hlist_bl_entry(ptr, type, member) container_of(ptr,type,member)
+
+static inline int hlist_bl_unhashed(const struct hlist_bl_node *h)
+{
+	return !h->pprev;
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
+}
+
+static inline int hlist_bl_empty(const struct hlist_bl_head *h)
+{
+	return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+}
+
+static inline void hlist_bl_add_head(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+	hlist_bl_set_first(h, n);
+}
+
+static inline void __hlist_bl_del(struct hlist_bl_node *n)
+{
+	struct hlist_bl_node *next = n->next;
+	struct hlist_bl_node **pprev = n->pprev;
+
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+
+	/* pprev may be `first`, so be careful not to lose the lock bit */
+	*pprev = (struct hlist_bl_node *)
+			((unsigned long)next |
+			 ((unsigned long)*pprev & LIST_BL_LOCKMASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+static inline void hlist_bl_del(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->next = LIST_POISON1;
+	n->pprev = LIST_POISON2;
+}
+
+static inline void hlist_bl_del_init(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		INIT_HLIST_BL_NODE(n);
+	}
+}
+
+/**
+ * hlist_bl_for_each_entry	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first(head);				\
+	     pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = pos->next)
+
+/**
+ * hlist_bl_for_each_entry_safe - iterate over list of given type safe against removal of list entry
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		another &struct hlist_node to use as temporary storage
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_bl_for_each_entry_safe(tpos, pos, n, head, member)	 \
+	for (pos = hlist_bl_first(head);				 \
+	     pos && ({ n = pos->next; 1; }) && 				 \
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
+#endif
diff --git a/include/linux/rculist_bl.h b/include/linux/rculist_bl.h
new file mode 100644
index 0000000..cdfb54e
--- /dev/null
+++ b/include/linux/rculist_bl.h
@@ -0,0 +1,128 @@
+#ifndef _LINUX_RCULIST_BL_H
+#define _LINUX_RCULIST_BL_H
+
+/*
+ * RCU-protected bl list version. See include/linux/list_bl.h.
+ */
+#include <linux/list_bl.h>
+#include <linux/rcupdate.h>
+#include <linux/bit_spinlock.h>
+
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
+					struct hlist_bl_node *n)
+{
+	LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
+	LIST_BL_BUG_ON(!bit_spin_is_locked(0, (unsigned long *)&h->first));
+	rcu_assign_pointer(h->first,
+		(struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK));
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+	return (struct hlist_bl_node *)
+		((unsigned long)rcu_dereference(h->first) & ~LIST_BL_LOCKMASK);
+}
+
+/**
+ * hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on the node returns true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_bl_add_head_rcu() or
+ * hlist_bl_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_bl_for_each_entry_rcu().
+ */
+static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
+{
+	if (!hlist_bl_unhashed(n)) {
+		__hlist_bl_del(n);
+		n->pprev = NULL;
+	}
+}
+
+/**
+ * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: hlist_bl_unhashed() on entry does not return true after this,
+ * the entry is in an undefined state. It is useful for RCU based
+ * lockfree traversal.
+ *
+ * In particular, it means that we can not poison the forward
+ * pointers that may still be used for walking the hash list.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry().
+ */
+static inline void hlist_bl_del_rcu(struct hlist_bl_node *n)
+{
+	__hlist_bl_del(n);
+	n->pprev = LIST_POISON2;
+}
+
+/**
+ * hlist_bl_add_head_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_bl,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_bl_add_head_rcu()
+ * or hlist_bl_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_bl_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_bl_add_head_rcu(struct hlist_bl_node *n,
+					struct hlist_bl_head *h)
+{
+	struct hlist_bl_node *first;
+
+	/* don't need hlist_bl_first_rcu because we're under lock */
+	first = hlist_bl_first(h);
+
+	n->next = first;
+	if (first)
+		first->pprev = &n->next;
+	n->pprev = &h->first;
+
+	/* need _rcu because we can have concurrent lock free readers */
+	hlist_bl_set_first_rcu(h, n);
+}
+/**
+ * hlist_bl_for_each_entry_rcu - iterate over rcu list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_bl_node to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_bl_node within the struct.
+ *
+ */
+#define hlist_bl_for_each_entry_rcu(tpos, pos, head, member)		\
+	for (pos = hlist_bl_first_rcu(head);				\
+		pos &&							\
+		({ tpos = hlist_bl_entry(pos, typeof(*tpos), member); 1; }); \
+		pos = rcu_dereference_raw(pos->next))
+
+#endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 43/46] bit_spinlock: add required includes
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (40 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 42/46] kernel: add bl_list Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 44/46] fs: dcache per-bucket dcache hash locking Nick Piggin
                   ` (7 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 include/linux/bit_spinlock.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index 7113a32..e612575 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -1,6 +1,10 @@
 #ifndef __LINUX_BIT_SPINLOCK_H
 #define __LINUX_BIT_SPINLOCK_H
 
+#include <linux/kernel.h>
+#include <linux/preempt.h>
+#include <asm/atomic.h>
+
 /*
  *  bit-based spin_lock()
  *
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 44/46] fs: dcache per-bucket dcache hash locking
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (41 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 43/46] bit_spinlock: add required includes Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 45/46] fs: dcache per-inode inode alias locking Nick Piggin
                   ` (6 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c            |  131 ++++++++++++++++++++++++++++++++---------------
 fs/super.c             |    3 +-
 include/linux/dcache.h |   23 ++-------
 include/linux/fs.h     |    3 +-
 4 files changed, 97 insertions(+), 63 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 67a08d4..5e19940 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -33,13 +33,15 @@
 #include <linux/bootmem.h>
 #include <linux/fs_struct.h>
 #include <linux/hardirq.h>
+#include <linux/bit_spinlock.h>
+#include <linux/rculist_bl.h>
 #include "internal.h"
 
 /*
  * Usage:
  * dcache_inode_lock protects:
  *   - i_dentry, d_alias, d_inode
- * dcache_hash_lock protects:
+ * dcache_hash_bucket lock protects:
  *   - the dcache hash table
  * dcache_lru_lock protects:
  *   - the dcache lru lists and counters
@@ -57,7 +59,7 @@
  * dcache_inode_lock
  *   dentry->d_lock
  *     dcache_lru_lock
- *     dcache_hash_lock
+ *     dcache_hash_bucket lock
  *
  * If there is an ancestor relationship:
  * dentry->d_parent->...->d_parent->d_lock
@@ -74,13 +76,11 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 EXPORT_SYMBOL(dcache_inode_lock);
-EXPORT_SYMBOL(dcache_hash_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
 
@@ -97,13 +97,35 @@ static struct kmem_cache *dentry_cache __read_mostly;
 
 static unsigned int d_hash_mask __read_mostly;
 static unsigned int d_hash_shift __read_mostly;
-static struct hlist_head *dentry_hashtable __read_mostly;
+
+struct dcache_hash_bucket {
+	struct hlist_bl_head head;
+};
+static struct dcache_hash_bucket *dentry_hashtable __read_mostly;
 
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+static inline struct dcache_hash_bucket *d_hash(struct dentry *parent,
+					unsigned long hash)
+{
+	hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
+	hash = hash ^ ((hash ^ GOLDEN_RATIO_PRIME) >> D_HASHBITS);
+	return dentry_hashtable + (hash & D_HASHMASK);
+}
+
+static inline void spin_lock_bucket(struct dcache_hash_bucket *b)
+{
+	bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void spin_unlock_bucket(struct dcache_hash_bucket *b)
+{
+	__bit_spin_unlock(0, (unsigned long *)b);
+}
+
 static struct percpu_counter nr_dentry __cacheline_aligned_in_smp;
 static struct percpu_counter nr_dentry_unused __cacheline_aligned_in_smp;
 
@@ -138,7 +160,7 @@ static void d_free(struct dentry *dentry)
 		dentry->d_op->d_release(dentry);
 
 	/* if dentry was never inserted into hash, immediate free is OK */
-	if (hlist_unhashed(&dentry->d_hash))
+	if (hlist_bl_unhashed(&dentry->d_hash))
 		__d_free(&dentry->d_u.d_rcu);
 	else
 		call_rcu(&dentry->d_u.d_rcu, __d_free);
@@ -278,6 +300,39 @@ relock:
 	return d_kill(dentry, parent);
 }
 
+void __d_drop(struct dentry *dentry)
+{
+	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
+		if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED)) {
+			bit_spin_lock(0, (unsigned long *)&dentry->d_sb->s_anon);
+			dentry->d_flags |= DCACHE_UNHASHED;
+			hlist_bl_del_init(&dentry->d_hash);
+			__bit_spin_unlock(0, (unsigned long *)&dentry->d_sb->s_anon);
+		} else {
+			struct dcache_hash_bucket *b;
+			b = d_hash(dentry->d_parent, dentry->d_name.hash);
+			spin_lock_bucket(b);
+			/*
+			 * We may not actually need to put DCACHE_UNHASHED
+			 * manipulations under the hash lock, but follow
+			 * the principle of least surprise.
+			 */
+			dentry->d_flags |= DCACHE_UNHASHED;
+			hlist_bl_del_rcu(&dentry->d_hash);
+			spin_unlock_bucket(b);
+		}
+	}
+}
+EXPORT_SYMBOL(__d_drop);
+
+void d_drop(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+ 	__d_drop(dentry);
+	spin_unlock(&dentry->d_lock);
+}
+EXPORT_SYMBOL(d_drop);
+
 /* 
  * This is dput
  *
@@ -891,8 +946,8 @@ void shrink_dcache_for_umount(struct super_block *sb)
 	spin_unlock(&dentry->d_lock);
 	shrink_dcache_for_umount_subtree(dentry);
 
-	while (!hlist_empty(&sb->s_anon)) {
-		dentry = hlist_entry(sb->s_anon.first, struct dentry, d_hash);
+	while (!hlist_bl_empty(&sb->s_anon)) {
+		dentry = hlist_bl_entry(hlist_bl_first(&sb->s_anon), struct dentry, d_hash);
 		shrink_dcache_for_umount_subtree(dentry);
 	}
 }
@@ -1196,7 +1251,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 	dentry->d_sb = NULL;
 	dentry->d_op = NULL;
 	dentry->d_fsdata = NULL;
-	INIT_HLIST_NODE(&dentry->d_hash);
+	INIT_HLIST_BL_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
 	INIT_LIST_HEAD(&dentry->d_alias);
@@ -1387,14 +1442,6 @@ struct dentry * d_alloc_root(struct inode * root_inode)
 }
 EXPORT_SYMBOL(d_alloc_root);
 
-static inline struct hlist_head *d_hash(struct dentry *parent,
-					unsigned long hash)
-{
-	hash += ((unsigned long) parent ^ GOLDEN_RATIO_PRIME) / L1_CACHE_BYTES;
-	hash = hash ^ ((hash ^ GOLDEN_RATIO_PRIME) >> D_HASHBITS);
-	return dentry_hashtable + (hash & D_HASHMASK);
-}
-
 /**
  * d_obtain_alias - find or allocate a dentry for a given inode
  * @inode: inode to allocate the dentry for
@@ -1449,11 +1496,11 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	tmp->d_sb = inode->i_sb;
 	tmp->d_inode = inode;
 	tmp->d_flags |= DCACHE_DISCONNECTED;
-	tmp->d_flags &= ~DCACHE_UNHASHED;
 	list_add(&tmp->d_alias, &inode->i_dentry);
-	spin_lock(&dcache_hash_lock);
-	hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
-	spin_unlock(&dcache_hash_lock);
+	bit_spin_lock(0, (unsigned long *)&tmp->d_sb->s_anon);
+	tmp->d_flags &= ~DCACHE_UNHASHED;
+	hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_anon);
+	__bit_spin_unlock(0, (unsigned long *)&tmp->d_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
 	spin_unlock(&dcache_inode_lock);
 
@@ -1617,8 +1664,8 @@ struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
 	const unsigned char *str = name->name;
-	struct hlist_head *head = d_hash(parent,hash);
-	struct hlist_node *node;
+	struct dcache_hash_bucket *b = d_hash(parent, hash);
+	struct hlist_bl_node *node;
 	struct dentry *dentry;
 
  	/*
@@ -1641,7 +1688,7 @@ struct dentry *__d_lookup_rcu(struct dentry *parent, struct qstr *name,
 	 *
 	 * See Documentation/vfs/dcache-locking.txt for more details.
 	 */
-	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
+	hlist_bl_for_each_entry_rcu(dentry, node, &b->head, d_hash) {
 		struct inode *i;
 		const char *tname;
 		int tlen;
@@ -1754,8 +1801,8 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
 	const unsigned char *str = name->name;
-	struct hlist_head *head = d_hash(parent,hash);
-	struct hlist_node *node;
+	struct dcache_hash_bucket *b = d_hash(parent, hash);
+	struct hlist_bl_node *node;
 	struct dentry *found = NULL;
 	struct dentry *dentry;
 
@@ -1781,7 +1828,7 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 	 */
 	rcu_read_lock();
 	
-	hlist_for_each_entry_rcu(dentry, node, head, d_hash) {
+	hlist_bl_for_each_entry_rcu(dentry, node, &b->head, d_hash) {
 		const char *tname;
 		int tlen;
 
@@ -1932,11 +1979,12 @@ again:
 }
 EXPORT_SYMBOL(d_delete);
 
-static void __d_rehash(struct dentry * entry, struct hlist_head *list)
+static void __d_rehash(struct dentry * entry, struct dcache_hash_bucket *b)
 {
-
+	spin_lock_bucket(b);
  	entry->d_flags &= ~DCACHE_UNHASHED;
- 	hlist_add_head_rcu(&entry->d_hash, list);
+ 	hlist_bl_add_head_rcu(&entry->d_hash, &b->head);
+	spin_unlock_bucket(b);
 }
 
 static void _d_rehash(struct dentry * entry)
@@ -1954,9 +2002,7 @@ static void _d_rehash(struct dentry * entry)
 void d_rehash(struct dentry * entry)
 {
 	spin_lock(&entry->d_lock);
-	spin_lock(&dcache_hash_lock);
 	_d_rehash(entry);
-	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&entry->d_lock);
 }
 EXPORT_SYMBOL(d_rehash);
@@ -2035,6 +2081,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
  */
 void d_move(struct dentry * dentry, struct dentry * target)
 {
+	struct dcache_hash_bucket *b;
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -2065,11 +2112,13 @@ void d_move(struct dentry * dentry, struct dentry * target)
 	}
 
 	/* Move the dentry to the target hash queue, if on different bucket */
-	spin_lock(&dcache_hash_lock);
-	if (!d_unhashed(dentry))
-		hlist_del_rcu(&dentry->d_hash);
+	if (!d_unhashed(dentry)) {
+		b = d_hash(dentry->d_parent, dentry->d_name.hash);
+		spin_lock_bucket(b);
+		hlist_bl_del_rcu(&dentry->d_hash);
+		spin_unlock_bucket(b);
+	}
 	__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
-	spin_unlock(&dcache_hash_lock);
 
 	/* Unhash the target: dput() will then get rid of it */
 	__d_drop(target);
@@ -2280,9 +2329,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 
 	spin_lock(&actual->d_lock);
 found:
-	spin_lock(&dcache_hash_lock);
 	_d_rehash(actual);
-	spin_unlock(&dcache_hash_lock);
 	spin_unlock(&actual->d_lock);
 	spin_unlock(&dcache_inode_lock);
 out_nolock:
@@ -2864,7 +2911,7 @@ static void __init dcache_init_early(void)
 
 	dentry_hashtable =
 		alloc_large_system_hash("Dentry cache",
-					sizeof(struct hlist_head),
+					sizeof(struct dcache_hash_bucket),
 					dhash_entries,
 					13,
 					HASH_EARLY,
@@ -2873,7 +2920,7 @@ static void __init dcache_init_early(void)
 					0);
 
 	for (loop = 0; loop < (1 << d_hash_shift); loop++)
-		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&dentry_hashtable[loop].head);
 }
 
 static void __init dcache_init(void)
@@ -2899,7 +2946,7 @@ static void __init dcache_init(void)
 
 	dentry_hashtable =
 		alloc_large_system_hash("Dentry cache",
-					sizeof(struct hlist_head),
+					sizeof(struct dcache_hash_bucket),
 					dhash_entries,
 					13,
 					0,
@@ -2908,7 +2955,7 @@ static void __init dcache_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << d_hash_shift); loop++)
-		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
+		INIT_HLIST_BL_HEAD(&dentry_hashtable[loop].head);
 }
 
 /* SLAB cache for __getname() consumers */
diff --git a/fs/super.c b/fs/super.c
index ca69615..968ba01 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -30,6 +30,7 @@
 #include <linux/idr.h>
 #include <linux/mutex.h>
 #include <linux/backing-dev.h>
+#include <linux/rculist_bl.h>
 #include "internal.h"
 
 
@@ -71,7 +72,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_files);
 #endif
 		INIT_LIST_HEAD(&s->s_instances);
-		INIT_HLIST_HEAD(&s->s_anon);
+		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 72f5f32..97c2d78 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -4,6 +4,7 @@
 #include <asm/atomic.h>
 #include <linux/list.h>
 #include <linux/rculist.h>
+#include <linux/rculist_bl.h>
 #include <linux/spinlock.h>
 #include <linux/seqlock.h>
 #include <linux/cache.h>
@@ -91,7 +92,7 @@ struct dentry {
 	/* RCU lookup touched fields */
 	unsigned int d_flags;		/* protected by d_lock */
 	seqcount_t d_seq;		/* per dentry seqlock */
-	struct hlist_node d_hash;	/* lookup hash list */
+	struct hlist_bl_node d_hash;	/* lookup hash list */
 	struct dentry *d_parent;	/* parent directory */
 	struct qstr d_name;
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
@@ -193,7 +194,6 @@ struct dentry_operations {
 	(DCACHE_OP_REVALIDATE|DCACHE_OP_REVALIDATE_RCU)
 
 extern spinlock_t dcache_inode_lock;
-extern spinlock_t dcache_hash_lock;
 extern seqlock_t rename_lock;
 
 /**
@@ -211,23 +211,8 @@ extern seqlock_t rename_lock;
  *
  * __d_drop requires dentry->d_lock.
  */
-
-static inline void __d_drop(struct dentry *dentry)
-{
-	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
-		dentry->d_flags |= DCACHE_UNHASHED;
-		spin_lock(&dcache_hash_lock);
-		hlist_del_rcu(&dentry->d_hash);
-		spin_unlock(&dcache_hash_lock);
-	}
-}
-
-static inline void d_drop(struct dentry *dentry)
-{
-	spin_lock(&dentry->d_lock);
- 	__d_drop(dentry);
-	spin_unlock(&dentry->d_lock);
-}
+void d_drop(struct dentry *dentry);
+void __d_drop(struct dentry *dentry);
 
 static inline int dname_external(struct dentry *dentry)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 490eedd..315d0e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -392,6 +392,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/rculist_bl.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -1376,7 +1377,7 @@ struct super_block {
 	const struct xattr_handler **s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
-	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
+	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
 	struct list_head __percpu *s_files;
 #else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 45/46] fs: dcache per-inode inode alias locking
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (42 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 44/46] fs: dcache per-bucket dcache hash locking Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:45 ` [PATCH 46/46] fs: improve scalability of pseudo filesystems Nick Piggin
                   ` (5 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/9p/vfs_inode.c      |    4 +-
 fs/affs/amigaffs.c     |    4 +-
 fs/dcache.c            |   84 +++++++++++++++++++++++++----------------------
 fs/exportfs/expfs.c    |   12 ++++---
 fs/nfs/getroot.c       |    4 +-
 fs/notify/fsnotify.c   |    4 +-
 fs/ocfs2/dcache.c      |    4 +-
 include/linux/dcache.h |    1 -
 8 files changed, 62 insertions(+), 55 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index df8bbb3..5978298 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -277,11 +277,11 @@ static struct dentry *v9fs_dentry_from_dir_inode(struct inode *inode)
 {
 	struct dentry *dentry;
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	/* Directory should have only one entry. */
 	BUG_ON(S_ISDIR(inode->i_mode) && !list_is_singular(&inode->i_dentry));
 	dentry = list_entry(inode->i_dentry.next, struct dentry, d_alias);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 	return dentry;
 }
 
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 600101a..3a4557e 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -128,7 +128,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
 	void *data = dentry->d_fsdata;
 	struct list_head *head, *next;
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	head = &inode->i_dentry;
 	next = head->next;
 	while (next != head) {
@@ -139,7 +139,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
 		}
 		next = next->next;
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 }
 
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 5e19940..809ec46 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -39,8 +39,8 @@
 
 /*
  * Usage:
- * dcache_inode_lock protects:
- *   - i_dentry, d_alias, d_inode
+ * dcache->d_inode->i_lock protects:
+ *   - i_dentry, d_alias, d_inode of aliases
  * dcache_hash_bucket lock protects:
  *   - the dcache hash table
  * dcache_lru_lock protects:
@@ -56,7 +56,7 @@
  *   - d_alias, d_inode
  *
  * Ordering:
- * dcache_inode_lock
+ * dentry->d_inode->i_lock
  *   dentry->d_lock
  *     dcache_lru_lock
  *     dcache_hash_bucket lock
@@ -75,12 +75,10 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
-EXPORT_SYMBOL(dcache_inode_lock);
 
 static struct kmem_cache *dentry_cache __read_mostly;
 
@@ -172,7 +170,7 @@ static void d_free(struct dentry *dentry)
  */
 static void dentry_iput(struct dentry * dentry)
 	__releases(dentry->d_lock)
-	__releases(dcache_inode_lock)
+	__releases(dentry->d_inode->i_lock)
 {
 	struct inode *inode = dentry->d_inode;
 	if (inode) {
@@ -181,7 +179,7 @@ static void dentry_iput(struct dentry * dentry)
 		write_seqcount_end(&dentry->d_seq);
 		list_del_init(&dentry->d_alias);
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		if (!inode->i_nlink)
 			fsnotify_inoderemove(inode);
 		if (dentry->d_op && dentry->d_op->d_iput)
@@ -190,7 +188,6 @@ static void dentry_iput(struct dentry * dentry)
 			iput(inode);
 	} else {
 		spin_unlock(&dentry->d_lock);
-		spin_unlock(&dcache_inode_lock);
 	}
 }
 
@@ -251,7 +248,7 @@ static void dentry_lru_move_tail(struct dentry *dentry)
 static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 	__releases(dentry->d_lock)
 	__releases(parent->d_lock)
-	__releases(dcache_inode_lock)
+	__releases(dentry->d_inode->i_lock)
 {
 	dentry->d_parent = NULL;
 	list_del(&dentry->d_u.d_child);
@@ -275,9 +272,11 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
 static inline struct dentry *dentry_kill(struct dentry *dentry, int ref)
 	__releases(dentry->d_lock)
 {
+	struct inode *inode;
 	struct dentry *parent;
 
-	if (!spin_trylock(&dcache_inode_lock)) {
+	inode = dentry->d_inode;
+	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
 		spin_unlock(&dentry->d_lock);
 		cpu_relax();
@@ -288,7 +287,8 @@ relock:
 	else
 		parent = dentry->d_parent;
 	if (parent && !spin_trylock(&parent->d_lock)) {
-		spin_unlock(&dcache_inode_lock);
+		if (inode)
+			spin_unlock(&inode->i_lock);
 		goto relock;
 	}
 	if (ref)
@@ -560,9 +560,9 @@ struct dentry *d_find_alias(struct inode *inode)
 	struct dentry *de = NULL;
 
 	if (!list_empty(&inode->i_dentry)) {
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&inode->i_lock);
 		de = __d_find_alias(inode, 0);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 	}
 	return de;
 }
@@ -576,20 +576,20 @@ void d_prune_aliases(struct inode *inode)
 {
 	struct dentry *dentry;
 restart:
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
 		if (!dentry->d_count) {
 			__dget_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&dcache_inode_lock);
+			spin_unlock(&inode->i_lock);
 			dput(dentry);
 			goto restart;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
@@ -1334,9 +1334,11 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!list_empty(&entry->d_alias));
-	spin_lock(&dcache_inode_lock);
+	if (inode)
+		spin_lock(&inode->i_lock);
 	__d_instantiate(entry, inode);
-	spin_unlock(&dcache_inode_lock);
+	if (inode)
+		spin_unlock(&inode->i_lock);
 	security_d_instantiate(entry, inode);
 }
 EXPORT_SYMBOL(d_instantiate);
@@ -1399,9 +1401,11 @@ struct dentry *d_instantiate_unique(struct dentry *entry, struct inode *inode)
 
 	BUG_ON(!list_empty(&entry->d_alias));
 
-	spin_lock(&dcache_inode_lock);
+	if (inode)
+		spin_lock(&inode->i_lock);
 	result = __d_instantiate_unique(entry, inode);
-	spin_unlock(&dcache_inode_lock);
+	if (inode)
+		spin_unlock(&inode->i_lock);
 
 	if (!result) {
 		security_d_instantiate(entry, inode);
@@ -1483,10 +1487,10 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	tmp->d_parent = tmp; /* make sure dput doesn't croak */
 
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	res = __d_find_alias(inode, 0);
 	if (res) {
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		dput(tmp);
 		goto out_iput;
 	}
@@ -1502,7 +1506,7 @@ struct dentry *d_obtain_alias(struct inode *inode)
 	hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_anon);
 	__bit_spin_unlock(0, (unsigned long *)&tmp->d_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	return tmp;
 
@@ -1533,18 +1537,18 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	struct dentry *new = NULL;
 
 	if (inode && S_ISDIR(inode->i_mode)) {
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&inode->i_lock);
 		new = __d_find_alias(inode, 1);
 		if (new) {
 			BUG_ON(!(new->d_flags & DCACHE_DISCONNECTED));
-			spin_unlock(&dcache_inode_lock);
+			spin_unlock(&inode->i_lock);
 			security_d_instantiate(new, inode);
 			d_move(new, dentry);
 			iput(inode);
 		} else {
-			/* already taking dcache_inode_lock, so d_add() by hand */
+			/* already taking inode->i_lock, so d_add() by hand */
 			__d_instantiate(dentry, inode);
-			spin_unlock(&dcache_inode_lock);
+			spin_unlock(&inode->i_lock);
 			security_d_instantiate(dentry, inode);
 			d_rehash(dentry);
 		}
@@ -1617,10 +1621,10 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 * Negative dentry: instantiate it unless the inode is a directory and
 	 * already has a dentry.
 	 */
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	if (!S_ISDIR(inode->i_mode) || list_empty(&inode->i_dentry)) {
 		__d_instantiate(found, inode);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		security_d_instantiate(found, inode);
 		return found;
 	}
@@ -1631,7 +1635,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 	 */
 	new = list_entry(inode->i_dentry.next, struct dentry, d_alias);
 	__dget(new);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 	security_d_instantiate(found, inode);
 	d_move(new, found);
 	iput(inode);
@@ -1951,15 +1955,17 @@ EXPORT_SYMBOL(d_validate);
  
 void d_delete(struct dentry * dentry)
 {
+	struct inode *inode;
 	int isdir = 0;
 	/*
 	 * Are we the only user?
 	 */
 again:
 	spin_lock(&dentry->d_lock);
-	isdir = S_ISDIR(dentry->d_inode->i_mode);
+	inode = dentry->d_inode;
+	isdir = S_ISDIR(inode->i_mode);
 	if (dentry->d_count == 1) {
-		if (!spin_trylock(&dcache_inode_lock)) {
+		if (inode && !spin_trylock(&inode->i_lock)) {
 			spin_unlock(&dentry->d_lock);
 			cpu_relax();
 			goto again;
@@ -2181,13 +2187,13 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
  * This helper attempts to cope with remotely renamed directories
  *
  * It assumes that the caller is already holding
- * dentry->d_parent->d_inode->i_mutex and the dcache_inode_lock
+ * dentry->d_parent->d_inode->i_mutex and the inode->i_lock
  *
  * Note: If ever the locking in lock_rename() changes, then please
  * remember to update this too...
  */
-static struct dentry *__d_unalias(struct dentry *dentry, struct dentry *alias)
-	__releases(dcache_inode_lock)
+static struct dentry *__d_unalias(struct inode *inode,
+		struct dentry *dentry, struct dentry *alias)
 {
 	struct mutex *m1 = NULL, *m2 = NULL;
 	struct dentry *ret;
@@ -2213,7 +2219,7 @@ out_unalias:
 	d_move(alias, dentry);
 	ret = alias;
 out_err:
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 	if (m2)
 		mutex_unlock(m2);
 	if (m1)
@@ -2291,7 +2297,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 		goto out_nolock;
 	}
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *alias;
@@ -2313,7 +2319,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				goto found;
 			}
 			/* Nope, but we must(!) avoid directory aliasing */
-			actual = __d_unalias(dentry, alias);
+			actual = __d_unalias(inode, dentry, alias);
 			if (IS_ERR(actual))
 				dput(alias);
 			goto out_nolock;
@@ -2331,7 +2337,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 found:
 	_d_rehash(actual);
 	spin_unlock(&actual->d_lock);
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 out_nolock:
 	if (actual == dentry) {
 		security_d_instantiate(dentry, inode);
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index f06a940..4b68257 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -43,24 +43,26 @@ find_acceptable_alias(struct dentry *result,
 		void *context)
 {
 	struct dentry *dentry, *toput = NULL;
+	struct inode *inode;
 
 	if (acceptable(context, result))
 		return result;
 
-	spin_lock(&dcache_inode_lock);
-	list_for_each_entry(dentry, &result->d_inode->i_dentry, d_alias) {
+	inode = result->d_inode;
+	spin_lock(&inode->i_lock);
+	list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		dget(dentry);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&inode->i_lock);
 		if (toput)
 			dput(toput);
 		if (dentry != result && acceptable(context, dentry)) {
 			dput(result);
 			return dentry;
 		}
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&inode->i_lock);
 		toput = dentry;
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	if (toput)
 		dput(toput);
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index c3a5a11..5596c6a 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -63,11 +63,11 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 		 * This again causes shrink_dcache_for_umount_subtree() to
 		 * Oops, since the test for IS_ROOT() will fail.
 		 */
-		spin_lock(&dcache_inode_lock);
+		spin_lock(&sb->s_root->d_inode->i_lock);
 		spin_lock(&sb->s_root->d_lock);
 		list_del_init(&sb->s_root->d_alias);
 		spin_unlock(&sb->s_root->d_lock);
-		spin_unlock(&dcache_inode_lock);
+		spin_unlock(&sb->s_root->d_inode->i_lock);
 	}
 	return 0;
 }
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 9be6ec1..79b47cb 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -59,7 +59,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 	/* determine if the children should tell inode about their events */
 	watched = fsnotify_inode_watches_children(inode);
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	/* run all of the dentries associated with this inode.  Since this is a
 	 * directory, there damn well better only be one item on this list */
 	list_for_each_entry(alias, &inode->i_dentry, d_alias) {
@@ -82,7 +82,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
 		}
 		spin_unlock(&alias->d_lock);
 	}
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 }
 
 /* Notify this dentry's parent about a child's events. */
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index 8b26b54..b4d9421 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -169,7 +169,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	struct list_head *p;
 	struct dentry *dentry = NULL;
 
-	spin_lock(&dcache_inode_lock);
+	spin_lock(&inode->i_lock);
 	list_for_each(p, &inode->i_dentry) {
 		dentry = list_entry(p, struct dentry, d_alias);
 
@@ -187,7 +187,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 		dentry = NULL;
 	}
 
-	spin_unlock(&dcache_inode_lock);
+	spin_unlock(&inode->i_lock);
 
 	return dentry;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 97c2d78..8ff803b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -193,7 +193,6 @@ struct dentry_operations {
 #define DCACHE_OP_REVALIDATE_EITHER	\
 	(DCACHE_OP_REVALIDATE|DCACHE_OP_REVALIDATE_RCU)
 
-extern spinlock_t dcache_inode_lock;
 extern seqlock_t rename_lock;
 
 /**
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 46/46] fs: improve scalability of pseudo filesystems
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (43 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 45/46] fs: dcache per-inode inode alias locking Nick Piggin
@ 2010-11-27  9:45 ` Nick Piggin
  2010-11-27  9:56 ` [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate" Nick Piggin
                   ` (4 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/anon_inodes.c       |    2 +-
 fs/dcache.c            |   12 ++++++++++++
 fs/pipe.c              |    2 +-
 include/linux/dcache.h |    1 +
 net/socket.c           |    2 +-
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index aca8806..9d92b33 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -102,7 +102,7 @@ struct file *anon_inode_getfile(const char *name,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	path.dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
 	if (!path.dentry)
 		goto err_module;
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 809ec46..59f25b7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1276,6 +1276,18 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
 }
 EXPORT_SYMBOL(d_alloc);
 
+struct dentry *d_alloc_pseudo(struct super_block *sb, const struct qstr *name)
+{
+	struct dentry *dentry = d_alloc(NULL, name);
+	if (dentry) {
+		dentry->d_sb = sb;
+		dentry->d_parent = dentry;
+		dentry->d_flags |= DCACHE_DISCONNECTED;
+	}
+	return dentry;
+}
+EXPORT_SYMBOL(d_alloc_pseudo);
+
 struct dentry *d_alloc_name(struct dentry *parent, const char *name)
 {
 	struct qstr q;
diff --git a/fs/pipe.c b/fs/pipe.c
index e964d09..ebe9143 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -999,7 +999,7 @@ struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	path.dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	path.dentry = d_alloc_pseudo(pipe_mnt->mnt_sb, &name);
 	if (!path.dentry)
 		goto err_inode;
 	path.mnt = mntget(pipe_mnt);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 8ff803b..0cf2cb9 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -229,6 +229,7 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
 
 /* allocate/de-allocate */
 extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
+extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
 extern struct dentry * d_obtain_alias(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index 2d8b4c8..17d9aae 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -361,7 +361,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 	if (unlikely(fd < 0))
 		return fd;
 
-	path.dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	path.dentry = d_alloc_pseudo(sock_mnt->mnt_sb, &name);
 	if (unlikely(!path.dentry)) {
 		put_unused_fd(fd);
 		return -ENOMEM;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate"
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (44 preceding siblings ...)
  2010-11-27  9:45 ` [PATCH 46/46] fs: improve scalability of pseudo filesystems Nick Piggin
@ 2010-11-27  9:56 ` Nick Piggin
  2010-12-08  1:16   ` Dave Chinner
  2010-11-27 15:04   ` Anca Emanuel
                   ` (3 subsequent siblings)
  49 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-11-27  9:56 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb.

Patch is broken, you can't dget() without holding any locks!

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
 fs/dcache.c |   31 +++++++++++++++++++------------
 1 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 23702a9..cc2b938 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1491,26 +1491,33 @@ out:
  * This is used by ncpfs in its readdir implementation.
  * Zero is returned in the dentry is invalid.
  */
-int d_validate(struct dentry *dentry, struct dentry *parent)
+ 
+int d_validate(struct dentry *dentry, struct dentry *dparent)
 {
-	struct hlist_head *head = d_hash(parent, dentry->d_name.hash);
-	struct hlist_node *node;
-	struct dentry *d;
+	struct hlist_head *base;
+	struct hlist_node *lhp;
 
 	/* Check whether the ptr might be valid at all.. */
 	if (!kmem_ptr_validate(dentry_cache, dentry))
-		return 0;
-	if (dentry->d_parent != parent)
-		return 0;
+		goto out;
 
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(d, node, head, d_hash) {
-		if (d == dentry) {
-			dget(dentry);
+	if (dentry->d_parent != dparent)
+		goto out;
+
+	spin_lock(&dcache_lock);
+	base = d_hash(dparent, dentry->d_name.hash);
+	hlist_for_each(lhp,base) { 
+		/* hlist_for_each_entry_rcu() not required for d_hash list
+		 * as it is parsed under dcache_lock
+		 */
+		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
+			__dget_locked(dentry);
+			spin_unlock(&dcache_lock);
 			return 1;
 		}
 	}
-	rcu_read_unlock();
+	spin_unlock(&dcache_lock);
+out:
 	return 0;
 }
 EXPORT_SYMBOL(d_validate);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-27 10:15 Nick Piggin
  2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
                   ` (49 more replies)
  0 siblings, 50 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-27 10:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel


git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

Here is an new set of vfs patches for review, not that there was much interest
last time they were posted. It is structured like:

* preparation patches
* introduce new locks to take over dcache_lock, then remove it
* cleaning up and reworking things for new locks
* rcu-walk path walking
* start on some fine grained locking steps

Thanks,
Nick


Nick Piggin (46):
  Revert "fs: use RCU read side protection in d_validate"
  fs: d_validate fixes
  kernel: kmem_ptr_validate considered harmful
  fs: dcache documentation cleanup
  fs: change d_delete semantics
  cifs: dont overwrite dentry name in d_revalidate
  jfs: dont overwrite dentry name in d_revalidate
  fs: change d_compare for rcu-walk
  fs: change d_hash for rcu-walk
  hostfs: simplify locking
  fs: dcache scale hash
  fs: dcache scale lru
  fs: dcache scale dentry refcount
  fs: dcache scale d_unhashed
  fs: dcache scale subdirs
  fs: scale inode alias list
  fs: Use rename lock and RCU for multi-step operations
  fs: increase d_name lock coverage
  fs: dcache remove dcache_lock
  fs: dcache avoid starvation in dcache multi-step operations
  fs: dcache reduce dput locking
  fs: dcache reduce locking in d_alloc
  fs: dcache reduce dcache_inode_lock
  fs: dcache rationalise dget variants
  fs: dcache reduce d_parent locking
  fs: dcache reduce prune_one_dentry locking
  fs: reduce dcache_inode_lock width in lru scanning
  fs: use RCU in shrink_dentry_list to reduce lock nesting
  fs: consolidate dentry kill sequence
  fs: icache RCU free inodes
  fs: avoid inode RCU freeing for pseudo fs
  kernel: optimise seqlock
  fs: rcu-walk for path lookup
  fs: fs_struct use seqlock
  fs: dcache remove d_mounted
  fs: dcache reduce branches in lookup path
  fs: cache optimise dentry and inode for rcu-walk
  fs: prefetch inode data in dcache lookup
  fs: d_revalidate_rcu for rcu-walk
  fs: provide rcu-walk aware permission i_ops
  fs: provide simple rcu-walk ACL implementation
  kernel: add bl_list
  bit_spinlock: add required includes
  fs: dcache per-bucket dcache hash locking
  fs: dcache per-inode inode alias locking
  fs: improve scalability of pseudo filesystems

 Documentation/filesystems/Locking            |   23 +-
 Documentation/filesystems/dentry-locking.txt |  174 ----
 Documentation/filesystems/path-lookup.txt    |  247 ++++++
 Documentation/filesystems/porting            |   45 +-
 Documentation/filesystems/vfs.txt            |   54 +-
 arch/ia64/kernel/perfmon.c                   |    4 +-
 arch/powerpc/platforms/cell/spufs/inode.c    |   18 +-
 drivers/infiniband/hw/ipath/ipath_fs.c       |    8 +-
 drivers/infiniband/hw/qib/qib_fs.c           |    5 +-
 drivers/staging/autofs/root.c                |    2 +-
 drivers/staging/pohmelfs/inode.c             |   11 +-
 drivers/staging/pohmelfs/path_entry.c        |   17 +-
 drivers/staging/smbfs/cache.c                |   10 +-
 drivers/usb/core/inode.c                     |   12 +-
 fs/9p/vfs_dentry.c                           |    4 +-
 fs/9p/vfs_inode.c                            |   39 +-
 fs/adfs/dir.c                                |   13 +-
 fs/adfs/super.c                              |   11 +-
 fs/affs/amigaffs.c                           |    4 +-
 fs/affs/namei.c                              |   66 +-
 fs/affs/super.c                              |   11 +-
 fs/afs/dir.c                                 |    6 +-
 fs/afs/super.c                               |   10 +-
 fs/anon_inodes.c                             |    4 +-
 fs/autofs4/autofs_i.h                        |   21 +-
 fs/autofs4/expire.c                          |  143 ++--
 fs/autofs4/inode.c                           |    2 +-
 fs/autofs4/root.c                            |   78 +-
 fs/autofs4/waitq.c                           |   23 +-
 fs/befs/linuxvfs.c                           |   10 +-
 fs/bfs/inode.c                               |    9 +-
 fs/block_dev.c                               |    9 +-
 fs/btrfs/acl.c                               |   19 +-
 fs/btrfs/ctree.h                             |    4 +-
 fs/btrfs/export.c                            |    4 +-
 fs/btrfs/inode.c                             |   27 +-
 fs/ceph/dir.c                                |   21 +-
 fs/ceph/inode.c                              |   27 +-
 fs/ceph/mds_client.c                         |    2 +-
 fs/cifs/cifsfs.c                             |    9 +-
 fs/cifs/dir.c                                |   75 +-
 fs/cifs/inode.c                              |   14 +-
 fs/cifs/link.c                               |    4 +-
 fs/cifs/readdir.c                            |    6 +-
 fs/coda/cache.c                              |    4 +-
 fs/coda/dir.c                                |    8 +-
 fs/coda/inode.c                              |    9 +-
 fs/configfs/configfs_internal.h              |    4 +-
 fs/configfs/dir.c                            |   13 +-
 fs/configfs/inode.c                          |    8 +-
 fs/dcache.c                                  | 1224 ++++++++++++++++++--------
 fs/ecryptfs/dentry.c                         |   15 +-
 fs/ecryptfs/inode.c                          |    8 +-
 fs/ecryptfs/main.c                           |    4 +-
 fs/ecryptfs/super.c                          |   12 +-
 fs/efs/super.c                               |    9 +-
 fs/exofs/super.c                             |    9 +-
 fs/exportfs/expfs.c                          |   14 +-
 fs/ext2/acl.c                                |   11 +-
 fs/ext2/acl.h                                |    8 +-
 fs/ext2/file.c                               |    2 +-
 fs/ext2/namei.c                              |    4 +-
 fs/ext2/super.c                              |    9 +-
 fs/ext3/acl.c                                |   11 +-
 fs/ext3/acl.h                                |    8 +-
 fs/ext3/file.c                               |    2 +-
 fs/ext3/namei.c                              |    4 +-
 fs/ext3/super.c                              |    9 +-
 fs/ext4/acl.c                                |   11 +-
 fs/ext4/acl.h                                |    4 +-
 fs/ext4/file.c                               |    2 +-
 fs/ext4/namei.c                              |    4 +-
 fs/ext4/super.c                              |    9 +-
 fs/fat/inode.c                               |   13 +-
 fs/fat/namei_msdos.c                         |   24 +-
 fs/fat/namei_vfat.c                          |   69 +-
 fs/filesystems.c                             |    3 +
 fs/freevxfs/vxfs_inode.c                     |    9 +-
 fs/fs_struct.c                               |   10 +
 fs/fuse/dir.c                                |    9 +-
 fs/fuse/inode.c                              |   13 +-
 fs/generic_acl.c                             |   20 +
 fs/gfs2/dentry.c                             |    5 +-
 fs/gfs2/export.c                             |    4 +-
 fs/gfs2/ops_fstype.c                         |    2 +-
 fs/gfs2/ops_inode.c                          |    2 +-
 fs/gfs2/super.c                              |    9 +-
 fs/hfs/dir.c                                 |    2 +-
 fs/hfs/hfs_fs.h                              |    7 +-
 fs/hfs/string.c                              |   17 +-
 fs/hfs/super.c                               |   11 +-
 fs/hfsplus/dir.c                             |    2 +-
 fs/hfsplus/hfsplus_fs.h                      |    7 +-
 fs/hfsplus/super.c                           |   12 +-
 fs/hfsplus/unicode.c                         |   17 +-
 fs/hostfs/hostfs_kern.c                      |   37 +-
 fs/hpfs/dentry.c                             |   26 +-
 fs/hpfs/super.c                              |    9 +-
 fs/hppfs/hppfs.c                             |    9 +-
 fs/hugetlbfs/inode.c                         |    9 +-
 fs/inode.c                                   |   16 +-
 fs/isofs/inode.c                             |  127 ++--
 fs/isofs/namei.c                             |    5 +-
 fs/jffs2/super.c                             |    9 +-
 fs/jfs/namei.c                               |   60 +-
 fs/jfs/super.c                               |   12 +-
 fs/libfs.c                                   |   63 +-
 fs/locks.c                                   |    2 +-
 fs/logfs/inode.c                             |    9 +-
 fs/minix/inode.c                             |    9 +-
 fs/minix/namei.c                             |    2 +-
 fs/namei.c                                   |  855 +++++++++++++++----
 fs/namespace.c                               |   29 +-
 fs/ncpfs/dir.c                               |   66 +-
 fs/ncpfs/inode.c                             |   11 +-
 fs/ncpfs/ncplib_kernel.h                     |   16 +-
 fs/nfs/dir.c                                 |   17 +-
 fs/nfs/getroot.c                             |   10 +-
 fs/nfs/inode.c                               |    9 +-
 fs/nfs/namespace.c                           |   17 +-
 fs/nfs/unlink.c                              |    2 +-
 fs/nfsd/vfs.c                                |    5 +-
 fs/nilfs2/super.c                            |   12 +-
 fs/notify/fsnotify.c                         |    8 +-
 fs/ntfs/inode.c                              |    9 +-
 fs/ocfs2/dcache.c                            |   10 +-
 fs/ocfs2/dlmfs/dlmfs.c                       |    9 +-
 fs/ocfs2/export.c                            |    4 +-
 fs/ocfs2/namei.c                             |   10 +-
 fs/ocfs2/super.c                             |    9 +-
 fs/openpromfs/inode.c                        |    9 +-
 fs/pipe.c                                    |   10 +-
 fs/proc/base.c                               |   14 +-
 fs/proc/generic.c                            |    4 +-
 fs/proc/inode.c                              |    9 +-
 fs/proc/proc_sysctl.c                        |   18 +-
 fs/qnx4/inode.c                              |    9 +-
 fs/reiserfs/super.c                          |    9 +-
 fs/reiserfs/xattr.c                          |    2 +-
 fs/romfs/super.c                             |    9 +-
 fs/squashfs/super.c                          |    9 +-
 fs/super.c                                   |    3 +-
 fs/sysfs/dir.c                               |   22 +-
 fs/sysv/inode.c                              |    9 +-
 fs/sysv/namei.c                              |    5 +-
 fs/sysv/super.c                              |    2 +-
 fs/ubifs/super.c                             |   10 +-
 fs/udf/super.c                               |    9 +-
 fs/ufs/super.c                               |    9 +-
 fs/xfs/linux-2.6/xfs_acl.c                   |    8 +-
 fs/xfs/linux-2.6/xfs_iops.c                  |    8 +-
 fs/xfs/xfs_acl.h                             |    4 +-
 fs/xfs/xfs_iget.c                            |   13 +-
 include/linux/bit_spinlock.h                 |    4 +
 include/linux/dcache.h                       |  183 ++--
 include/linux/fs.h                           |   61 +-
 include/linux/fs_struct.h                    |    3 +
 include/linux/fsnotify.h                     |    2 -
 include/linux/fsnotify_backend.h             |   11 +-
 include/linux/generic_acl.h                  |    1 +
 include/linux/list_bl.h                      |  141 +++
 include/linux/namei.h                        |   16 +-
 include/linux/ncp_fs.h                       |    4 +-
 include/linux/posix_acl.h                    |   19 +
 include/linux/rculist_bl.h                   |  128 +++
 include/linux/seqlock.h                      |   67 ++-
 include/linux/slab.h                         |    2 -
 ipc/mqueue.c                                 |    9 +-
 kernel/cgroup.c                              |   29 +-
 mm/filemap.c                                 |    3 -
 mm/shmem.c                                   |   15 +-
 mm/slab.c                                    |   32 +-
 mm/slob.c                                    |    5 -
 mm/slub.c                                    |   40 -
 mm/util.c                                    |   21 -
 net/socket.c                                 |    5 +-
 net/sunrpc/rpc_pipe.c                        |   14 +-
 security/selinux/selinuxfs.c                 |   16 +-
 security/tomoyo/realpath.c                   |    1 +
 179 files changed, 3848 insertions(+), 1711 deletions(-)
 delete mode 100644 Documentation/filesystems/dentry-locking.txt
 create mode 100644 Documentation/filesystems/path-lookup.txt
 create mode 100644 include/linux/list_bl.h
 create mode 100644 include/linux/rculist_bl.h


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
@ 2010-11-27 15:04   ` Anca Emanuel
  2010-11-27  9:44 ` [PATCH 03/46] kernel: kmem_ptr_validate considered harmful Nick Piggin
                     ` (48 subsequent siblings)
  49 siblings, 0 replies; 107+ messages in thread
From: Anca Emanuel @ 2010-11-27 15:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

I get:
  CC [M]  fs/cifs/inode.o
fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
in this function)
fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
fs/cifs/inode.c:807: error: for each function it appears in.)
make[3]: *** [fs/cifs/inode.o] Error 1
make[2]: *** [fs/cifs] Error 2
make[1]: *** [fs] Error 2

I used the latest mainline.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-27 15:04   ` Anca Emanuel
  0 siblings, 0 replies; 107+ messages in thread
From: Anca Emanuel @ 2010-11-27 15:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

I get:
  CC [M]  fs/cifs/inode.o
fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
in this function)
fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
fs/cifs/inode.c:807: error: for each function it appears in.)
make[3]: *** [fs/cifs/inode.o] Error 1
make[2]: *** [fs/cifs] Error 2
make[1]: *** [fs] Error 2

I used the latest mainline.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 15:04   ` Anca Emanuel
@ 2010-11-28  3:28     ` Nick Piggin
  -1 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-28  3:28 UTC (permalink / raw)
  To: Anca Emanuel, Sedat Dilek; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 05:04:08PM +0200, Anca Emanuel wrote:
> On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> I get:
>   CC [M]  fs/cifs/inode.o
> fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
> fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
> in this function)
> fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
> fs/cifs/inode.c:807: error: for each function it appears in.)
> make[3]: *** [fs/cifs/inode.o] Error 1
> make[2]: *** [fs/cifs] Error 2
> make[1]: *** [fs] Error 2
> 
> I used the latest mainline.

Sorry, missed a conversion, it just needs to be changed to
inode->i_lock. Pushed the fix to git.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-28  3:28     ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-11-28  3:28 UTC (permalink / raw)
  To: Anca Emanuel, Sedat Dilek; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 05:04:08PM +0200, Anca Emanuel wrote:
> On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> I get:
>   CC [M]  fs/cifs/inode.o
> fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
> fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
> in this function)
> fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
> fs/cifs/inode.c:807: error: for each function it appears in.)
> make[3]: *** [fs/cifs/inode.o] Error 1
> make[2]: *** [fs/cifs] Error 2
> make[1]: *** [fs] Error 2
> 
> I used the latest mainline.

Sorry, missed a conversion, it just needs to be changed to
inode->i_lock. Pushed the fix to git.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-28  3:28     ` Nick Piggin
  (?)
@ 2010-11-28  6:24     ` Sedat Dilek
  -1 siblings, 0 replies; 107+ messages in thread
From: Sedat Dilek @ 2010-11-28  6:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Anca Emanuel, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1292 bytes --]

On Sun, Nov 28, 2010 at 4:28 AM, Nick Piggin <npiggin@kernel.dk> wrote:
> On Sat, Nov 27, 2010 at 05:04:08PM +0200, Anca Emanuel wrote:
>> On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> I get:
>>   CC [M]  fs/cifs/inode.o
>> fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
>> fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
>> in this function)
>> fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
>> fs/cifs/inode.c:807: error: for each function it appears in.)
>> make[3]: *** [fs/cifs/inode.o] Error 1
>> make[2]: *** [fs/cifs] Error 2
>> make[1]: *** [fs] Error 2
>>
>> I used the latest mainline.
>
> Sorry, missed a conversion, it just needs to be changed to
> inode->i_lock. Pushed the fix to git.
>

I attached a patch to my posting in [1] but it was somehow "eaten"
(here in my mbox the patch is definitely attached).
Patchwork is also not listing my patch.
Anyway, it's fixed - that's good.

I have to check why I get a Call trace with systemd-v15 but not with
sysvinit package here on Debian.

- Sedat -

[1] http://lkml.org/lkml/2010/11/27/145

[-- Attachment #2: fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch --]
[-- Type: plain/text, Size: 612 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (46 preceding siblings ...)
  2010-11-27 15:04   ` Anca Emanuel
@ 2010-12-01 18:03 ` David Miller
  2010-12-03 16:55   ` Nick Piggin
  2010-12-07 11:25 ` Dave Chinner
  2010-12-07 21:56 ` Dave Chinner
  49 siblings, 1 reply; 107+ messages in thread
From: David Miller @ 2010-12-01 18:03 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

From: Nick Piggin <npiggin@kernel.dk>
Date: Sat, 27 Nov 2010 21:15:58 +1100

> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

Just want to say that I've been running this code for the past few days on
my 128 cpu box and it seems quite sturdy.

If there are any kinds of vfs benchmarks you want me to run on this
machine both with and without the scaling changes, just let me know.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-01 18:03 ` David Miller
@ 2010-12-03 16:55   ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-03 16:55 UTC (permalink / raw)
  To: David Miller; +Cc: npiggin, linux-fsdevel, linux-kernel

On Wed, Dec 01, 2010 at 10:03:30AM -0800, David Miller wrote:
> From: Nick Piggin <npiggin@kernel.dk>
> Date: Sat, 27 Nov 2010 21:15:58 +1100
> 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> Just want to say that I've been running this code for the past few days on
> my 128 cpu box and it seems quite sturdy.
> 
> If there are any kinds of vfs benchmarks you want me to run on this
> machine both with and without the scaling changes, just let me know.

Hi Dave, great, thanks!

I am mostly interested in single-thread performance on different ISAs
and different microarchitectures.

Microbenchmarks are easy, I attached a couple of quick ones I use in
an offline mail.

Single-threaded (preloadindex=false), cached and uncached, git diff on
an unmodified tree is slightly more real world.

And then anything else you can think of.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (47 preceding siblings ...)
  2010-12-01 18:03 ` David Miller
@ 2010-12-07 11:25 ` Dave Chinner
  2010-12-07 15:24     ` Nick Piggin
  2010-12-07 21:56 ` Dave Chinner
  49 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-07 11:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> Here is an new set of vfs patches for review, not that there was much interest
> last time they were posted. It is structured like:
> 
> * preparation patches
> * introduce new locks to take over dcache_lock, then remove it
> * cleaning up and reworking things for new locks
> * rcu-walk path walking
> * start on some fine grained locking steps

Just got this set of traces doing an 8-way parallel remove of 50
million inodes at about 40M inodes unlinked:

[ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
[ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
[ 5954.063122] 3 locks held by rm/2927:
[ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
[ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
[ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
[ 5954.064014] irq event stamp: 1484376719
[ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
[ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
[ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
[ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
[ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
[ 5954.064014] Call Trace:
[ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
[ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
[ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
[ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
[ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
[ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
[ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
[ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
[ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
[ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
[ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
[ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
[ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
[ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
[ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
[ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
[ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
[ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
[ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
[ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
[ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
[ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
[ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
[ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
[ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
[ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
[ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0 
[ 5954.094830] Oops: 0000 [#1] SMP 
[ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
[ 5954.095760] CPU 6 
[ 5954.095954] Modules linked in:
[ 5954.096319] 
[ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
[ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
[ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
[ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
[ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
[ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
[ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
[ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
[ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
[ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
[ 5954.096665] Stack:
[ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
[ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
[ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
[ 5954.096665] Call Trace:
[ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
[ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
[ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
[ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
[ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
[ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
[ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
[ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
[ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
[ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
[ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
[ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
[ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
[ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
[ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
[ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
[ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
[ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
[ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
[ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39  
[ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
[ 5954.096665]  RSP <ffff8801175539a8>
[ 5954.096665] CR2: 0000000000000000
[ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 11:25 ` Dave Chinner
@ 2010-12-07 15:24     ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-07 15:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Peter Zijlstra

On Tue, Dec 7, 2010 at 10:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> Here is an new set of vfs patches for review, not that there was much interest
>> last time they were posted. It is structured like:
>>
>> * preparation patches
>> * introduce new locks to take over dcache_lock, then remove it
>> * cleaning up and reworking things for new locks
>> * rcu-walk path walking
>> * start on some fine grained locking steps
>
> Just got this set of traces doing an 8-way parallel remove of 50
> million inodes at about 40M inodes unlinked:

Thanks for testing...


> [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> [ 5954.063122] 3 locks held by rm/2927:
> [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> [ 5954.064014] irq event stamp: 1484376719
> [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> [ 5954.064014] Call Trace:
> [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Seems that lockdep exploded.

> [ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0
> [ 5954.094830] Oops: 0000 [#1] SMP
> [ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
> [ 5954.095760] CPU 6
> [ 5954.095954] Modules linked in:
> [ 5954.096319]
> [ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
> [ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
> [ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
> [ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
> [ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
> [ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> [ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
> [ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
> [ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
> [ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
> [ 5954.096665] Stack:
> [ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
> [ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
> [ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
> [ 5954.096665] Call Trace:
> [ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> [ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39
> [ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665]  RSP <ffff8801175539a8>
> [ 5954.096665] CR2: 0000000000000000
> [ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---

So I vfs-scale-working branch may not be entirely in the clear, seeing
as it touches
the code lower in the call chain. However I don't know what can cause
lockdep to go off
the rails like this.

There is a sequence I used to hack around lockdep nesting
restrictions, following this
pattern:

 repeat:
    spin_lock(&parent->d_lock);
    spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
    /* do stuff */
    spin_unlock(&parent->d_lock);
    spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
    parent = dentry;
    spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
    goto repeat;

It's not directly in this call chain, but I wonder if it could have
caused any problem?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-12-07 15:24     ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-07 15:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Peter Zijlstra

On Tue, Dec 7, 2010 at 10:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> Here is an new set of vfs patches for review, not that there was much interest
>> last time they were posted. It is structured like:
>>
>> * preparation patches
>> * introduce new locks to take over dcache_lock, then remove it
>> * cleaning up and reworking things for new locks
>> * rcu-walk path walking
>> * start on some fine grained locking steps
>
> Just got this set of traces doing an 8-way parallel remove of 50
> million inodes at about 40M inodes unlinked:

Thanks for testing...


> [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> [ 5954.063122] 3 locks held by rm/2927:
> [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> [ 5954.064014] irq event stamp: 1484376719
> [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> [ 5954.064014] Call Trace:
> [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Seems that lockdep exploded.

> [ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0
> [ 5954.094830] Oops: 0000 [#1] SMP
> [ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
> [ 5954.095760] CPU 6
> [ 5954.095954] Modules linked in:
> [ 5954.096319]
> [ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
> [ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
> [ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
> [ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
> [ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
> [ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> [ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
> [ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
> [ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
> [ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
> [ 5954.096665] Stack:
> [ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
> [ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
> [ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
> [ 5954.096665] Call Trace:
> [ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> [ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39
> [ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665]  RSP <ffff8801175539a8>
> [ 5954.096665] CR2: 0000000000000000
> [ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---

So I vfs-scale-working branch may not be entirely in the clear, seeing
as it touches
the code lower in the call chain. However I don't know what can cause
lockdep to go off
the rails like this.

There is a sequence I used to hack around lockdep nesting
restrictions, following this
pattern:

 repeat:
    spin_lock(&parent->d_lock);
    spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
    /* do stuff */
    spin_unlock(&parent->d_lock);
    spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
    parent = dentry;
    spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
    goto repeat;

It's not directly in this call chain, but I wonder if it could have
caused any problem?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:24     ` Nick Piggin
  (?)
@ 2010-12-07 15:49     ` Peter Zijlstra
  2010-12-07 15:59       ` Nick Piggin
  -1 siblings, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2010-12-07 15:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, 2010-12-08 at 02:24 +1100, Nick Piggin wrote:
> 
>  repeat:
>     spin_lock(&parent->d_lock);
>     spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
>     /* do stuff */
>     spin_unlock(&parent->d_lock);
>     spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
>     parent = dentry;
>     spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
>     goto repeat; 

shouldn't that be s/this_parent/parent/ ?

So what you're trying to do is:

  A -> B -> C -> ...

lock A
lock B, nested
unlock A
flip B from nested to top
lock C, nested
unlock B
flip C from nested to top
lock ...

Anyway, the way to write that is something like:

  lock_set_subclass(&detry->d_lock.dep_map, 0, _RET_IP_);

Which will reset the subclass of the held lock from DENTRY_D_LOCK_NESTED
to 0.

This is also used in double_unlock_balance(), we go into
double_lock_balance() with this_rq locked and want to lock busiest,
because of the lock ordering we might need to unlock this_rq and lock
busiest first, at which point this_rq is nested.

On unlock we thus need to map this_rq back to subclass 0 (which it had
before double_lock_balance(), because otherwise subsequent lock
operations will be done against the subclass and confuse things.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:49     ` Peter Zijlstra
@ 2010-12-07 15:59       ` Nick Piggin
  2010-12-07 16:23         ` Peter Zijlstra
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-07 15:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 8, 2010 at 2:49 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2010-12-08 at 02:24 +1100, Nick Piggin wrote:
>>
>>  repeat:
>>     spin_lock(&parent->d_lock);
>>     spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
>>     /* do stuff */
>>     spin_unlock(&parent->d_lock);
>>     spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
>>     parent = dentry;
>>     spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
>>     goto repeat;
>
> shouldn't that be s/this_parent/parent/ ?

Yes, typo in my pseudo code.


> So what you're trying to do is:
>
>  A -> B -> C -> ...
>
> lock A
> lock B, nested
> unlock A
> flip B from nested to top
> lock C, nested
> unlock B
> flip C from nested to top
> lock ...
>
> Anyway, the way to write that is something like:
>
>  lock_set_subclass(&detry->d_lock.dep_map, 0, _RET_IP_);
>
> Which will reset the subclass of the held lock from DENTRY_D_LOCK_NESTED
> to 0.

OK, thanks. My version should not have caused any problems though,
right? Any idea what might have caused Dave's crash?


> This is also used in double_unlock_balance(), we go into
> double_lock_balance() with this_rq locked and want to lock busiest,
> because of the lock ordering we might need to unlock this_rq and lock
> busiest first, at which point this_rq is nested.
>
> On unlock we thus need to map this_rq back to subclass 0 (which it had
> before double_lock_balance(), because otherwise subsequent lock
> operations will be done against the subclass and confuse things.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:59       ` Nick Piggin
@ 2010-12-07 16:23         ` Peter Zijlstra
  0 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2010-12-07 16:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel, MingLei

On Wed, 2010-12-08 at 02:59 +1100, Nick Piggin wrote:
> OK, thanks. My version should not have caused any problems though,
> right? 

I tihnk so, yes, altough looking at it again I wonder why you use
spin_aquire(.trylock=1) -- but that too shouldn't cause anything like
the explosion.

> Any idea what might have caused Dave's crash?

Not directly, no. Usually lockdep crashes indicate use after free like
things, where we try to lock a lock that's been scribbled on. But that
usually explodes a bit earlier.

You faulting in the middle of that breath-first-search does suggest some
data corruption, but I'm not quite sure what kind, I can't remember ever
having seem something like this before.

I've CC'ed Ming Lei who wrote the bfs search, maybe he's got an idea.


Copy of the splat:
---

> [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> [ 5954.063122] 3 locks held by rm/2927:
> [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> [ 5954.064014] irq event stamp: 1484376719
> [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> [ 5954.064014] Call Trace:
> [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
                   ` (48 preceding siblings ...)
  2010-12-07 11:25 ` Dave Chinner
@ 2010-12-07 21:56 ` Dave Chinner
  2010-12-08  1:47   ` Nick Piggin
  49 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-07 21:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> Here is an new set of vfs patches for review, not that there was much interest
> last time they were posted. It is structured like:
> 
> * preparation patches
> * introduce new locks to take over dcache_lock, then remove it
> * cleaning up and reworking things for new locks
> * rcu-walk path walking
> * start on some fine grained locking steps

Stress test doing:

	single thread 50M inode create
	single thread rm -rf
	2-way 50M inode create
	2-way rm -rf
	4-way 50M inode create
	4-way rm -rf
	8-way 50M inode create
	8-way rm -rf
	8-way 250M inode create
	8-way rm -rf

Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
with a CPU stuck spinning on here:

[37372.084012] NMI backtrace for cpu 5
[37372.084012] CPU 5 
[37372.084012] Modules linked in:
[37372.084012] 
[37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
[37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
[37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
[37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
[37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
[37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
[37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
[37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
[37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
[37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
[37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
[37372.084012] Stack:
[37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
[37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
[37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
[37372.084012] Call Trace:
[37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
[37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
[37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
[37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
[37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
[37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
[37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
[37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
[37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[37372.084012] Code: c1 c1 41 06 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 c9 c3 66 0f 1f 44 00 00 55 48 89 e5 0f b7 07 38 e0 
[37372.084012] Call Trace:
[37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
[37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
[37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
[37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
[37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
[37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
[37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
[37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
[37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate"
  2010-11-27  9:56 ` [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate" Nick Piggin
@ 2010-12-08  1:16   ` Dave Chinner
  2010-12-08  9:38     ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-08  1:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 08:56:03PM +1100, Nick Piggin wrote:
> This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb.
> 
> Patch is broken, you can't dget() without holding any locks!

I believe you can - for the same reasons we can take a reference to
an inode without holding the inode_lock. That is, as long as the
caller already holds an active reference to the dentry,
dget() can be used to take another reference without needing the
dcache_lock.

Such usage appears to be described in the comment above dget() and
there's a BUG_ON() in dget() to catch callers that don't already
have an active reference. An example of a valid unlocked dget():
d_alloc() does an unlocked dget() to take a reference to the parent
dentry which we already are guaranteed to have a reference to.

As to d_validate() - it depends on the caller behaviour as to
whether the unlocked dget() is valid or not.  From a cursory check
of the NCP and SMB readdir caches, both appear to hold an active
reference to the dentry it is passing to d_validate(). If that is
the case then there is nothing wrong with the way d_validate uses
dget(). Can someone with more SMB/NCP expertise than me validate the
use of cached dentries?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 21:56 ` Dave Chinner
@ 2010-12-08  1:47   ` Nick Piggin
  2010-12-08  3:32     ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-08  1:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> Here is an new set of vfs patches for review, not that there was much interest
>> last time they were posted. It is structured like:
>>
>> * preparation patches
>> * introduce new locks to take over dcache_lock, then remove it
>> * cleaning up and reworking things for new locks
>> * rcu-walk path walking
>> * start on some fine grained locking steps
>
> Stress test doing:
>
>        single thread 50M inode create
>        single thread rm -rf
>        2-way 50M inode create
>        2-way rm -rf
>        4-way 50M inode create
>        4-way rm -rf
>        8-way 50M inode create
>        8-way rm -rf
>        8-way 250M inode create
>        8-way rm -rf
>
> Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> with a CPU stuck spinning on here:
>
> [37372.084012] NMI backtrace for cpu 5
> [37372.084012] CPU 5
> [37372.084012] Modules linked in:
> [37372.084012]
> [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> [37372.084012] Stack:
> [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> [37372.084012] Call Trace:
> [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

OK good, with any luck, that's the same bug.

Is this XFS? Is there any concurrent activity happening on the same dentries?
Ie. are the rm -rf threads running on the same directories, or is
there any reclaim
happening in the background?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/46] fs: d_validate fixes
  2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
@ 2010-12-08  1:53   ` Dave Chinner
  2010-12-08  6:59     ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-08  1:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 08:44:32PM +1100, Nick Piggin wrote:
> d_validate has been broken for a long time.
> 
> kmem_ptr_validate does not guarantee that a pointer can be dereferenced
> if it can go away at any time. Even rcu_read_lock doesn't help, because
> the pointer might be queued in RCU callbacks but not executed yet.
> 
> So the parent cannot be checked, nor the name hashed. The dentry pointer
> can not be touched until it can be verified under lock. Hashing simply
> cannot be used.
> 
> Instead, verify the parent/child relationship by traversing parent's
> d_child list. It's slow, but only ncpfs and the destaged smbfs care
> about it, at this point.

I'd drop the previous revert patch and just convert the RCU hash
traversal straight to the d_child traversal code you introduce here.
This is a much better explanation of why the d_validate mechanism
needs to be changed, and the revert is really an unnecessary extra
step...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:24     ` Nick Piggin
  (?)
  (?)
@ 2010-12-08  3:28     ` Nick Piggin
  -1 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-08  3:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel, Peter Zijlstra

On Wed, Dec 08, 2010 at 02:24:23AM +1100, Nick Piggin wrote:
> On Tue, Dec 7, 2010 at 10:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> > [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> > [ 5954.063122] 3 locks held by rm/2927:
> > [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> > [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> > [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> > [ 5954.064014] irq event stamp: 1484376719
> > [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> > [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> > [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> > [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> > [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> > [ 5954.064014] Call Trace:
> > [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> > [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> > [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> > [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> > [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> > [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> > [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> > [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> > [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> > [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> > [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> > [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> > [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> > [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> > [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> > [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> > [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> > [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> > [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> > [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> > [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> > [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> > [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> > [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> > [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> > [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> 
> Seems that lockdep exploded.
> 
> > [ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
> > [ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> > [ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0
> > [ 5954.094830] Oops: 0000 [#1] SMP
> > [ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
> > [ 5954.095760] CPU 6
> > [ 5954.095954] Modules linked in:
> > [ 5954.096319]
> > [ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
> > [ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> > [ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
> > [ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
> > [ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
> > [ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
> > [ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> > [ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
> > [ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
> > [ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
> > [ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
> > [ 5954.096665] Stack:
> > [ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
> > [ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
> > [ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
> > [ 5954.096665] Call Trace:
> > [ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> > [ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> > [ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> > [ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> > [ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> > [ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> > [ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> > [ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> > [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> > [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> > [ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> > [ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> > [ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> > [ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> > [ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> > [ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> > [ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> > [ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> > [ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > [ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39
> > [ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> > [ 5954.096665]  RSP <ffff8801175539a8>
> > [ 5954.096665] CR2: 0000000000000000
> > [ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---
> 
> So vfs-scale-working branch may not be entirely in the clear, seeing

Ah, may have been a stupid little bug. The list entry check was being done
and then the pointer reloaded to be used.

What does the asm for shrink_dentry_list look like (before this patch)?

Thanks,
Nick

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-12-08 12:41:35.000000000 +1100
+++ linux-2.6/fs/dcache.c	2010-12-08 14:23:46.000000000 +1100
@@ -657,10 +657,10 @@ static void shrink_dentry_list(struct li
 	struct dentry *dentry;
 
 	rcu_read_lock();
-	while (!list_empty(list)) {
-		dentry = list_entry(list->prev, struct dentry, d_lru);
-
-		/* Don't need RCU dereference because we recheck under lock */
+	for (;;) {
+		dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
+		if (&dentry->d_lru == list)
+			break;
 		spin_lock(&dentry->d_lock);
 		if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
 			spin_unlock(&dentry->d_lock);

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  1:47   ` Nick Piggin
@ 2010-12-08  3:32     ` Dave Chinner
  2010-12-08  4:28       ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-08  3:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> >>
> >> Here is an new set of vfs patches for review, not that there was much interest
> >> last time they were posted. It is structured like:
> >>
> >> * preparation patches
> >> * introduce new locks to take over dcache_lock, then remove it
> >> * cleaning up and reworking things for new locks
> >> * rcu-walk path walking
> >> * start on some fine grained locking steps
> >
> > Stress test doing:
> >
> >        single thread 50M inode create
> >        single thread rm -rf
> >        2-way 50M inode create
> >        2-way rm -rf
> >        4-way 50M inode create
> >        4-way rm -rf
> >        8-way 50M inode create
> >        8-way rm -rf
> >        8-way 250M inode create
> >        8-way rm -rf
> >
> > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > with a CPU stuck spinning on here:
> >
> > [37372.084012] NMI backtrace for cpu 5
> > [37372.084012] CPU 5
> > [37372.084012] Modules linked in:
> > [37372.084012]
> > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > [37372.084012] Stack:
> > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > [37372.084012] Call Trace:
> > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> 
> OK good, with any luck, that's the same bug.
> 
> Is this XFS?

Yes.

> Is there any concurrent activity happening on the same dentries?

Not from an application perspective.

> Ie. are the rm -rf threads running on the same directories,

No, each thread operating on a different directory.

> or is there any reclaim happening in the background?

IIRC, kswapd was consuming about 5-10% of a CPU during parallel
unlink tests. Mainly reclaiming XFS inodes, I think, but there may
be dentry cache reclaim going as well.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  3:32     ` Dave Chinner
@ 2010-12-08  4:28       ` Dave Chinner
  2010-12-08  7:09           ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-08  4:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > >>
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > >>
> > >> Here is an new set of vfs patches for review, not that there was much interest
> > >> last time they were posted. It is structured like:
> > >>
> > >> * preparation patches
> > >> * introduce new locks to take over dcache_lock, then remove it
> > >> * cleaning up and reworking things for new locks
> > >> * rcu-walk path walking
> > >> * start on some fine grained locking steps
> > >
> > > Stress test doing:
> > >
> > >        single thread 50M inode create
> > >        single thread rm -rf
> > >        2-way 50M inode create
> > >        2-way rm -rf
> > >        4-way 50M inode create
> > >        4-way rm -rf
> > >        8-way 50M inode create
> > >        8-way rm -rf
> > >        8-way 250M inode create
> > >        8-way rm -rf
> > >
> > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > with a CPU stuck spinning on here:
> > >
> > > [37372.084012] NMI backtrace for cpu 5
> > > [37372.084012] CPU 5
> > > [37372.084012] Modules linked in:
> > > [37372.084012]
> > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > [37372.084012] Stack:
> > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > [37372.084012] Call Trace:
> > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > 
> > OK good, with any luck, that's the same bug.
> > 
> > Is this XFS?
> 
> Yes.
> 
> > Is there any concurrent activity happening on the same dentries?
> 
> Not from an application perspective.
> 
> > Ie. are the rm -rf threads running on the same directories,
> 
> No, each thread operating on a different directory.
> 
> > or is there any reclaim happening in the background?
> 
> IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> be dentry cache reclaim going as well.

Turns out that the kswapd peaks are upwards of 50% of a CPU for a
few seconds, then idle for 10-15s. Typical perf top output of kswapd
while it is active during unlinks is:

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            17168.00 10.2% __call_rcu                  [kernel.kallsyms]
            13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
            12917.00  7.6% down_write                  [kernel.kallsyms]
            12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
            10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
             9314.00  5.5% __lookup_tag                [kernel.kallsyms]
             9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
             8694.00  5.1% is_bad_inode                [kernel.kallsyms]
             7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
             6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
             5484.00  3.2% __d_drop                    [kernel.kallsyms]
             5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
             4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
             3556.00  2.1% up_write                    [kernel.kallsyms]
             3206.00  1.9% _cond_resched               [kernel.kallsyms]
             3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
             2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
             2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
             2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
             2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
             2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
             1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
             1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
             1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
             1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
             1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
             1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
             1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
              940.00  0.6% xfs_perag_put               [kernel.kallsyms]

So there is some dentry cache reclaim going on. 

FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
more CPU time) in the work kswapd is doing during these unlinks, too.
I just had a look at kswapd when a 8-way create is running - it's running at
50-60% of a cpu for seconds at a time. I caught this while it was doing pure
XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            27171.00  9.0% __call_rcu                  [kernel.kallsyms]
            21491.00  7.1% down_write                  [kernel.kallsyms]
            20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
            20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
            15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
            15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
            14893.00  4.9% is_bad_inode                [kernel.kallsyms]
            14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
            14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
            14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
            10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
            10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
             8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
             7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
             7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
             6974.00  2.3% up_write                    [kernel.kallsyms]
             5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
             5634.00  1.9% _cond_resched               [kernel.kallsyms]

Which is showing a similar amount of RCU overhead as the unlink as above.
And this while it was doing dentry cache reclaim (~10s sample):

            35921.00 15.7% __d_drop                      [kernel.kallsyms]
            30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
            29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
            19043.00  8.3% __call_rcu                    [kernel.kallsyms]
            10098.00  4.4% iput                          [kernel.kallsyms]
             7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
             6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
             6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
             5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
             5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
             4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
             4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
             3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
             3727.00  1.6% dentry_iput                   [kernel.kallsyms]
             3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
             3279.00  1.4% kfree                         [kernel.kallsyms]
             3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
             2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
             2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]

I've never really seen any signficant dentry cache reclaim overhead
in profiles of these workloads before, so this was a bit of a
surprise....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/46] fs: d_validate fixes
  2010-12-08  1:53   ` Dave Chinner
@ 2010-12-08  6:59     ` Nick Piggin
  2010-12-09  0:50         ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-08  6:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 12:53:44PM +1100, Dave Chinner wrote:
> On Sat, Nov 27, 2010 at 08:44:32PM +1100, Nick Piggin wrote:
> > d_validate has been broken for a long time.
> > 
> > kmem_ptr_validate does not guarantee that a pointer can be dereferenced
> > if it can go away at any time. Even rcu_read_lock doesn't help, because
> > the pointer might be queued in RCU callbacks but not executed yet.
> > 
> > So the parent cannot be checked, nor the name hashed. The dentry pointer
> > can not be touched until it can be verified under lock. Hashing simply
> > cannot be used.
> > 
> > Instead, verify the parent/child relationship by traversing parent's
> > d_child list. It's slow, but only ncpfs and the destaged smbfs care
> > about it, at this point.
> 
> I'd drop the previous revert patch and just convert the RCU hash
> traversal straight to the d_child traversal code you introduce here.
> This is a much better explanation of why the d_validate mechanism
> needs to be changed, and the revert is really an unnecessary extra
> step...

Has to be backported, though. Patch that is to be reverted obviously
adds more brokenness and is a good example that you cannot dget() under
rcu read protection even if the rest of the surrounding function is
bugfree. I wouldn't have thought it's a big deal.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  4:28       ` Dave Chinner
@ 2010-12-08  7:09           ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-08  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > >>
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > >>
> > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > >> last time they were posted. It is structured like:
> > > >>
> > > >> * preparation patches
> > > >> * introduce new locks to take over dcache_lock, then remove it
> > > >> * cleaning up and reworking things for new locks
> > > >> * rcu-walk path walking
> > > >> * start on some fine grained locking steps
> > > >
> > > > Stress test doing:
> > > >
> > > >        single thread 50M inode create
> > > >        single thread rm -rf
> > > >        2-way 50M inode create
> > > >        2-way rm -rf
> > > >        4-way 50M inode create
> > > >        4-way rm -rf
> > > >        8-way 50M inode create
> > > >        8-way rm -rf
> > > >        8-way 250M inode create
> > > >        8-way rm -rf
> > > >
> > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > with a CPU stuck spinning on here:
> > > >
> > > > [37372.084012] NMI backtrace for cpu 5
> > > > [37372.084012] CPU 5
> > > > [37372.084012] Modules linked in:
> > > > [37372.084012]
> > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > [37372.084012] Stack:
> > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > [37372.084012] Call Trace:
> > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > 
> > > OK good, with any luck, that's the same bug.
> > > 
> > > Is this XFS?
> > 
> > Yes.
> > 
> > > Is there any concurrent activity happening on the same dentries?
> > 
> > Not from an application perspective.
> > 
> > > Ie. are the rm -rf threads running on the same directories,
> > 
> > No, each thread operating on a different directory.

This is probably fixed by the same patch as the lockdep splat trace.


> > > or is there any reclaim happening in the background?
> > 
> > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > be dentry cache reclaim going as well.
> 
> Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> few seconds, then idle for 10-15s. Typical perf top output of kswapd
> while it is active during unlinks is:
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
>             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
>             12917.00  7.6% down_write                  [kernel.kallsyms]
>             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
>             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
>              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
>              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
>              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
>              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
>              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>              5484.00  3.2% __d_drop                    [kernel.kallsyms]
>              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
>              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
>              3556.00  2.1% up_write                    [kernel.kallsyms]
>              3206.00  1.9% _cond_resched               [kernel.kallsyms]
>              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
>              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
>              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
>              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
>              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
>              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
>              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
>              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
>              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
>              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
>              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
>              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
>              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
>               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> 
> So there is some dentry cache reclaim going on. 
> 
> FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> more CPU time) in the work kswapd is doing during these unlinks, too.
> I just had a look at kswapd when a 8-way create is running - it's running at
> 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
>             21491.00  7.1% down_write                  [kernel.kallsyms]
>             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
>             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
>             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
>             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
>             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
>             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
>             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
>             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
>             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
>              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
>              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
>              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
>              6974.00  2.3% up_write                    [kernel.kallsyms]
>              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
>              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> 
> Which is showing a similar amount of RCU overhead as the unlink as above.
> And this while it was doing dentry cache reclaim (~10s sample):
> 
>             35921.00 15.7% __d_drop                      [kernel.kallsyms]
>             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
>             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
>             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
>             10098.00  4.4% iput                          [kernel.kallsyms]
>              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
>              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
>              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
>              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
>              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
>              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
>              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
>              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
>              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
>              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
>              3279.00  1.4% kfree                         [kernel.kallsyms]
>              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
>              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
>              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> 
> I've never really seen any signficant dentry cache reclaim overhead
> in profiles of these workloads before, so this was a bit of a
> surprise....

call_rcu shouldn't be doing much, except for disabling irqs and linking
the object into the list. I have a patch somewhere to reduce the irq
disable overhead a bit, but it really shouldn't be doing a lot of work.

Sometimes you find that touching the rcu head field needs to get a
cacheline exclusive, so a bit of work gets transferred there....

But it may also be something going a bit wrong in RCU. I blew it up
once already, after the files_lock splitup that enabled all CPUs to
create and destroy files :)


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-12-08  7:09           ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-08  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > >>
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > >>
> > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > >> last time they were posted. It is structured like:
> > > >>
> > > >> * preparation patches
> > > >> * introduce new locks to take over dcache_lock, then remove it
> > > >> * cleaning up and reworking things for new locks
> > > >> * rcu-walk path walking
> > > >> * start on some fine grained locking steps
> > > >
> > > > Stress test doing:
> > > >
> > > >        single thread 50M inode create
> > > >        single thread rm -rf
> > > >        2-way 50M inode create
> > > >        2-way rm -rf
> > > >        4-way 50M inode create
> > > >        4-way rm -rf
> > > >        8-way 50M inode create
> > > >        8-way rm -rf
> > > >        8-way 250M inode create
> > > >        8-way rm -rf
> > > >
> > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > with a CPU stuck spinning on here:
> > > >
> > > > [37372.084012] NMI backtrace for cpu 5
> > > > [37372.084012] CPU 5
> > > > [37372.084012] Modules linked in:
> > > > [37372.084012]
> > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > [37372.084012] Stack:
> > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > [37372.084012] Call Trace:
> > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > 
> > > OK good, with any luck, that's the same bug.
> > > 
> > > Is this XFS?
> > 
> > Yes.
> > 
> > > Is there any concurrent activity happening on the same dentries?
> > 
> > Not from an application perspective.
> > 
> > > Ie. are the rm -rf threads running on the same directories,
> > 
> > No, each thread operating on a different directory.

This is probably fixed by the same patch as the lockdep splat trace.


> > > or is there any reclaim happening in the background?
> > 
> > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > be dentry cache reclaim going as well.
> 
> Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> few seconds, then idle for 10-15s. Typical perf top output of kswapd
> while it is active during unlinks is:
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
>             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
>             12917.00  7.6% down_write                  [kernel.kallsyms]
>             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
>             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
>              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
>              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
>              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
>              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
>              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>              5484.00  3.2% __d_drop                    [kernel.kallsyms]
>              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
>              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
>              3556.00  2.1% up_write                    [kernel.kallsyms]
>              3206.00  1.9% _cond_resched               [kernel.kallsyms]
>              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
>              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
>              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
>              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
>              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
>              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
>              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
>              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
>              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
>              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
>              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
>              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
>              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
>               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> 
> So there is some dentry cache reclaim going on. 
> 
> FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> more CPU time) in the work kswapd is doing during these unlinks, too.
> I just had a look at kswapd when a 8-way create is running - it's running at
> 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
>             21491.00  7.1% down_write                  [kernel.kallsyms]
>             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
>             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
>             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
>             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
>             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
>             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
>             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
>             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
>             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
>              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
>              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
>              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
>              6974.00  2.3% up_write                    [kernel.kallsyms]
>              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
>              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> 
> Which is showing a similar amount of RCU overhead as the unlink as above.
> And this while it was doing dentry cache reclaim (~10s sample):
> 
>             35921.00 15.7% __d_drop                      [kernel.kallsyms]
>             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
>             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
>             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
>             10098.00  4.4% iput                          [kernel.kallsyms]
>              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
>              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
>              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
>              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
>              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
>              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
>              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
>              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
>              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
>              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
>              3279.00  1.4% kfree                         [kernel.kallsyms]
>              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
>              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
>              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> 
> I've never really seen any signficant dentry cache reclaim overhead
> in profiles of these workloads before, so this was a bit of a
> surprise....

call_rcu shouldn't be doing much, except for disabling irqs and linking
the object into the list. I have a patch somewhere to reduce the irq
disable overhead a bit, but it really shouldn't be doing a lot of work.

Sometimes you find that touching the rcu head field needs to get a
cacheline exclusive, so a bit of work gets transferred there....

But it may also be something going a bit wrong in RCU. I blew it up
once already, after the files_lock splitup that enabled all CPUs to
create and destroy files :)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate"
  2010-12-08  1:16   ` Dave Chinner
@ 2010-12-08  9:38     ` Nick Piggin
  2010-12-09  0:44       ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-08  9:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 12:16:56PM +1100, Dave Chinner wrote:
> On Sat, Nov 27, 2010 at 08:56:03PM +1100, Nick Piggin wrote:
> > This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb.
> > 
> > Patch is broken, you can't dget() without holding any locks!
> 
> I believe you can - for the same reasons we can take a reference to
> an inode without holding the inode_lock. That is, as long as the
> caller already holds an active reference to the dentry,
> dget() can be used to take another reference without needing the
> dcache_lock.
> 
> Such usage appears to be described in the comment above dget() and
> there's a BUG_ON() in dget() to catch callers that don't already
> have an active reference. An example of a valid unlocked dget():
> d_alloc() does an unlocked dget() to take a reference to the parent
> dentry whichn we already are guaranteed to have a reference to.

Of course you can dget if you already have a reference :)

 
> As to d_validate() - it depends on the caller behaviour as to
> whether the unlocked dget() is valid or not.  From a cursory check
> of the NCP and SMB readdir caches, both appear to hold an active
> reference to the dentry it is passing to d_validate().

I don't see where? Can you point to where the refcount is taken?
AFAIKS it drops the reference 3 lines after it puts the pointer
into cache.


> If that is
> the case then there is nothing wrong with the way d_validate uses
> dget(). Can someone with more SMB/NCP expertise than me validate the
> use of cached dentries?

Then why would it have to use d_validate if it has a reference?
That is supposed to be for an "untrusted" pointer (which is why
it had all the crazy checks that it's in kmem and in the right
slab etc).


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate"
  2010-12-08  9:38     ` Nick Piggin
@ 2010-12-09  0:44       ` Dave Chinner
  2010-12-09  4:38         ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-09  0:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 08:38:24PM +1100, Nick Piggin wrote:
> On Wed, Dec 08, 2010 at 12:16:56PM +1100, Dave Chinner wrote:
> > On Sat, Nov 27, 2010 at 08:56:03PM +1100, Nick Piggin wrote:
> > > This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb.
> > > 
> > > Patch is broken, you can't dget() without holding any locks!
> > 
> > I believe you can - for the same reasons we can take a reference to
> > an inode without holding the inode_lock. That is, as long as the
> > caller already holds an active reference to the dentry,
> > dget() can be used to take another reference without needing the
> > dcache_lock.
> > 
> > Such usage appears to be described in the comment above dget() and
> > there's a BUG_ON() in dget() to catch callers that don't already
> > have an active reference. An example of a valid unlocked dget():
> > d_alloc() does an unlocked dget() to take a reference to the parent
> > dentry whichn we already are guaranteed to have a reference to.
> 
> Of course you can dget if you already have a reference :)

Right, so the commit message is wrong. Can you update it to tell us why
dget() can't be used there - the commit message from the second
patch explained it far better....

> > As to d_validate() - it depends on the caller behaviour as to
> > whether the unlocked dget() is valid or not.  From a cursory check
> > of the NCP and SMB readdir caches, both appear to hold an active
> > reference to the dentry it is passing to d_validate().
> 
> I don't see where? Can you point to where the refcount is taken?
> AFAIKS it drops the reference 3 lines after it puts the pointer
> into cache.

Yeah, you're right, I missed that one - I spent more tiem checking
the validation part of the code than the initial insertion. Hence
my request:

> > If that is
> > the case then there is nothing wrong with the way d_validate uses
> > dget(). Can someone with more SMB/NCP expertise than me validate the
> > use of cached dentries?
> 
> Then why would it have to use d_validate if it has a reference?
> That is supposed to be for an "untrusted" pointer (which is why
> it had all the crazy checks that it's in kmem and in the right
> slab etc).

Code changes. It may not be doing what it was originally
needed/intended to be doing - I don't need to waste time on code
archeology and second guessing when there are others around that can
tell me this off the top oftheir head. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/46] fs: d_validate fixes
  2010-12-08  6:59     ` Nick Piggin
@ 2010-12-09  0:50         ` Dave Chinner
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Chinner @ 2010-12-09  0:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 05:59:55PM +1100, Nick Piggin wrote:
> On Wed, Dec 08, 2010 at 12:53:44PM +1100, Dave Chinner wrote:
> > On Sat, Nov 27, 2010 at 08:44:32PM +1100, Nick Piggin wrote:
> > > d_validate has been broken for a long time.
> > > 
> > > kmem_ptr_validate does not guarantee that a pointer can be dereferenced
> > > if it can go away at any time. Even rcu_read_lock doesn't help, because
> > > the pointer might be queued in RCU callbacks but not executed yet.
> > > 
> > > So the parent cannot be checked, nor the name hashed. The dentry pointer
> > > can not be touched until it can be verified under lock. Hashing simply
> > > cannot be used.
> > > 
> > > Instead, verify the parent/child relationship by traversing parent's
> > > d_child list. It's slow, but only ncpfs and the destaged smbfs care
> > > about it, at this point.
> > 
> > I'd drop the previous revert patch and just convert the RCU hash
> > traversal straight to the d_child traversal code you introduce here.
> > This is a much better explanation of why the d_validate mechanism
> > needs to be changed, and the revert is really an unnecessary extra
> > step...
> 
> Has to be backported, though.

Backported where? The d_validate() change only got included in .37-rc1.

> Patch that is to be reverted obviously
> adds more brokenness and is a good example that you cannot dget() under
> rcu read protection even if the rest of the surrounding function is
> bugfree. I wouldn't have thought it's a big deal.

Reverting something broken to something already broken just to fix
to the less broken version seems like an unnecessary step. Just
fix the brokenneѕs in a single patch - no need to indirect the real
fix through a revert. One less patch to worry about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/46] fs: d_validate fixes
@ 2010-12-09  0:50         ` Dave Chinner
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Chinner @ 2010-12-09  0:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 05:59:55PM +1100, Nick Piggin wrote:
> On Wed, Dec 08, 2010 at 12:53:44PM +1100, Dave Chinner wrote:
> > On Sat, Nov 27, 2010 at 08:44:32PM +1100, Nick Piggin wrote:
> > > d_validate has been broken for a long time.
> > > 
> > > kmem_ptr_validate does not guarantee that a pointer can be dereferenced
> > > if it can go away at any time. Even rcu_read_lock doesn't help, because
> > > the pointer might be queued in RCU callbacks but not executed yet.
> > > 
> > > So the parent cannot be checked, nor the name hashed. The dentry pointer
> > > can not be touched until it can be verified under lock. Hashing simply
> > > cannot be used.
> > > 
> > > Instead, verify the parent/child relationship by traversing parent's
> > > d_child list. It's slow, but only ncpfs and the destaged smbfs care
> > > about it, at this point.
> > 
> > I'd drop the previous revert patch and just convert the RCU hash
> > traversal straight to the d_child traversal code you introduce here.
> > This is a much better explanation of why the d_validate mechanism
> > needs to be changed, and the revert is really an unnecessary extra
> > step...
> 
> Has to be backported, though.

Backported where? The d_validate() change only got included in .37-rc1.

> Patch that is to be reverted obviously
> adds more brokenness and is a good example that you cannot dget() under
> rcu read protection even if the rest of the surrounding function is
> bugfree. I wouldn't have thought it's a big deal.

Reverting something broken to something already broken just to fix
to the less broken version seems like an unnecessary step. Just
fix the brokenneѕs in a single patch - no need to indirect the real
fix through a revert. One less patch to worry about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate"
  2010-12-09  0:44       ` Dave Chinner
@ 2010-12-09  4:38         ` Nick Piggin
  2010-12-09  5:16           ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-09  4:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 11:44:13AM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 08:38:24PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 12:16:56PM +1100, Dave Chinner wrote:
> > > On Sat, Nov 27, 2010 at 08:56:03PM +1100, Nick Piggin wrote:
> > > > This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb.
> > > > 
> > > > Patch is broken, you can't dget() without holding any locks!
> > > 
> > > I believe you can - for the same reasons we can take a reference to
> > > an inode without holding the inode_lock. That is, as long as the
> > > caller already holds an active reference to the dentry,
> > > dget() can be used to take another reference without needing the
> > > dcache_lock.
> > > 
> > > Such usage appears to be described in the comment above dget() and
> > > there's a BUG_ON() in dget() to catch callers that don't already
> > > have an active reference. An example of a valid unlocked dget():
> > > d_alloc() does an unlocked dget() to take a reference to the parent
> > > dentry whichn we already are guaranteed to have a reference to.
> > 
> > Of course you can dget if you already have a reference :)
> 
> Right, so the commit message is wrong. Can you update it to tell us why
> dget() can't be used there - the commit message from the second
> patch explained it far better....

I suppose if you're not reading it in the context of d_validate,
then yes. And as an historical record, I'll clarify.

Obviously if we do have a reference, then we can take another,
and if we don't, then we need more than RCU because RCU only
provides persistence guarantee for the memory, not any persistence
or validity guarantee for the object.

 
> > > As to d_validate() - it depends on the caller behaviour as to
> > > whether the unlocked dget() is valid or not.  From a cursory check
> > > of the NCP and SMB readdir caches, both appear to hold an active
> > > reference to the dentry it is passing to d_validate().
> > 
> > I don't see where? Can you point to where the refcount is taken?
> > AFAIKS it drops the reference 3 lines after it puts the pointer
> > into cache.
> 
> Yeah, you're right, I missed that one - I spent more tiem checking
> the validation part of the code than the initial insertion. Hence
> my request:

Yes, I'm pretty sure it doesn't have any references.

 
> > > If that is
> > > the case then there is nothing wrong with the way d_validate uses
> > > dget(). Can someone with more SMB/NCP expertise than me validate the
> > > use of cached dentries?
> > 
> > Then why would it have to use d_validate if it has a reference?
> > That is supposed to be for an "untrusted" pointer (which is why
> > it had all the crazy checks that it's in kmem and in the right
> > slab etc).
> 
> Code changes. It may not be doing what it was originally
> needed/intended to be doing - I don't need to waste time on code
> archeology and second guessing when there are others around that can
> tell me this off the top oftheir head. ;)

Well the d_validate API is meant to provide that, so it's broken
whether or not its callers use it correctly. It's also exported
to external modules...

Yes we should remove smbfs and rip the cache out of ncpfs and
remove d_validate entirely when possible (or, provide a more
reasonable API and caching library entirely in the dcache code
that a filesystem might use). But this is the right first step.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/46] fs: d_validate fixes
  2010-12-09  0:50         ` Dave Chinner
@ 2010-12-09  4:50           ` Nick Piggin
  -1 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-09  4:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 11:50:29AM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 05:59:55PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 12:53:44PM +1100, Dave Chinner wrote:
> > > On Sat, Nov 27, 2010 at 08:44:32PM +1100, Nick Piggin wrote:
> > > > d_validate has been broken for a long time.
> > > > 
> > > > kmem_ptr_validate does not guarantee that a pointer can be dereferenced
> > > > if it can go away at any time. Even rcu_read_lock doesn't help, because
> > > > the pointer might be queued in RCU callbacks but not executed yet.
> > > > 
> > > > So the parent cannot be checked, nor the name hashed. The dentry pointer
> > > > can not be touched until it can be verified under lock. Hashing simply
> > > > cannot be used.
> > > > 
> > > > Instead, verify the parent/child relationship by traversing parent's
> > > > d_child list. It's slow, but only ncpfs and the destaged smbfs care
> > > > about it, at this point.
> > > 
> > > I'd drop the previous revert patch and just convert the RCU hash
> > > traversal straight to the d_child traversal code you introduce here.
> > > This is a much better explanation of why the d_validate mechanism
> > > needs to be changed, and the revert is really an unnecessary extra
> > > step...
> > 
> > Has to be backported, though.
> 
> Backported where? The d_validate() change only got included in .37-rc1.

Backported to stable/distro kernels I suppose. I'm not sure what your
point is?

 
> > Patch that is to be reverted obviously
> > adds more brokenness and is a good example that you cannot dget() under
> > rcu read protection even if the rest of the surrounding function is
> > bugfree. I wouldn't have thought it's a big deal.
> 
> Reverting something broken to something already broken just to fix
> to the less broken version seems like an unnecessary step. Just
> fix the brokenneѕs in a single patch - no need to indirect the real
> fix through a revert. One less patch to worry about.

OK but I disagree. Firstly, reverting that patch gives a good record of
that particular pattern of bug (that Christoph and Al both missed).
With more RCU going into the vfs, people need to be pretty clear about
the pitfalls.

Secondly, as I said, reverting means that I can use exact same patch
for upstream and stable kernels.

And finally, it gives better bisectability. If somebody hits a bug in
my patch, I would rather have them bisect into the well-worn (if buggy)
version of the code than bisect into a different type of brokenness.

It isn't indirecting the real fix through a revert, they are broken in
different ways. My fix is for the bug that it doesn't guarantee the
persistence of *memory* we are using, and the revert is for the bug that
it doesn't guarantee the persistence/validity of the *object*, and which
is actually more likely to be a problem if you think about it, because
the window is much larger.

Git has no problem with lots of patches, so I don't see any advantage
to doing one patch, and you lose the advantages above.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/46] fs: d_validate fixes
@ 2010-12-09  4:50           ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-09  4:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 11:50:29AM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 05:59:55PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 12:53:44PM +1100, Dave Chinner wrote:
> > > On Sat, Nov 27, 2010 at 08:44:32PM +1100, Nick Piggin wrote:
> > > > d_validate has been broken for a long time.
> > > > 
> > > > kmem_ptr_validate does not guarantee that a pointer can be dereferenced
> > > > if it can go away at any time. Even rcu_read_lock doesn't help, because
> > > > the pointer might be queued in RCU callbacks but not executed yet.
> > > > 
> > > > So the parent cannot be checked, nor the name hashed. The dentry pointer
> > > > can not be touched until it can be verified under lock. Hashing simply
> > > > cannot be used.
> > > > 
> > > > Instead, verify the parent/child relationship by traversing parent's
> > > > d_child list. It's slow, but only ncpfs and the destaged smbfs care
> > > > about it, at this point.
> > > 
> > > I'd drop the previous revert patch and just convert the RCU hash
> > > traversal straight to the d_child traversal code you introduce here.
> > > This is a much better explanation of why the d_validate mechanism
> > > needs to be changed, and the revert is really an unnecessary extra
> > > step...
> > 
> > Has to be backported, though.
> 
> Backported where? The d_validate() change only got included in .37-rc1.

Backported to stable/distro kernels I suppose. I'm not sure what your
point is?

 
> > Patch that is to be reverted obviously
> > adds more brokenness and is a good example that you cannot dget() under
> > rcu read protection even if the rest of the surrounding function is
> > bugfree. I wouldn't have thought it's a big deal.
> 
> Reverting something broken to something already broken just to fix
> to the less broken version seems like an unnecessary step. Just
> fix the brokenneѕs in a single patch - no need to indirect the real
> fix through a revert. One less patch to worry about.

OK but I disagree. Firstly, reverting that patch gives a good record of
that particular pattern of bug (that Christoph and Al both missed).
With more RCU going into the vfs, people need to be pretty clear about
the pitfalls.

Secondly, as I said, reverting means that I can use exact same patch
for upstream and stable kernels.

And finally, it gives better bisectability. If somebody hits a bug in
my patch, I would rather have them bisect into the well-worn (if buggy)
version of the code than bisect into a different type of brokenness.

It isn't indirecting the real fix through a revert, they are broken in
different ways. My fix is for the bug that it doesn't guarantee the
persistence of *memory* we are using, and the revert is for the bug that
it doesn't guarantee the persistence/validity of the *object*, and which
is actually more likely to be a problem if you think about it, because
the window is much larger.

Git has no problem with lots of patches, so I don't see any advantage
to doing one patch, and you lose the advantages above.


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate"
  2010-12-09  4:38         ` Nick Piggin
@ 2010-12-09  5:16           ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-09  5:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 03:38:42PM +1100, Nick Piggin wrote:
> > Code changes. It may not be doing what it was originally
> > needed/intended to be doing - I don't need to waste time on code
> > archeology and second guessing when there are others around that can
> > tell me this off the top oftheir head. ;)
> 
> Well the d_validate API is meant to provide that, so it's broken
> whether or not its callers use it correctly. It's also exported
> to external modules...
> 
> Yes we should remove smbfs and rip the cache out of ncpfs and
> remove d_validate entirely when possible (or, provide a more

The reason why I don't remove the caching crap entirely is because
it is not a trivial change to properly remove it, and I don't have
the time or inclination to do it properly and have it tested when
it's easy to provide a correct, simple, and back compatible API.

The patch Christoph posted for smbfs and ncpfs just hacks out where the
dentry is used from the cache, but leaves a lot of cache infrastructure
there to rot.


> reasonable API and caching library entirely in the dcache code
> that a filesystem might use). But this is the right first step.

For the record, if anyone cares, the sanest and simplest way to do this
would be to store the dentry's hash along with its pointer, and have
interfaces to allocate, destroy, insert, and delete from cache, and
lookup based on a supplied compare function. (but obviously we shouldn't
bother unless some good numbers turn up)


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-11-27  9:44 ` [PATCH 11/46] fs: dcache scale hash Nick Piggin
@ 2010-12-09  6:09   ` Dave Chinner
  2010-12-09  6:28     ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-09  6:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 08:44:41PM +1100, Nick Piggin wrote:
> Add a new lock, dcache_hash_lock, to protect the dcache hash table from
> concurrent modification. d_hash is also protected by d_lock.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> ---
>  fs/dcache.c            |   38 +++++++++++++++++++++++++++-----------
>  include/linux/dcache.h |    3 +++
>  2 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 4f9ccbe..50c65c7 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -35,12 +35,27 @@
>  #include <linux/hardirq.h>
>  #include "internal.h"
>  
> +/*
> + * Usage:
> + * dcache_hash_lock protects dcache hash table
> + *
> + * Ordering:
> + * dcache_lock
> + *   dentry->d_lock
> + *     dcache_hash_lock
> + *

What locking is used to keep DCACHE_UNHASHED/d_unhashed() in check
with the whether the dentry is on the hash list or not? It looks to
me that to make any hash modification, you have to hold both the
dentry->d_lock and the dcache_hash_lock to keep them in step. If
this is correct, can you add this to the comments above?

> + * if (dentry1 < dentry2)
> + *   dentry1->d_lock
> + *     dentry2->d_lock
> + */

Perhaps the places where we need to lock two dentries should use a
wrapper like we do for other objects. Such as:

void dentry_dlock_two(struct dentry *d1, struct dentry *d2)
{
	if (d1 < d2) {
		spin_lock(&d1->d_lock);
		spin_lock_nested(&d2->d_lock, DENTRY_D_LOCK_NESTED);
	} else {
		spin_lock(&d2->d_lock);
		spin_lock_nested(&d1->d_lock, DENTRY_D_LOCK_NESTED);
	}
}

> @@ -1581,7 +1598,9 @@ void d_rehash(struct dentry * entry)
>  {
>  	spin_lock(&dcache_lock);
>  	spin_lock(&entry->d_lock);
> +	spin_lock(&dcache_hash_lock);
>  	_d_rehash(entry);
> +	spin_unlock(&dcache_hash_lock);
>  	spin_unlock(&entry->d_lock);
>  	spin_unlock(&dcache_lock);
>  }

Shouldn't we really kill _d_rehash() by replacing all the callers
with direct calls to __d_rehash() first? There doesn't seem to be much
sense to keep both methods around....

> @@ -1661,8 +1680,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
>   */
>  static void d_move_locked(struct dentry * dentry, struct dentry * target)
>  {
> -	struct hlist_head *list;
> -
>  	if (!dentry->d_inode)
>  		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
>  
> @@ -1679,14 +1696,11 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
>  	}
>  
>  	/* Move the dentry to the target hash queue, if on different bucket */
> -	if (d_unhashed(dentry))
> -		goto already_unhashed;
> -
> -	hlist_del_rcu(&dentry->d_hash);
> -
> -already_unhashed:
> -	list = d_hash(target->d_parent, target->d_name.hash);
> -	__d_rehash(dentry, list);
> +	spin_lock(&dcache_hash_lock);
> +	if (!d_unhashed(dentry))
> +		hlist_del_rcu(&dentry->d_hash);
> +	__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
> +	spin_unlock(&dcache_hash_lock);
>  
>  	/* Unhash the target: dput() will then get rid of it */
>  	__d_drop(target);
> @@ -1883,7 +1897,9 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
>  found_lock:
>  	spin_lock(&actual->d_lock);
>  found:
> +	spin_lock(&dcache_hash_lock);
>  	_d_rehash(actual);
> +	spin_unlock(&dcache_hash_lock);
>  	spin_unlock(&actual->d_lock);
>  	spin_unlock(&dcache_lock);
>  out_nolock:
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 6b5760b..7ce20f5 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -181,6 +181,7 @@ struct dentry_operations {
>  
>  #define DCACHE_CANT_MOUNT	0x0100
>  
> +extern spinlock_t dcache_hash_lock;
>  extern spinlock_t dcache_lock;
>  extern seqlock_t rename_lock;
>  
> @@ -204,7 +205,9 @@ static inline void __d_drop(struct dentry *dentry)
>  {
>  	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
>  		dentry->d_flags |= DCACHE_UNHASHED;
> +		spin_lock(&dcache_hash_lock);
>  		hlist_del_rcu(&dentry->d_hash);
> +		spin_unlock(&dcache_hash_lock);
>  	}
>  }

Un-inline __d_drop so you don't need to make the dcache_hash_lock
visible outside of fs/dcache.c. That happens later in the series
anyway, so may as well do it now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-09  6:09   ` Dave Chinner
@ 2010-12-09  6:28     ` Nick Piggin
  2010-12-09  8:17       ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-09  6:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 05:09:11PM +1100, Dave Chinner wrote:
> On Sat, Nov 27, 2010 at 08:44:41PM +1100, Nick Piggin wrote:
> > Add a new lock, dcache_hash_lock, to protect the dcache hash table from
> > concurrent modification. d_hash is also protected by d_lock.
> > 
> > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > ---
> >  fs/dcache.c            |   38 +++++++++++++++++++++++++++-----------
> >  include/linux/dcache.h |    3 +++
> >  2 files changed, 30 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 4f9ccbe..50c65c7 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -35,12 +35,27 @@
> >  #include <linux/hardirq.h>
> >  #include "internal.h"
> >  
> > +/*
> > + * Usage:
> > + * dcache_hash_lock protects dcache hash table
> > + *
> > + * Ordering:
> > + * dcache_lock
> > + *   dentry->d_lock
> > + *     dcache_hash_lock
> > + *
> 
> What locking is used to keep DCACHE_UNHASHED/d_unhashed() in check
> with the whether the dentry is on the hash list or not? It looks to
> me that to make any hash modification, you have to hold both the
> dentry->d_lock and the dcache_hash_lock to keep them in step. If
> this is correct, can you add this to the comments above?

No, dcache_lock still does that. d_unhashed is protected with
d_lock a few patches later, which adds to the comment.

 
> > + * if (dentry1 < dentry2)
> > + *   dentry1->d_lock
> > + *     dentry2->d_lock
> > + */
> 
> Perhaps the places where we need to lock two dentries should use a
> wrapper like we do for other objects. Such as:
> 
> void dentry_dlock_two(struct dentry *d1, struct dentry *d2)
> {
> 	if (d1 < d2) {
> 		spin_lock(&d1->d_lock);
> 		spin_lock_nested(&d2->d_lock, DENTRY_D_LOCK_NESTED);
> 	} else {
> 		spin_lock(&d2->d_lock);
> 		spin_lock_nested(&d1->d_lock, DENTRY_D_LOCK_NESTED);
> 	}
> }

It only happens once in rename, so I don't think it's useful.
Nothing outside core code should be locking 2 unrelated dentries.

 
> > @@ -1581,7 +1598,9 @@ void d_rehash(struct dentry * entry)
> >  {
> >  	spin_lock(&dcache_lock);
> >  	spin_lock(&entry->d_lock);
> > +	spin_lock(&dcache_hash_lock);
> >  	_d_rehash(entry);
> > +	spin_unlock(&dcache_hash_lock);
> >  	spin_unlock(&entry->d_lock);
> >  	spin_unlock(&dcache_lock);
> >  }
> 
> Shouldn't we really kill _d_rehash() by replacing all the callers
> with direct calls to __d_rehash() first? There doesn't seem to be much
> sense to keep both methods around....

No. Several filesystems are using it, and it's an exported symbol. I'm
focusing on changed to locking, and keeping APIs the same, where
possible. I don't want just more and more depencendies on pushing
through filesystem changes before this series.

Like I said, there are infinite cleanups or improvements you can make.
It does not particularly matter that they happen before or after the
scaling work, except if there are classes of APIs that the new locking
model can no longer support.

 
> > @@ -1661,8 +1680,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
> >   */
> >  static void d_move_locked(struct dentry * dentry, struct dentry * target)
> >  {
> > -	struct hlist_head *list;
> > -
> >  	if (!dentry->d_inode)
> >  		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
> >  
> > @@ -1679,14 +1696,11 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
> >  	}
> >  
> >  	/* Move the dentry to the target hash queue, if on different bucket */
> > -	if (d_unhashed(dentry))
> > -		goto already_unhashed;
> > -
> > -	hlist_del_rcu(&dentry->d_hash);
> > -
> > -already_unhashed:
> > -	list = d_hash(target->d_parent, target->d_name.hash);
> > -	__d_rehash(dentry, list);
> > +	spin_lock(&dcache_hash_lock);
> > +	if (!d_unhashed(dentry))
> > +		hlist_del_rcu(&dentry->d_hash);
> > +	__d_rehash(dentry, d_hash(target->d_parent, target->d_name.hash));
> > +	spin_unlock(&dcache_hash_lock);
> >  
> >  	/* Unhash the target: dput() will then get rid of it */
> >  	__d_drop(target);
> > @@ -1883,7 +1897,9 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
> >  found_lock:
> >  	spin_lock(&actual->d_lock);
> >  found:
> > +	spin_lock(&dcache_hash_lock);
> >  	_d_rehash(actual);
> > +	spin_unlock(&dcache_hash_lock);
> >  	spin_unlock(&actual->d_lock);
> >  	spin_unlock(&dcache_lock);
> >  out_nolock:
> > diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> > index 6b5760b..7ce20f5 100644
> > --- a/include/linux/dcache.h
> > +++ b/include/linux/dcache.h
> > @@ -181,6 +181,7 @@ struct dentry_operations {
> >  
> >  #define DCACHE_CANT_MOUNT	0x0100
> >  
> > +extern spinlock_t dcache_hash_lock;
> >  extern spinlock_t dcache_lock;
> >  extern seqlock_t rename_lock;
> >  
> > @@ -204,7 +205,9 @@ static inline void __d_drop(struct dentry *dentry)
> >  {
> >  	if (!(dentry->d_flags & DCACHE_UNHASHED)) {
> >  		dentry->d_flags |= DCACHE_UNHASHED;
> > +		spin_lock(&dcache_hash_lock);
> >  		hlist_del_rcu(&dentry->d_hash);
> > +		spin_unlock(&dcache_hash_lock);
> >  	}
> >  }
> 
> Un-inline __d_drop so you don't need to make the dcache_hash_lock
> visible outside of fs/dcache.c. That happens later in the series
> anyway, so may as well do it now...

Yeah that makes sense.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 12/46] fs: dcache scale lru
  2010-11-27  9:44 ` [PATCH 12/46] fs: dcache scale lru Nick Piggin
@ 2010-12-09  7:22   ` Dave Chinner
  2010-12-09 12:34     ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-09  7:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 08:44:42PM +1100, Nick Piggin wrote:
> Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
> modification. d_lru is also protected by d_lock.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> ---
>  fs/dcache.c |  112 ++++++++++++++++++++++++++++++++++++++++++++---------------
>  1 files changed, 84 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 50c65c7..aa410b6 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -37,11 +37,19 @@
>  
>  /*
>   * Usage:
> - * dcache_hash_lock protects dcache hash table
> + * dcache_hash_lock protects:
> + *   - the dcache hash table
> + * dcache_lru_lock protects:
> + *   - the dcache lru lists and counters
> + * d_lock protects:
> + *   - d_flags

Which bit of d_flags does it protect? Why in this patch and not in
the hash scaling patch with needs DCACHE_UNHASHED to be in sync with
the whether it is in the hash list?

> + *   - d_name
> + *   - d_lru

Why is the d_lock required to protect d_lru? I can't see any reason
in this patch that requires d_lock to protect anything that
dcache_lru_lock does not protect....

>  /**
> @@ -186,6 +206,8 @@ static void dentry_lru_move_tail(struct dentry *dentry)
>   * The dentry must already be unhashed and removed from the LRU.
>   *
>   * If this is the root of the dentry tree, return NULL.
> + *
> + * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
>   */
>  static struct dentry *d_kill(struct dentry *dentry)
>  	__releases(dentry->d_lock)
> @@ -341,10 +363,19 @@ int d_invalidate(struct dentry * dentry)
>  EXPORT_SYMBOL(d_invalidate);
>  
>  /* This should be called _only_ with dcache_lock held */
> +static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
> +{
> +	atomic_inc(&dentry->d_count);
> +	dentry_lru_del(dentry);
> +	return dentry;
> +}
> +
>  static inline struct dentry * __dget_locked(struct dentry *dentry)
>  {
>  	atomic_inc(&dentry->d_count);
> +	spin_lock(&dentry->d_lock);
>  	dentry_lru_del(dentry);
> +	spin_unlock(&dentry->d_lock);
>  	return dentry;
>  }

Why do we need to call dentry_lru_del() in __dget_* functions? The
shrinkers already handle lazy deletion just fine, and this just
seems like a method for bouncing the dcache_lru_lock around.

The new lazy inode lru code which was modelled on the dcache code
does not remove inodes from the LRU when taking new references to an
inode and it seems to function just fine, so I'm thinking that we
should be making the dcache LRU even more lazy by removing these
calls to dentry_lru_del() here.

I think this is important as as lockstat is telling me
the dcache_lru_lock is the most heavily contended lock in the system
in my testing....

> @@ -474,21 +505,31 @@ static void shrink_dentry_list(struct list_head *list)
>  
>  	while (!list_empty(list)) {
>  		dentry = list_entry(list->prev, struct dentry, d_lru);
> -		dentry_lru_del(dentry);
> +
> +		if (!spin_trylock(&dentry->d_lock)) {
> +			spin_unlock(&dcache_lru_lock);
> +			cpu_relax();
> +			spin_lock(&dcache_lru_lock);
> +			continue;
> +		}

Wouldn't it be better to move the entry to the tail of the list
here so we don't get stuck spinning on the same dentry for a length
of time?

> @@ -509,32 +550,36 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
>  	int cnt = *count;
>  
>  	spin_lock(&dcache_lock);
> +relock:
> +	spin_lock(&dcache_lru_lock);
>  	while (!list_empty(&sb->s_dentry_lru)) {
>  		dentry = list_entry(sb->s_dentry_lru.prev,
>  				struct dentry, d_lru);
>  		BUG_ON(dentry->d_sb != sb);
>  
> +		if (!spin_trylock(&dentry->d_lock)) {
> +			spin_unlock(&dcache_lru_lock);
> +			cpu_relax();
> +			goto relock;
> +		}

Same again - if the dentry is locked, then it is likely to be
newly referenced so move it to the tail of the list and continue....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-09  6:28     ` Nick Piggin
@ 2010-12-09  8:17       ` Dave Chinner
  2010-12-09 12:53         ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-09  8:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 05:28:01PM +1100, Nick Piggin wrote:
> On Thu, Dec 09, 2010 at 05:09:11PM +1100, Dave Chinner wrote:
> > On Sat, Nov 27, 2010 at 08:44:41PM +1100, Nick Piggin wrote:
> > > Add a new lock, dcache_hash_lock, to protect the dcache hash table from
> > > concurrent modification. d_hash is also protected by d_lock.
> > > 
> > > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > > ---
> > >  fs/dcache.c            |   38 +++++++++++++++++++++++++++-----------
> > >  include/linux/dcache.h |    3 +++
> > >  2 files changed, 30 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/fs/dcache.c b/fs/dcache.c
> > > index 4f9ccbe..50c65c7 100644
> > > --- a/fs/dcache.c
> > > +++ b/fs/dcache.c
> > > @@ -35,12 +35,27 @@
> > >  #include <linux/hardirq.h>
> > >  #include "internal.h"
> > >  
> > > +/*
> > > + * Usage:
> > > + * dcache_hash_lock protects dcache hash table
> > > + *
> > > + * Ordering:
> > > + * dcache_lock
> > > + *   dentry->d_lock
> > > + *     dcache_hash_lock
> > > + *
> > 
> > What locking is used to keep DCACHE_UNHASHED/d_unhashed() in check
> > with the whether the dentry is on the hash list or not? It looks to
> > me that to make any hash modification, you have to hold both the
> > dentry->d_lock and the dcache_hash_lock to keep them in step. If
> > this is correct, can you add this to the comments above?
> 
> No, dcache_lock still does that. d_unhashed is protected with
> d_lock a few patches later, which adds to the comment.

d_unhashed() is just a flag bit in ->d_flags, which is apparently
protected by ->d_lock in the next the LRU patch. I say apparently
because that patch doesn't appear to protect anything to do with
d_flags. I'm struggling to work out what is being protected in each
patch because it is not clearexactly what is being done.

It makes much more sense to me to start by making all access to
d_flags atomic (i.e. protected by ->d_lock) in a separate patch than
to do it hodge-podge across multiple patches as you currently are.
Its hard to follow when things that are intimately related are
separated by mutliple patches doing different things...


> > > + * if (dentry1 < dentry2)
> > > + *   dentry1->d_lock
> > > + *     dentry2->d_lock
> > > + */
> > 
> > Perhaps the places where we need to lock two dentries should use a
> > wrapper like we do for other objects. Such as:
> > 
> > void dentry_dlock_two(struct dentry *d1, struct dentry *d2)
> > {
> > 	if (d1 < d2) {
> > 		spin_lock(&d1->d_lock);
> > 		spin_lock_nested(&d2->d_lock, DENTRY_D_LOCK_NESTED);
> > 	} else {
> > 		spin_lock(&d2->d_lock);
> > 		spin_lock_nested(&d1->d_lock, DENTRY_D_LOCK_NESTED);
> > 	}
> > }
> 
> It only happens once in rename, so I don't think it's useful.

It is self documenting code, which does have value...

> Nothing outside core code should be locking 2 unrelated dentries.

So it is static.

> 
>  
> > > @@ -1581,7 +1598,9 @@ void d_rehash(struct dentry * entry)
> > >  {
> > >  	spin_lock(&dcache_lock);
> > >  	spin_lock(&entry->d_lock);
> > > +	spin_lock(&dcache_hash_lock);
> > >  	_d_rehash(entry);
> > > +	spin_unlock(&dcache_hash_lock);
> > >  	spin_unlock(&entry->d_lock);
> > >  	spin_unlock(&dcache_lock);
> > >  }
> > 
> > Shouldn't we really kill _d_rehash() by replacing all the callers
> > with direct calls to __d_rehash() first? There doesn't seem to be much
> > sense to keep both methods around....
> 
> No. Several filesystems are using it, and it's an exported symbol.

That's __d_rehash().

_d_rehash() (single underscore) is static and only called by
d_rehash() and d_materialise_unique() And is one line of code
calling __d_rehash(). Kill it, please.

> I'm
> focusing on changed to locking, and keeping APIs the same, where
> possible. I don't want just more and more depencendies on pushing
> through filesystem changes before this series.
> 
> Like I said, there are infinite cleanups or improvements you can make.
> It does not particularly matter that they happen before or after the
> scaling work, except if there are classes of APIs that the new locking
> model can no longer support.

We do plenty of cleanups when changing code when the result gives us
simpler and easier to understand code. It's a trivial change that,
IMO, makes the code more consistent and easier to follow.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 12/46] fs: dcache scale lru
  2010-12-09  7:22   ` Dave Chinner
@ 2010-12-09 12:34     ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-09 12:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

Damn, forgot cc, sorry.

On Thu, Dec 9, 2010 at 6:22 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 08:44:42PM +1100, Nick Piggin wrote:
>> Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
>> modification. d_lru is also protected by d_lock.
>>
>> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
>> ---
>>  fs/dcache.c |  112 ++++++++++++++++++++++++++++++++++++++++++++---------------
>>  1 files changed, 84 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 50c65c7..aa410b6 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -37,11 +37,19 @@
>>
>>  /*
>>   * Usage:
>> - * dcache_hash_lock protects dcache hash table
>> + * dcache_hash_lock protects:
>> + *   - the dcache hash table
>> + * dcache_lru_lock protects:
>> + *   - the dcache lru lists and counters
>> + * d_lock protects:
>> + *   - d_flags
>
> Which bit of d_flags does it protect?

All of them.

> Why in this patch and not in
> the hash scaling patch with needs DCACHE_UNHASHED to be in sync with
> the whether it is in the hash list?

It's unrelated to this patch (same as d_name, below).

>> + *   - d_name
>> + *   - d_lru
>
> Why is the d_lock required to protect d_lru? I can't see any reason
> in this patch that requires d_lock to protect anything that
> dcache_lru_lock does not protect....

It's used in future patches.


>>  /**
>> @@ -186,6 +206,8 @@ static void dentry_lru_move_tail(struct dentry *dentry)
>>   * The dentry must already be unhashed and removed from the LRU.
>>   *
>>   * If this is the root of the dentry tree, return NULL.
>> + *
>> + * dcache_lock and d_lock must be held by caller, are dropped by d_kill.
>>   */
>>  static struct dentry *d_kill(struct dentry *dentry)
>>       __releases(dentry->d_lock)
>> @@ -341,10 +363,19 @@ int d_invalidate(struct dentry * dentry)
>>  EXPORT_SYMBOL(d_invalidate);
>>
>>  /* This should be called _only_ with dcache_lock held */
>> +static inline struct dentry * __dget_locked_dlock(struct dentry *dentry)
>> +{
>> +     atomic_inc(&dentry->d_count);
>> +     dentry_lru_del(dentry);
>> +     return dentry;
>> +}
>> +
>>  static inline struct dentry * __dget_locked(struct dentry *dentry)
>>  {
>>       atomic_inc(&dentry->d_count);
>> +     spin_lock(&dentry->d_lock);
>>       dentry_lru_del(dentry);
>> +     spin_unlock(&dentry->d_lock);
>>       return dentry;
>>  }
>
> Why do we need to call dentry_lru_del() in __dget_* functions? The
> shrinkers already handle lazy deletion just fine, and this just
> seems like a method for bouncing the dcache_lru_lock around.

This is how the code works before the locking series. The intention
is that places which already hold dcache_lock for other reasons can
take the dentry off the list cheaply.

I am avoiding where possible in making functional changes along with
locking changes (like I have said many times). After dcache_lock goes
away, there are patches to make it more lazy.

>
> The new lazy inode lru code which was modelled on the dcache code
> does not remove inodes from the LRU when taking new references to an
> inode and it seems to function just fine,

I know, didn't I implement it? :)

> so I'm thinking that we
> should be making the dcache LRU even more lazy by removing these
> calls to dentry_lru_del() here.

Not in this patch, but later yes.

>
> I think this is important as as lockstat is telling me
> the dcache_lru_lock is the most heavily contended lock in the system
> in my testing....
>
>> @@ -474,21 +505,31 @@ static void shrink_dentry_list(struct list_head *list)
>>
>>       while (!list_empty(list)) {
>>               dentry = list_entry(list->prev, struct dentry, d_lru);
>> -             dentry_lru_del(dentry);
>> +
>> +             if (!spin_trylock(&dentry->d_lock)) {
>> +                     spin_unlock(&dcache_lru_lock);
>> +                     cpu_relax();
>> +                     spin_lock(&dcache_lru_lock);
>> +                     continue;
>> +             }
>
> Wouldn't it be better to move the entry to the tail of the list
> here so we don't get stuck spinning on the same dentry for a length
> of time?

Again, I'm avoiding functional changes in this part of the series. It
makes things slightly more clunky in intermediate stages, but makes
verification and bisecting much easier.

This particular one gets done completely without lru lock in future patch.


>> @@ -509,32 +550,36 @@ static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
>>       int cnt = *count;
>>
>>       spin_lock(&dcache_lock);
>> +relock:
>> +     spin_lock(&dcache_lru_lock);
>>       while (!list_empty(&sb->s_dentry_lru)) {
>>               dentry = list_entry(sb->s_dentry_lru.prev,
>>                               struct dentry, d_lru);
>>               BUG_ON(dentry->d_sb != sb);
>>
>> +             if (!spin_trylock(&dentry->d_lock)) {
>> +                     spin_unlock(&dcache_lru_lock);
>> +                     cpu_relax();
>> +                     goto relock;
>> +             }
>
> Same again - if the dentry is locked, then it is likely to be
> newly referenced so move it to the tail of the list and continue....

Not this patch.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-09  8:17       ` Dave Chinner
@ 2010-12-09 12:53         ` Nick Piggin
  2010-12-09 23:42           ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-09 12:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 9, 2010 at 7:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Dec 09, 2010 at 05:28:01PM +1100, Nick Piggin wrote:
>> On Thu, Dec 09, 2010 at 05:09:11PM +1100, Dave Chinner wrote:
>> > On Sat, Nov 27, 2010 at 08:44:41PM +1100, Nick Piggin wrote:
>> > > Add a new lock, dcache_hash_lock, to protect the dcache hash table from
>> > > concurrent modification. d_hash is also protected by d_lock.
>> > >
>> > > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
>> > > ---
>> > >  fs/dcache.c            |   38 +++++++++++++++++++++++++++-----------
>> > >  include/linux/dcache.h |    3 +++
>> > >  2 files changed, 30 insertions(+), 11 deletions(-)
>> > >
>> > > diff --git a/fs/dcache.c b/fs/dcache.c
>> > > index 4f9ccbe..50c65c7 100644
>> > > --- a/fs/dcache.c
>> > > +++ b/fs/dcache.c
>> > > @@ -35,12 +35,27 @@
>> > >  #include <linux/hardirq.h>
>> > >  #include "internal.h"
>> > >
>> > > +/*
>> > > + * Usage:
>> > > + * dcache_hash_lock protects dcache hash table
>> > > + *
>> > > + * Ordering:
>> > > + * dcache_lock
>> > > + *   dentry->d_lock
>> > > + *     dcache_hash_lock
>> > > + *
>> >
>> > What locking is used to keep DCACHE_UNHASHED/d_unhashed() in check
>> > with the whether the dentry is on the hash list or not? It looks to
>> > me that to make any hash modification, you have to hold both the
>> > dentry->d_lock and the dcache_hash_lock to keep them in step. If
>> > this is correct, can you add this to the comments above?
>>
>> No, dcache_lock still does that. d_unhashed is protected with
>> d_lock a few patches later, which adds to the comment.
>
> d_unhashed() is just a flag bit in ->d_flags, which is apparently
> protected by ->d_lock in the next the LRU patch.

No, d_flags in upstream kernel is protected by d_lock (and also you
can't protected one bit in a word with a lock but not another).


> I say apparently
> because that patch doesn't appear to protect anything to do with
> d_flags. I'm struggling to work out what is being protected in each
> patch because it is not clearexactly what is being done.

Well hopefully the code and changelog is more clear. The lockorder
doc changes probably aren't a good source for a running commentary.
Not that it's wrong, it just may have a few nits like this where an existing
lock order or role is commented in another patch. Meh.


> It makes much more sense to me to start by making all access to
> d_flags atomic (i.e. protected by ->d_lock) in a separate patch than
> to do it hodge-podge across multiple patches as you currently are.
> Its hard to follow when things that are intimately related are
> separated by mutliple patches doing different things...

It might be easier to follow if you know d_flags is already protected
by d_lock.

The d_unhashed patch uses d_lock to keep _both_ the d_flags and
the hash list membership status in sync. It does not make the d_flags
itself atomic.


>> > > + * if (dentry1 < dentry2)
>> > > + *   dentry1->d_lock
>> > > + *     dentry2->d_lock
>> > > + */
>> >
>> > Perhaps the places where we need to lock two dentries should use a
>> > wrapper like we do for other objects. Such as:
>> >
>> > void dentry_dlock_two(struct dentry *d1, struct dentry *d2)
>> > {
>> >     if (d1 < d2) {
>> >             spin_lock(&d1->d_lock);
>> >             spin_lock_nested(&d2->d_lock, DENTRY_D_LOCK_NESTED);
>> >     } else {
>> >             spin_lock(&d2->d_lock);
>> >             spin_lock_nested(&d1->d_lock, DENTRY_D_LOCK_NESTED);
>> >     }
>> > }
>>
>> It only happens once in rename, so I don't think it's useful.
>
> It is self documenting code, which does have value...
>
>> Nothing outside core code should be locking 2 unrelated dentries.
>
> So it is static.

Anyway, cleanup. It can equally well be done before or after, and seeing as
we're being nice and not breaking my tree with minutiae, can you base it on
top please?


>> > > @@ -1581,7 +1598,9 @@ void d_rehash(struct dentry * entry)
>> > >  {
>> > >   spin_lock(&dcache_lock);
>> > >   spin_lock(&entry->d_lock);
>> > > + spin_lock(&dcache_hash_lock);
>> > >   _d_rehash(entry);
>> > > + spin_unlock(&dcache_hash_lock);
>> > >   spin_unlock(&entry->d_lock);
>> > >   spin_unlock(&dcache_lock);
>> > >  }
>> >
>> > Shouldn't we really kill _d_rehash() by replacing all the callers
>> > with direct calls to __d_rehash() first? There doesn't seem to be much
>> > sense to keep both methods around....
>>
>> No. Several filesystems are using it, and it's an exported symbol.
>
> That's __d_rehash().

Oh I beg your pardon.


> _d_rehash() (single underscore) is static and only called by
> d_rehash() and d_materialise_unique() And is one line of code
> calling __d_rehash(). Kill it, please.

That doesn't belong in this patch either, it's shuffling. Anyway I like
_d_rehash, so I'm certainly not going to send a patch to kill it.


>> I'm
>> focusing on changed to locking, and keeping APIs the same, where
>> possible. I don't want just more and more depencendies on pushing
>> through filesystem changes before this series.
>>
>> Like I said, there are infinite cleanups or improvements you can make.
>> It does not particularly matter that they happen before or after the
>> scaling work, except if there are classes of APIs that the new locking
>> model can no longer support.
>
> We do plenty of cleanups when changing code when the result gives us
> simpler and easier to understand code. It's a trivial change that,
> IMO, makes the code more consistent and easier to follow.

Unrelated "cleanups" in the same patch as non trivial locking change
is stupid.

Necessary changes to prevent bad ugliness resulting, or preventing
repeated steps for the particular changes, etc. of course. Killing un
related functions no.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-09 12:53         ` Nick Piggin
@ 2010-12-09 23:42           ` Dave Chinner
  2010-12-10  2:35             ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2010-12-09 23:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Thu, Dec 09, 2010 at 11:53:27PM +1100, Nick Piggin wrote:
> >> Like I said, there are infinite cleanups or improvements you can make.
> >> It does not particularly matter that they happen before or after the
> >> scaling work, except if there are classes of APIs that the new locking
> >> model can no longer support.
> >
> > We do plenty of cleanups when changing code when the result gives us
> > simpler and easier to understand code. It's a trivial change that,
> > IMO, makes the code more consistent and easier to follow.
> 
> Unrelated "cleanups" in the same patch as non trivial locking change
> is stupid.

So put it in another prepartory patch. It makes the locking changes
easier to understand...

> Necessary changes to prevent bad ugliness resulting, or preventing
> repeated steps for the particular changes, etc. of course. Killing un
> related functions no.

Ok, I get the picture. You don't want a code review, you want a
rubber stamp. Find someone else to get it from.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-09 23:42           ` Dave Chinner
@ 2010-12-10  2:35             ` Nick Piggin
  2010-12-10  9:01               ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2010-12-10  2:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 10:42:58AM +1100, Dave Chinner wrote:
> On Thu, Dec 09, 2010 at 11:53:27PM +1100, Nick Piggin wrote:
> > >> Like I said, there are infinite cleanups or improvements you can make.
> > >> It does not particularly matter that they happen before or after the
> > >> scaling work, except if there are classes of APIs that the new locking
> > >> model can no longer support.
> > >
> > > We do plenty of cleanups when changing code when the result gives us
> > > simpler and easier to understand code. It's a trivial change that,
> > > IMO, makes the code more consistent and easier to follow.
> > 
> > Unrelated "cleanups" in the same patch as non trivial locking change
> > is stupid.
> 
> So put it in another prepartory patch. It makes the locking changes
> easier to understand...

I didn't change that, though, the ordering of locking unrelated
dentries and the code is already in rename code and is not touched
during this patch set.


> > Necessary changes to prevent bad ugliness resulting, or preventing
> > repeated steps for the particular changes, etc. of course. Killing un
> > related functions no.
> 
> Ok, I get the picture. You don't want a code review, you want a
> rubber stamp. Find someone else to get it from.

Of course I want code review. I am not going to just do everything
you say that I don't agree with, but I will explain why every time
(as I have done to all your points).

I would prefer more in-depth review than from someone who doesn't know
d_lock protects d_flags, but any and all help is welcome. Even minor
nitpicking or cleanups are welcome if they are relevant to the patches.

Thanks,
Nick

PS. don't accuse me of not wanting a code review, because you're just
projecting. To paraphrase you:

 I don't have to justify myself to you, nick, only the maintainers, so
 I'm not answering.

In response to my questions.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-10  2:35             ` Nick Piggin
@ 2010-12-10  9:01               ` Dave Chinner
  2010-12-13  4:48                 ` Nick Piggin
  2010-12-13  5:05                 ` Nick Piggin
  0 siblings, 2 replies; 107+ messages in thread
From: Dave Chinner @ 2010-12-10  9:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 01:35:20PM +1100, Nick Piggin wrote:
> On Fri, Dec 10, 2010 at 10:42:58AM +1100, Dave Chinner wrote:
> > On Thu, Dec 09, 2010 at 11:53:27PM +1100, Nick Piggin wrote:
> > > Necessary changes to prevent bad ugliness resulting, or preventing
> > > repeated steps for the particular changes, etc. of course. Killing un
> > > related functions no.
> > 
> > Ok, I get the picture. You don't want a code review, you want a
> > rubber stamp. Find someone else to get it from.
> 
> Of course I want code review. I am not going to just do everything
> you say that I don't agree with, but I will explain why every time
> (as I have done to all your points).

Which generally comes down to "I disagree with you". That's hard to
argue against because you aren't willing to compromise.

So, to address your next comment, I'll restate what I was proposing.
That is, to ensure all the d_flags accesses protected by d_lock as
an initial patch rather than cleaning it up in an ad-hoc fashion
later on, such as this later patch in your series:

[PATCH 14/46] fs: dcache scale d_unhashed

which has the description:

	Protect d_unhashed(dentry) condition with d_lock.

which illustrates my point that not all accesses to d_flags are
currently protected by d_lock as you are asserting. Hence:

> I would prefer more in-depth review than from someone who doesn't know
> d_lock protects d_flags,

Your implication about my competence is incorrect and entirely
inappropriate.  Ad hominen attacks don't improve your argument or
encourage other people to review your code.

> but any and all help is welcome. Even minor
> nitpicking or cleanups are welcome if they are relevant to the patches.

If _you_ decide they are relevant.

Nick, in the past couple of months you've burnt everyone who has
tried to review your changes in any meaningful way. Nobody wants to
engage with you because you've aggressively disagreed with every
significant change that has been requested. You have shown no desire
to compromise, instead you argue that you are right until you've had
the last word, and you have frequently resorted to condesending and
disrespectful attacks on reviewers. You would do well to keep that
in mind next time you wonder why nobody is stepping up to review
your code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  7:09           ` Nick Piggin
  (?)
@ 2010-12-10 20:32           ` Paul E. McKenney
  2010-12-12 14:54               ` Paul E. McKenney
  -1 siblings, 1 reply; 107+ messages in thread
From: Paul E. McKenney @ 2010-12-10 20:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > >>
> > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > >>
> > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > >> last time they were posted. It is structured like:
> > > > >>
> > > > >> * preparation patches
> > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > >> * cleaning up and reworking things for new locks
> > > > >> * rcu-walk path walking
> > > > >> * start on some fine grained locking steps
> > > > >
> > > > > Stress test doing:
> > > > >
> > > > >        single thread 50M inode create
> > > > >        single thread rm -rf
> > > > >        2-way 50M inode create
> > > > >        2-way rm -rf
> > > > >        4-way 50M inode create
> > > > >        4-way rm -rf
> > > > >        8-way 50M inode create
> > > > >        8-way rm -rf
> > > > >        8-way 250M inode create
> > > > >        8-way rm -rf
> > > > >
> > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > with a CPU stuck spinning on here:
> > > > >
> > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > [37372.084012] CPU 5
> > > > > [37372.084012] Modules linked in:
> > > > > [37372.084012]
> > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > [37372.084012] Stack:
> > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > [37372.084012] Call Trace:
> > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > 
> > > > OK good, with any luck, that's the same bug.
> > > > 
> > > > Is this XFS?
> > > 
> > > Yes.
> > > 
> > > > Is there any concurrent activity happening on the same dentries?
> > > 
> > > Not from an application perspective.
> > > 
> > > > Ie. are the rm -rf threads running on the same directories,
> > > 
> > > No, each thread operating on a different directory.
> 
> This is probably fixed by the same patch as the lockdep splat trace.
> 
> 
> > > > or is there any reclaim happening in the background?
> > > 
> > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > be dentry cache reclaim going as well.
> > 
> > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > while it is active during unlinks is:
> > 
> >              samples  pcnt function                    DSO
> >              _______ _____ ___________________________ _________________
> > 
> >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> >             12917.00  7.6% down_write                  [kernel.kallsyms]
> >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> >              3556.00  2.1% up_write                    [kernel.kallsyms]
> >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > 
> > So there is some dentry cache reclaim going on. 
> > 
> > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > more CPU time) in the work kswapd is doing during these unlinks, too.
> > I just had a look at kswapd when a 8-way create is running - it's running at
> > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > 
> >              samples  pcnt function                    DSO
> >              _______ _____ ___________________________ _________________
> > 
> >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> >             21491.00  7.1% down_write                  [kernel.kallsyms]
> >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> >              6974.00  2.3% up_write                    [kernel.kallsyms]
> >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > 
> > Which is showing a similar amount of RCU overhead as the unlink as above.
> > And this while it was doing dentry cache reclaim (~10s sample):
> > 
> >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> >             10098.00  4.4% iput                          [kernel.kallsyms]
> >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> >              3279.00  1.4% kfree                         [kernel.kallsyms]
> >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > 
> > I've never really seen any signficant dentry cache reclaim overhead
> > in profiles of these workloads before, so this was a bit of a
> > surprise....
> 
> call_rcu shouldn't be doing much, except for disabling irqs and linking
> the object into the list. I have a patch somewhere to reduce the irq
> disable overhead a bit, but it really shouldn't be doing a lot of work.

Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
look at rcu/rcudata?  There will be a "ql=" number printed for each
CPU, and if that number is too large, __call_rcu() does take what it
considers to be corrective action, which can incur some overhead.

If this is the problem, then increasing the value of the qhimark module
parameter might help.

If this is not the problem, I could make a patch that disables some of
__call_rcu()'s grace-period acceleration code if you are willing to try
it out.

> Sometimes you find that touching the rcu head field needs to get a
> cacheline exclusive, so a bit of work gets transferred there....
> 
> But it may also be something going a bit wrong in RCU. I blew it up
> once already, after the files_lock splitup that enabled all CPUs to
> create and destroy files :)

I would certainly like the opportunity to fix any bugs that might be
in RCU...  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-10 20:32           ` Paul E. McKenney
@ 2010-12-12 14:54               ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-12-12 14:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > > >>
> > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > > >>
> > > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > > >> last time they were posted. It is structured like:
> > > > > >>
> > > > > >> * preparation patches
> > > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > > >> * cleaning up and reworking things for new locks
> > > > > >> * rcu-walk path walking
> > > > > >> * start on some fine grained locking steps
> > > > > >
> > > > > > Stress test doing:
> > > > > >
> > > > > >        single thread 50M inode create
> > > > > >        single thread rm -rf
> > > > > >        2-way 50M inode create
> > > > > >        2-way rm -rf
> > > > > >        4-way 50M inode create
> > > > > >        4-way rm -rf
> > > > > >        8-way 50M inode create
> > > > > >        8-way rm -rf
> > > > > >        8-way 250M inode create
> > > > > >        8-way rm -rf
> > > > > >
> > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > > with a CPU stuck spinning on here:
> > > > > >
> > > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > > [37372.084012] CPU 5
> > > > > > [37372.084012] Modules linked in:
> > > > > > [37372.084012]
> > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > > [37372.084012] Stack:
> > > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > > [37372.084012] Call Trace:
> > > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > OK good, with any luck, that's the same bug.
> > > > > 
> > > > > Is this XFS?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is there any concurrent activity happening on the same dentries?
> > > > 
> > > > Not from an application perspective.
> > > > 
> > > > > Ie. are the rm -rf threads running on the same directories,
> > > > 
> > > > No, each thread operating on a different directory.
> > 
> > This is probably fixed by the same patch as the lockdep splat trace.
> > 
> > 
> > > > > or is there any reclaim happening in the background?
> > > > 
> > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > > be dentry cache reclaim going as well.
> > > 
> > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > > while it is active during unlinks is:
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> > >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> > >             12917.00  7.6% down_write                  [kernel.kallsyms]
> > >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> > >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> > >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> > >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> > >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> > >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> > >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> > >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> > >              3556.00  2.1% up_write                    [kernel.kallsyms]
> > >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> > >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> > >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> > >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> > >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> > >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> > >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> > >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> > >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> > >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> > >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> > >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> > >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > > 
> > > So there is some dentry cache reclaim going on. 
> > > 
> > > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > > more CPU time) in the work kswapd is doing during these unlinks, too.
> > > I just had a look at kswapd when a 8-way create is running - it's running at
> > > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> > >             21491.00  7.1% down_write                  [kernel.kallsyms]
> > >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> > >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> > >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> > >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> > >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> > >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> > >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> > >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> > >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> > >              6974.00  2.3% up_write                    [kernel.kallsyms]
> > >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> > >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > > 
> > > Which is showing a similar amount of RCU overhead as the unlink as above.
> > > And this while it was doing dentry cache reclaim (~10s sample):
> > > 
> > >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> > >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> > >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> > >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> > >             10098.00  4.4% iput                          [kernel.kallsyms]
> > >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> > >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> > >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> > >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> > >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> > >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> > >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> > >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> > >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> > >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> > >              3279.00  1.4% kfree                         [kernel.kallsyms]
> > >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> > >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> > >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > > 
> > > I've never really seen any signficant dentry cache reclaim overhead
> > > in profiles of these workloads before, so this was a bit of a
> > > surprise....
> > 
> > call_rcu shouldn't be doing much, except for disabling irqs and linking
> > the object into the list. I have a patch somewhere to reduce the irq
> > disable overhead a bit, but it really shouldn't be doing a lot of work.
> 
> Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
> look at rcu/rcudata?  There will be a "ql=" number printed for each
> CPU, and if that number is too large, __call_rcu() does take what it
> considers to be corrective action, which can incur some overhead.
> 
> If this is the problem, then increasing the value of the qhimark module
> parameter might help.
> 
> If this is not the problem, I could make a patch that disables some of
> __call_rcu()'s grace-period acceleration code if you are willing to try
> it out.

Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT
to something like 16.  If this does help, then there is a reasonably
straightforward change I can make to RCU.

							Thanx, Paul

> > Sometimes you find that touching the rcu head field needs to get a
> > cacheline exclusive, so a bit of work gets transferred there....
> > 
> > But it may also be something going a bit wrong in RCU. I blew it up
> > once already, after the files_lock splitup that enabled all CPUs to
> > create and destroy files :)
> 
> I would certainly like the opportunity to fix any bugs that might be
> in RCU...  ;-)
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-12-12 14:54               ` Paul E. McKenney
  0 siblings, 0 replies; 107+ messages in thread
From: Paul E. McKenney @ 2010-12-12 14:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > > >>
> > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > > >>
> > > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > > >> last time they were posted. It is structured like:
> > > > > >>
> > > > > >> * preparation patches
> > > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > > >> * cleaning up and reworking things for new locks
> > > > > >> * rcu-walk path walking
> > > > > >> * start on some fine grained locking steps
> > > > > >
> > > > > > Stress test doing:
> > > > > >
> > > > > >        single thread 50M inode create
> > > > > >        single thread rm -rf
> > > > > >        2-way 50M inode create
> > > > > >        2-way rm -rf
> > > > > >        4-way 50M inode create
> > > > > >        4-way rm -rf
> > > > > >        8-way 50M inode create
> > > > > >        8-way rm -rf
> > > > > >        8-way 250M inode create
> > > > > >        8-way rm -rf
> > > > > >
> > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > > with a CPU stuck spinning on here:
> > > > > >
> > > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > > [37372.084012] CPU 5
> > > > > > [37372.084012] Modules linked in:
> > > > > > [37372.084012]
> > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > > [37372.084012] Stack:
> > > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > > [37372.084012] Call Trace:
> > > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > OK good, with any luck, that's the same bug.
> > > > > 
> > > > > Is this XFS?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is there any concurrent activity happening on the same dentries?
> > > > 
> > > > Not from an application perspective.
> > > > 
> > > > > Ie. are the rm -rf threads running on the same directories,
> > > > 
> > > > No, each thread operating on a different directory.
> > 
> > This is probably fixed by the same patch as the lockdep splat trace.
> > 
> > 
> > > > > or is there any reclaim happening in the background?
> > > > 
> > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > > be dentry cache reclaim going as well.
> > > 
> > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > > while it is active during unlinks is:
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> > >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> > >             12917.00  7.6% down_write                  [kernel.kallsyms]
> > >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> > >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> > >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> > >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> > >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> > >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> > >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> > >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> > >              3556.00  2.1% up_write                    [kernel.kallsyms]
> > >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> > >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> > >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> > >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> > >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> > >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> > >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> > >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> > >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> > >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> > >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> > >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> > >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > > 
> > > So there is some dentry cache reclaim going on. 
> > > 
> > > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > > more CPU time) in the work kswapd is doing during these unlinks, too.
> > > I just had a look at kswapd when a 8-way create is running - it's running at
> > > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> > >             21491.00  7.1% down_write                  [kernel.kallsyms]
> > >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> > >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> > >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> > >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> > >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> > >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> > >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> > >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> > >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> > >              6974.00  2.3% up_write                    [kernel.kallsyms]
> > >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> > >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > > 
> > > Which is showing a similar amount of RCU overhead as the unlink as above.
> > > And this while it was doing dentry cache reclaim (~10s sample):
> > > 
> > >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> > >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> > >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> > >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> > >             10098.00  4.4% iput                          [kernel.kallsyms]
> > >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> > >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> > >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> > >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> > >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> > >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> > >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> > >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> > >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> > >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> > >              3279.00  1.4% kfree                         [kernel.kallsyms]
> > >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> > >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> > >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > > 
> > > I've never really seen any signficant dentry cache reclaim overhead
> > > in profiles of these workloads before, so this was a bit of a
> > > surprise....
> > 
> > call_rcu shouldn't be doing much, except for disabling irqs and linking
> > the object into the list. I have a patch somewhere to reduce the irq
> > disable overhead a bit, but it really shouldn't be doing a lot of work.
> 
> Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
> look at rcu/rcudata?  There will be a "ql=" number printed for each
> CPU, and if that number is too large, __call_rcu() does take what it
> considers to be corrective action, which can incur some overhead.
> 
> If this is the problem, then increasing the value of the qhimark module
> parameter might help.
> 
> If this is not the problem, I could make a patch that disables some of
> __call_rcu()'s grace-period acceleration code if you are willing to try
> it out.

Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT
to something like 16.  If this does help, then there is a reasonably
straightforward change I can make to RCU.

							Thanx, Paul

> > Sometimes you find that touching the rcu head field needs to get a
> > cacheline exclusive, so a bit of work gets transferred there....
> > 
> > But it may also be something going a bit wrong in RCU. I blew it up
> > once already, after the files_lock splitup that enabled all CPUs to
> > create and destroy files :)
> 
> I would certainly like the opportunity to fix any bugs that might be
> in RCU...  ;-)
> 
> 							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-10  9:01               ` Dave Chinner
@ 2010-12-13  4:48                 ` Nick Piggin
  2010-12-13  5:05                 ` Nick Piggin
  1 sibling, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-13  4:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 08:01:26PM +1100, Dave Chinner wrote:
> On Fri, Dec 10, 2010 at 01:35:20PM +1100, Nick Piggin wrote:
> > On Fri, Dec 10, 2010 at 10:42:58AM +1100, Dave Chinner wrote:
> > > On Thu, Dec 09, 2010 at 11:53:27PM +1100, Nick Piggin wrote:
> > > > Necessary changes to prevent bad ugliness resulting, or preventing
> > > > repeated steps for the particular changes, etc. of course. Killing un
> > > > related functions no.
> > > 
> > > Ok, I get the picture. You don't want a code review, you want a
> > > rubber stamp. Find someone else to get it from.
> > 
> > Of course I want code review. I am not going to just do everything
> > you say that I don't agree with, but I will explain why every time
> > (as I have done to all your points).
> 
> Which generally comes down to "I disagree with you". That's hard to
> argue against because you aren't willing to compromise.

I have been trying to explain my reasoning. For example, the suggestion
to change _d_rehash and to put by-address ordering of locking 2 dentries
in its own function I simply said that I don't want to pull in such
changes because they're not related or really touched by the patches.

I think that's reasonable, and so if I have a reasonable objection to a
minor issue, then I think we should get past it.


> So, to address your next comment, I'll restate what I was proposing.
> That is, to ensure all the d_flags accesses protected by d_lock as
> an initial patch rather than cleaning it up in an ad-hoc fashion
> later on, such as this later patch in your series:
> 
> [PATCH 14/46] fs: dcache scale d_unhashed
> 
> which has the description:
> 
> 	Protect d_unhashed(dentry) condition with d_lock.
> 
> which illustrates my point that not all accesses to d_flags are
> currently protected by d_lock as you are asserting. Hence:

It depends what you mean by accesses to d_flags.

No, not all of them are, because there are in fact some cases
where d_flags is read without any locking, when races don't matter
or aren't applicable.

But all writes to d_flags, in code where the dentry is live and
there can be concurrent writes to d_flags *are* protected by d_lock.

d_unhashed() is defined to:
 Returns true if the dentry passed is not currently hashed.

So what I have called the d_unhashed condition, I mean the combination
of DCACHE_UNHASHED and dentry membership on the hash list.

I'll improve that changelog because now you've brought it to my
attention I agree it's not very good.


> > I would prefer more in-depth review than from someone who doesn't know
> > d_lock protects d_flags,
> 
> Your implication about my competence is incorrect and entirely

Well you said that my patch adds d_flags protection in bits and
pieces in a random manner. d_flags is already protected by d_lock
upstream which I explained (nicely).


> inappropriate.  Ad hominen attacks don't improve your argument or
> encourage other people to review your code.

Well you keep escalating it too, like you swore at me when I try
several times to explain an issue.

http://marc.info/?l=linux-fsdevel&m=129193745921777&w=2


> > but any and all help is welcome. Even minor
> > nitpicking or cleanups are welcome if they are relevant to the patches.
> 
> If _you_ decide they are relevant.

There is give an take.

 
> Nick, in the past couple of months you've burnt everyone who has
> tried to review your changes in any meaningful way. Nobody wants to
> engage with you because you've aggressively disagreed with every
> significant change that has been requested. You have shown no desire
> to compromise, instead you argue that you are right until you've had
> the last word, and you have frequently resorted to condesending and
> disrespectful attacks on reviewers. You would do well to keep that
> in mind next time you wonder why nobody is stepping up to review
> your code.

This is exactly what you and Christoph did, to me, actually. And you're
wrong, nobody was reviewing my code long before that little episode. I
certainly did compromise with Al, regarding the merging of the inode
lock stuff, and although I disagreed with some parts, I said OK fine.

You can't seem to concede a single time that I am right or have a valid
point. The best you can possibly manage to to go silent (and then maybe
bring it up again a few weeks later). This is perhaps why I appear so
insolent, because when I disagree with you, I'm wrong so my reasoning
must be irrelevant. It just keeps happening (recently again with the vfs
percpu counters thread).

So as far as I can see, there never was a bridge there to begin with. I
wish we could work together because I don't in fact question your
competence or intelligence, but it seems you do mine.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/46] fs: dcache scale hash
  2010-12-10  9:01               ` Dave Chinner
  2010-12-13  4:48                 ` Nick Piggin
@ 2010-12-13  5:05                 ` Nick Piggin
  1 sibling, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2010-12-13  5:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 08:01:26PM +1100, Dave Chinner wrote:
> > I would prefer more in-depth review than from someone who doesn't know
> > d_lock protects d_flags,
> 
> Your implication about my competence is incorrect and entirely
> inappropriate.  Ad hominen attacks don't improve your argument or
> encourage other people to review your code.

I'll also just add that if you don't like ad-hominen attacks, you
shouldn't have made implications about my integrity and honesty by
accusing me of wanting a rubber stamp, rather than a real review.
Which was before I suggested that you were confused about d_flags
locking, you'll note.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2010-11-27  9:44 ` [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations Nick Piggin
@ 2011-01-18 22:32   ` Yehuda Sadeh Weinraub
  2011-01-18 22:42     ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-01-18 22:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel, Sage Weil

On Sat, Nov 27, 2010 at 1:44 AM, Nick Piggin <npiggin@kernel.dk> wrote:
> The remaining usages for dcache_lock is to allow atomic, multi-step read-side
> operations over the directory tree by excluding modifications to the tree.
> Also, to walk in the leaf->root direction in the tree where we don't have
> a natural d_lock ordering.
>
> This could be accomplished by taking every d_lock, but this would mean a
> huge number of locks and actually gets very tricky.
>
> Solve this instead by using the rename seqlock for multi-step read-side
> operations, retry in case of a rename so we don't walk up the wrong parent.
> Concurrent dentry insertions are not serialised against.  Concurrent deletes
> are tricky when walking up the directory: our parent might have been deleted
> when dropping locks so also need to check and retry for that.
>
> We can also use the rename lock in cases where livelock is a worry (and it
> is introduced in subsequent patch).
>
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
..
> @@ -237,6 +238,7 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
>        __releases(dcache_inode_lock)
>        __releases(dcache_lock)
>  {
> +       dentry->d_parent = NULL;
>        list_del(&dentry->d_u.d_child);
>        if (parent)
>                spin_unlock(&parent->d_lock);

There's an issue with ceph as it references the
dentry->d_parent(->d_inode) at dentry_release(), so setting
dentry->d_parent to NULL here doesn't work with ceph. Though there is
some workaround for it, we would like to be sure that this one is
really required so that we don't exacerbate the ugliness. The
workaround is to keep a pointer to the parent inode in the private
dentry structure, which will be referenced only at the .release()
callback. This is clearly not ideal.

Yehuda

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-18 22:32   ` Yehuda Sadeh Weinraub
@ 2011-01-18 22:42     ` Nick Piggin
  2011-01-19 22:27       ` Yehuda Sadeh Weinraub
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2011-01-18 22:42 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil

On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Sat, Nov 27, 2010 at 1:44 AM, Nick Piggin <npiggin@kernel.dk> wrote:
>> The remaining usages for dcache_lock is to allow atomic, multi-step read-side
>> operations over the directory tree by excluding modifications to the tree.
>> Also, to walk in the leaf->root direction in the tree where we don't have
>> a natural d_lock ordering.
>>
>> This could be accomplished by taking every d_lock, but this would mean a
>> huge number of locks and actually gets very tricky.
>>
>> Solve this instead by using the rename seqlock for multi-step read-side
>> operations, retry in case of a rename so we don't walk up the wrong parent.
>> Concurrent dentry insertions are not serialised against.  Concurrent deletes
>> are tricky when walking up the directory: our parent might have been deleted
>> when dropping locks so also need to check and retry for that.
>>
>> We can also use the rename lock in cases where livelock is a worry (and it
>> is introduced in subsequent patch).
>>
>> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> ..
>> @@ -237,6 +238,7 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
>>        __releases(dcache_inode_lock)
>>        __releases(dcache_lock)
>>  {
>> +       dentry->d_parent = NULL;
>>        list_del(&dentry->d_u.d_child);
>>        if (parent)
>>                spin_unlock(&parent->d_lock);
>
> There's an issue with ceph as it references the
> dentry->d_parent(->d_inode) at dentry_release(), so setting
> dentry->d_parent to NULL here doesn't work with ceph. Though there is
> some workaround for it, we would like to be sure that this one is
> really required so that we don't exacerbate the ugliness. The
> workaround is to keep a pointer to the parent inode in the private
> dentry structure, which will be referenced only at the .release()
> callback. This is clearly not ideal.

Hmm, I'll have to think about it. Probably we can check for
d_count == 0 rather than parent != NULL I think?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-18 22:42     ` Nick Piggin
@ 2011-01-19 22:27       ` Yehuda Sadeh Weinraub
  2011-01-19 22:32         ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-01-19 22:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil, ceph-devel

On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub

>> There's an issue with ceph as it references the
>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>> some workaround for it, we would like to be sure that this one is
>> really required so that we don't exacerbate the ugliness. The
>> workaround is to keep a pointer to the parent inode in the private
>> dentry structure, which will be referenced only at the .release()
>> callback. This is clearly not ideal.
>
> Hmm, I'll have to think about it. Probably we can check for
> d_count == 0 rather than parent != NULL I think?
>

That'll solve ceph's problem, don't know about how'd affect other
stuff. We'll need to know whether this is the solution, or whether
we'd need to introduce some other band aid fix.

Thanks,
Yehuda

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-19 22:27       ` Yehuda Sadeh Weinraub
@ 2011-01-19 22:32         ` Nick Piggin
  2011-01-25 22:10           ` Yehuda Sadeh Weinraub
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2011-01-19 22:32 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil, ceph-devel

On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
>
>>> There's an issue with ceph as it references the
>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>>> some workaround for it, we would like to be sure that this one is
>>> really required so that we don't exacerbate the ugliness. The
>>> workaround is to keep a pointer to the parent inode in the private
>>> dentry structure, which will be referenced only at the .release()
>>> callback. This is clearly not ideal.
>>
>> Hmm, I'll have to think about it. Probably we can check for
>> d_count == 0 rather than parent != NULL I think?
>>
>
> That'll solve ceph's problem, don't know about how'd affect other
> stuff. We'll need to know whether this is the solution, or whether
> we'd need to introduce some other band aid fix.

No I think it will work fine. Basically we just need to know whether
we have been deleted, and if so then we restart rather than walking
back up the parent.

I'll send a patch in a few days. For the meantime, it's a rathe
small window for ceph to worry about. So we'll have something
before -rc2 which should be OK.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-19 22:32         ` Nick Piggin
@ 2011-01-25 22:10           ` Yehuda Sadeh Weinraub
  2011-01-27  5:18             ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-01-25 22:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil, ceph-devel

On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
>>
>>>> There's an issue with ceph as it references the
>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>>>> some workaround for it, we would like to be sure that this one is
>>>> really required so that we don't exacerbate the ugliness. The
>>>> workaround is to keep a pointer to the parent inode in the private
>>>> dentry structure, which will be referenced only at the .release()
>>>> callback. This is clearly not ideal.
>>>
>>> Hmm, I'll have to think about it. Probably we can check for
>>> d_count == 0 rather than parent != NULL I think?
>>>
>>
>> That'll solve ceph's problem, don't know about how'd affect other
>> stuff. We'll need to know whether this is the solution, or whether
>> we'd need to introduce some other band aid fix.
>
> No I think it will work fine. Basically we just need to know whether
> we have been deleted, and if so then we restart rather than walking
> back up the parent.
>
> I'll send a patch in a few days. For the meantime, it's a rathe
> small window for ceph to worry about. So we'll have something
> before -rc2 which should be OK.
>

I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?

Thanks,
Yehuda

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-25 22:10           ` Yehuda Sadeh Weinraub
@ 2011-01-27  5:18             ` Nick Piggin
  2011-02-07 18:52               ` Jim Schutt
  2011-02-14 17:57               ` Yehuda Sadeh Weinraub
  0 siblings, 2 replies; 107+ messages in thread
From: Nick Piggin @ 2011-01-27  5:18 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil, ceph-devel

On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
>> <yehudasa@gmail.com> wrote:
>>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
>>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
>>>
>>>>> There's an issue with ceph as it references the
>>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>>>>> some workaround for it, we would like to be sure that this one is
>>>>> really required so that we don't exacerbate the ugliness. The
>>>>> workaround is to keep a pointer to the parent inode in the private
>>>>> dentry structure, which will be referenced only at the .release()
>>>>> callback. This is clearly not ideal.
>>>>
>>>> Hmm, I'll have to think about it. Probably we can check for
>>>> d_count == 0 rather than parent != NULL I think?
>>>>
>>>
>>> That'll solve ceph's problem, don't know about how'd affect other
>>> stuff. We'll need to know whether this is the solution, or whether
>>> we'd need to introduce some other band aid fix.
>>
>> No I think it will work fine. Basically we just need to know whether
>> we have been deleted, and if so then we restart rather than walking
>> back up the parent.
>>
>> I'll send a patch in a few days. For the meantime, it's a rathe
>> small window for ceph to worry about. So we'll have something
>> before -rc2 which should be OK.
>>
>
> I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?

Yeah, I'm sorry I've been travelling and a bit disconnected.

NFS folk are having a similar problem and looks like similar
proposed fix will do it.

http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2

So I think it is the best way to go to restore behaviour back to what
filesystems already expect, to avoid more surprises in future.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-27  5:18             ` Nick Piggin
@ 2011-02-07 18:52               ` Jim Schutt
  2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
  2011-02-14 17:57               ` Yehuda Sadeh Weinraub
  1 sibling, 1 reply; 107+ messages in thread
From: Jim Schutt @ 2011-02-07 18:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Yehuda Sadeh Weinraub, Nick Piggin, linux-fsdevel, linux-kernel,
	Sage Weil, ceph-devel


On Wed, 2011-01-26 at 22:18 -0700, Nick Piggin wrote:
> On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
> > On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
> >> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
> >> <yehudasa@gmail.com> wrote:
> >>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
> >>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
> >>>
> >>>>> There's an issue with ceph as it references the
> >>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
> >>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
> >>>>> some workaround for it, we would like to be sure that this one is
> >>>>> really required so that we don't exacerbate the ugliness. The
> >>>>> workaround is to keep a pointer to the parent inode in the private
> >>>>> dentry structure, which will be referenced only at the .release()
> >>>>> callback. This is clearly not ideal.
> >>>>
> >>>> Hmm, I'll have to think about it. Probably we can check for
> >>>> d_count == 0 rather than parent != NULL I think?
> >>>>
> >>>
> >>> That'll solve ceph's problem, don't know about how'd affect other
> >>> stuff. We'll need to know whether this is the solution, or whether
> >>> we'd need to introduce some other band aid fix.
> >>
> >> No I think it will work fine. Basically we just need to know whether
> >> we have been deleted, and if so then we restart rather than walking
> >> back up the parent.
> >>
> >> I'll send a patch in a few days. For the meantime, it's a rathe
> >> small window for ceph to worry about. So we'll have something
> >> before -rc2 which should be OK.
> >>
> >
> > I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?
> 
> Yeah, I'm sorry I've been travelling and a bit disconnected.
> 
> NFS folk are having a similar problem and looks like similar
> proposed fix will do it.
> 
> http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2
> 
> So I think it is the best way to go to restore behaviour back to what
> filesystems already expect, to avoid more surprises in future.

I think the following BUG indicates I'm hitting this problem?
All I have to do to cause it is unlink a file.

My ceph client kernel is 8dbdea8444 (master branch) from 
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ e41cdbb6c5 (master branch) + a3f5274e53 (unstable branch)
  from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git

Are there any patches available for this I can test?

Thanks -- Jim

[ 1471.018973] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 1471.019909] IP: [<ffffffffa0748275>] ceph_dentry_release+0x31/0x148 [ceph]
[ 1471.019909] PGD 121fb9067 PUD 120520067 PMD 0 
[ 1471.019909] Oops: 0000 [#1] SMP 
[ 1471.019909] last sysfs file: /sys/block/md0/range
[ 1471.019909] CPU 1 
[ 1471.019909] Modules linked in: ceph libceph ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd be2iscsi iscsi_boot_sysfs iscsi]
[ 1471.019909] 
[ 1471.019909] Pid: 20, comm: kworker/1:1 Not tainted 2.6.38-rc3-00247-g4a9cd22 #13 0UR033/PowerEdge 1950
[ 1471.019909] RIP: 0010:[<ffffffffa0748275>]  [<ffffffffa0748275>] ceph_dentry_release+0x31/0x148 [ceph]
[ 1471.019909] RSP: 0018:ffff88012b09ba20  EFLAGS: 00010286
[ 1471.019909] RAX: 0000000000000000 RBX: ffff880129e3f0c0 RCX: ffff88011d448280
[ 1471.019909] RDX: 000000000000cbc0 RSI: 0000000000000001 RDI: ffff880129e3f0c0
[ 1471.019909] RBP: ffff88012b09ba60 R08: 0000000000000000 R09: ffff88012b09b9e0
[ 1471.019909] R10: 000001000000fa40 R11: ffff88012b09ba20 R12: ffff88011d448840
[ 1471.019909] R13: 0000000000000000 R14: ffff880129e3f0c0 R15: ffff88011d416000
[ 1471.019909] FS:  0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
[ 1471.019909] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1471.019909] CR2: 0000000000000030 CR3: 0000000128a1b000 CR4: 00000000000006e0
[ 1471.019909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1471.019909] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1471.019909] Process kworker/1:1 (pid: 20, threadinfo ffff88012b09a000, task ffff88012b0a8000)
[ 1471.019909] Stack:
[ 1471.019909]  ffff88011d448840 ffff880128f89800 fffffffffffffffe ffff880129e3f0c0
[ 1471.019909]  ffff88011d448840 ffff880129fa70c0 0000000000000001 ffff880120f86000
[ 1471.019909]  ffff88012b09ba80 ffffffff81104007 ffff880129e3f0c0 ffff880129fa70c0
[ 1471.019909] Call Trace:
[ 1471.019909]  [<ffffffff81104007>] d_free+0x37/0x5c
[ 1471.019909]  [<ffffffff811048d4>] dentry_kill+0x11a/0x126
[ 1471.019909]  [<ffffffff8110523d>] dput+0xbc/0xc9
[ 1471.019909]  [<ffffffffa076099c>] ceph_mdsc_release_request+0xa9/0x117 [ceph]
[ 1471.019909]  [<ffffffffa07608f3>] ? ceph_mdsc_release_request+0x0/0x117 [ceph]
[ 1471.019909]  [<ffffffff811ab542>] kref_put+0x43/0x4f
[ 1471.019909]  [<ffffffffa075bab9>] ceph_mdsc_put_request+0x1c/0x1e [ceph]
[ 1471.019909]  [<ffffffffa075fc26>] dispatch+0xbdc/0x1282 [ceph]
[ 1471.019909]  [<ffffffff81192c70>] ? chksum_update+0x15/0x1d
[ 1471.019909]  [<ffffffff8118cf30>] ? crypto_shash_update+0x1f/0x21
[ 1471.019909]  [<ffffffff812cf943>] ? kernel_recvmsg+0x3a/0x46
[ 1471.019909]  [<ffffffffa0704c69>] ? ceph_tcp_recvmsg+0x4e/0x5b [libceph]
[ 1471.019909]  [<ffffffffa07066ce>] try_read+0x1363/0x1508 [libceph]
[ 1471.019909]  [<ffffffff81030af3>] ? should_resched+0xe/0x2f
[ 1471.019909]  [<ffffffffa0707318>] con_work+0xec/0x1426 [libceph]
[ 1471.019909]  [<ffffffff81030adb>] ? need_resched+0x23/0x2d
[ 1471.019909]  [<ffffffff8136f43a>] ? schedule+0x68d/0x6a7
[ 1471.019909]  [<ffffffff8104e9d5>] ? add_timer+0x1c/0x1e
[ 1471.019909]  [<ffffffff81058341>] ? queue_delayed_work_on+0xde/0xf2
[ 1471.019909]  [<ffffffff81056dc5>] process_one_work+0x16e/0x26a
[ 1471.019909]  [<ffffffffa070722c>] ? con_work+0x0/0x1426 [libceph]
[ 1471.019909]  [<ffffffff8105852d>] ? worker_thread+0x0/0x183
[ 1471.019909]  [<ffffffff810585f0>] worker_thread+0xc3/0x183
[ 1471.019909]  [<ffffffff8105be62>] kthread+0x72/0x7a
[ 1471.019909]  [<ffffffff81003914>] kernel_thread_helper+0x4/0x10
[ 1471.019909]  [<ffffffff8105bdf0>] ? kthread+0x0/0x7a
[ 1471.019909]  [<ffffffff81003910>] ? kernel_thread_helper+0x0/0x10
[ 1471.019909] Code: 41 56 41 55 41 54 53 48 83 ec 18 0f 1f 44 00 00 48 8b 47 18 45 31 ed 4c 8b 7f 78 49 89 fe 48 c7 45 d0 fe ff ff ff 48 39 c7 74 14 <4c> 8b 68 30 4d 85 ed 74 0b 49 8b 85 08 fd ff ff 48 89 45 d0 80 
[ 1471.019909] RIP  [<ffffffffa0748275>] ceph_dentry_release+0x31/0x148 [ceph]
[ 1471.019909]  RSP <ffff88012b09ba20>
[ 1471.019909] CR2: 0000000000000030
[ 1471.455942] ---[ end trace 782e52b3ca82de3c ]---
[ 1471.460581] BUG: unable to handle kernel paging request at fffffffffffffff8
[ 1471.461551] IP: [<ffffffff8105bb0e>] kthread_data+0x10/0x16
[ 1471.461551] PGD 1805067 PUD 1806067 PMD 0 
[ 1471.461551] Oops: 0000 [#2] SMP 
[ 1471.461551] last sysfs file: /sys/block/md0/range
[ 1471.461551] CPU 1 
[ 1471.461551] Modules linked in: ceph libceph ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd be2iscsi iscsi_boot_sysfs iscsi]
[ 1471.461551] 
[ 1471.461551] Pid: 20, comm: kworker/1:1 Tainted: G      D     2.6.38-rc3-00247-g4a9cd22 #13 0UR033/PowerEdge 1950
[ 1471.461551] RIP: 0010:[<ffffffff8105bb0e>]  [<ffffffff8105bb0e>] kthread_data+0x10/0x16
[ 1471.461551] RSP: 0018:ffff88012b09b5b8  EFLAGS: 00010092
[ 1471.461551] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88012b0a8000
[ 1471.461551] RDX: 000000000000cbc0 RSI: 0000000000000001 RDI: ffff88012b0a8000
[ 1471.461551] RBP: ffff88012b09b5b8 R08: ffff8800cfc54f40 R09: dead000000200200
[ 1471.461551] R10: dead000000200200 R11: 0000000000000002 R12: 00007ffffffff000
[ 1471.461551] R13: 0000000000000001 R14: ffff8800cfc51cc0 R15: 0000000000000001
[ 1471.461551] FS:  0000000000000000(0000) GS:ffff8800cfc40000(0000) knlGS:0000000000000000
[ 1471.461551] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1471.461551] CR2: fffffffffffffff8 CR3: 0000000128a1b000 CR4: 00000000000006e0
[ 1471.461551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1471.461551] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1471.461551] Process kworker/1:1 (pid: 20, threadinfo ffff88012b09a000, task ffff88012b0a8000)
[ 1471.461551] Stack:
[ 1471.461551]  ffff88012b09b5e8 ffffffff810583d7 ffff88012b09b5f8 0000000000000001
[ 1471.461551]  00007ffffffff000 0000000000011cc0 ffff88012b09b6f8 ffffffff8136ef22
[ 1471.461551]  ffff88012b09b618 ffff88012b0a8000 ffff88012b6b0850 ffff88012b0a83a0
[ 1471.461551] Call Trace:
[ 1471.461551]  [<ffffffff810583d7>] wq_worker_sleeping+0x1a/0x87
[ 1471.461551]  [<ffffffff8136ef22>] schedule+0x175/0x6a7
[ 1471.461551]  [<ffffffff8108b8d0>] ? call_rcu_sched+0x15/0x17
[ 1471.461551]  [<ffffffff810e8b0e>] ? __slab_free+0x52/0xe9
[ 1471.461551]  [<ffffffff8108b8d0>] ? call_rcu_sched+0x15/0x17
[ 1471.461551]  [<ffffffff810443f2>] ? release_task+0x32b/0x343
[ 1471.461551]  [<ffffffff810601df>] ? switch_task_namespaces+0x1d/0x51
[ 1471.461551]  [<ffffffff810456cf>] do_exit+0x678/0x692
[ 1471.461551]  [<ffffffff81371b59>] oops_end+0xb7/0xbf
[ 1471.461551]  [<ffffffff81026de1>] no_context+0x1fa/0x209
[ 1471.461551]  [<ffffffff81027076>] __bad_area_nosemaphore+0x187/0x1aa
[ 1471.461551]  [<ffffffff810b517d>] ? __pagevec_free+0x70/0x8c
[ 1471.461551]  [<ffffffff810b8ab2>] ? hpage_nr_pages+0x1a/0x2c
[ 1471.461551]  [<ffffffff81027123>] bad_area_nosemaphore+0x13/0x18
[ 1471.461551]  [<ffffffff81373aa5>] do_page_fault+0x175/0x325
[ 1471.461551]  [<ffffffff810afe20>] ? find_get_pages+0x44/0xbb
[ 1471.461551]  [<ffffffffa075070b>] ? list_add+0x11/0x13 [ceph]
[ 1471.461551]  [<ffffffffa0753773>] ? ceph_put_cap+0xf6/0x12d [ceph]
[ 1471.461551]  [<ffffffff810b9611>] ? pagevec_lookup+0x24/0x2d
[ 1471.461551]  [<ffffffff813710df>] page_fault+0x1f/0x30
[ 1471.461551]  [<ffffffffa0748275>] ? ceph_dentry_release+0x31/0x148 [ceph]
[ 1471.461551]  [<ffffffff81104007>] d_free+0x37/0x5c
[ 1471.461551]  [<ffffffff811048d4>] dentry_kill+0x11a/0x126
[ 1471.461551]  [<ffffffff8110523d>] dput+0xbc/0xc9
[ 1471.461551]  [<ffffffffa076099c>] ceph_mdsc_release_request+0xa9/0x117 [ceph]
[ 1471.461551]  [<ffffffffa07608f3>] ? ceph_mdsc_release_request+0x0/0x117 [ceph]
[ 1471.461551]  [<ffffffff811ab542>] kref_put+0x43/0x4f
[ 1471.461551]  [<ffffffffa075bab9>] ceph_mdsc_put_request+0x1c/0x1e [ceph]
[ 1471.461551]  [<ffffffffa075fc26>] dispatch+0xbdc/0x1282 [ceph]
[ 1471.461551]  [<ffffffff81192c70>] ? chksum_update+0x15/0x1d
[ 1471.461551]  [<ffffffff8118cf30>] ? crypto_shash_update+0x1f/0x21
[ 1471.461551]  [<ffffffff812cf943>] ? kernel_recvmsg+0x3a/0x46
[ 1471.461551]  [<ffffffffa0704c69>] ? ceph_tcp_recvmsg+0x4e/0x5b [libceph]
[ 1471.461551]  [<ffffffffa07066ce>] try_read+0x1363/0x1508 [libceph]
[ 1471.461551]  [<ffffffff81030af3>] ? should_resched+0xe/0x2f
[ 1471.461551]  [<ffffffffa0707318>] con_work+0xec/0x1426 [libceph]
[ 1471.461551]  [<ffffffff81030adb>] ? need_resched+0x23/0x2d
[ 1471.461551]  [<ffffffff8136f43a>] ? schedule+0x68d/0x6a7
[ 1471.461551]  [<ffffffff8104e9d5>] ? add_timer+0x1c/0x1e
[ 1471.461551]  [<ffffffff81058341>] ? queue_delayed_work_on+0xde/0xf2
[ 1471.461551]  [<ffffffff81056dc5>] process_one_work+0x16e/0x26a
[ 1471.461551]  [<ffffffffa070722c>] ? con_work+0x0/0x1426 [libceph]
[ 1471.461551]  [<ffffffff8105852d>] ? worker_thread+0x0/0x183
[ 1471.461551]  [<ffffffff810585f0>] worker_thread+0xc3/0x183
[ 1471.461551]  [<ffffffff8105be62>] kthread+0x72/0x7a
[ 1471.461551]  [<ffffffff81003914>] kernel_thread_helper+0x4/0x10
[ 1471.461551]  [<ffffffff8105bdf0>] ? kthread+0x0/0x7a
[ 1471.461551]  [<ffffffff81003910>] ? kernel_thread_helper+0x0/0x10
[ 1471.461551] Code: e5 0f 1f 44 00 00 65 48 8b 04 25 80 b5 00 00 48 8b 80 48 03 00 00 8b 40 f0 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8b 87 48 03 00 00 <48> 8b 40 f8 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8d 47 08 c7 07 
[ 1471.461551] RIP  [<ffffffff8105bb0e>] kthread_data+0x10/0x16
[ 1471.461551]  RSP <ffff88012b09b5b8>
[ 1471.461551] CR2: fffffffffffffff8
[ 1471.461551] ---[ end trace 782e52b3ca82de3d ]---
[ 1471.461551] Fixing recursive fault but reboot is needed!

> 
> Thanks,
> Nick
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-02-07 18:52               ` Jim Schutt
@ 2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
  0 siblings, 0 replies; 107+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-02-07 21:04 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil,
	ceph-devel

On Mon, Feb 7, 2011 at 10:52 AM, Jim Schutt <jaschut@sandia.gov> wrote:
>
> On Wed, 2011-01-26 at 22:18 -0700, Nick Piggin wrote:
>> On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
>> <yehudasa@gmail.com> wrote:
>> > On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> >> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
>> >> <yehudasa@gmail.com> wrote:
>> >>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> >>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
>> >>>
>> >>>>> There's an issue with ceph as it references the
>> >>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>> >>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>> >>>>> some workaround for it, we would like to be sure that this one is
>> >>>>> really required so that we don't exacerbate the ugliness. The
>> >>>>> workaround is to keep a pointer to the parent inode in the private
>> >>>>> dentry structure, which will be referenced only at the .release()
>> >>>>> callback. This is clearly not ideal.
>> >>>>
>> >>>> Hmm, I'll have to think about it. Probably we can check for
>> >>>> d_count == 0 rather than parent != NULL I think?
>> >>>>
>> >>>
>> >>> That'll solve ceph's problem, don't know about how'd affect other
>> >>> stuff. We'll need to know whether this is the solution, or whether
>> >>> we'd need to introduce some other band aid fix.
>> >>
>> >> No I think it will work fine. Basically we just need to know whether
>> >> we have been deleted, and if so then we restart rather than walking
>> >> back up the parent.
>> >>
>> >> I'll send a patch in a few days. For the meantime, it's a rathe
>> >> small window for ceph to worry about. So we'll have something
>> >> before -rc2 which should be OK.
>> >>
>> >
>> > I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?
>>
>> Yeah, I'm sorry I've been travelling and a bit disconnected.
>>
>> NFS folk are having a similar problem and looks like similar
>> proposed fix will do it.
>>
>> http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2
>>
>> So I think it is the best way to go to restore behaviour back to what
>> filesystems already expect, to avoid more surprises in future.
>
> I think the following BUG indicates I'm hitting this problem?
> All I have to do to cause it is unlink a file.
>
> My ceph client kernel is 8dbdea8444 (master branch) from
>  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> + e41cdbb6c5 (master branch) + a3f5274e53 (unstable branch)
>  from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
>
> Are there any patches available for this I can test?
>
> Thanks -- Jim
>

It does look like this specific problem.
You can try cherry-pick commit 9c3db35 off the ceph git. It is just a
temporary workaround, and it wasn't tested too much. Hopefully Nick
will push his fix soon so that it wouldn't be needed.

Thanks,
Yehuda

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
@ 2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
  0 siblings, 0 replies; 107+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-02-07 21:04 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil,
	ceph-devel

On Mon, Feb 7, 2011 at 10:52 AM, Jim Schutt <jaschut@sandia.gov> wrote:
>
> On Wed, 2011-01-26 at 22:18 -0700, Nick Piggin wrote:
>> On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
>> <yehudasa@gmail.com> wrote:
>> > On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> >> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
>> >> <yehudasa@gmail.com> wrote:
>> >>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
>> >>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
>> >>>
>> >>>>> There's an issue with ceph as it references the
>> >>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>> >>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>> >>>>> some workaround for it, we would like to be sure that this one is
>> >>>>> really required so that we don't exacerbate the ugliness. The
>> >>>>> workaround is to keep a pointer to the parent inode in the private
>> >>>>> dentry structure, which will be referenced only at the .release()
>> >>>>> callback. This is clearly not ideal.
>> >>>>
>> >>>> Hmm, I'll have to think about it. Probably we can check for
>> >>>> d_count == 0 rather than parent != NULL I think?
>> >>>>
>> >>>
>> >>> That'll solve ceph's problem, don't know about how'd affect other
>> >>> stuff. We'll need to know whether this is the solution, or whether
>> >>> we'd need to introduce some other band aid fix.
>> >>
>> >> No I think it will work fine. Basically we just need to know whether
>> >> we have been deleted, and if so then we restart rather than walking
>> >> back up the parent.
>> >>
>> >> I'll send a patch in a few days. For the meantime, it's a rathe
>> >> small window for ceph to worry about. So we'll have something
>> >> before -rc2 which should be OK.
>> >>
>> >
>> > I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?
>>
>> Yeah, I'm sorry I've been travelling and a bit disconnected.
>>
>> NFS folk are having a similar problem and looks like similar
>> proposed fix will do it.
>>
>> http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2
>>
>> So I think it is the best way to go to restore behaviour back to what
>> filesystems already expect, to avoid more surprises in future.
>
> I think the following BUG indicates I'm hitting this problem?
> All I have to do to cause it is unlink a file.
>
> My ceph client kernel is 8dbdea8444 (master branch) from
>  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> + e41cdbb6c5 (master branch) + a3f5274e53 (unstable branch)
>  from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
>
> Are there any patches available for this I can test?
>
> Thanks -- Jim
>

It does look like this specific problem.
You can try cherry-pick commit 9c3db35 off the ceph git. It is just a
temporary workaround, and it wasn't tested too much. Hopefully Nick
will push his fix soon so that it wouldn't be needed.

Thanks,
Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
  (?)
@ 2011-02-07 21:31                   ` Jim Schutt
  2011-02-07 21:35                     ` Gregory Farnum
  -1 siblings, 1 reply; 107+ messages in thread
From: Jim Schutt @ 2011-02-07 21:31 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil,
	ceph-devel


On Mon, 2011-02-07 at 14:04 -0700, Yehuda Sadeh Weinraub wrote:
> On Mon, Feb 7, 2011 at 10:52 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> >
> > On Wed, 2011-01-26 at 22:18 -0700, Nick Piggin wrote:
> >> On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
> >> <yehudasa@gmail.com> wrote:
> >> > On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
> >> >> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
> >> >> <yehudasa@gmail.com> wrote:
> >> >>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
> >> >>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
> >> >>>
> >> >>>>> There's an issue with ceph as it references the
> >> >>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
> >> >>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
> >> >>>>> some workaround for it, we would like to be sure that this one is
> >> >>>>> really required so that we don't exacerbate the ugliness. The
> >> >>>>> workaround is to keep a pointer to the parent inode in the private
> >> >>>>> dentry structure, which will be referenced only at the .release()
> >> >>>>> callback. This is clearly not ideal.
> >> >>>>
> >> >>>> Hmm, I'll have to think about it. Probably we can check for
> >> >>>> d_count == 0 rather than parent != NULL I think?
> >> >>>>
> >> >>>
> >> >>> That'll solve ceph's problem, don't know about how'd affect other
> >> >>> stuff. We'll need to know whether this is the solution, or whether
> >> >>> we'd need to introduce some other band aid fix.
> >> >>
> >> >> No I think it will work fine. Basically we just need to know whether
> >> >> we have been deleted, and if so then we restart rather than walking
> >> >> back up the parent.
> >> >>
> >> >> I'll send a patch in a few days. For the meantime, it's a rathe
> >> >> small window for ceph to worry about. So we'll have something
> >> >> before -rc2 which should be OK.
> >> >>
> >> >
> >> > I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?
> >>
> >> Yeah, I'm sorry I've been travelling and a bit disconnected.
> >>
> >> NFS folk are having a similar problem and looks like similar
> >> proposed fix will do it.
> >>
> >> http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2
> >>
> >> So I think it is the best way to go to restore behaviour back to what
> >> filesystems already expect, to avoid more surprises in future.
> >
> > I think the following BUG indicates I'm hitting this problem?
> > All I have to do to cause it is unlink a file.
> >
> > My ceph client kernel is 8dbdea8444 (master branch) from
> >  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> > + e41cdbb6c5 (master branch) + a3f5274e53 (unstable branch)
> >  from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
> >
> > Are there any patches available for this I can test?
> >
> > Thanks -- Jim
> >
> 
> It does look like this specific problem.
> You can try cherry-pick commit 9c3db35 off the ceph git. It is just a
> temporary workaround, and it wasn't tested too much. Hopefully Nick
> will push his fix soon so that it wouldn't be needed.

That commit doesn't seem to be in
  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git

Maybe there's another tree I should be looking at?

Thanks -- Jim

> 
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-02-07 21:31                   ` Jim Schutt
@ 2011-02-07 21:35                     ` Gregory Farnum
  0 siblings, 0 replies; 107+ messages in thread
From: Gregory Farnum @ 2011-02-07 21:35 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Yehuda Sadeh Weinraub, ceph-devel

On Mon, Feb 7, 2011 at 1:31 PM, Jim Schutt <jaschut@sandia.gov> wrote:
>
> On Mon, 2011-02-07 at 14:04 -0700, Yehuda Sadeh Weinraub wrote:
>> It does look like this specific problem.
>> You can try cherry-pick commit 9c3db35 off the ceph git. It is just a
>> temporary workaround, and it wasn't tested too much. Hopefully Nick
>> will push his fix soon so that it wouldn't be needed.
>
> That commit doesn't seem to be in
>  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
>
> Maybe there's another tree I should be looking at?
>
> Thanks -- Jim
Yep:
ceph.newdream.net:/git/ceph-client.git
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
  (?)
  (?)
@ 2011-02-07 22:25                   ` Jim Schutt
  -1 siblings, 0 replies; 107+ messages in thread
From: Jim Schutt @ 2011-02-07 22:25 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil,
	ceph-devel


On Mon, 2011-02-07 at 14:04 -0700, Yehuda Sadeh Weinraub wrote:
> On Mon, Feb 7, 2011 at 10:52 AM, Jim Schutt <jaschut@sandia.gov> wrote:
> >
> > On Wed, 2011-01-26 at 22:18 -0700, Nick Piggin wrote:
> >> On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
> >> <yehudasa@gmail.com> wrote:
> >> > On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
> >> >> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
> >> >> <yehudasa@gmail.com> wrote:
> >> >>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
> >> >>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
> >> >>>
> >> >>>>> There's an issue with ceph as it references the
> >> >>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
> >> >>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
> >> >>>>> some workaround for it, we would like to be sure that this one is
> >> >>>>> really required so that we don't exacerbate the ugliness. The
> >> >>>>> workaround is to keep a pointer to the parent inode in the private
> >> >>>>> dentry structure, which will be referenced only at the .release()
> >> >>>>> callback. This is clearly not ideal.
> >> >>>>
> >> >>>> Hmm, I'll have to think about it. Probably we can check for
> >> >>>> d_count == 0 rather than parent != NULL I think?
> >> >>>>
> >> >>>
> >> >>> That'll solve ceph's problem, don't know about how'd affect other
> >> >>> stuff. We'll need to know whether this is the solution, or whether
> >> >>> we'd need to introduce some other band aid fix.
> >> >>
> >> >> No I think it will work fine. Basically we just need to know whether
> >> >> we have been deleted, and if so then we restart rather than walking
> >> >> back up the parent.
> >> >>
> >> >> I'll send a patch in a few days. For the meantime, it's a rathe
> >> >> small window for ceph to worry about. So we'll have something
> >> >> before -rc2 which should be OK.
> >> >>
> >> >
> >> > I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?
> >>
> >> Yeah, I'm sorry I've been travelling and a bit disconnected.
> >>
> >> NFS folk are having a similar problem and looks like similar
> >> proposed fix will do it.
> >>
> >> http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2
> >>
> >> So I think it is the best way to go to restore behaviour back to what
> >> filesystems already expect, to avoid more surprises in future.
> >
> > I think the following BUG indicates I'm hitting this problem?
> > All I have to do to cause it is unlink a file.
> >
> > My ceph client kernel is 8dbdea8444 (master branch) from
> >  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
> > + e41cdbb6c5 (master branch) + a3f5274e53 (unstable branch)
> >  from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
> >
> > Are there any patches available for this I can test?
> >
> > Thanks -- Jim
> >
> 
> It does look like this specific problem.
> You can try cherry-pick commit 9c3db35 off the ceph git. It is just a
> temporary workaround, and it wasn't tested too much. Hopefully Nick
> will push his fix soon so that it wouldn't be needed.


That commit fixes my unlink issue, thanks.
I'm happy to use it while things get resolved.

-- Jim

> 
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations
  2011-01-27  5:18             ` Nick Piggin
  2011-02-07 18:52               ` Jim Schutt
@ 2011-02-14 17:57               ` Yehuda Sadeh Weinraub
  1 sibling, 0 replies; 107+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-02-14 17:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Sage Weil, ceph-devel, jaschut

On Wed, Jan 26, 2011 at 9:18 PM, Nick Piggin <npiggin@gmail.com> wrote:
> On Wed, Jan 26, 2011 at 9:10 AM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
>> On Wed, Jan 19, 2011 at 2:32 PM, Nick Piggin <npiggin@gmail.com> wrote:
>>> On Thu, Jan 20, 2011 at 9:27 AM, Yehuda Sadeh Weinraub
>>> <yehudasa@gmail.com> wrote:
>>>> On Tue, Jan 18, 2011 at 2:42 PM, Nick Piggin <npiggin@gmail.com> wrote:
>>>>> On Wed, Jan 19, 2011 at 9:32 AM, Yehuda Sadeh Weinraub
>>>>
>>>>>> There's an issue with ceph as it references the
>>>>>> dentry->d_parent(->d_inode) at dentry_release(), so setting
>>>>>> dentry->d_parent to NULL here doesn't work with ceph. Though there is
>>>>>> some workaround for it, we would like to be sure that this one is
>>>>>> really required so that we don't exacerbate the ugliness. The
>>>>>> workaround is to keep a pointer to the parent inode in the private
>>>>>> dentry structure, which will be referenced only at the .release()
>>>>>> callback. This is clearly not ideal.
>>>>>
>>>>> Hmm, I'll have to think about it. Probably we can check for
>>>>> d_count == 0 rather than parent != NULL I think?
>>>>>
>>>>
>>>> That'll solve ceph's problem, don't know about how'd affect other
>>>> stuff. We'll need to know whether this is the solution, or whether
>>>> we'd need to introduce some other band aid fix.
>>>
>>> No I think it will work fine. Basically we just need to know whether
>>> we have been deleted, and if so then we restart rather than walking
>>> back up the parent.
>>>
>>> I'll send a patch in a few days. For the meantime, it's a rathe
>>> small window for ceph to worry about. So we'll have something
>>> before -rc2 which should be OK.
>>>
>>
>> I guess that it's a bit late for -rc2, should we assume that it'll be on -rc3?
>
> Yeah, I'm sorry I've been travelling and a bit disconnected.
>
> NFS folk are having a similar problem and looks like similar
> proposed fix will do it.
>
> http://marc.info/?l=linux-fsdevel&m=129599823927039&w=2
>
> So I think it is the best way to go to restore behaviour back to what
> filesystems already expect, to avoid more surprises in future.
>

Hi Nick,
   -rc4 is out and that issue is still broken. Do you have that patch
ready or should we push our workaround?

Thanks,
Yehuda

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 19:20 Sedat Dilek
@ 2010-11-27 20:53 ` Sedat Dilek
  0 siblings, 0 replies; 107+ messages in thread
From: Sedat Dilek @ 2010-11-27 20:53 UTC (permalink / raw)
  To: Nick Piggin, LKML

On Sat, Nov 27, 2010 at 8:20 PM, Sedat Dilek <sedat.dilek@googlemail.com> wrote:
> Hi,
>
> I wanted to give your patchset a try (on top of latest linux-next).
>
> Unfortunately, the build breaks here:
>
> /home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
> error: ‘dcache_inode_lock’ undeclared (first use in this function)
> /home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
> error: (Each undeclared identifier is reported only once
> /home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
> error: for each function it appears in.)
>
> Attached patch "fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch"
> should fix it.
>
> Kind Regards,
> - Sedat -
>
> Compile-tested-by: me
>
[ ... ]

Next breakage:

/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/drivers/staging/pohmelfs/inode.c:834:
error: ‘psb’ undeclared (first use in this function)
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/drivers/staging/pohmelfs/inode.c:834:
error: (Each undeclared identifier is reported only once
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/drivers/staging/pohmelfs/inode.c:834:
error: for each function it appears in.)

- Sedat -

P.S.: Note to myself: Unset CONFIG_POHMELFS for testing-purposes.

[ debian/config/i386/none/config.686 ]
...
##
## file: drivers/staging/pohmelfs/Kconfig
##
# CONFIG_POHMELFS is not set

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-27 19:20 Sedat Dilek
  2010-11-27 20:53 ` Sedat Dilek
  0 siblings, 1 reply; 107+ messages in thread
From: Sedat Dilek @ 2010-11-27 19:20 UTC (permalink / raw)
  To: Nick Piggin, LKML

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

Hi,

I wanted to give your patchset a try (on top of latest linux-next).

Unfortunately, the build breaks here:

/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
error: ‘dcache_inode_lock’ undeclared (first use in this function)
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
error: (Each undeclared identifier is reported only once
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
error: for each function it appears in.)

Attached patch "fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch"
should fix it.

Kind Regards,
- Sedat -

Compile-tested-by: me

$ grep cifs build_linux-next_next20101126.dileks.4.log
  LD      fs/cifs/built-in.o
  CC [M]  fs/cifs/cifsfs.o
  CC [M]  fs/cifs/cifssmb.o
  CC [M]  fs/cifs/cifs_debug.o
  CC [M]  fs/cifs/connect.o
  CC [M]  fs/cifs/dir.o
  CC [M]  fs/cifs/file.o
  CC [M]  fs/cifs/inode.o
  CC [M]  fs/cifs/link.o
  CC [M]  fs/cifs/misc.o
  CC [M]  fs/cifs/netmisc.o
  CC [M]  fs/cifs/smbdes.o
  CC [M]  fs/cifs/smbencrypt.o
  CC [M]  fs/cifs/transport.o
  CC [M]  fs/cifs/asn1.o
  CC [M]  fs/cifs/md4.o
  CC [M]  fs/cifs/md5.o
  CC [M]  fs/cifs/cifs_unicode.o
  CC [M]  fs/cifs/nterr.o

[-- Attachment #2: fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch --]
[-- Type: plain/text, Size: 612 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2011-02-14 17:57 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
2010-12-08  1:53   ` Dave Chinner
2010-12-08  6:59     ` Nick Piggin
2010-12-09  0:50       ` Dave Chinner
2010-12-09  0:50         ` Dave Chinner
2010-12-09  4:50         ` Nick Piggin
2010-12-09  4:50           ` Nick Piggin
2010-11-27  9:44 ` [PATCH 03/46] kernel: kmem_ptr_validate considered harmful Nick Piggin
2010-11-27  9:44 ` [PATCH 04/46] fs: dcache documentation cleanup Nick Piggin
2010-11-27  9:44 ` [PATCH 05/46] fs: change d_delete semantics Nick Piggin
2010-11-27  9:44 ` [PATCH 06/46] cifs: dont overwrite dentry name in d_revalidate Nick Piggin
2010-11-27  9:44 ` [PATCH 07/46] jfs: " Nick Piggin
2010-11-27  9:44 ` [PATCH 08/46] fs: change d_compare for rcu-walk Nick Piggin
2010-11-27  9:44 ` [PATCH 09/46] fs: change d_hash " Nick Piggin
2010-11-27  9:44 ` [PATCH 10/46] hostfs: simplify locking Nick Piggin
2010-11-27  9:44 ` [PATCH 11/46] fs: dcache scale hash Nick Piggin
2010-12-09  6:09   ` Dave Chinner
2010-12-09  6:28     ` Nick Piggin
2010-12-09  8:17       ` Dave Chinner
2010-12-09 12:53         ` Nick Piggin
2010-12-09 23:42           ` Dave Chinner
2010-12-10  2:35             ` Nick Piggin
2010-12-10  9:01               ` Dave Chinner
2010-12-13  4:48                 ` Nick Piggin
2010-12-13  5:05                 ` Nick Piggin
2010-11-27  9:44 ` [PATCH 12/46] fs: dcache scale lru Nick Piggin
2010-12-09  7:22   ` Dave Chinner
2010-12-09 12:34     ` Nick Piggin
2010-11-27  9:44 ` [PATCH 13/46] fs: dcache scale dentry refcount Nick Piggin
2010-11-27  9:44 ` [PATCH 14/46] fs: dcache scale d_unhashed Nick Piggin
2010-11-27  9:44 ` [PATCH 15/46] fs: dcache scale subdirs Nick Piggin
2010-11-27  9:44 ` [PATCH 16/46] fs: scale inode alias list Nick Piggin
2010-11-27  9:44 ` [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations Nick Piggin
2011-01-18 22:32   ` Yehuda Sadeh Weinraub
2011-01-18 22:42     ` Nick Piggin
2011-01-19 22:27       ` Yehuda Sadeh Weinraub
2011-01-19 22:32         ` Nick Piggin
2011-01-25 22:10           ` Yehuda Sadeh Weinraub
2011-01-27  5:18             ` Nick Piggin
2011-02-07 18:52               ` Jim Schutt
2011-02-07 21:04                 ` Yehuda Sadeh Weinraub
2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
2011-02-07 21:31                   ` Jim Schutt
2011-02-07 21:35                     ` Gregory Farnum
2011-02-07 22:25                   ` Jim Schutt
2011-02-14 17:57               ` Yehuda Sadeh Weinraub
2010-11-27  9:44 ` [PATCH 18/46] fs: increase d_name lock coverage Nick Piggin
2010-11-27  9:44 ` [PATCH 19/46] fs: dcache remove dcache_lock Nick Piggin
2010-11-27  9:44 ` [PATCH 20/46] fs: dcache avoid starvation in dcache multi-step operations Nick Piggin
2010-11-27  9:44 ` [PATCH 21/46] fs: dcache reduce dput locking Nick Piggin
2010-11-27  9:44 ` [PATCH 22/46] fs: dcache reduce locking in d_alloc Nick Piggin
2010-11-27  9:44 ` [PATCH 23/46] fs: dcache reduce dcache_inode_lock Nick Piggin
2010-11-27  9:44 ` [PATCH 24/46] fs: dcache rationalise dget variants Nick Piggin
2010-11-27  9:44 ` [PATCH 25/46] fs: dcache reduce d_parent locking Nick Piggin
2010-11-27  9:44 ` [PATCH 26/46] fs: dcache reduce prune_one_dentry locking Nick Piggin
2010-11-27  9:44 ` [PATCH 27/46] fs: reduce dcache_inode_lock width in lru scanning Nick Piggin
2010-11-27  9:44 ` [PATCH 28/46] fs: use RCU in shrink_dentry_list to reduce lock nesting Nick Piggin
2010-11-27  9:44 ` [PATCH 29/46] fs: consolidate dentry kill sequence Nick Piggin
2010-11-27  9:45 ` [PATCH 30/46] fs: icache RCU free inodes Nick Piggin
2010-11-27  9:45 ` [PATCH 31/46] fs: avoid inode RCU freeing for pseudo fs Nick Piggin
2010-11-27  9:45 ` [PATCH 32/46] kernel: optimise seqlock Nick Piggin
2010-11-27  9:45 ` [PATCH 33/46] fs: rcu-walk for path lookup Nick Piggin
2010-11-27  9:45 ` [PATCH 34/46] fs: fs_struct use seqlock Nick Piggin
2010-11-27  9:45 ` [PATCH 35/46] fs: dcache remove d_mounted Nick Piggin
2010-11-27  9:45 ` [PATCH 36/46] fs: dcache reduce branches in lookup path Nick Piggin
2010-11-27  9:45 ` [PATCH 37/46] fs: cache optimise dentry and inode for rcu-walk Nick Piggin
2010-11-27  9:45 ` [PATCH 38/46] fs: prefetch inode data in dcache lookup Nick Piggin
2010-11-27  9:45 ` [PATCH 39/46] fs: d_revalidate_rcu for rcu-walk Nick Piggin
2010-11-27  9:45 ` [PATCH 40/46] fs: provide rcu-walk aware permission i_ops Nick Piggin
2010-11-27  9:45 ` [PATCH 41/46] fs: provide simple rcu-walk ACL implementation Nick Piggin
2010-11-27  9:45 ` [PATCH 42/46] kernel: add bl_list Nick Piggin
2010-11-27  9:45 ` [PATCH 43/46] bit_spinlock: add required includes Nick Piggin
2010-11-27  9:45 ` [PATCH 44/46] fs: dcache per-bucket dcache hash locking Nick Piggin
2010-11-27  9:45 ` [PATCH 45/46] fs: dcache per-inode inode alias locking Nick Piggin
2010-11-27  9:45 ` [PATCH 46/46] fs: improve scalability of pseudo filesystems Nick Piggin
2010-11-27  9:56 ` [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate" Nick Piggin
2010-12-08  1:16   ` Dave Chinner
2010-12-08  9:38     ` Nick Piggin
2010-12-09  0:44       ` Dave Chinner
2010-12-09  4:38         ` Nick Piggin
2010-12-09  5:16           ` Nick Piggin
2010-11-27 15:04 ` [PATCH 00/46] rcu-walk and dcache scaling Anca Emanuel
2010-11-27 15:04   ` Anca Emanuel
2010-11-28  3:28   ` Nick Piggin
2010-11-28  3:28     ` Nick Piggin
2010-11-28  6:24     ` Sedat Dilek
2010-12-01 18:03 ` David Miller
2010-12-03 16:55   ` Nick Piggin
2010-12-07 11:25 ` Dave Chinner
2010-12-07 15:24   ` Nick Piggin
2010-12-07 15:24     ` Nick Piggin
2010-12-07 15:49     ` Peter Zijlstra
2010-12-07 15:59       ` Nick Piggin
2010-12-07 16:23         ` Peter Zijlstra
2010-12-08  3:28     ` Nick Piggin
2010-12-07 21:56 ` Dave Chinner
2010-12-08  1:47   ` Nick Piggin
2010-12-08  3:32     ` Dave Chinner
2010-12-08  4:28       ` Dave Chinner
2010-12-08  7:09         ` Nick Piggin
2010-12-08  7:09           ` Nick Piggin
2010-12-10 20:32           ` Paul E. McKenney
2010-12-12 14:54             ` Paul E. McKenney
2010-12-12 14:54               ` Paul E. McKenney
2010-11-27 19:20 Sedat Dilek
2010-11-27 20:53 ` Sedat Dilek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.